* [RFC PATCH v1 0/4] mm: ZSWAP swap-out of mTHP folios
@ 2024-08-14 6:28 Kanchana P Sridhar
2024-08-14 6:28 ` [RFC PATCH v1 1/4] mm: zswap: zswap_is_folio_same_filled() takes an index in the folio Kanchana P Sridhar
` (3 more replies)
0 siblings, 4 replies; 11+ messages in thread
From: Kanchana P Sridhar @ 2024-08-14 6:28 UTC (permalink / raw)
To: linux-kernel, linux-mm, hannes, yosryahmed, nphamcs,
ryan.roberts, ying.huang, 21cnbao, akpm
Cc: nanhai.zou, wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar
This RFC patch-series enables zswap_store() to accept and store mTHP
folios. The most significant contribution in this series is from the
earlier RFC submitted by Ryan Roberts [1]. Ryan's original RFC has been
migrated to v6.10 in patch [3] of this series.
[1]: [RFC PATCH v1] mm: zswap: Store large folios without splitting
https://lore.kernel.org/linux-mm/20231019110543.3284654-1-ryan.roberts@arm.com/T/#u
Additionally, there is an attempt to modularize some of the functionality
in zswap_store(), to make it more amenable to supporting any-order
mTHPs.
For instance, the determination of whether a folio is same-filled is
based on mapping an index into the folio to derive the page. Likewise,
there is a function "zswap_store_entry" added to store a zswap_entry in
the xarray.
For testing purposes, per-mTHP size vmstat zswap_store event counters are
added, and incremented upon successful zswap_store of an mTHP.
This patch-series is a precursor to ZSWAP compress batching of mTHP
swap-out and decompress batching of swap-ins based on swapin_readahead(),
using Intel IAA hardware acceleration, which we would like to submit in
subsequent RFC patch-series, with performance improvement data.
Performance Testing:
====================
Testing of this patch-series was done with the v6.10 mainline, without
and with this RFC, on an Intel Sapphire Rapids server, dual-socket 56
cores per socket, 4 IAA devices per socket.
The system has 503 GiB RAM, 176 GiB swap/ZSWAP with ZRAM as the backing
swap device. Core frequency was fixed at 2500MHz.
The vm-scalability "usemem" test was run in a cgroup whose memory.high
was fixed at 40G. Following a similar methodology as in Ryan Roberts'
"Swap-out mTHP without splitting" series [2], 70 usemem processes were
run, each allocating and writing 1G of memory:
usemem --init-time -w -O -n 70 1g
Other kernel configuration parameters:
ZSWAP Compressor : LZ4, DEFLATE-IAA
ZSWAP Allocator : ZSMALLOC
ZRAM Compressor : LZO-RLE
SWAP page-cluster : 2
In the experiments where "deflate-iaa" is used as the ZSWAP compressor,
IAA "compression verification" is enabled. Hence each IAA compression
will be decompressed internally by the "iaa_crypto" driver, the crc-s
returned by the hardware will be compared and errors reported in case of
mismatches. Thus "deflate-iaa" helps ensure better data integrity as
compared to the software compressors.
Throughput reported by usemem and perf sys time for running the test were
measured and averaged across 3 runs:
64KB mTHP:
==========
----------------------------------------------------------
| | | | |
|Kernel | mTHP SWAP-OUT | Throughput | Improvement|
| | | KB/s | |
|----------------|---------------|------------|------------|
|v6.10 mainline | ZRAM lzo-rle | 111,180 | Baseline |
|zswap-mTHP-RFC | ZSWAP lz4 | 115,996 | 4% |
|zswap-mTHP-RFC | ZSWAP | | |
| | deflate-iaa | 166,048 | 49% |
|----------------------------------------------------------|
| | | | |
|Kernel | mTHP SWAP-OUT | Sys time | Improvement|
| | | sec | |
|----------------|---------------|------------|------------|
|v6.10 mainline | ZRAM lzo-rle | 1,049.69 | Baseline |
|zswap-mTHP RFC | ZSWAP lz4 | 1,178.20 | -12% |
|zswap-mTHP-RFC | ZSWAP | | |
| | deflate-iaa | 626.12 | 40% |
----------------------------------------------------------
-------------------------------------------------------
| VMSTATS, mTHP ZSWAP stats, | v6.10 | zswap-mTHP |
| mTHP ZRAM stats: | mainline | RFC |
|-------------------------------------------------------|
| pswpin | 16 | 0 |
| pswpout | 7,823,984 | 0 |
| zswpin | 551 | 647 |
| zswpout | 1,410 | 15,175,113 |
|-------------------------------------------------------|
| thp_swpout | 0 | 0 |
| thp_swpout_fallback | 0 | 0 |
| pgmajfault | 2,189 | 2,241 |
|-------------------------------------------------------|
| zswpout_4kb_folio | | 1,497 |
| mthp_zswpout_64kb | | 948,351 |
|-------------------------------------------------------|
| hugepages-64kB/stats/swpout| 488,999 | 0 |
-------------------------------------------------------
2MB PMD-THP/2048K mTHP:
=======================
----------------------------------------------------------
| | | | |
|Kernel | mTHP SWAP-OUT | Throughput | Improvement|
| | | KB/s | |
|----------------|---------------|------------|------------|
|v6.10 mainline | ZRAM lzo-rle | 136,617 | Baseline |
|zswap-mTHP-RFC | ZSWAP lz4 | 137,360 | 1% |
|zswap-mTHP-RFC | ZSWAP | | |
| | deflate-iaa | 179,097 | 31% |
|----------------------------------------------------------|
| | | | |
|Kernel | mTHP SWAP-OUT | Sys time | Improvement|
| | | sec | |
|----------------|---------------|------------|------------|
|v6.10 mainline | ZRAM lzo-rle | 1,044.40 | Baseline |
|zswap-mTHP RFC | ZSWAP lz4 | 1,035.79 | 1% |
|zswap-mTHP-RFC | ZSWAP | | |
| | deflate-iaa | 571.31 | 45% |
----------------------------------------------------------
---------------------------------------------------------
| VMSTATS, mTHP ZSWAP stats, | v6.10 | zswap-mTHP |
| mTHP ZRAM stats: | mainline | RFC |
|---------------------------------------------------------|
| pswpin | 0 | 0 |
| pswpout | 8,630,272 | 0 |
| zswpin | 565 | 6,901 |
| zswpout | 1,388 | 15,379,163 |
|---------------------------------------------------------|
| thp_swpout | 16,856 | 0 |
| thp_swpout_fallback | 0 | 0 |
| pgmajfault | 2,184 | 8,532 |
|---------------------------------------------------------|
| zswpout_4kb_folio | | 5,851 |
| mthp_zswpout_2048kb | | 30,026 |
| zswpout_pmd_thp_folio | | 30,026 |
|---------------------------------------------------------|
| hugepages-2048kB/stats/swpout| 16,856 | 0 |
---------------------------------------------------------
As expected in the "Before" experiment, there are relatively fewer
swapouts, because ZRAM utilization is not accounted in the cgroup.
With the introduction of zswap_store mTHP, the "After" data reflects the
higher swapout activity, and consequent sys time degradation.
Our goal is to improve ZSWAP mTHP store performance using batching. With
Intel IAA compress/decompress batching used in ZSWAP (to be submitted as
additional RFC series), we are able to demonstrate significant
performance improvements with IAA as compared to software compressors.
For instance, with IAA-Canned compression [3] used with batching of
zswap_stores and of zswap_loads, the usemem experiment's average of 3
runs throughput improves to 170,461 KB/s (64KB mTHP) and 188,325 KB/s
(2MB THP).
[2] https://lore.kernel.org/linux-mm/20240408183946.2991168-1-ryan.roberts@arm.com/
[3] https://patchwork.kernel.org/project/linux-crypto/cover/cover.1710969449.git.andre.glover@linux.intel.com/
Kanchana P Sridhar (4):
mm: zswap: zswap_is_folio_same_filled() takes an index in the folio.
mm: vmstat: Per mTHP-size zswap_store vmstat event counters.
mm: zswap: zswap_store() extended to handle mTHP folios.
mm: page_io: Count successful mTHP zswap stores in vmstat.
include/linux/vm_event_item.h | 15 +++
mm/page_io.c | 44 +++++++
mm/vmstat.c | 15 +++
mm/zswap.c | 223 ++++++++++++++++++++++++----------
4 files changed, 233 insertions(+), 64 deletions(-)
--
2.27.0
^ permalink raw reply [flat|nested] 11+ messages in thread* [RFC PATCH v1 1/4] mm: zswap: zswap_is_folio_same_filled() takes an index in the folio. 2024-08-14 6:28 [RFC PATCH v1 0/4] mm: ZSWAP swap-out of mTHP folios Kanchana P Sridhar @ 2024-08-14 6:28 ` Kanchana P Sridhar 2024-08-14 6:28 ` [RFC PATCH v1 2/4] mm: vmstat: Per mTHP-size zswap_store vmstat event counters Kanchana P Sridhar ` (2 subsequent siblings) 3 siblings, 0 replies; 11+ messages in thread From: Kanchana P Sridhar @ 2024-08-14 6:28 UTC (permalink / raw) To: linux-kernel, linux-mm, hannes, yosryahmed, nphamcs, ryan.roberts, ying.huang, 21cnbao, akpm Cc: nanhai.zou, wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar This change is being made so that zswap_store can process mTHP folios. Modified zswap_is_folio_same_filled() to work for any-order folios, by accepting an additional "index" parameter to arrive at the page within the folio to run the same-filled page check. Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com> --- mm/zswap.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/mm/zswap.c b/mm/zswap.c index a50e2986cd2f..a6b0a7c636db 100644 --- a/mm/zswap.c +++ b/mm/zswap.c @@ -1373,14 +1373,14 @@ static void shrink_worker(struct work_struct *w) /********************************* * same-filled functions **********************************/ -static bool zswap_is_folio_same_filled(struct folio *folio, unsigned long *value) +static bool zswap_is_folio_same_filled(struct folio *folio, long index, unsigned long *value) { unsigned long *page; unsigned long val; unsigned int pos, last_pos = PAGE_SIZE / sizeof(*page) - 1; bool ret = false; - page = kmap_local_folio(folio, 0); + page = kmap_local_folio(folio, index * PAGE_SIZE); val = page[0]; if (val != page[last_pos]) @@ -1450,7 +1450,7 @@ bool zswap_store(struct folio *folio) goto reject; } - if (zswap_is_folio_same_filled(folio, &value)) { + if (zswap_is_folio_same_filled(folio, 0, &value)) { entry->length = 0; entry->value = value; atomic_inc(&zswap_same_filled_pages); -- 2.27.0 ^ permalink raw reply [flat|nested] 11+ messages in thread
* [RFC PATCH v1 2/4] mm: vmstat: Per mTHP-size zswap_store vmstat event counters. 2024-08-14 6:28 [RFC PATCH v1 0/4] mm: ZSWAP swap-out of mTHP folios Kanchana P Sridhar 2024-08-14 6:28 ` [RFC PATCH v1 1/4] mm: zswap: zswap_is_folio_same_filled() takes an index in the folio Kanchana P Sridhar @ 2024-08-14 6:28 ` Kanchana P Sridhar 2024-08-14 7:48 ` Barry Song 2024-08-14 6:28 ` [RFC PATCH v1 3/4] mm: zswap: zswap_store() extended to handle mTHP folios Kanchana P Sridhar 2024-08-14 6:28 ` [RFC PATCH v1 4/4] mm: page_io: Count successful mTHP zswap stores in vmstat Kanchana P Sridhar 3 siblings, 1 reply; 11+ messages in thread From: Kanchana P Sridhar @ 2024-08-14 6:28 UTC (permalink / raw) To: linux-kernel, linux-mm, hannes, yosryahmed, nphamcs, ryan.roberts, ying.huang, 21cnbao, akpm Cc: nanhai.zou, wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar Added vmstat event counters per mTHP-size that can be used to account for folios of different sizes being successfully stored in ZSWAP. For this RFC, it is not clear if these zswpout counters should instead be added as part of the existing mTHP stats in /sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats. The following is also a viable option, should it make better sense, for instance, as: /sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats/zswpout. If so, we would be able to distinguish between mTHP zswap and non-zswap swapouts through: /sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats/zswpout and /sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats/swpout respectively. Comments would be appreciated as to which approach is preferable. Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com> --- include/linux/vm_event_item.h | 15 +++++++++++++++ mm/vmstat.c | 15 +++++++++++++++ 2 files changed, 30 insertions(+) diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h index 747943bc8cc2..2451bcfcf05c 100644 --- a/include/linux/vm_event_item.h +++ b/include/linux/vm_event_item.h @@ -114,6 +114,9 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT, THP_ZERO_PAGE_ALLOC, THP_ZERO_PAGE_ALLOC_FAILED, THP_SWPOUT, +#ifdef CONFIG_ZSWAP + ZSWPOUT_PMD_THP_FOLIO, +#endif THP_SWPOUT_FALLBACK, #endif #ifdef CONFIG_MEMORY_BALLOON @@ -143,6 +146,18 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT, ZSWPIN, ZSWPOUT, ZSWPWB, + ZSWPOUT_4KB_FOLIO, +#ifdef CONFIG_THP_SWAP + mTHP_ZSWPOUT_8kB, + mTHP_ZSWPOUT_16kB, + mTHP_ZSWPOUT_32kB, + mTHP_ZSWPOUT_64kB, + mTHP_ZSWPOUT_128kB, + mTHP_ZSWPOUT_256kB, + mTHP_ZSWPOUT_512kB, + mTHP_ZSWPOUT_1024kB, + mTHP_ZSWPOUT_2048kB, +#endif #endif #ifdef CONFIG_X86 DIRECT_MAP_LEVEL2_SPLIT, diff --git a/mm/vmstat.c b/mm/vmstat.c index 8507c497218b..0e66c8b0c486 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -1375,6 +1375,9 @@ const char * const vmstat_text[] = { "thp_zero_page_alloc", "thp_zero_page_alloc_failed", "thp_swpout", +#ifdef CONFIG_ZSWAP + "zswpout_pmd_thp_folio", +#endif "thp_swpout_fallback", #endif #ifdef CONFIG_MEMORY_BALLOON @@ -1405,6 +1408,18 @@ const char * const vmstat_text[] = { "zswpin", "zswpout", "zswpwb", + "zswpout_4kb_folio", +#ifdef CONFIG_THP_SWAP + "mthp_zswpout_8kb", + "mthp_zswpout_16kb", + "mthp_zswpout_32kb", + "mthp_zswpout_64kb", + "mthp_zswpout_128kb", + "mthp_zswpout_256kb", + "mthp_zswpout_512kb", + "mthp_zswpout_1024kb", + "mthp_zswpout_2048kb", +#endif #endif #ifdef CONFIG_X86 "direct_map_level2_splits", -- 2.27.0 ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [RFC PATCH v1 2/4] mm: vmstat: Per mTHP-size zswap_store vmstat event counters. 2024-08-14 6:28 ` [RFC PATCH v1 2/4] mm: vmstat: Per mTHP-size zswap_store vmstat event counters Kanchana P Sridhar @ 2024-08-14 7:48 ` Barry Song 2024-08-14 17:40 ` Sridhar, Kanchana P 0 siblings, 1 reply; 11+ messages in thread From: Barry Song @ 2024-08-14 7:48 UTC (permalink / raw) To: Kanchana P Sridhar Cc: linux-kernel, linux-mm, hannes, yosryahmed, nphamcs, ryan.roberts, ying.huang, akpm, nanhai.zou, wajdi.k.feghali, vinodh.gopal On Wed, Aug 14, 2024 at 6:28 PM Kanchana P Sridhar <kanchana.p.sridhar@intel.com> wrote: > > Added vmstat event counters per mTHP-size that can be used to account > for folios of different sizes being successfully stored in ZSWAP. > > For this RFC, it is not clear if these zswpout counters should instead > be added as part of the existing mTHP stats in > /sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats. > > The following is also a viable option, should it make better sense, > for instance, as: > > /sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats/zswpout. > > If so, we would be able to distinguish between mTHP zswap and > non-zswap swapouts through: > > /sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats/zswpout > > and > > /sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats/swpout > > respectively. > > Comments would be appreciated as to which approach is preferable. Even though swapout might go through zswap, from the perspective of the mm core, it shouldn't be aware of that. Shouldn't zswpout be part of swpout? Why are they separate? no matter if a mTHP has been put in zswap, it has been swapped-out to mm-core? No? > > Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com> > --- > include/linux/vm_event_item.h | 15 +++++++++++++++ > mm/vmstat.c | 15 +++++++++++++++ > 2 files changed, 30 insertions(+) > > diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h > index 747943bc8cc2..2451bcfcf05c 100644 > --- a/include/linux/vm_event_item.h > +++ b/include/linux/vm_event_item.h > @@ -114,6 +114,9 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT, > THP_ZERO_PAGE_ALLOC, > THP_ZERO_PAGE_ALLOC_FAILED, > THP_SWPOUT, > +#ifdef CONFIG_ZSWAP > + ZSWPOUT_PMD_THP_FOLIO, > +#endif > THP_SWPOUT_FALLBACK, > #endif > #ifdef CONFIG_MEMORY_BALLOON > @@ -143,6 +146,18 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT, > ZSWPIN, > ZSWPOUT, > ZSWPWB, > + ZSWPOUT_4KB_FOLIO, > +#ifdef CONFIG_THP_SWAP > + mTHP_ZSWPOUT_8kB, > + mTHP_ZSWPOUT_16kB, > + mTHP_ZSWPOUT_32kB, > + mTHP_ZSWPOUT_64kB, > + mTHP_ZSWPOUT_128kB, > + mTHP_ZSWPOUT_256kB, > + mTHP_ZSWPOUT_512kB, > + mTHP_ZSWPOUT_1024kB, > + mTHP_ZSWPOUT_2048kB, > +#endif This implementation hardcodes assumptions about the page size being 4KB, but page sizes can vary, and so can the THP orders? > #endif > #ifdef CONFIG_X86 > DIRECT_MAP_LEVEL2_SPLIT, > diff --git a/mm/vmstat.c b/mm/vmstat.c > index 8507c497218b..0e66c8b0c486 100644 > --- a/mm/vmstat.c > +++ b/mm/vmstat.c > @@ -1375,6 +1375,9 @@ const char * const vmstat_text[] = { > "thp_zero_page_alloc", > "thp_zero_page_alloc_failed", > "thp_swpout", > +#ifdef CONFIG_ZSWAP > + "zswpout_pmd_thp_folio", > +#endif > "thp_swpout_fallback", > #endif > #ifdef CONFIG_MEMORY_BALLOON > @@ -1405,6 +1408,18 @@ const char * const vmstat_text[] = { > "zswpin", > "zswpout", > "zswpwb", > + "zswpout_4kb_folio", > +#ifdef CONFIG_THP_SWAP > + "mthp_zswpout_8kb", > + "mthp_zswpout_16kb", > + "mthp_zswpout_32kb", > + "mthp_zswpout_64kb", > + "mthp_zswpout_128kb", > + "mthp_zswpout_256kb", > + "mthp_zswpout_512kb", > + "mthp_zswpout_1024kb", > + "mthp_zswpout_2048kb", > +#endif The issue here is that the number of THP orders can vary across different platforms. > #endif > #ifdef CONFIG_X86 > "direct_map_level2_splits", > -- > 2.27.0 > Thanks Barry ^ permalink raw reply [flat|nested] 11+ messages in thread
* RE: [RFC PATCH v1 2/4] mm: vmstat: Per mTHP-size zswap_store vmstat event counters. 2024-08-14 7:48 ` Barry Song @ 2024-08-14 17:40 ` Sridhar, Kanchana P 2024-08-14 23:24 ` Barry Song 0 siblings, 1 reply; 11+ messages in thread From: Sridhar, Kanchana P @ 2024-08-14 17:40 UTC (permalink / raw) To: Barry Song Cc: linux-kernel, linux-mm, hannes, yosryahmed, nphamcs, ryan.roberts, Huang, Ying, akpm, Zou, Nanhai, Feghali, Wajdi K, Gopal, Vinodh, Sridhar, Kanchana P Hi Barry, > -----Original Message----- > From: Barry Song <21cnbao@gmail.com> > Sent: Wednesday, August 14, 2024 12:49 AM > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com> > Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org; > hannes@cmpxchg.org; yosryahmed@google.com; nphamcs@gmail.com; > ryan.roberts@arm.com; Huang, Ying <ying.huang@intel.com>; akpm@linux- > foundation.org; Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K > <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com> > Subject: Re: [RFC PATCH v1 2/4] mm: vmstat: Per mTHP-size zswap_store > vmstat event counters. > > On Wed, Aug 14, 2024 at 6:28 PM Kanchana P Sridhar > <kanchana.p.sridhar@intel.com> wrote: > > > > Added vmstat event counters per mTHP-size that can be used to account > > for folios of different sizes being successfully stored in ZSWAP. > > > > For this RFC, it is not clear if these zswpout counters should instead > > be added as part of the existing mTHP stats in > > /sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats. > > > > The following is also a viable option, should it make better sense, > > for instance, as: > > > > /sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats/zswpout. > > > > If so, we would be able to distinguish between mTHP zswap and > > non-zswap swapouts through: > > > > /sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats/zswpout > > > > and > > > > /sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats/swpout > > > > respectively. > > > > Comments would be appreciated as to which approach is preferable. > > Even though swapout might go through zswap, from the perspective of > the mm core, it shouldn't be aware of that. Shouldn't zswpout be part > of swpout? Why are they separate? no matter if a mTHP has been > put in zswap, it has been swapped-out to mm-core? No? Thanks for the code review comments. This is a good point. I was keeping in mind the convention used by existing vmstat event counters that distinguish zswpout/zswpin from pswpout/pswpin events. If we want to keep the distinction in mTHP swapouts, would adding a separate MTHP_STAT_ZSWPOUT to "enum mthp_stat_item" be Ok? In any case, it looks like all that would be needed is a call to count_mthp_stat(folio_order(folio), MTHP_STAT_[Z]SWPOUT) in the general case. I will make this change in v2, depending on whether or not the separation of zswpout vs. non-zswap swpout is recommended for mTHP. > > > > > > Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com> > > --- > > include/linux/vm_event_item.h | 15 +++++++++++++++ > > mm/vmstat.c | 15 +++++++++++++++ > > 2 files changed, 30 insertions(+) > > > > diff --git a/include/linux/vm_event_item.h > b/include/linux/vm_event_item.h > > index 747943bc8cc2..2451bcfcf05c 100644 > > --- a/include/linux/vm_event_item.h > > +++ b/include/linux/vm_event_item.h > > @@ -114,6 +114,9 @@ enum vm_event_item { PGPGIN, PGPGOUT, > PSWPIN, PSWPOUT, > > THP_ZERO_PAGE_ALLOC, > > THP_ZERO_PAGE_ALLOC_FAILED, > > THP_SWPOUT, > > +#ifdef CONFIG_ZSWAP > > + ZSWPOUT_PMD_THP_FOLIO, > > +#endif > > THP_SWPOUT_FALLBACK, > > #endif > > #ifdef CONFIG_MEMORY_BALLOON > > @@ -143,6 +146,18 @@ enum vm_event_item { PGPGIN, PGPGOUT, > PSWPIN, PSWPOUT, > > ZSWPIN, > > ZSWPOUT, > > ZSWPWB, > > + ZSWPOUT_4KB_FOLIO, > > +#ifdef CONFIG_THP_SWAP > > + mTHP_ZSWPOUT_8kB, > > + mTHP_ZSWPOUT_16kB, > > + mTHP_ZSWPOUT_32kB, > > + mTHP_ZSWPOUT_64kB, > > + mTHP_ZSWPOUT_128kB, > > + mTHP_ZSWPOUT_256kB, > > + mTHP_ZSWPOUT_512kB, > > + mTHP_ZSWPOUT_1024kB, > > + mTHP_ZSWPOUT_2048kB, > > +#endif > > This implementation hardcodes assumptions about the page size being 4KB, > but page sizes can vary, and so can the THP orders? Agreed, will address in v2. > > > #endif > > #ifdef CONFIG_X86 > > DIRECT_MAP_LEVEL2_SPLIT, > > diff --git a/mm/vmstat.c b/mm/vmstat.c > > index 8507c497218b..0e66c8b0c486 100644 > > --- a/mm/vmstat.c > > +++ b/mm/vmstat.c > > @@ -1375,6 +1375,9 @@ const char * const vmstat_text[] = { > > "thp_zero_page_alloc", > > "thp_zero_page_alloc_failed", > > "thp_swpout", > > +#ifdef CONFIG_ZSWAP > > + "zswpout_pmd_thp_folio", > > +#endif > > "thp_swpout_fallback", > > #endif > > #ifdef CONFIG_MEMORY_BALLOON > > @@ -1405,6 +1408,18 @@ const char * const vmstat_text[] = { > > "zswpin", > > "zswpout", > > "zswpwb", > > + "zswpout_4kb_folio", > > +#ifdef CONFIG_THP_SWAP > > + "mthp_zswpout_8kb", > > + "mthp_zswpout_16kb", > > + "mthp_zswpout_32kb", > > + "mthp_zswpout_64kb", > > + "mthp_zswpout_128kb", > > + "mthp_zswpout_256kb", > > + "mthp_zswpout_512kb", > > + "mthp_zswpout_1024kb", > > + "mthp_zswpout_2048kb", > > +#endif > > The issue here is that the number of THP orders > can vary across different platforms. Agreed, will address in v2. Thanks, Kanchana > > > #endif > > #ifdef CONFIG_X86 > > "direct_map_level2_splits", > > -- > > 2.27.0 > > > > Thanks > Barry ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [RFC PATCH v1 2/4] mm: vmstat: Per mTHP-size zswap_store vmstat event counters. 2024-08-14 17:40 ` Sridhar, Kanchana P @ 2024-08-14 23:24 ` Barry Song 2024-08-15 1:37 ` Sridhar, Kanchana P 0 siblings, 1 reply; 11+ messages in thread From: Barry Song @ 2024-08-14 23:24 UTC (permalink / raw) To: Sridhar, Kanchana P Cc: linux-kernel, linux-mm, hannes, yosryahmed, nphamcs, ryan.roberts, Huang, Ying, akpm, Zou, Nanhai, Feghali, Wajdi K, Gopal, Vinodh On Thu, Aug 15, 2024 at 5:40 AM Sridhar, Kanchana P <kanchana.p.sridhar@intel.com> wrote: > > Hi Barry, > > > -----Original Message----- > > From: Barry Song <21cnbao@gmail.com> > > Sent: Wednesday, August 14, 2024 12:49 AM > > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com> > > Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org; > > hannes@cmpxchg.org; yosryahmed@google.com; nphamcs@gmail.com; > > ryan.roberts@arm.com; Huang, Ying <ying.huang@intel.com>; akpm@linux- > > foundation.org; Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K > > <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com> > > Subject: Re: [RFC PATCH v1 2/4] mm: vmstat: Per mTHP-size zswap_store > > vmstat event counters. > > > > On Wed, Aug 14, 2024 at 6:28 PM Kanchana P Sridhar > > <kanchana.p.sridhar@intel.com> wrote: > > > > > > Added vmstat event counters per mTHP-size that can be used to account > > > for folios of different sizes being successfully stored in ZSWAP. > > > > > > For this RFC, it is not clear if these zswpout counters should instead > > > be added as part of the existing mTHP stats in > > > /sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats. > > > > > > The following is also a viable option, should it make better sense, > > > for instance, as: > > > > > > /sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats/zswpout. > > > > > > If so, we would be able to distinguish between mTHP zswap and > > > non-zswap swapouts through: > > > > > > /sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats/zswpout > > > > > > and > > > > > > /sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats/swpout > > > > > > respectively. > > > > > > Comments would be appreciated as to which approach is preferable. > > > > Even though swapout might go through zswap, from the perspective of > > the mm core, it shouldn't be aware of that. Shouldn't zswpout be part > > of swpout? Why are they separate? no matter if a mTHP has been > > put in zswap, it has been swapped-out to mm-core? No? > > Thanks for the code review comments. This is a good point. I was keeping in > mind the convention used by existing vmstat event counters that distinguish > zswpout/zswpin from pswpout/pswpin events. > > If we want to keep the distinction in mTHP swapouts, would adding a > separate MTHP_STAT_ZSWPOUT to "enum mthp_stat_item" be Ok? > I'm not entirely sure how important the zswpout counter is. To me, it doesn't seem as critical as swpout and swpout_fallback, which are more useful for system profiling. zswapout feels more like an internal detail related to how the swap-out process is handled? If this is the case, we might not need this per-size counter. Otherwise, I believe sysfs is a better place to avoid all the chaos in vmstat to handle various orders and sizes. So the question is, per-size zswpout counter is really important or just for debugging purposes? > In any case, it looks like all that would be needed is a call to > count_mthp_stat(folio_order(folio), MTHP_STAT_[Z]SWPOUT) in the > general case. > > I will make this change in v2, depending on whether or not the > separation of zswpout vs. non-zswap swpout is recommended for > mTHP. > > > > > > > > > > > Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com> > > > --- > > > include/linux/vm_event_item.h | 15 +++++++++++++++ > > > mm/vmstat.c | 15 +++++++++++++++ > > > 2 files changed, 30 insertions(+) > > > > > > diff --git a/include/linux/vm_event_item.h > > b/include/linux/vm_event_item.h > > > index 747943bc8cc2..2451bcfcf05c 100644 > > > --- a/include/linux/vm_event_item.h > > > +++ b/include/linux/vm_event_item.h > > > @@ -114,6 +114,9 @@ enum vm_event_item { PGPGIN, PGPGOUT, > > PSWPIN, PSWPOUT, > > > THP_ZERO_PAGE_ALLOC, > > > THP_ZERO_PAGE_ALLOC_FAILED, > > > THP_SWPOUT, > > > +#ifdef CONFIG_ZSWAP > > > + ZSWPOUT_PMD_THP_FOLIO, > > > +#endif > > > THP_SWPOUT_FALLBACK, > > > #endif > > > #ifdef CONFIG_MEMORY_BALLOON > > > @@ -143,6 +146,18 @@ enum vm_event_item { PGPGIN, PGPGOUT, > > PSWPIN, PSWPOUT, > > > ZSWPIN, > > > ZSWPOUT, > > > ZSWPWB, > > > + ZSWPOUT_4KB_FOLIO, > > > +#ifdef CONFIG_THP_SWAP > > > + mTHP_ZSWPOUT_8kB, > > > + mTHP_ZSWPOUT_16kB, > > > + mTHP_ZSWPOUT_32kB, > > > + mTHP_ZSWPOUT_64kB, > > > + mTHP_ZSWPOUT_128kB, > > > + mTHP_ZSWPOUT_256kB, > > > + mTHP_ZSWPOUT_512kB, > > > + mTHP_ZSWPOUT_1024kB, > > > + mTHP_ZSWPOUT_2048kB, > > > +#endif > > > > This implementation hardcodes assumptions about the page size being 4KB, > > but page sizes can vary, and so can the THP orders? > > Agreed, will address in v2. > > > > > > #endif > > > #ifdef CONFIG_X86 > > > DIRECT_MAP_LEVEL2_SPLIT, > > > diff --git a/mm/vmstat.c b/mm/vmstat.c > > > index 8507c497218b..0e66c8b0c486 100644 > > > --- a/mm/vmstat.c > > > +++ b/mm/vmstat.c > > > @@ -1375,6 +1375,9 @@ const char * const vmstat_text[] = { > > > "thp_zero_page_alloc", > > > "thp_zero_page_alloc_failed", > > > "thp_swpout", > > > +#ifdef CONFIG_ZSWAP > > > + "zswpout_pmd_thp_folio", > > > +#endif > > > "thp_swpout_fallback", > > > #endif > > > #ifdef CONFIG_MEMORY_BALLOON > > > @@ -1405,6 +1408,18 @@ const char * const vmstat_text[] = { > > > "zswpin", > > > "zswpout", > > > "zswpwb", > > > + "zswpout_4kb_folio", > > > +#ifdef CONFIG_THP_SWAP > > > + "mthp_zswpout_8kb", > > > + "mthp_zswpout_16kb", > > > + "mthp_zswpout_32kb", > > > + "mthp_zswpout_64kb", > > > + "mthp_zswpout_128kb", > > > + "mthp_zswpout_256kb", > > > + "mthp_zswpout_512kb", > > > + "mthp_zswpout_1024kb", > > > + "mthp_zswpout_2048kb", > > > +#endif > > > > The issue here is that the number of THP orders > > can vary across different platforms. > > Agreed, will address in v2. > > Thanks, > Kanchana > > > > > > #endif > > > #ifdef CONFIG_X86 > > > "direct_map_level2_splits", > > > -- > > > 2.27.0 > > > Thanks Barry ^ permalink raw reply [flat|nested] 11+ messages in thread
* RE: [RFC PATCH v1 2/4] mm: vmstat: Per mTHP-size zswap_store vmstat event counters. 2024-08-14 23:24 ` Barry Song @ 2024-08-15 1:37 ` Sridhar, Kanchana P 0 siblings, 0 replies; 11+ messages in thread From: Sridhar, Kanchana P @ 2024-08-15 1:37 UTC (permalink / raw) To: Barry Song Cc: linux-kernel, linux-mm, hannes, yosryahmed, nphamcs, ryan.roberts, Huang, Ying, akpm, Zou, Nanhai, Feghali, Wajdi K, Gopal, Vinodh, Sridhar, Kanchana P Hi Barry, > -----Original Message----- > From: Barry Song <21cnbao@gmail.com> > Sent: Wednesday, August 14, 2024 4:25 PM > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com> > Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org; > hannes@cmpxchg.org; yosryahmed@google.com; nphamcs@gmail.com; > ryan.roberts@arm.com; Huang, Ying <ying.huang@intel.com>; akpm@linux- > foundation.org; Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K > <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com> > Subject: Re: [RFC PATCH v1 2/4] mm: vmstat: Per mTHP-size zswap_store > vmstat event counters. > > On Thu, Aug 15, 2024 at 5:40 AM Sridhar, Kanchana P > <kanchana.p.sridhar@intel.com> wrote: > > > > Hi Barry, > > > > > -----Original Message----- > > > From: Barry Song <21cnbao@gmail.com> > > > Sent: Wednesday, August 14, 2024 12:49 AM > > > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com> > > > Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org; > > > hannes@cmpxchg.org; yosryahmed@google.com; nphamcs@gmail.com; > > > ryan.roberts@arm.com; Huang, Ying <ying.huang@intel.com>; > akpm@linux- > > > foundation.org; Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K > > > <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com> > > > Subject: Re: [RFC PATCH v1 2/4] mm: vmstat: Per mTHP-size zswap_store > > > vmstat event counters. > > > > > > On Wed, Aug 14, 2024 at 6:28 PM Kanchana P Sridhar > > > <kanchana.p.sridhar@intel.com> wrote: > > > > > > > > Added vmstat event counters per mTHP-size that can be used to account > > > > for folios of different sizes being successfully stored in ZSWAP. > > > > > > > > For this RFC, it is not clear if these zswpout counters should instead > > > > be added as part of the existing mTHP stats in > > > > /sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats. > > > > > > > > The following is also a viable option, should it make better sense, > > > > for instance, as: > > > > > > > > /sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats/zswpout. > > > > > > > > If so, we would be able to distinguish between mTHP zswap and > > > > non-zswap swapouts through: > > > > > > > > /sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats/zswpout > > > > > > > > and > > > > > > > > /sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats/swpout > > > > > > > > respectively. > > > > > > > > Comments would be appreciated as to which approach is preferable. > > > > > > Even though swapout might go through zswap, from the perspective of > > > the mm core, it shouldn't be aware of that. Shouldn't zswpout be part > > > of swpout? Why are they separate? no matter if a mTHP has been > > > put in zswap, it has been swapped-out to mm-core? No? > > > > Thanks for the code review comments. This is a good point. I was keeping in > > mind the convention used by existing vmstat event counters that distinguish > > zswpout/zswpin from pswpout/pswpin events. > > > > If we want to keep the distinction in mTHP swapouts, would adding a > > separate MTHP_STAT_ZSWPOUT to "enum mthp_stat_item" be Ok? > > > > I'm not entirely sure how important the zswpout counter is. To me, it doesn't > seem as critical as swpout and swpout_fallback, which are more useful for > system profiling. zswapout feels more like an internal detail related to > how the swap-out process is handled? If this is the case, we might not > need this per-size counter. > > Otherwise, I believe sysfs is a better place to avoid all the chaos in vmstat > to handle various orders and sizes. So the question is, per-size zswpout > counter is really important or just for debugging purposes? I agree, sysfs would be a cleaner mTHP stats accounting solution, given the existing mTHP swpout stats under the per-order /sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats/swpout. I personally find distinct zswap vs. bdev/fs swapout accounting useful for debugging, and for overall reclaim path characterization. For instance, the impact of different zswap compressors' compress latency on zswpout activity for a given workload. Is a slowdown in compress latency causing active/hot memory to be reclaimed and immediately faulted in? Does better zswap compress efficiency co-relate to more cold memory as mTHP to be reclaimed? How does the reclaim path efficiency improvement resulting from improving zswap_store mTHP performance co-relate with ZSWAP utilization and memory savings? I have found these counters useful in understanding some of these characteristics. I also believe it helps to account for the number of mTHP being stored in different compress tiers. For e.g. how many mTHP were stored in zswap vs. being rejected and stored in the backing swap device. This could help say in provisioning zswap memory, and knowing the impact of zswap compress path latency on scaling. Another interesting characteristic that mTHP zswpout accounting could help understand would be compressor incompressibility and/or zpool fragmentation; and being able to better co-relate the zswap/reject_* sysfs counters with mTHP [z]swpout stats. Look forward to inputs from yourself and others on the direction and next steps. Thanks, Kanchana > > > In any case, it looks like all that would be needed is a call to > > count_mthp_stat(folio_order(folio), MTHP_STAT_[Z]SWPOUT) in the > > general case. > > > > I will make this change in v2, depending on whether or not the > > separation of zswpout vs. non-zswap swpout is recommended for > > mTHP. > > > > > > > > > > > > > > > > Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com> > > > > --- > > > > include/linux/vm_event_item.h | 15 +++++++++++++++ > > > > mm/vmstat.c | 15 +++++++++++++++ > > > > 2 files changed, 30 insertions(+) > > > > > > > > diff --git a/include/linux/vm_event_item.h > > > b/include/linux/vm_event_item.h > > > > index 747943bc8cc2..2451bcfcf05c 100644 > > > > --- a/include/linux/vm_event_item.h > > > > +++ b/include/linux/vm_event_item.h > > > > @@ -114,6 +114,9 @@ enum vm_event_item { PGPGIN, PGPGOUT, > > > PSWPIN, PSWPOUT, > > > > THP_ZERO_PAGE_ALLOC, > > > > THP_ZERO_PAGE_ALLOC_FAILED, > > > > THP_SWPOUT, > > > > +#ifdef CONFIG_ZSWAP > > > > + ZSWPOUT_PMD_THP_FOLIO, > > > > +#endif > > > > THP_SWPOUT_FALLBACK, > > > > #endif > > > > #ifdef CONFIG_MEMORY_BALLOON > > > > @@ -143,6 +146,18 @@ enum vm_event_item { PGPGIN, PGPGOUT, > > > PSWPIN, PSWPOUT, > > > > ZSWPIN, > > > > ZSWPOUT, > > > > ZSWPWB, > > > > + ZSWPOUT_4KB_FOLIO, > > > > +#ifdef CONFIG_THP_SWAP > > > > + mTHP_ZSWPOUT_8kB, > > > > + mTHP_ZSWPOUT_16kB, > > > > + mTHP_ZSWPOUT_32kB, > > > > + mTHP_ZSWPOUT_64kB, > > > > + mTHP_ZSWPOUT_128kB, > > > > + mTHP_ZSWPOUT_256kB, > > > > + mTHP_ZSWPOUT_512kB, > > > > + mTHP_ZSWPOUT_1024kB, > > > > + mTHP_ZSWPOUT_2048kB, > > > > +#endif > > > > > > This implementation hardcodes assumptions about the page size being > 4KB, > > > but page sizes can vary, and so can the THP orders? > > > > Agreed, will address in v2. > > > > > > > > > #endif > > > > #ifdef CONFIG_X86 > > > > DIRECT_MAP_LEVEL2_SPLIT, > > > > diff --git a/mm/vmstat.c b/mm/vmstat.c > > > > index 8507c497218b..0e66c8b0c486 100644 > > > > --- a/mm/vmstat.c > > > > +++ b/mm/vmstat.c > > > > @@ -1375,6 +1375,9 @@ const char * const vmstat_text[] = { > > > > "thp_zero_page_alloc", > > > > "thp_zero_page_alloc_failed", > > > > "thp_swpout", > > > > +#ifdef CONFIG_ZSWAP > > > > + "zswpout_pmd_thp_folio", > > > > +#endif > > > > "thp_swpout_fallback", > > > > #endif > > > > #ifdef CONFIG_MEMORY_BALLOON > > > > @@ -1405,6 +1408,18 @@ const char * const vmstat_text[] = { > > > > "zswpin", > > > > "zswpout", > > > > "zswpwb", > > > > + "zswpout_4kb_folio", > > > > +#ifdef CONFIG_THP_SWAP > > > > + "mthp_zswpout_8kb", > > > > + "mthp_zswpout_16kb", > > > > + "mthp_zswpout_32kb", > > > > + "mthp_zswpout_64kb", > > > > + "mthp_zswpout_128kb", > > > > + "mthp_zswpout_256kb", > > > > + "mthp_zswpout_512kb", > > > > + "mthp_zswpout_1024kb", > > > > + "mthp_zswpout_2048kb", > > > > +#endif > > > > > > The issue here is that the number of THP orders > > > can vary across different platforms. > > > > Agreed, will address in v2. > > > > Thanks, > > Kanchana > > > > > > > > > #endif > > > > #ifdef CONFIG_X86 > > > > "direct_map_level2_splits", > > > > -- > > > > 2.27.0 > > > > > > Thanks > Barry ^ permalink raw reply [flat|nested] 11+ messages in thread
* [RFC PATCH v1 3/4] mm: zswap: zswap_store() extended to handle mTHP folios. 2024-08-14 6:28 [RFC PATCH v1 0/4] mm: ZSWAP swap-out of mTHP folios Kanchana P Sridhar 2024-08-14 6:28 ` [RFC PATCH v1 1/4] mm: zswap: zswap_is_folio_same_filled() takes an index in the folio Kanchana P Sridhar 2024-08-14 6:28 ` [RFC PATCH v1 2/4] mm: vmstat: Per mTHP-size zswap_store vmstat event counters Kanchana P Sridhar @ 2024-08-14 6:28 ` Kanchana P Sridhar 2024-08-14 6:28 ` [RFC PATCH v1 4/4] mm: page_io: Count successful mTHP zswap stores in vmstat Kanchana P Sridhar 3 siblings, 0 replies; 11+ messages in thread From: Kanchana P Sridhar @ 2024-08-14 6:28 UTC (permalink / raw) To: linux-kernel, linux-mm, hannes, yosryahmed, nphamcs, ryan.roberts, ying.huang, 21cnbao, akpm Cc: nanhai.zou, wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar zswap_store() will now process and store mTHP and PMD-size THP folios. This change reuses and adapts the functionality in Ryan Roberts' RFC patch [1]: "[RFC,v1] mm: zswap: Store large folios without splitting" [1] https://lore.kernel.org/linux-mm/20231019110543.3284654-1-ryan.roberts@arm.com/T/#u This patch provides a sequential implementation of storing an mTHP in zswap_store() by iterating through each page in the folio to compress and store it in the zswap zpool. Towards this goal, zswap_compress() is modified to take a page instead of a folio as input. Each page's swap offset is stored as a separate zswap entry. If an error is encountered during the store of any page in the mTHP, all previous pages/entries stored will be invalidated. Thus, an mTHP is either entirely stored in ZSWAP, or entirely not stored in ZSWAP. This forms the basis for building batching of pages during zswap store of large folios, by compressing batches of up to say, 8 pages in an mTHP in parallel in hardware, with the Intel In-Memory Analytics Accelerator (Intel IAA). Co-developed-by: Ryan Roberts Signed-off-by: Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com> --- mm/zswap.c | 219 ++++++++++++++++++++++++++++++++++++++--------------- 1 file changed, 157 insertions(+), 62 deletions(-) diff --git a/mm/zswap.c b/mm/zswap.c index a6b0a7c636db..98ff98b485f5 100644 --- a/mm/zswap.c +++ b/mm/zswap.c @@ -899,7 +899,7 @@ static int zswap_cpu_comp_dead(unsigned int cpu, struct hlist_node *node) return 0; } -static bool zswap_compress(struct folio *folio, struct zswap_entry *entry) +static bool zswap_compress(struct page *page, struct zswap_entry *entry) { struct crypto_acomp_ctx *acomp_ctx; struct scatterlist input, output; @@ -917,7 +917,7 @@ static bool zswap_compress(struct folio *folio, struct zswap_entry *entry) dst = acomp_ctx->buffer; sg_init_table(&input, 1); - sg_set_page(&input, &folio->page, PAGE_SIZE, 0); + sg_set_page(&input, page, PAGE_SIZE, 0); /* * We need PAGE_SIZE * 2 here since there maybe over-compression case, @@ -1409,36 +1409,82 @@ static void zswap_fill_page(void *ptr, unsigned long value) /********************************* * main API **********************************/ -bool zswap_store(struct folio *folio) + +/* + * Returns true if the entry was successfully + * stored in the xarray, and false otherwise. + */ +static bool zswap_store_entry(struct xarray *tree, + struct zswap_entry *entry) { - swp_entry_t swp = folio->swap; - pgoff_t offset = swp_offset(swp); - struct xarray *tree = swap_zswap_tree(swp); - struct zswap_entry *entry, *old; - struct obj_cgroup *objcg = NULL; - struct mem_cgroup *memcg = NULL; - unsigned long value; + struct zswap_entry *old; + pgoff_t offset = swp_offset(entry->swpentry); - VM_WARN_ON_ONCE(!folio_test_locked(folio)); - VM_WARN_ON_ONCE(!folio_test_swapcache(folio)); + old = xa_store(tree, offset, entry, GFP_KERNEL); - /* Large folios aren't supported */ - if (folio_test_large(folio)) + if (xa_is_err(old)) { + int err = xa_err(old); + + WARN_ONCE(err != -ENOMEM, "unexpected xarray error: %d\n", err); + zswap_reject_alloc_fail++; return false; + } - if (!zswap_enabled) - goto check_old; + /* + * We may have had an existing entry that became stale when + * the folio was redirtied and now the new version is being + * swapped out. Get rid of the old. + */ + if (old) + zswap_entry_free(old); - /* Check cgroup limits */ - objcg = get_obj_cgroup_from_folio(folio); - if (objcg && !obj_cgroup_may_zswap(objcg)) { - memcg = get_mem_cgroup_from_objcg(objcg); - if (shrink_memcg(memcg)) { - mem_cgroup_put(memcg); - goto reject; - } - mem_cgroup_put(memcg); + return true; +} + +/* + * If the zswap store fails or zswap is disabled, we must invalidate the + * possibly stale entry which was previously stored at this offset. + * Otherwise, writeback could overwrite the new data in the swapfile. + * + * This is called after the store of the i-th offset + * in a large folio, has failed. All entries from + * [i-1 .. 0] must be deleted. + * + * This is also called if zswap_store() is called, + * but zswap is not enabled. All offsets for the folio + * are deleted from zswap in this case. + */ +static void zswap_delete_stored_offsets(struct xarray *tree, + pgoff_t offset, + long nr_pages) +{ + struct zswap_entry *entry; + long i; + + for (i = 0; i < nr_pages; ++i) { + entry = xa_erase(tree, offset + i); + if (entry) + zswap_entry_free(entry); } +} + +/* + * Stores the page at specified "index" in a folio. + */ +static bool zswap_store_page(struct folio *folio, long index, + struct obj_cgroup *objcg, + struct zswap_pool *pool) +{ + swp_entry_t swp = folio->swap; + int type = swp_type(swp); + pgoff_t offset = swp_offset(swp) + index; + struct page *page = folio_page(folio, index); + struct xarray *tree = swap_zswap_tree(swp); + struct zswap_entry *entry; + unsigned long value; + + if (objcg) + obj_cgroup_get(objcg); if (zswap_check_limits()) goto reject; @@ -1450,7 +1496,7 @@ bool zswap_store(struct folio *folio) goto reject; } - if (zswap_is_folio_same_filled(folio, 0, &value)) { + if (zswap_is_folio_same_filled(folio, index, &value)) { entry->length = 0; entry->value = value; atomic_inc(&zswap_same_filled_pages); @@ -1458,42 +1504,20 @@ bool zswap_store(struct folio *folio) } /* if entry is successfully added, it keeps the reference */ - entry->pool = zswap_pool_current_get(); - if (!entry->pool) + if (!zswap_pool_get(pool)) goto freepage; - if (objcg) { - memcg = get_mem_cgroup_from_objcg(objcg); - if (memcg_list_lru_alloc(memcg, &zswap_list_lru, GFP_KERNEL)) { - mem_cgroup_put(memcg); - goto put_pool; - } - mem_cgroup_put(memcg); - } + entry->pool = pool; - if (!zswap_compress(folio, entry)) + if (!zswap_compress(page, entry)) goto put_pool; store_entry: - entry->swpentry = swp; + entry->swpentry = swp_entry(type, offset); entry->objcg = objcg; - old = xa_store(tree, offset, entry, GFP_KERNEL); - if (xa_is_err(old)) { - int err = xa_err(old); - - WARN_ONCE(err != -ENOMEM, "unexpected xarray error: %d\n", err); - zswap_reject_alloc_fail++; + if (!zswap_store_entry(tree, entry)) goto store_failed; - } - - /* - * We may have had an existing entry that became stale when - * the folio was redirtied and now the new version is being - * swapped out. Get rid of the old. - */ - if (old) - zswap_entry_free(old); if (objcg) { obj_cgroup_charge_zswap(objcg, entry->length); @@ -1527,7 +1551,7 @@ bool zswap_store(struct folio *folio) else { zpool_free(zswap_find_zpool(entry), entry->handle); put_pool: - zswap_pool_put(entry->pool); + zswap_pool_put(pool); } freepage: zswap_entry_cache_free(entry); @@ -1535,16 +1559,87 @@ bool zswap_store(struct folio *folio) obj_cgroup_put(objcg); if (zswap_pool_reached_full) queue_work(shrink_wq, &zswap_shrink_work); -check_old: + + return false; +} + +/* + * Modified to store mTHP folios. + * Each page in the mTHP will be compressed + * and stored sequentially. + */ +bool zswap_store(struct folio *folio) +{ + long nr_pages = folio_nr_pages(folio); + swp_entry_t swp = folio->swap; + pgoff_t offset = swp_offset(swp); + struct xarray *tree = swap_zswap_tree(swp); + struct obj_cgroup *objcg = NULL; + struct mem_cgroup *memcg = NULL; + struct zswap_pool *pool; + bool ret = false; + long index; + + VM_WARN_ON_ONCE(!folio_test_locked(folio)); + VM_WARN_ON_ONCE(!folio_test_swapcache(folio)); + /* - * If the zswap store fails or zswap is disabled, we must invalidate the - * possibly stale entry which was previously stored at this offset. - * Otherwise, writeback could overwrite the new data in the swapfile. + * If zswap is disabled, we must invalidate the possibly stale entry + * which was previously stored at this offset. Otherwise, writeback + * could overwrite the new data in the swapfile. */ - entry = xa_erase(tree, offset); - if (entry) - zswap_entry_free(entry); - return false; + if (!zswap_enabled) + goto reject; + + /* Check cgroup limits */ + objcg = get_obj_cgroup_from_folio(folio); + if (objcg && !obj_cgroup_may_zswap(objcg)) { + memcg = get_mem_cgroup_from_objcg(objcg); + if (shrink_memcg(memcg)) { + mem_cgroup_put(memcg); + goto put_objcg; + } + mem_cgroup_put(memcg); + } + + if (zswap_check_limits()) + goto put_objcg; + + pool = zswap_pool_current_get(); + if (!pool) + goto put_objcg; + + if (objcg) { + memcg = get_mem_cgroup_from_objcg(objcg); + if (memcg_list_lru_alloc(memcg, &zswap_list_lru, GFP_KERNEL)) { + mem_cgroup_put(memcg); + goto put_pool; + } + mem_cgroup_put(memcg); + } + + /* + * Store each page of the folio as a separate entry. If we fail to store + * a page, unwind by removing all the previous pages we stored. + */ + for (index = 0; index < nr_pages; ++index) { + if (!zswap_store_page(folio, index, objcg, pool)) + goto put_pool; + } + + ret = true; + +put_pool: + zswap_pool_put(pool); +put_objcg: + obj_cgroup_put(objcg); + if (zswap_pool_reached_full) + queue_work(shrink_wq, &zswap_shrink_work); +reject: + if (!ret) + zswap_delete_stored_offsets(tree, offset, nr_pages); + + return ret; } bool zswap_load(struct folio *folio) -- 2.27.0 ^ permalink raw reply [flat|nested] 11+ messages in thread
* [RFC PATCH v1 4/4] mm: page_io: Count successful mTHP zswap stores in vmstat. 2024-08-14 6:28 [RFC PATCH v1 0/4] mm: ZSWAP swap-out of mTHP folios Kanchana P Sridhar ` (2 preceding siblings ...) 2024-08-14 6:28 ` [RFC PATCH v1 3/4] mm: zswap: zswap_store() extended to handle mTHP folios Kanchana P Sridhar @ 2024-08-14 6:28 ` Kanchana P Sridhar 2024-08-14 7:53 ` Barry Song 3 siblings, 1 reply; 11+ messages in thread From: Kanchana P Sridhar @ 2024-08-14 6:28 UTC (permalink / raw) To: linux-kernel, linux-mm, hannes, yosryahmed, nphamcs, ryan.roberts, ying.huang, 21cnbao, akpm Cc: nanhai.zou, wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar Added count_zswap_thp_swpout_vm_event() that will increment the appropriate mTHP/PMD vmstat event counters if zswap_store succeeds for a large folio: zswap_store mTHP order [0, HPAGE_PMD_ORDER-1] will increment these vmstat event counters: ZSWPOUT_4KB_FOLIO mTHP_ZSWPOUT_8kB mTHP_ZSWPOUT_16kB mTHP_ZSWPOUT_32kB mTHP_ZSWPOUT_64kB mTHP_ZSWPOUT_128kB mTHP_ZSWPOUT_256kB mTHP_ZSWPOUT_512kB mTHP_ZSWPOUT_1024kB zswap_store of a PMD-size THP, i.e., mTHP order HPAGE_PMD_ORDER, will increment both these vmstat event counters: ZSWPOUT_PMD_THP_FOLIO mTHP_ZSWPOUT_2048kB Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com> --- mm/page_io.c | 44 ++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 44 insertions(+) diff --git a/mm/page_io.c b/mm/page_io.c index 0a150c240bf4..ab54d2060cc4 100644 --- a/mm/page_io.c +++ b/mm/page_io.c @@ -172,6 +172,49 @@ int generic_swapfile_activate(struct swap_info_struct *sis, goto out; } +/* + * Count vmstats for ZSWAP store of large folios (mTHP and PMD-size THP). + */ +static inline void count_zswap_thp_swpout_vm_event(struct folio *folio) +{ + if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) && folio_test_pmd_mappable(folio)) { + count_vm_event(ZSWPOUT_PMD_THP_FOLIO); + count_vm_event(mTHP_ZSWPOUT_2048kB); + } else if (folio_order(folio) == 0) { + count_vm_event(ZSWPOUT_4KB_FOLIO); + } else if (IS_ENABLED(CONFIG_THP_SWAP)) { + switch (folio_order(folio)) { + case 1: + count_vm_event(mTHP_ZSWPOUT_8kB); + break; + case 2: + count_vm_event(mTHP_ZSWPOUT_16kB); + break; + case 3: + count_vm_event(mTHP_ZSWPOUT_32kB); + break; + case 4: + count_vm_event(mTHP_ZSWPOUT_64kB); + break; + case 5: + count_vm_event(mTHP_ZSWPOUT_128kB); + break; + case 6: + count_vm_event(mTHP_ZSWPOUT_256kB); + break; + case 7: + count_vm_event(mTHP_ZSWPOUT_512kB); + break; + case 8: + count_vm_event(mTHP_ZSWPOUT_1024kB); + break; + case 9: + count_vm_event(mTHP_ZSWPOUT_2048kB); + break; + } + } +} + /* * We may have stale swap cache pages in memory: notice * them here and get rid of the unnecessary final write. @@ -196,6 +239,7 @@ int swap_writepage(struct page *page, struct writeback_control *wbc) return ret; } if (zswap_store(folio)) { + count_zswap_thp_swpout_vm_event(folio); folio_start_writeback(folio); folio_unlock(folio); folio_end_writeback(folio); -- 2.27.0 ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [RFC PATCH v1 4/4] mm: page_io: Count successful mTHP zswap stores in vmstat. 2024-08-14 6:28 ` [RFC PATCH v1 4/4] mm: page_io: Count successful mTHP zswap stores in vmstat Kanchana P Sridhar @ 2024-08-14 7:53 ` Barry Song 2024-08-14 17:47 ` Sridhar, Kanchana P 0 siblings, 1 reply; 11+ messages in thread From: Barry Song @ 2024-08-14 7:53 UTC (permalink / raw) To: Kanchana P Sridhar Cc: linux-kernel, linux-mm, hannes, yosryahmed, nphamcs, ryan.roberts, ying.huang, akpm, nanhai.zou, wajdi.k.feghali, vinodh.gopal On Wed, Aug 14, 2024 at 6:28 PM Kanchana P Sridhar <kanchana.p.sridhar@intel.com> wrote: > > Added count_zswap_thp_swpout_vm_event() that will increment the > appropriate mTHP/PMD vmstat event counters if zswap_store succeeds for > a large folio: > > zswap_store mTHP order [0, HPAGE_PMD_ORDER-1] will increment these > vmstat event counters: > > ZSWPOUT_4KB_FOLIO > mTHP_ZSWPOUT_8kB > mTHP_ZSWPOUT_16kB > mTHP_ZSWPOUT_32kB > mTHP_ZSWPOUT_64kB > mTHP_ZSWPOUT_128kB > mTHP_ZSWPOUT_256kB > mTHP_ZSWPOUT_512kB > mTHP_ZSWPOUT_1024kB > > zswap_store of a PMD-size THP, i.e., mTHP order HPAGE_PMD_ORDER, will > increment both these vmstat event counters: > > ZSWPOUT_PMD_THP_FOLIO > mTHP_ZSWPOUT_2048kB > > Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com> > --- > mm/page_io.c | 44 ++++++++++++++++++++++++++++++++++++++++++++ > 1 file changed, 44 insertions(+) > > diff --git a/mm/page_io.c b/mm/page_io.c > index 0a150c240bf4..ab54d2060cc4 100644 > --- a/mm/page_io.c > +++ b/mm/page_io.c > @@ -172,6 +172,49 @@ int generic_swapfile_activate(struct swap_info_struct *sis, > goto out; > } > > +/* > + * Count vmstats for ZSWAP store of large folios (mTHP and PMD-size THP). > + */ > +static inline void count_zswap_thp_swpout_vm_event(struct folio *folio) > +{ > + if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) && folio_test_pmd_mappable(folio)) { > + count_vm_event(ZSWPOUT_PMD_THP_FOLIO); > + count_vm_event(mTHP_ZSWPOUT_2048kB); > + } else if (folio_order(folio) == 0) { > + count_vm_event(ZSWPOUT_4KB_FOLIO); > + } else if (IS_ENABLED(CONFIG_THP_SWAP)) { > + switch (folio_order(folio)) { > + case 1: > + count_vm_event(mTHP_ZSWPOUT_8kB); > + break; > + case 2: > + count_vm_event(mTHP_ZSWPOUT_16kB); > + break; > + case 3: > + count_vm_event(mTHP_ZSWPOUT_32kB); > + break; > + case 4: > + count_vm_event(mTHP_ZSWPOUT_64kB); > + break; > + case 5: > + count_vm_event(mTHP_ZSWPOUT_128kB); > + break; > + case 6: > + count_vm_event(mTHP_ZSWPOUT_256kB); > + break; > + case 7: > + count_vm_event(mTHP_ZSWPOUT_512kB); > + break; > + case 8: > + count_vm_event(mTHP_ZSWPOUT_1024kB); > + break; > + case 9: > + count_vm_event(mTHP_ZSWPOUT_2048kB); > + break; > + } The number of orders is PMD_ORDER, also ilog2(MAX_PTRS_PER_PTE) . PMD_ORDER isn't necessarily 9. It seems we need some general way to handle this and avoid so many duplicated case 1, case 2.... case 9. > + } > +} > + > /* > * We may have stale swap cache pages in memory: notice > * them here and get rid of the unnecessary final write. > @@ -196,6 +239,7 @@ int swap_writepage(struct page *page, struct writeback_control *wbc) > return ret; > } > if (zswap_store(folio)) { > + count_zswap_thp_swpout_vm_event(folio); > folio_start_writeback(folio); > folio_unlock(folio); > folio_end_writeback(folio); > -- > 2.27.0 > Thanks Barry ^ permalink raw reply [flat|nested] 11+ messages in thread
* RE: [RFC PATCH v1 4/4] mm: page_io: Count successful mTHP zswap stores in vmstat. 2024-08-14 7:53 ` Barry Song @ 2024-08-14 17:47 ` Sridhar, Kanchana P 0 siblings, 0 replies; 11+ messages in thread From: Sridhar, Kanchana P @ 2024-08-14 17:47 UTC (permalink / raw) To: Barry Song Cc: linux-kernel, linux-mm, hannes, yosryahmed, nphamcs, ryan.roberts, Huang, Ying, akpm, Zou, Nanhai, Feghali, Wajdi K, Gopal, Vinodh, Sridhar, Kanchana P Hi Barry, > -----Original Message----- > From: Barry Song <21cnbao@gmail.com> > Sent: Wednesday, August 14, 2024 12:53 AM > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com> > Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org; > hannes@cmpxchg.org; yosryahmed@google.com; nphamcs@gmail.com; > ryan.roberts@arm.com; Huang, Ying <ying.huang@intel.com>; akpm@linux- > foundation.org; Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K > <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com> > Subject: Re: [RFC PATCH v1 4/4] mm: page_io: Count successful mTHP zswap > stores in vmstat. > > On Wed, Aug 14, 2024 at 6:28 PM Kanchana P Sridhar > <kanchana.p.sridhar@intel.com> wrote: > > > > Added count_zswap_thp_swpout_vm_event() that will increment the > > appropriate mTHP/PMD vmstat event counters if zswap_store succeeds for > > a large folio: > > > > zswap_store mTHP order [0, HPAGE_PMD_ORDER-1] will increment these > > vmstat event counters: > > > > ZSWPOUT_4KB_FOLIO > > mTHP_ZSWPOUT_8kB > > mTHP_ZSWPOUT_16kB > > mTHP_ZSWPOUT_32kB > > mTHP_ZSWPOUT_64kB > > mTHP_ZSWPOUT_128kB > > mTHP_ZSWPOUT_256kB > > mTHP_ZSWPOUT_512kB > > mTHP_ZSWPOUT_1024kB > > > > zswap_store of a PMD-size THP, i.e., mTHP order HPAGE_PMD_ORDER, will > > increment both these vmstat event counters: > > > > ZSWPOUT_PMD_THP_FOLIO > > mTHP_ZSWPOUT_2048kB > > > > Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com> > > --- > > mm/page_io.c | 44 ++++++++++++++++++++++++++++++++++++++++++++ > > 1 file changed, 44 insertions(+) > > > > diff --git a/mm/page_io.c b/mm/page_io.c > > index 0a150c240bf4..ab54d2060cc4 100644 > > --- a/mm/page_io.c > > +++ b/mm/page_io.c > > @@ -172,6 +172,49 @@ int generic_swapfile_activate(struct > swap_info_struct *sis, > > goto out; > > } > > > > +/* > > + * Count vmstats for ZSWAP store of large folios (mTHP and PMD-size THP). > > + */ > > +static inline void count_zswap_thp_swpout_vm_event(struct folio *folio) > > +{ > > + if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) && > folio_test_pmd_mappable(folio)) { > > + count_vm_event(ZSWPOUT_PMD_THP_FOLIO); > > + count_vm_event(mTHP_ZSWPOUT_2048kB); > > + } else if (folio_order(folio) == 0) { > > + count_vm_event(ZSWPOUT_4KB_FOLIO); > > + } else if (IS_ENABLED(CONFIG_THP_SWAP)) { > > + switch (folio_order(folio)) { > > + case 1: > > + count_vm_event(mTHP_ZSWPOUT_8kB); > > + break; > > + case 2: > > + count_vm_event(mTHP_ZSWPOUT_16kB); > > + break; > > + case 3: > > + count_vm_event(mTHP_ZSWPOUT_32kB); > > + break; > > + case 4: > > + count_vm_event(mTHP_ZSWPOUT_64kB); > > + break; > > + case 5: > > + count_vm_event(mTHP_ZSWPOUT_128kB); > > + break; > > + case 6: > > + count_vm_event(mTHP_ZSWPOUT_256kB); > > + break; > > + case 7: > > + count_vm_event(mTHP_ZSWPOUT_512kB); > > + break; > > + case 8: > > + count_vm_event(mTHP_ZSWPOUT_1024kB); > > + break; > > + case 9: > > + count_vm_event(mTHP_ZSWPOUT_2048kB); > > + break; > > + } > > The number of orders is PMD_ORDER, also ilog2(MAX_PTRS_PER_PTE) . > PMD_ORDER isn't necessarily 9. It seems we need some general way > to handle this and avoid so many duplicated case 1, case 2.... case 9. Thanks for this suggestion. The general way to do this appears to be simply calling count_mthp_stat(folio_order(folio), MTHP_STAT_[Z]SWPOUT) potentially with the addition of a new "MTHP_STAT_ZSWPOUT" to "enum mthp_stat_item". I will make this change in v2 accordingly. Thanks, Kanchana > > > + } > > +} > > + > > /* > > * We may have stale swap cache pages in memory: notice > > * them here and get rid of the unnecessary final write. > > @@ -196,6 +239,7 @@ int swap_writepage(struct page *page, struct > writeback_control *wbc) > > return ret; > > } > > if (zswap_store(folio)) { > > + count_zswap_thp_swpout_vm_event(folio); > > folio_start_writeback(folio); > > folio_unlock(folio); > > folio_end_writeback(folio); > > -- > > 2.27.0 > > > > Thanks > Barry ^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2024-08-15 1:37 UTC | newest] Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2024-08-14 6:28 [RFC PATCH v1 0/4] mm: ZSWAP swap-out of mTHP folios Kanchana P Sridhar 2024-08-14 6:28 ` [RFC PATCH v1 1/4] mm: zswap: zswap_is_folio_same_filled() takes an index in the folio Kanchana P Sridhar 2024-08-14 6:28 ` [RFC PATCH v1 2/4] mm: vmstat: Per mTHP-size zswap_store vmstat event counters Kanchana P Sridhar 2024-08-14 7:48 ` Barry Song 2024-08-14 17:40 ` Sridhar, Kanchana P 2024-08-14 23:24 ` Barry Song 2024-08-15 1:37 ` Sridhar, Kanchana P 2024-08-14 6:28 ` [RFC PATCH v1 3/4] mm: zswap: zswap_store() extended to handle mTHP folios Kanchana P Sridhar 2024-08-14 6:28 ` [RFC PATCH v1 4/4] mm: page_io: Count successful mTHP zswap stores in vmstat Kanchana P Sridhar 2024-08-14 7:53 ` Barry Song 2024-08-14 17:47 ` Sridhar, Kanchana P
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox