[RFC PATCH v1 0/4] mm: ZSWAP swap-out of mTHP folios

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH v1 0/4] mm: ZSWAP swap-out of mTHP folios
@ 2024-08-14  6:28 Kanchana P Sridhar
  2024-08-14  6:28 ` [RFC PATCH v1 1/4] mm: zswap: zswap_is_folio_same_filled() takes an index in the folio Kanchana P Sridhar
                   ` (3 more replies)
  0 siblings, 4 replies; 11+ messages in thread
From: Kanchana P Sridhar @ 2024-08-14  6:28 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosryahmed, nphamcs,
	ryan.roberts, ying.huang, 21cnbao, akpm
  Cc: nanhai.zou, wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

This RFC patch-series enables zswap_store() to accept and store mTHP
folios. The most significant contribution in this series is from the 
earlier RFC submitted by Ryan Roberts [1]. Ryan's original RFC has been
migrated to v6.10 in patch [3] of this series.

[1]: [RFC PATCH v1] mm: zswap: Store large folios without splitting
     https://lore.kernel.org/linux-mm/20231019110543.3284654-1-ryan.roberts@arm.com/T/#u

Additionally, there is an attempt to modularize some of the functionality
in zswap_store(), to make it more amenable to supporting any-order
mTHPs.

For instance, the determination of whether a folio is same-filled is
based on mapping an index into the folio to derive the page. Likewise,
there is a function "zswap_store_entry" added to store a zswap_entry in
the xarray.

For testing purposes, per-mTHP size vmstat zswap_store event counters are
added, and incremented upon successful zswap_store of an mTHP.

This patch-series is a precursor to ZSWAP compress batching of mTHP
swap-out and decompress batching of swap-ins based on swapin_readahead(),
using Intel IAA hardware acceleration, which we would like to submit in
subsequent RFC patch-series, with performance improvement data.

Performance Testing:
====================
Testing of this patch-series was done with the v6.10 mainline, without
and with this RFC, on an Intel Sapphire Rapids server, dual-socket 56
cores per socket, 4 IAA devices per socket.

The system has 503 GiB RAM, 176 GiB swap/ZSWAP with ZRAM as the backing
swap device. Core frequency was fixed at 2500MHz.

The vm-scalability "usemem" test was run in a cgroup whose memory.high
was fixed at 40G. Following a similar methodology as in Ryan Roberts'
"Swap-out mTHP without splitting" series [2], 70 usemem processes were
run, each allocating and writing 1G of memory:

    usemem --init-time -w -O -n 70 1g

Other kernel configuration parameters:

    ZSWAP Compressor  : LZ4, DEFLATE-IAA
    ZSWAP Allocator   : ZSMALLOC
    ZRAM Compressor   : LZO-RLE
    SWAP page-cluster : 2

In the experiments where "deflate-iaa" is used as the ZSWAP compressor,
IAA "compression verification" is enabled. Hence each IAA compression
will be decompressed internally by the "iaa_crypto" driver, the crc-s
returned by the hardware will be compared and errors reported in case of
mismatches. Thus "deflate-iaa" helps ensure better data integrity as
compared to the software compressors.

Throughput reported by usemem and perf sys time for running the test were
measured and averaged across 3 runs:

 64KB mTHP:
 ==========
  ----------------------------------------------------------
 |                |               |            |            |
 |Kernel          | mTHP SWAP-OUT | Throughput | Improvement|
 |                |               |       KB/s |            |
 |----------------|---------------|------------|------------|
 |v6.10 mainline  | ZRAM lzo-rle  |    111,180 |   Baseline |
 |zswap-mTHP-RFC  | ZSWAP lz4     |    115,996 |         4% |
 |zswap-mTHP-RFC  | ZSWAP         |            |            |
 |                | deflate-iaa   |    166,048 |        49% |
 |----------------------------------------------------------|
 |                |               |            |            |
 |Kernel          | mTHP SWAP-OUT |   Sys time | Improvement|
 |                |               |        sec |            |
 |----------------|---------------|------------|------------|
 |v6.10 mainline  | ZRAM lzo-rle  |   1,049.69 |   Baseline |
 |zswap-mTHP RFC  | ZSWAP lz4     |   1,178.20 |       -12% |
 |zswap-mTHP-RFC  | ZSWAP         |            |            |
 |                | deflate-iaa   |     626.12 |        40% |
  ----------------------------------------------------------

  -------------------------------------------------------
 | VMSTATS, mTHP ZSWAP stats, |      v6.10 |  zswap-mTHP |
 | mTHP ZRAM stats:           |   mainline |         RFC |
 |-------------------------------------------------------|
 | pswpin                     |         16 |           0 |
 | pswpout                    |  7,823,984 |           0 |
 | zswpin                     |        551 |         647 |
 | zswpout                    |      1,410 |  15,175,113 |
 |-------------------------------------------------------|
 | thp_swpout                 |          0 |           0 |
 | thp_swpout_fallback        |          0 |           0 |
 | pgmajfault                 |      2,189 |       2,241 |
 |-------------------------------------------------------|
 | zswpout_4kb_folio          |            |       1,497 |
 | mthp_zswpout_64kb          |            |     948,351 |
 |-------------------------------------------------------|
 | hugepages-64kB/stats/swpout|    488,999 |           0 |
  -------------------------------------------------------

 2MB PMD-THP/2048K mTHP:
 =======================
  ----------------------------------------------------------
 |                |               |            |            |
 |Kernel          | mTHP SWAP-OUT | Throughput | Improvement|
 |                |               |       KB/s |            |
 |----------------|---------------|------------|------------|
 |v6.10 mainline  | ZRAM lzo-rle  |    136,617 |   Baseline |
 |zswap-mTHP-RFC  | ZSWAP lz4     |    137,360 |         1% |
 |zswap-mTHP-RFC  | ZSWAP         |            |            |
 |                | deflate-iaa   |    179,097 |        31% |
 |----------------------------------------------------------|
 |                |               |            |            |
 |Kernel          | mTHP SWAP-OUT |   Sys time | Improvement|
 |                |               |        sec |            |
 |----------------|---------------|------------|------------|
 |v6.10 mainline  | ZRAM lzo-rle  |   1,044.40 |   Baseline |
 |zswap-mTHP RFC  | ZSWAP lz4     |   1,035.79 |         1% |
 |zswap-mTHP-RFC  | ZSWAP         |            |            |
 |                | deflate-iaa   |     571.31 |        45% |
  ----------------------------------------------------------

  ---------------------------------------------------------
 | VMSTATS, mTHP ZSWAP stats,   |      v6.10 |  zswap-mTHP |
 | mTHP ZRAM stats:             |   mainline |         RFC |
 |---------------------------------------------------------|
 | pswpin                       |          0 |           0 |
 | pswpout                      |  8,630,272 |           0 |
 | zswpin                       |        565 |       6,901 |
 | zswpout                      |      1,388 |  15,379,163 |
 |---------------------------------------------------------|
 | thp_swpout                   |     16,856 |           0 |
 | thp_swpout_fallback          |          0 |           0 |
 | pgmajfault                   |      2,184 |       8,532 |
 |---------------------------------------------------------|
 | zswpout_4kb_folio            |            |       5,851 |
 | mthp_zswpout_2048kb          |            |      30,026 |
 | zswpout_pmd_thp_folio        |            |      30,026 |
 |---------------------------------------------------------|
 | hugepages-2048kB/stats/swpout|     16,856 |           0 |
  ---------------------------------------------------------

As expected in the "Before" experiment, there are relatively fewer
swapouts, because ZRAM utilization is not accounted in the cgroup.

With the introduction of zswap_store mTHP, the "After" data reflects the
higher swapout activity, and consequent sys time degradation.

Our goal is to improve ZSWAP mTHP store performance using batching. With
Intel IAA compress/decompress batching used in ZSWAP (to be submitted as
additional RFC series), we are able to demonstrate significant
performance improvements with IAA as compared to software compressors.

For instance, with IAA-Canned compression [3] used with batching of
zswap_stores and of zswap_loads, the usemem experiment's average of 3
runs throughput improves to 170,461 KB/s (64KB mTHP) and 188,325 KB/s
(2MB THP).

[2] https://lore.kernel.org/linux-mm/20240408183946.2991168-1-ryan.roberts@arm.com/
[3] https://patchwork.kernel.org/project/linux-crypto/cover/cover.1710969449.git.andre.glover@linux.intel.com/

Kanchana P Sridhar (4):
  mm: zswap: zswap_is_folio_same_filled() takes an index in the folio.
  mm: vmstat: Per mTHP-size zswap_store vmstat event counters.
  mm: zswap: zswap_store() extended to handle mTHP folios.
  mm: page_io: Count successful mTHP zswap stores in vmstat.

 include/linux/vm_event_item.h |  15 +++
 mm/page_io.c                  |  44 +++++++
 mm/vmstat.c                   |  15 +++
 mm/zswap.c                    | 223 ++++++++++++++++++++++++----------
 4 files changed, 233 insertions(+), 64 deletions(-)

-- 
2.27.0

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [RFC PATCH v1 1/4] mm: zswap: zswap_is_folio_same_filled() takes an index in the folio.
  2024-08-14  6:28 [RFC PATCH v1 0/4] mm: ZSWAP swap-out of mTHP folios Kanchana P Sridhar
@ 2024-08-14  6:28 ` Kanchana P Sridhar
  2024-08-14  6:28 ` [RFC PATCH v1 2/4] mm: vmstat: Per mTHP-size zswap_store vmstat event counters Kanchana P Sridhar
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 11+ messages in thread
From: Kanchana P Sridhar @ 2024-08-14  6:28 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosryahmed, nphamcs,
	ryan.roberts, ying.huang, 21cnbao, akpm
  Cc: nanhai.zou, wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

This change is being made so that zswap_store can process mTHP folios.

Modified zswap_is_folio_same_filled() to work for any-order folios,
by accepting an additional "index" parameter to arrive at the
page within the folio to run the same-filled page check.

Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 mm/zswap.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/mm/zswap.c b/mm/zswap.c
index a50e2986cd2f..a6b0a7c636db 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -1373,14 +1373,14 @@ static void shrink_worker(struct work_struct *w)
 /*********************************
 * same-filled functions
 **********************************/
-static bool zswap_is_folio_same_filled(struct folio *folio, unsigned long *value)
+static bool zswap_is_folio_same_filled(struct folio *folio, long index, unsigned long *value)
 {
 	unsigned long *page;
 	unsigned long val;
 	unsigned int pos, last_pos = PAGE_SIZE / sizeof(*page) - 1;
 	bool ret = false;
 
-	page = kmap_local_folio(folio, 0);
+	page = kmap_local_folio(folio, index * PAGE_SIZE);
 	val = page[0];
 
 	if (val != page[last_pos])
@@ -1450,7 +1450,7 @@ bool zswap_store(struct folio *folio)
 		goto reject;
 	}
 
-	if (zswap_is_folio_same_filled(folio, &value)) {
+	if (zswap_is_folio_same_filled(folio, 0, &value)) {
 		entry->length = 0;
 		entry->value = value;
 		atomic_inc(&zswap_same_filled_pages);
-- 
2.27.0



^ permalink raw reply	[flat|nested] 11+ messages in thread

* [RFC PATCH v1 2/4] mm: vmstat: Per mTHP-size zswap_store vmstat event counters.
  2024-08-14  6:28 [RFC PATCH v1 0/4] mm: ZSWAP swap-out of mTHP folios Kanchana P Sridhar
  2024-08-14  6:28 ` [RFC PATCH v1 1/4] mm: zswap: zswap_is_folio_same_filled() takes an index in the folio Kanchana P Sridhar
@ 2024-08-14  6:28 ` Kanchana P Sridhar
  2024-08-14  7:48   ` Barry Song
  2024-08-14  6:28 ` [RFC PATCH v1 3/4] mm: zswap: zswap_store() extended to handle mTHP folios Kanchana P Sridhar
  2024-08-14  6:28 ` [RFC PATCH v1 4/4] mm: page_io: Count successful mTHP zswap stores in vmstat Kanchana P Sridhar
  3 siblings, 1 reply; 11+ messages in thread
From: Kanchana P Sridhar @ 2024-08-14  6:28 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosryahmed, nphamcs,
	ryan.roberts, ying.huang, 21cnbao, akpm
  Cc: nanhai.zou, wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

Added vmstat event counters per mTHP-size that can be used to account
for folios of different sizes being successfully stored in ZSWAP.

For this RFC, it is not clear if these zswpout counters should instead
be added as part of the existing mTHP stats in
/sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats.

The following is also a viable option, should it make better sense,
for instance, as:

/sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats/zswpout.

If so, we would be able to distinguish between mTHP zswap and
non-zswap swapouts through:

/sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats/zswpout

and

/sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats/swpout

respectively.

Comments would be appreciated as to which approach is preferable.

Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 include/linux/vm_event_item.h | 15 +++++++++++++++
 mm/vmstat.c                   | 15 +++++++++++++++
 2 files changed, 30 insertions(+)

diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 747943bc8cc2..2451bcfcf05c 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -114,6 +114,9 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 		THP_ZERO_PAGE_ALLOC,
 		THP_ZERO_PAGE_ALLOC_FAILED,
 		THP_SWPOUT,
+#ifdef CONFIG_ZSWAP
+		ZSWPOUT_PMD_THP_FOLIO,
+#endif
 		THP_SWPOUT_FALLBACK,
 #endif
 #ifdef CONFIG_MEMORY_BALLOON
@@ -143,6 +146,18 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 		ZSWPIN,
 		ZSWPOUT,
 		ZSWPWB,
+		ZSWPOUT_4KB_FOLIO,
+#ifdef CONFIG_THP_SWAP
+		mTHP_ZSWPOUT_8kB,
+		mTHP_ZSWPOUT_16kB,
+		mTHP_ZSWPOUT_32kB,
+		mTHP_ZSWPOUT_64kB,
+		mTHP_ZSWPOUT_128kB,
+		mTHP_ZSWPOUT_256kB,
+		mTHP_ZSWPOUT_512kB,
+		mTHP_ZSWPOUT_1024kB,
+		mTHP_ZSWPOUT_2048kB,
+#endif
 #endif
 #ifdef CONFIG_X86
 		DIRECT_MAP_LEVEL2_SPLIT,
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 8507c497218b..0e66c8b0c486 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1375,6 +1375,9 @@ const char * const vmstat_text[] = {
 	"thp_zero_page_alloc",
 	"thp_zero_page_alloc_failed",
 	"thp_swpout",
+#ifdef CONFIG_ZSWAP
+	"zswpout_pmd_thp_folio",
+#endif
 	"thp_swpout_fallback",
 #endif
 #ifdef CONFIG_MEMORY_BALLOON
@@ -1405,6 +1408,18 @@ const char * const vmstat_text[] = {
 	"zswpin",
 	"zswpout",
 	"zswpwb",
+	"zswpout_4kb_folio",
+#ifdef CONFIG_THP_SWAP
+	"mthp_zswpout_8kb",
+	"mthp_zswpout_16kb",
+	"mthp_zswpout_32kb",
+	"mthp_zswpout_64kb",
+	"mthp_zswpout_128kb",
+	"mthp_zswpout_256kb",
+	"mthp_zswpout_512kb",
+	"mthp_zswpout_1024kb",
+	"mthp_zswpout_2048kb",
+#endif
 #endif
 #ifdef CONFIG_X86
 	"direct_map_level2_splits",
-- 
2.27.0



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH v1 2/4] mm: vmstat: Per mTHP-size zswap_store vmstat event counters.
  2024-08-14  6:28 ` [RFC PATCH v1 2/4] mm: vmstat: Per mTHP-size zswap_store vmstat event counters Kanchana P Sridhar
@ 2024-08-14  7:48   ` Barry Song
  2024-08-14 17:40     ` Sridhar, Kanchana P
  0 siblings, 1 reply; 11+ messages in thread
From: Barry Song @ 2024-08-14  7:48 UTC (permalink / raw)
  To: Kanchana P Sridhar
  Cc: linux-kernel, linux-mm, hannes, yosryahmed, nphamcs,
	ryan.roberts, ying.huang, akpm, nanhai.zou, wajdi.k.feghali,
	vinodh.gopal

On Wed, Aug 14, 2024 at 6:28 PM Kanchana P Sridhar
<kanchana.p.sridhar@intel.com> wrote:
>
> Added vmstat event counters per mTHP-size that can be used to account
> for folios of different sizes being successfully stored in ZSWAP.
>
> For this RFC, it is not clear if these zswpout counters should instead
> be added as part of the existing mTHP stats in
> /sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats.
>
> The following is also a viable option, should it make better sense,
> for instance, as:
>
> /sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats/zswpout.
>
> If so, we would be able to distinguish between mTHP zswap and
> non-zswap swapouts through:
>
> /sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats/zswpout
>
> and
>
> /sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats/swpout
>
> respectively.
>
> Comments would be appreciated as to which approach is preferable.

Even though swapout might go through zswap, from the perspective of
the mm core, it shouldn't be aware of that. Shouldn't zswpout be part
of swpout? Why are they separate? no matter if a mTHP has been
put in zswap, it has been swapped-out to mm-core? No?


>
> Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
> ---
>  include/linux/vm_event_item.h | 15 +++++++++++++++
>  mm/vmstat.c                   | 15 +++++++++++++++
>  2 files changed, 30 insertions(+)
>
> diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
> index 747943bc8cc2..2451bcfcf05c 100644
> --- a/include/linux/vm_event_item.h
> +++ b/include/linux/vm_event_item.h
> @@ -114,6 +114,9 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
>                 THP_ZERO_PAGE_ALLOC,
>                 THP_ZERO_PAGE_ALLOC_FAILED,
>                 THP_SWPOUT,
> +#ifdef CONFIG_ZSWAP
> +               ZSWPOUT_PMD_THP_FOLIO,
> +#endif
>                 THP_SWPOUT_FALLBACK,
>  #endif
>  #ifdef CONFIG_MEMORY_BALLOON
> @@ -143,6 +146,18 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
>                 ZSWPIN,
>                 ZSWPOUT,
>                 ZSWPWB,
> +               ZSWPOUT_4KB_FOLIO,
> +#ifdef CONFIG_THP_SWAP
> +               mTHP_ZSWPOUT_8kB,
> +               mTHP_ZSWPOUT_16kB,
> +               mTHP_ZSWPOUT_32kB,
> +               mTHP_ZSWPOUT_64kB,
> +               mTHP_ZSWPOUT_128kB,
> +               mTHP_ZSWPOUT_256kB,
> +               mTHP_ZSWPOUT_512kB,
> +               mTHP_ZSWPOUT_1024kB,
> +               mTHP_ZSWPOUT_2048kB,
> +#endif

This implementation hardcodes assumptions about the page size being 4KB,
but page sizes can vary, and so can the THP orders?

>  #endif
>  #ifdef CONFIG_X86
>                 DIRECT_MAP_LEVEL2_SPLIT,
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 8507c497218b..0e66c8b0c486 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -1375,6 +1375,9 @@ const char * const vmstat_text[] = {
>         "thp_zero_page_alloc",
>         "thp_zero_page_alloc_failed",
>         "thp_swpout",
> +#ifdef CONFIG_ZSWAP
> +       "zswpout_pmd_thp_folio",
> +#endif
>         "thp_swpout_fallback",
>  #endif
>  #ifdef CONFIG_MEMORY_BALLOON
> @@ -1405,6 +1408,18 @@ const char * const vmstat_text[] = {
>         "zswpin",
>         "zswpout",
>         "zswpwb",
> +       "zswpout_4kb_folio",
> +#ifdef CONFIG_THP_SWAP
> +       "mthp_zswpout_8kb",
> +       "mthp_zswpout_16kb",
> +       "mthp_zswpout_32kb",
> +       "mthp_zswpout_64kb",
> +       "mthp_zswpout_128kb",
> +       "mthp_zswpout_256kb",
> +       "mthp_zswpout_512kb",
> +       "mthp_zswpout_1024kb",
> +       "mthp_zswpout_2048kb",
> +#endif

The issue here is that the number of THP orders
can vary across different platforms.

>  #endif
>  #ifdef CONFIG_X86
>         "direct_map_level2_splits",
> --
> 2.27.0
>

Thanks
Barry


^ permalink raw reply	[flat|nested] 11+ messages in thread

* RE: [RFC PATCH v1 2/4] mm: vmstat: Per mTHP-size zswap_store vmstat event counters.
  2024-08-14  7:48   ` Barry Song
@ 2024-08-14 17:40     ` Sridhar, Kanchana P
  2024-08-14 23:24       ` Barry Song
  0 siblings, 1 reply; 11+ messages in thread
From: Sridhar, Kanchana P @ 2024-08-14 17:40 UTC (permalink / raw)
  To: Barry Song
  Cc: linux-kernel, linux-mm, hannes, yosryahmed, nphamcs,
	ryan.roberts, Huang, Ying, akpm, Zou, Nanhai, Feghali, Wajdi K,
	Gopal, Vinodh, Sridhar, Kanchana P

Hi Barry,

> -----Original Message-----
> From: Barry Song <21cnbao@gmail.com>
> Sent: Wednesday, August 14, 2024 12:49 AM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; yosryahmed@google.com; nphamcs@gmail.com;
> ryan.roberts@arm.com; Huang, Ying <ying.huang@intel.com>; akpm@linux-
> foundation.org; Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K
> <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [RFC PATCH v1 2/4] mm: vmstat: Per mTHP-size zswap_store
> vmstat event counters.
> 
> On Wed, Aug 14, 2024 at 6:28 PM Kanchana P Sridhar
> <kanchana.p.sridhar@intel.com> wrote:
> >
> > Added vmstat event counters per mTHP-size that can be used to account
> > for folios of different sizes being successfully stored in ZSWAP.
> >
> > For this RFC, it is not clear if these zswpout counters should instead
> > be added as part of the existing mTHP stats in
> > /sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats.
> >
> > The following is also a viable option, should it make better sense,
> > for instance, as:
> >
> > /sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats/zswpout.
> >
> > If so, we would be able to distinguish between mTHP zswap and
> > non-zswap swapouts through:
> >
> > /sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats/zswpout
> >
> > and
> >
> > /sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats/swpout
> >
> > respectively.
> >
> > Comments would be appreciated as to which approach is preferable.
> 
> Even though swapout might go through zswap, from the perspective of
> the mm core, it shouldn't be aware of that. Shouldn't zswpout be part
> of swpout? Why are they separate? no matter if a mTHP has been
> put in zswap, it has been swapped-out to mm-core? No?

Thanks for the code review comments. This is a good point. I was keeping in
mind the convention used by existing vmstat event counters that distinguish
zswpout/zswpin from pswpout/pswpin events.

If we want to keep the distinction in mTHP swapouts, would adding a
separate MTHP_STAT_ZSWPOUT to "enum mthp_stat_item" be Ok?

In any case, it looks like all that would be needed is a call to
count_mthp_stat(folio_order(folio), MTHP_STAT_[Z]SWPOUT) in the
general case.

I will make this change in v2, depending on whether or not the
separation of zswpout vs. non-zswap swpout is recommended for
mTHP.

> 
> 
> >
> > Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
> > ---
> >  include/linux/vm_event_item.h | 15 +++++++++++++++
> >  mm/vmstat.c                   | 15 +++++++++++++++
> >  2 files changed, 30 insertions(+)
> >
> > diff --git a/include/linux/vm_event_item.h
> b/include/linux/vm_event_item.h
> > index 747943bc8cc2..2451bcfcf05c 100644
> > --- a/include/linux/vm_event_item.h
> > +++ b/include/linux/vm_event_item.h
> > @@ -114,6 +114,9 @@ enum vm_event_item { PGPGIN, PGPGOUT,
> PSWPIN, PSWPOUT,
> >                 THP_ZERO_PAGE_ALLOC,
> >                 THP_ZERO_PAGE_ALLOC_FAILED,
> >                 THP_SWPOUT,
> > +#ifdef CONFIG_ZSWAP
> > +               ZSWPOUT_PMD_THP_FOLIO,
> > +#endif
> >                 THP_SWPOUT_FALLBACK,
> >  #endif
> >  #ifdef CONFIG_MEMORY_BALLOON
> > @@ -143,6 +146,18 @@ enum vm_event_item { PGPGIN, PGPGOUT,
> PSWPIN, PSWPOUT,
> >                 ZSWPIN,
> >                 ZSWPOUT,
> >                 ZSWPWB,
> > +               ZSWPOUT_4KB_FOLIO,
> > +#ifdef CONFIG_THP_SWAP
> > +               mTHP_ZSWPOUT_8kB,
> > +               mTHP_ZSWPOUT_16kB,
> > +               mTHP_ZSWPOUT_32kB,
> > +               mTHP_ZSWPOUT_64kB,
> > +               mTHP_ZSWPOUT_128kB,
> > +               mTHP_ZSWPOUT_256kB,
> > +               mTHP_ZSWPOUT_512kB,
> > +               mTHP_ZSWPOUT_1024kB,
> > +               mTHP_ZSWPOUT_2048kB,
> > +#endif
> 
> This implementation hardcodes assumptions about the page size being 4KB,
> but page sizes can vary, and so can the THP orders?

Agreed, will address in v2.

> 
> >  #endif
> >  #ifdef CONFIG_X86
> >                 DIRECT_MAP_LEVEL2_SPLIT,
> > diff --git a/mm/vmstat.c b/mm/vmstat.c
> > index 8507c497218b..0e66c8b0c486 100644
> > --- a/mm/vmstat.c
> > +++ b/mm/vmstat.c
> > @@ -1375,6 +1375,9 @@ const char * const vmstat_text[] = {
> >         "thp_zero_page_alloc",
> >         "thp_zero_page_alloc_failed",
> >         "thp_swpout",
> > +#ifdef CONFIG_ZSWAP
> > +       "zswpout_pmd_thp_folio",
> > +#endif
> >         "thp_swpout_fallback",
> >  #endif
> >  #ifdef CONFIG_MEMORY_BALLOON
> > @@ -1405,6 +1408,18 @@ const char * const vmstat_text[] = {
> >         "zswpin",
> >         "zswpout",
> >         "zswpwb",
> > +       "zswpout_4kb_folio",
> > +#ifdef CONFIG_THP_SWAP
> > +       "mthp_zswpout_8kb",
> > +       "mthp_zswpout_16kb",
> > +       "mthp_zswpout_32kb",
> > +       "mthp_zswpout_64kb",
> > +       "mthp_zswpout_128kb",
> > +       "mthp_zswpout_256kb",
> > +       "mthp_zswpout_512kb",
> > +       "mthp_zswpout_1024kb",
> > +       "mthp_zswpout_2048kb",
> > +#endif
> 
> The issue here is that the number of THP orders
> can vary across different platforms.

Agreed, will address in v2.

Thanks,
Kanchana

> 
> >  #endif
> >  #ifdef CONFIG_X86
> >         "direct_map_level2_splits",
> > --
> > 2.27.0
> >
> 
> Thanks
> Barry

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH v1 2/4] mm: vmstat: Per mTHP-size zswap_store vmstat event counters.
  2024-08-14 17:40     ` Sridhar, Kanchana P
@ 2024-08-14 23:24       ` Barry Song
  2024-08-15  1:37         ` Sridhar, Kanchana P
  0 siblings, 1 reply; 11+ messages in thread
From: Barry Song @ 2024-08-14 23:24 UTC (permalink / raw)
  To: Sridhar, Kanchana P
  Cc: linux-kernel, linux-mm, hannes, yosryahmed, nphamcs,
	ryan.roberts, Huang, Ying, akpm, Zou, Nanhai, Feghali, Wajdi K,
	Gopal, Vinodh

On Thu, Aug 15, 2024 at 5:40 AM Sridhar, Kanchana P
<kanchana.p.sridhar@intel.com> wrote:
>
> Hi Barry,
>
> > -----Original Message-----
> > From: Barry Song <21cnbao@gmail.com>
> > Sent: Wednesday, August 14, 2024 12:49 AM
> > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> > Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> > hannes@cmpxchg.org; yosryahmed@google.com; nphamcs@gmail.com;
> > ryan.roberts@arm.com; Huang, Ying <ying.huang@intel.com>; akpm@linux-
> > foundation.org; Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K
> > <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> > Subject: Re: [RFC PATCH v1 2/4] mm: vmstat: Per mTHP-size zswap_store
> > vmstat event counters.
> >
> > On Wed, Aug 14, 2024 at 6:28 PM Kanchana P Sridhar
> > <kanchana.p.sridhar@intel.com> wrote:
> > >
> > > Added vmstat event counters per mTHP-size that can be used to account
> > > for folios of different sizes being successfully stored in ZSWAP.
> > >
> > > For this RFC, it is not clear if these zswpout counters should instead
> > > be added as part of the existing mTHP stats in
> > > /sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats.
> > >
> > > The following is also a viable option, should it make better sense,
> > > for instance, as:
> > >
> > > /sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats/zswpout.
> > >
> > > If so, we would be able to distinguish between mTHP zswap and
> > > non-zswap swapouts through:
> > >
> > > /sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats/zswpout
> > >
> > > and
> > >
> > > /sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats/swpout
> > >
> > > respectively.
> > >
> > > Comments would be appreciated as to which approach is preferable.
> >
> > Even though swapout might go through zswap, from the perspective of
> > the mm core, it shouldn't be aware of that. Shouldn't zswpout be part
> > of swpout? Why are they separate? no matter if a mTHP has been
> > put in zswap, it has been swapped-out to mm-core? No?
>
> Thanks for the code review comments. This is a good point. I was keeping in
> mind the convention used by existing vmstat event counters that distinguish
> zswpout/zswpin from pswpout/pswpin events.
>
> If we want to keep the distinction in mTHP swapouts, would adding a
> separate MTHP_STAT_ZSWPOUT to "enum mthp_stat_item" be Ok?
>

I'm not entirely sure how important the zswpout counter is. To me, it doesn't
seem as critical as swpout and swpout_fallback, which are more useful for
system profiling. zswapout feels more like an internal detail related to
how the swap-out process is handled? If this is the case, we might not
need this per-size counter.

Otherwise, I believe sysfs is a better place to avoid all the chaos in vmstat
to handle various orders and sizes. So the question is, per-size zswpout
counter is really important or just for debugging purposes?

> In any case, it looks like all that would be needed is a call to
> count_mthp_stat(folio_order(folio), MTHP_STAT_[Z]SWPOUT) in the
> general case.
>
> I will make this change in v2, depending on whether or not the
> separation of zswpout vs. non-zswap swpout is recommended for
> mTHP.
>
> >
> >
> > >
> > > Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
> > > ---
> > >  include/linux/vm_event_item.h | 15 +++++++++++++++
> > >  mm/vmstat.c                   | 15 +++++++++++++++
> > >  2 files changed, 30 insertions(+)
> > >
> > > diff --git a/include/linux/vm_event_item.h
> > b/include/linux/vm_event_item.h
> > > index 747943bc8cc2..2451bcfcf05c 100644
> > > --- a/include/linux/vm_event_item.h
> > > +++ b/include/linux/vm_event_item.h
> > > @@ -114,6 +114,9 @@ enum vm_event_item { PGPGIN, PGPGOUT,
> > PSWPIN, PSWPOUT,
> > >                 THP_ZERO_PAGE_ALLOC,
> > >                 THP_ZERO_PAGE_ALLOC_FAILED,
> > >                 THP_SWPOUT,
> > > +#ifdef CONFIG_ZSWAP
> > > +               ZSWPOUT_PMD_THP_FOLIO,
> > > +#endif
> > >                 THP_SWPOUT_FALLBACK,
> > >  #endif
> > >  #ifdef CONFIG_MEMORY_BALLOON
> > > @@ -143,6 +146,18 @@ enum vm_event_item { PGPGIN, PGPGOUT,
> > PSWPIN, PSWPOUT,
> > >                 ZSWPIN,
> > >                 ZSWPOUT,
> > >                 ZSWPWB,
> > > +               ZSWPOUT_4KB_FOLIO,
> > > +#ifdef CONFIG_THP_SWAP
> > > +               mTHP_ZSWPOUT_8kB,
> > > +               mTHP_ZSWPOUT_16kB,
> > > +               mTHP_ZSWPOUT_32kB,
> > > +               mTHP_ZSWPOUT_64kB,
> > > +               mTHP_ZSWPOUT_128kB,
> > > +               mTHP_ZSWPOUT_256kB,
> > > +               mTHP_ZSWPOUT_512kB,
> > > +               mTHP_ZSWPOUT_1024kB,
> > > +               mTHP_ZSWPOUT_2048kB,
> > > +#endif
> >
> > This implementation hardcodes assumptions about the page size being 4KB,
> > but page sizes can vary, and so can the THP orders?
>
> Agreed, will address in v2.
>
> >
> > >  #endif
> > >  #ifdef CONFIG_X86
> > >                 DIRECT_MAP_LEVEL2_SPLIT,
> > > diff --git a/mm/vmstat.c b/mm/vmstat.c
> > > index 8507c497218b..0e66c8b0c486 100644
> > > --- a/mm/vmstat.c
> > > +++ b/mm/vmstat.c
> > > @@ -1375,6 +1375,9 @@ const char * const vmstat_text[] = {
> > >         "thp_zero_page_alloc",
> > >         "thp_zero_page_alloc_failed",
> > >         "thp_swpout",
> > > +#ifdef CONFIG_ZSWAP
> > > +       "zswpout_pmd_thp_folio",
> > > +#endif
> > >         "thp_swpout_fallback",
> > >  #endif
> > >  #ifdef CONFIG_MEMORY_BALLOON
> > > @@ -1405,6 +1408,18 @@ const char * const vmstat_text[] = {
> > >         "zswpin",
> > >         "zswpout",
> > >         "zswpwb",
> > > +       "zswpout_4kb_folio",
> > > +#ifdef CONFIG_THP_SWAP
> > > +       "mthp_zswpout_8kb",
> > > +       "mthp_zswpout_16kb",
> > > +       "mthp_zswpout_32kb",
> > > +       "mthp_zswpout_64kb",
> > > +       "mthp_zswpout_128kb",
> > > +       "mthp_zswpout_256kb",
> > > +       "mthp_zswpout_512kb",
> > > +       "mthp_zswpout_1024kb",
> > > +       "mthp_zswpout_2048kb",
> > > +#endif
> >
> > The issue here is that the number of THP orders
> > can vary across different platforms.
>
> Agreed, will address in v2.
>
> Thanks,
> Kanchana
>
> >
> > >  #endif
> > >  #ifdef CONFIG_X86
> > >         "direct_map_level2_splits",
> > > --
> > > 2.27.0
> > >

Thanks
Barry


^ permalink raw reply	[flat|nested] 11+ messages in thread

* RE: [RFC PATCH v1 2/4] mm: vmstat: Per mTHP-size zswap_store vmstat event counters.
  2024-08-14 23:24       ` Barry Song
@ 2024-08-15  1:37         ` Sridhar, Kanchana P
  0 siblings, 0 replies; 11+ messages in thread
From: Sridhar, Kanchana P @ 2024-08-15  1:37 UTC (permalink / raw)
  To: Barry Song
  Cc: linux-kernel, linux-mm, hannes, yosryahmed, nphamcs,
	ryan.roberts, Huang, Ying, akpm, Zou, Nanhai, Feghali, Wajdi K,
	Gopal, Vinodh, Sridhar, Kanchana P

Hi Barry,

> -----Original Message-----
> From: Barry Song <21cnbao@gmail.com>
> Sent: Wednesday, August 14, 2024 4:25 PM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; yosryahmed@google.com; nphamcs@gmail.com;
> ryan.roberts@arm.com; Huang, Ying <ying.huang@intel.com>; akpm@linux-
> foundation.org; Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K
> <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [RFC PATCH v1 2/4] mm: vmstat: Per mTHP-size zswap_store
> vmstat event counters.
> 
> On Thu, Aug 15, 2024 at 5:40 AM Sridhar, Kanchana P
> <kanchana.p.sridhar@intel.com> wrote:
> >
> > Hi Barry,
> >
> > > -----Original Message-----
> > > From: Barry Song <21cnbao@gmail.com>
> > > Sent: Wednesday, August 14, 2024 12:49 AM
> > > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> > > Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> > > hannes@cmpxchg.org; yosryahmed@google.com; nphamcs@gmail.com;
> > > ryan.roberts@arm.com; Huang, Ying <ying.huang@intel.com>;
> akpm@linux-
> > > foundation.org; Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K
> > > <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> > > Subject: Re: [RFC PATCH v1 2/4] mm: vmstat: Per mTHP-size zswap_store
> > > vmstat event counters.
> > >
> > > On Wed, Aug 14, 2024 at 6:28 PM Kanchana P Sridhar
> > > <kanchana.p.sridhar@intel.com> wrote:
> > > >
> > > > Added vmstat event counters per mTHP-size that can be used to account
> > > > for folios of different sizes being successfully stored in ZSWAP.
> > > >
> > > > For this RFC, it is not clear if these zswpout counters should instead
> > > > be added as part of the existing mTHP stats in
> > > > /sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats.
> > > >
> > > > The following is also a viable option, should it make better sense,
> > > > for instance, as:
> > > >
> > > > /sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats/zswpout.
> > > >
> > > > If so, we would be able to distinguish between mTHP zswap and
> > > > non-zswap swapouts through:
> > > >
> > > > /sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats/zswpout
> > > >
> > > > and
> > > >
> > > > /sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats/swpout
> > > >
> > > > respectively.
> > > >
> > > > Comments would be appreciated as to which approach is preferable.
> > >
> > > Even though swapout might go through zswap, from the perspective of
> > > the mm core, it shouldn't be aware of that. Shouldn't zswpout be part
> > > of swpout? Why are they separate? no matter if a mTHP has been
> > > put in zswap, it has been swapped-out to mm-core? No?
> >
> > Thanks for the code review comments. This is a good point. I was keeping in
> > mind the convention used by existing vmstat event counters that distinguish
> > zswpout/zswpin from pswpout/pswpin events.
> >
> > If we want to keep the distinction in mTHP swapouts, would adding a
> > separate MTHP_STAT_ZSWPOUT to "enum mthp_stat_item" be Ok?
> >
> 
> I'm not entirely sure how important the zswpout counter is. To me, it doesn't
> seem as critical as swpout and swpout_fallback, which are more useful for
> system profiling. zswapout feels more like an internal detail related to
> how the swap-out process is handled? If this is the case, we might not
> need this per-size counter.
> 
> Otherwise, I believe sysfs is a better place to avoid all the chaos in vmstat
> to handle various orders and sizes. So the question is, per-size zswpout
> counter is really important or just for debugging purposes?

I agree, sysfs would be a cleaner mTHP stats accounting solution, given the
existing mTHP swpout stats under the per-order
/sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats/swpout.

I personally find distinct zswap vs. bdev/fs swapout accounting useful
for debugging, and for overall reclaim path characterization.
For instance, the impact of different zswap compressors' compress latency
on zswpout activity for a given workload. Is a slowdown in compress latency
causing active/hot memory to be reclaimed and immediately faulted in?
Does better zswap compress efficiency co-relate to more cold memory
as mTHP to be reclaimed? How does the reclaim path efficiency
improvement resulting from improving zswap_store mTHP performance
co-relate with ZSWAP utilization and memory savings? I have found these
counters useful in understanding some of these characteristics.

I also believe it helps to account for the number of mTHP being stored in
different compress tiers. For e.g. how many mTHP were stored in zswap vs.
being rejected and stored in the backing swap device. This could help say
in provisioning zswap memory, and knowing the impact of zswap compress
path latency on scaling.

Another interesting characteristic that mTHP zswpout accounting could help
understand would be compressor incompressibility and/or zpool fragmentation;
and being able to better co-relate the zswap/reject_* sysfs counters with
mTHP [z]swpout stats.

Look forward to inputs from yourself and others on the direction and next steps.

Thanks,
Kanchana

> 
> > In any case, it looks like all that would be needed is a call to
> > count_mthp_stat(folio_order(folio), MTHP_STAT_[Z]SWPOUT) in the
> > general case.
> >
> > I will make this change in v2, depending on whether or not the
> > separation of zswpout vs. non-zswap swpout is recommended for
> > mTHP.
> >
> > >
> > >
> > > >
> > > > Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
> > > > ---
> > > >  include/linux/vm_event_item.h | 15 +++++++++++++++
> > > >  mm/vmstat.c                   | 15 +++++++++++++++
> > > >  2 files changed, 30 insertions(+)
> > > >
> > > > diff --git a/include/linux/vm_event_item.h
> > > b/include/linux/vm_event_item.h
> > > > index 747943bc8cc2..2451bcfcf05c 100644
> > > > --- a/include/linux/vm_event_item.h
> > > > +++ b/include/linux/vm_event_item.h
> > > > @@ -114,6 +114,9 @@ enum vm_event_item { PGPGIN, PGPGOUT,
> > > PSWPIN, PSWPOUT,
> > > >                 THP_ZERO_PAGE_ALLOC,
> > > >                 THP_ZERO_PAGE_ALLOC_FAILED,
> > > >                 THP_SWPOUT,
> > > > +#ifdef CONFIG_ZSWAP
> > > > +               ZSWPOUT_PMD_THP_FOLIO,
> > > > +#endif
> > > >                 THP_SWPOUT_FALLBACK,
> > > >  #endif
> > > >  #ifdef CONFIG_MEMORY_BALLOON
> > > > @@ -143,6 +146,18 @@ enum vm_event_item { PGPGIN, PGPGOUT,
> > > PSWPIN, PSWPOUT,
> > > >                 ZSWPIN,
> > > >                 ZSWPOUT,
> > > >                 ZSWPWB,
> > > > +               ZSWPOUT_4KB_FOLIO,
> > > > +#ifdef CONFIG_THP_SWAP
> > > > +               mTHP_ZSWPOUT_8kB,
> > > > +               mTHP_ZSWPOUT_16kB,
> > > > +               mTHP_ZSWPOUT_32kB,
> > > > +               mTHP_ZSWPOUT_64kB,
> > > > +               mTHP_ZSWPOUT_128kB,
> > > > +               mTHP_ZSWPOUT_256kB,
> > > > +               mTHP_ZSWPOUT_512kB,
> > > > +               mTHP_ZSWPOUT_1024kB,
> > > > +               mTHP_ZSWPOUT_2048kB,
> > > > +#endif
> > >
> > > This implementation hardcodes assumptions about the page size being
> 4KB,
> > > but page sizes can vary, and so can the THP orders?
> >
> > Agreed, will address in v2.
> >
> > >
> > > >  #endif
> > > >  #ifdef CONFIG_X86
> > > >                 DIRECT_MAP_LEVEL2_SPLIT,
> > > > diff --git a/mm/vmstat.c b/mm/vmstat.c
> > > > index 8507c497218b..0e66c8b0c486 100644
> > > > --- a/mm/vmstat.c
> > > > +++ b/mm/vmstat.c
> > > > @@ -1375,6 +1375,9 @@ const char * const vmstat_text[] = {
> > > >         "thp_zero_page_alloc",
> > > >         "thp_zero_page_alloc_failed",
> > > >         "thp_swpout",
> > > > +#ifdef CONFIG_ZSWAP
> > > > +       "zswpout_pmd_thp_folio",
> > > > +#endif
> > > >         "thp_swpout_fallback",
> > > >  #endif
> > > >  #ifdef CONFIG_MEMORY_BALLOON
> > > > @@ -1405,6 +1408,18 @@ const char * const vmstat_text[] = {
> > > >         "zswpin",
> > > >         "zswpout",
> > > >         "zswpwb",
> > > > +       "zswpout_4kb_folio",
> > > > +#ifdef CONFIG_THP_SWAP
> > > > +       "mthp_zswpout_8kb",
> > > > +       "mthp_zswpout_16kb",
> > > > +       "mthp_zswpout_32kb",
> > > > +       "mthp_zswpout_64kb",
> > > > +       "mthp_zswpout_128kb",
> > > > +       "mthp_zswpout_256kb",
> > > > +       "mthp_zswpout_512kb",
> > > > +       "mthp_zswpout_1024kb",
> > > > +       "mthp_zswpout_2048kb",
> > > > +#endif
> > >
> > > The issue here is that the number of THP orders
> > > can vary across different platforms.
> >
> > Agreed, will address in v2.
> >
> > Thanks,
> > Kanchana
> >
> > >
> > > >  #endif
> > > >  #ifdef CONFIG_X86
> > > >         "direct_map_level2_splits",
> > > > --
> > > > 2.27.0
> > > >
> 
> Thanks
> Barry

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [RFC PATCH v1 3/4] mm: zswap: zswap_store() extended to handle mTHP folios.
  2024-08-14  6:28 [RFC PATCH v1 0/4] mm: ZSWAP swap-out of mTHP folios Kanchana P Sridhar
  2024-08-14  6:28 ` [RFC PATCH v1 1/4] mm: zswap: zswap_is_folio_same_filled() takes an index in the folio Kanchana P Sridhar
  2024-08-14  6:28 ` [RFC PATCH v1 2/4] mm: vmstat: Per mTHP-size zswap_store vmstat event counters Kanchana P Sridhar
@ 2024-08-14  6:28 ` Kanchana P Sridhar
  2024-08-14  6:28 ` [RFC PATCH v1 4/4] mm: page_io: Count successful mTHP zswap stores in vmstat Kanchana P Sridhar
  3 siblings, 0 replies; 11+ messages in thread
From: Kanchana P Sridhar @ 2024-08-14  6:28 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosryahmed, nphamcs,
	ryan.roberts, ying.huang, 21cnbao, akpm
  Cc: nanhai.zou, wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

zswap_store() will now process and store mTHP and PMD-size THP folios.

This change reuses and adapts the functionality in Ryan Roberts' RFC
patch [1]:

  "[RFC,v1] mm: zswap: Store large folios without splitting"

  [1] https://lore.kernel.org/linux-mm/20231019110543.3284654-1-ryan.roberts@arm.com/T/#u

This patch provides a sequential implementation of storing an mTHP in
zswap_store() by iterating through each page in the folio to compress
and store it in the zswap zpool.

Towards this goal, zswap_compress() is modified to take a page instead
of a folio as input.

Each page's swap offset is stored as a separate zswap entry.

If an error is encountered during the store of any page in the mTHP,
all previous pages/entries stored will be invalidated. Thus, an mTHP
is either entirely stored in ZSWAP, or entirely not stored in ZSWAP.

This forms the basis for building batching of pages during zswap store
of large folios, by compressing batches of up to say, 8 pages in an
mTHP in parallel in hardware, with the Intel In-Memory Analytics
Accelerator (Intel IAA).

Co-developed-by: Ryan Roberts
Signed-off-by:
Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 mm/zswap.c | 219 ++++++++++++++++++++++++++++++++++++++---------------
 1 file changed, 157 insertions(+), 62 deletions(-)

diff --git a/mm/zswap.c b/mm/zswap.c
index a6b0a7c636db..98ff98b485f5 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -899,7 +899,7 @@ static int zswap_cpu_comp_dead(unsigned int cpu, struct hlist_node *node)
 	return 0;
 }
 
-static bool zswap_compress(struct folio *folio, struct zswap_entry *entry)
+static bool zswap_compress(struct page *page, struct zswap_entry *entry)
 {
 	struct crypto_acomp_ctx *acomp_ctx;
 	struct scatterlist input, output;
@@ -917,7 +917,7 @@ static bool zswap_compress(struct folio *folio, struct zswap_entry *entry)
 
 	dst = acomp_ctx->buffer;
 	sg_init_table(&input, 1);
-	sg_set_page(&input, &folio->page, PAGE_SIZE, 0);
+	sg_set_page(&input, page, PAGE_SIZE, 0);
 
 	/*
 	 * We need PAGE_SIZE * 2 here since there maybe over-compression case,
@@ -1409,36 +1409,82 @@ static void zswap_fill_page(void *ptr, unsigned long value)
 /*********************************
 * main API
 **********************************/
-bool zswap_store(struct folio *folio)
+
+/*
+ * Returns true if the entry was successfully
+ * stored in the xarray, and false otherwise.
+ */
+static bool zswap_store_entry(struct xarray *tree,
+			      struct zswap_entry *entry)
 {
-	swp_entry_t swp = folio->swap;
-	pgoff_t offset = swp_offset(swp);
-	struct xarray *tree = swap_zswap_tree(swp);
-	struct zswap_entry *entry, *old;
-	struct obj_cgroup *objcg = NULL;
-	struct mem_cgroup *memcg = NULL;
-	unsigned long value;
+	struct zswap_entry *old;
+	pgoff_t offset = swp_offset(entry->swpentry);
 
-	VM_WARN_ON_ONCE(!folio_test_locked(folio));
-	VM_WARN_ON_ONCE(!folio_test_swapcache(folio));
+	old = xa_store(tree, offset, entry, GFP_KERNEL);
 
-	/* Large folios aren't supported */
-	if (folio_test_large(folio))
+	if (xa_is_err(old)) {
+		int err = xa_err(old);
+
+		WARN_ONCE(err != -ENOMEM, "unexpected xarray error: %d\n", err);
+		zswap_reject_alloc_fail++;
 		return false;
+	}
 
-	if (!zswap_enabled)
-		goto check_old;
+	/*
+	 * We may have had an existing entry that became stale when
+	 * the folio was redirtied and now the new version is being
+	 * swapped out. Get rid of the old.
+	 */
+	if (old)
+		zswap_entry_free(old);
 
-	/* Check cgroup limits */
-	objcg = get_obj_cgroup_from_folio(folio);
-	if (objcg && !obj_cgroup_may_zswap(objcg)) {
-		memcg = get_mem_cgroup_from_objcg(objcg);
-		if (shrink_memcg(memcg)) {
-			mem_cgroup_put(memcg);
-			goto reject;
-		}
-		mem_cgroup_put(memcg);
+	return true;
+}
+
+/*
+ * If the zswap store fails or zswap is disabled, we must invalidate the
+ * possibly stale entry which was previously stored at this offset.
+ * Otherwise, writeback could overwrite the new data in the swapfile.
+ *
+ * This is called after the store of the i-th offset
+ * in a large folio, has failed. All entries from
+ * [i-1 .. 0] must be deleted.
+ *
+ * This is also called if zswap_store() is called,
+ * but zswap is not enabled. All offsets for the folio
+ * are deleted from zswap in this case.
+ */
+static void zswap_delete_stored_offsets(struct xarray *tree,
+					pgoff_t offset,
+					long nr_pages)
+{
+	struct zswap_entry *entry;
+	long i;
+
+	for (i = 0; i < nr_pages; ++i) {
+		entry = xa_erase(tree, offset + i);
+		if (entry)
+			zswap_entry_free(entry);
 	}
+}
+
+/*
+ * Stores the page at specified "index" in a folio.
+ */
+static bool zswap_store_page(struct folio *folio, long index,
+			     struct obj_cgroup *objcg,
+			     struct zswap_pool *pool)
+{
+	swp_entry_t swp = folio->swap;
+	int type = swp_type(swp);
+	pgoff_t offset = swp_offset(swp) + index;
+	struct page *page = folio_page(folio, index);
+	struct xarray *tree = swap_zswap_tree(swp);
+	struct zswap_entry *entry;
+	unsigned long value;
+
+	if (objcg)
+		obj_cgroup_get(objcg);
 
 	if (zswap_check_limits())
 		goto reject;
@@ -1450,7 +1496,7 @@ bool zswap_store(struct folio *folio)
 		goto reject;
 	}
 
-	if (zswap_is_folio_same_filled(folio, 0, &value)) {
+	if (zswap_is_folio_same_filled(folio, index, &value)) {
 		entry->length = 0;
 		entry->value = value;
 		atomic_inc(&zswap_same_filled_pages);
@@ -1458,42 +1504,20 @@ bool zswap_store(struct folio *folio)
 	}
 
 	/* if entry is successfully added, it keeps the reference */
-	entry->pool = zswap_pool_current_get();
-	if (!entry->pool)
+	if (!zswap_pool_get(pool))
 		goto freepage;
 
-	if (objcg) {
-		memcg = get_mem_cgroup_from_objcg(objcg);
-		if (memcg_list_lru_alloc(memcg, &zswap_list_lru, GFP_KERNEL)) {
-			mem_cgroup_put(memcg);
-			goto put_pool;
-		}
-		mem_cgroup_put(memcg);
-	}
+	entry->pool = pool;
 
-	if (!zswap_compress(folio, entry))
+	if (!zswap_compress(page, entry))
 		goto put_pool;
 
 store_entry:
-	entry->swpentry = swp;
+	entry->swpentry = swp_entry(type, offset);
 	entry->objcg = objcg;
 
-	old = xa_store(tree, offset, entry, GFP_KERNEL);
-	if (xa_is_err(old)) {
-		int err = xa_err(old);
-
-		WARN_ONCE(err != -ENOMEM, "unexpected xarray error: %d\n", err);
-		zswap_reject_alloc_fail++;
+	if (!zswap_store_entry(tree, entry))
 		goto store_failed;
-	}
-
-	/*
-	 * We may have had an existing entry that became stale when
-	 * the folio was redirtied and now the new version is being
-	 * swapped out. Get rid of the old.
-	 */
-	if (old)
-		zswap_entry_free(old);
 
 	if (objcg) {
 		obj_cgroup_charge_zswap(objcg, entry->length);
@@ -1527,7 +1551,7 @@ bool zswap_store(struct folio *folio)
 	else {
 		zpool_free(zswap_find_zpool(entry), entry->handle);
 put_pool:
-		zswap_pool_put(entry->pool);
+		zswap_pool_put(pool);
 	}
 freepage:
 	zswap_entry_cache_free(entry);
@@ -1535,16 +1559,87 @@ bool zswap_store(struct folio *folio)
 	obj_cgroup_put(objcg);
 	if (zswap_pool_reached_full)
 		queue_work(shrink_wq, &zswap_shrink_work);
-check_old:
+
+	return false;
+}
+
+/*
+ * Modified to store mTHP folios.
+ * Each page in the mTHP will be compressed
+ * and stored sequentially.
+ */
+bool zswap_store(struct folio *folio)
+{
+	long nr_pages = folio_nr_pages(folio);
+	swp_entry_t swp = folio->swap;
+	pgoff_t offset = swp_offset(swp);
+	struct xarray *tree = swap_zswap_tree(swp);
+	struct obj_cgroup *objcg = NULL;
+	struct mem_cgroup *memcg = NULL;
+	struct zswap_pool *pool;
+	bool ret = false;
+	long index;
+
+	VM_WARN_ON_ONCE(!folio_test_locked(folio));
+	VM_WARN_ON_ONCE(!folio_test_swapcache(folio));
+
 	/*
-	 * If the zswap store fails or zswap is disabled, we must invalidate the
-	 * possibly stale entry which was previously stored at this offset.
-	 * Otherwise, writeback could overwrite the new data in the swapfile.
+	 * If zswap is disabled, we must invalidate the possibly stale entry
+	 * which was previously stored at this offset. Otherwise, writeback
+	 * could overwrite the new data in the swapfile.
 	 */
-	entry = xa_erase(tree, offset);
-	if (entry)
-		zswap_entry_free(entry);
-	return false;
+	if (!zswap_enabled)
+		goto reject;
+
+	/* Check cgroup limits */
+	objcg = get_obj_cgroup_from_folio(folio);
+	if (objcg && !obj_cgroup_may_zswap(objcg)) {
+		memcg = get_mem_cgroup_from_objcg(objcg);
+		if (shrink_memcg(memcg)) {
+			mem_cgroup_put(memcg);
+			goto put_objcg;
+		}
+		mem_cgroup_put(memcg);
+	}
+
+	if (zswap_check_limits())
+		goto put_objcg;
+
+	pool = zswap_pool_current_get();
+	if (!pool)
+		goto put_objcg;
+
+	if (objcg) {
+		memcg = get_mem_cgroup_from_objcg(objcg);
+		if (memcg_list_lru_alloc(memcg, &zswap_list_lru, GFP_KERNEL)) {
+			mem_cgroup_put(memcg);
+			goto put_pool;
+		}
+		mem_cgroup_put(memcg);
+	}
+
+	/*
+	 * Store each page of the folio as a separate entry. If we fail to store
+	 * a page, unwind by removing all the previous pages we stored.
+	 */
+	for (index = 0; index < nr_pages; ++index) {
+		if (!zswap_store_page(folio, index, objcg, pool))
+			goto put_pool;
+	}
+
+	ret = true;
+
+put_pool:
+	zswap_pool_put(pool);
+put_objcg:
+	obj_cgroup_put(objcg);
+	if (zswap_pool_reached_full)
+		queue_work(shrink_wq, &zswap_shrink_work);
+reject:
+	if (!ret)
+		zswap_delete_stored_offsets(tree, offset, nr_pages);
+
+	return ret;
 }
 
 bool zswap_load(struct folio *folio)
-- 
2.27.0



^ permalink raw reply	[flat|nested] 11+ messages in thread

* [RFC PATCH v1 4/4] mm: page_io: Count successful mTHP zswap stores in vmstat.
  2024-08-14  6:28 [RFC PATCH v1 0/4] mm: ZSWAP swap-out of mTHP folios Kanchana P Sridhar
                   ` (2 preceding siblings ...)
  2024-08-14  6:28 ` [RFC PATCH v1 3/4] mm: zswap: zswap_store() extended to handle mTHP folios Kanchana P Sridhar
@ 2024-08-14  6:28 ` Kanchana P Sridhar
  2024-08-14  7:53   ` Barry Song
  3 siblings, 1 reply; 11+ messages in thread
From: Kanchana P Sridhar @ 2024-08-14  6:28 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosryahmed, nphamcs,
	ryan.roberts, ying.huang, 21cnbao, akpm
  Cc: nanhai.zou, wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

Added count_zswap_thp_swpout_vm_event() that will increment the
appropriate mTHP/PMD vmstat event counters if zswap_store succeeds for
a large folio:

zswap_store mTHP order [0, HPAGE_PMD_ORDER-1] will increment these
vmstat event counters:

  ZSWPOUT_4KB_FOLIO
  mTHP_ZSWPOUT_8kB
  mTHP_ZSWPOUT_16kB
  mTHP_ZSWPOUT_32kB
  mTHP_ZSWPOUT_64kB
  mTHP_ZSWPOUT_128kB
  mTHP_ZSWPOUT_256kB
  mTHP_ZSWPOUT_512kB
  mTHP_ZSWPOUT_1024kB

zswap_store of a PMD-size THP, i.e., mTHP order HPAGE_PMD_ORDER, will
increment both these vmstat event counters:

  ZSWPOUT_PMD_THP_FOLIO
  mTHP_ZSWPOUT_2048kB

Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 mm/page_io.c | 44 ++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 44 insertions(+)

diff --git a/mm/page_io.c b/mm/page_io.c
index 0a150c240bf4..ab54d2060cc4 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -172,6 +172,49 @@ int generic_swapfile_activate(struct swap_info_struct *sis,
 	goto out;
 }
 
+/*
+ * Count vmstats for ZSWAP store of large folios (mTHP and PMD-size THP).
+ */
+static inline void count_zswap_thp_swpout_vm_event(struct folio *folio)
+{
+	if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) && folio_test_pmd_mappable(folio)) {
+		count_vm_event(ZSWPOUT_PMD_THP_FOLIO);
+		count_vm_event(mTHP_ZSWPOUT_2048kB);
+	} else if (folio_order(folio) == 0) {
+		count_vm_event(ZSWPOUT_4KB_FOLIO);
+	} else if (IS_ENABLED(CONFIG_THP_SWAP)) {
+		switch (folio_order(folio)) {
+		case 1:
+			count_vm_event(mTHP_ZSWPOUT_8kB);
+			break;
+		case 2:
+			count_vm_event(mTHP_ZSWPOUT_16kB);
+			break;
+		case 3:
+			count_vm_event(mTHP_ZSWPOUT_32kB);
+			break;
+		case 4:
+			count_vm_event(mTHP_ZSWPOUT_64kB);
+			break;
+		case 5:
+			count_vm_event(mTHP_ZSWPOUT_128kB);
+			break;
+		case 6:
+			count_vm_event(mTHP_ZSWPOUT_256kB);
+			break;
+		case 7:
+			count_vm_event(mTHP_ZSWPOUT_512kB);
+			break;
+		case 8:
+			count_vm_event(mTHP_ZSWPOUT_1024kB);
+			break;
+		case 9:
+			count_vm_event(mTHP_ZSWPOUT_2048kB);
+			break;
+		}
+	}
+}
+
 /*
  * We may have stale swap cache pages in memory: notice
  * them here and get rid of the unnecessary final write.
@@ -196,6 +239,7 @@ int swap_writepage(struct page *page, struct writeback_control *wbc)
 		return ret;
 	}
 	if (zswap_store(folio)) {
+		count_zswap_thp_swpout_vm_event(folio);
 		folio_start_writeback(folio);
 		folio_unlock(folio);
 		folio_end_writeback(folio);
-- 
2.27.0



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH v1 4/4] mm: page_io: Count successful mTHP zswap stores in vmstat.
  2024-08-14  6:28 ` [RFC PATCH v1 4/4] mm: page_io: Count successful mTHP zswap stores in vmstat Kanchana P Sridhar
@ 2024-08-14  7:53   ` Barry Song
  2024-08-14 17:47     ` Sridhar, Kanchana P
  0 siblings, 1 reply; 11+ messages in thread
From: Barry Song @ 2024-08-14  7:53 UTC (permalink / raw)
  To: Kanchana P Sridhar
  Cc: linux-kernel, linux-mm, hannes, yosryahmed, nphamcs,
	ryan.roberts, ying.huang, akpm, nanhai.zou, wajdi.k.feghali,
	vinodh.gopal

On Wed, Aug 14, 2024 at 6:28 PM Kanchana P Sridhar
<kanchana.p.sridhar@intel.com> wrote:
>
> Added count_zswap_thp_swpout_vm_event() that will increment the
> appropriate mTHP/PMD vmstat event counters if zswap_store succeeds for
> a large folio:
>
> zswap_store mTHP order [0, HPAGE_PMD_ORDER-1] will increment these
> vmstat event counters:
>
>   ZSWPOUT_4KB_FOLIO
>   mTHP_ZSWPOUT_8kB
>   mTHP_ZSWPOUT_16kB
>   mTHP_ZSWPOUT_32kB
>   mTHP_ZSWPOUT_64kB
>   mTHP_ZSWPOUT_128kB
>   mTHP_ZSWPOUT_256kB
>   mTHP_ZSWPOUT_512kB
>   mTHP_ZSWPOUT_1024kB
>
> zswap_store of a PMD-size THP, i.e., mTHP order HPAGE_PMD_ORDER, will
> increment both these vmstat event counters:
>
>   ZSWPOUT_PMD_THP_FOLIO
>   mTHP_ZSWPOUT_2048kB
>
> Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
> ---
>  mm/page_io.c | 44 ++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 44 insertions(+)
>
> diff --git a/mm/page_io.c b/mm/page_io.c
> index 0a150c240bf4..ab54d2060cc4 100644
> --- a/mm/page_io.c
> +++ b/mm/page_io.c
> @@ -172,6 +172,49 @@ int generic_swapfile_activate(struct swap_info_struct *sis,
>         goto out;
>  }
>
> +/*
> + * Count vmstats for ZSWAP store of large folios (mTHP and PMD-size THP).
> + */
> +static inline void count_zswap_thp_swpout_vm_event(struct folio *folio)
> +{
> +       if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) && folio_test_pmd_mappable(folio)) {
> +               count_vm_event(ZSWPOUT_PMD_THP_FOLIO);
> +               count_vm_event(mTHP_ZSWPOUT_2048kB);
> +       } else if (folio_order(folio) == 0) {
> +               count_vm_event(ZSWPOUT_4KB_FOLIO);
> +       } else if (IS_ENABLED(CONFIG_THP_SWAP)) {
> +               switch (folio_order(folio)) {
> +               case 1:
> +                       count_vm_event(mTHP_ZSWPOUT_8kB);
> +                       break;
> +               case 2:
> +                       count_vm_event(mTHP_ZSWPOUT_16kB);
> +                       break;
> +               case 3:
> +                       count_vm_event(mTHP_ZSWPOUT_32kB);
> +                       break;
> +               case 4:
> +                       count_vm_event(mTHP_ZSWPOUT_64kB);
> +                       break;
> +               case 5:
> +                       count_vm_event(mTHP_ZSWPOUT_128kB);
> +                       break;
> +               case 6:
> +                       count_vm_event(mTHP_ZSWPOUT_256kB);
> +                       break;
> +               case 7:
> +                       count_vm_event(mTHP_ZSWPOUT_512kB);
> +                       break;
> +               case 8:
> +                       count_vm_event(mTHP_ZSWPOUT_1024kB);
> +                       break;
> +               case 9:
> +                       count_vm_event(mTHP_ZSWPOUT_2048kB);
> +                       break;
> +               }

The number of orders is PMD_ORDER, also ilog2(MAX_PTRS_PER_PTE) .
PMD_ORDER isn't necessarily 9. It seems we need some general way
to handle this and avoid so many duplicated case 1, case 2.... case 9.

> +       }
> +}
> +
>  /*
>   * We may have stale swap cache pages in memory: notice
>   * them here and get rid of the unnecessary final write.
> @@ -196,6 +239,7 @@ int swap_writepage(struct page *page, struct writeback_control *wbc)
>                 return ret;
>         }
>         if (zswap_store(folio)) {
> +               count_zswap_thp_swpout_vm_event(folio);
>                 folio_start_writeback(folio);
>                 folio_unlock(folio);
>                 folio_end_writeback(folio);
> --
> 2.27.0
>

Thanks
Barry


^ permalink raw reply	[flat|nested] 11+ messages in thread

* RE: [RFC PATCH v1 4/4] mm: page_io: Count successful mTHP zswap stores in vmstat.
  2024-08-14  7:53   ` Barry Song
@ 2024-08-14 17:47     ` Sridhar, Kanchana P
  0 siblings, 0 replies; 11+ messages in thread
From: Sridhar, Kanchana P @ 2024-08-14 17:47 UTC (permalink / raw)
  To: Barry Song
  Cc: linux-kernel, linux-mm, hannes, yosryahmed, nphamcs,
	ryan.roberts, Huang, Ying, akpm, Zou, Nanhai, Feghali, Wajdi K,
	Gopal, Vinodh, Sridhar, Kanchana P

Hi Barry,

> -----Original Message-----
> From: Barry Song <21cnbao@gmail.com>
> Sent: Wednesday, August 14, 2024 12:53 AM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; yosryahmed@google.com; nphamcs@gmail.com;
> ryan.roberts@arm.com; Huang, Ying <ying.huang@intel.com>; akpm@linux-
> foundation.org; Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K
> <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [RFC PATCH v1 4/4] mm: page_io: Count successful mTHP zswap
> stores in vmstat.
> 
> On Wed, Aug 14, 2024 at 6:28 PM Kanchana P Sridhar
> <kanchana.p.sridhar@intel.com> wrote:
> >
> > Added count_zswap_thp_swpout_vm_event() that will increment the
> > appropriate mTHP/PMD vmstat event counters if zswap_store succeeds for
> > a large folio:
> >
> > zswap_store mTHP order [0, HPAGE_PMD_ORDER-1] will increment these
> > vmstat event counters:
> >
> >   ZSWPOUT_4KB_FOLIO
> >   mTHP_ZSWPOUT_8kB
> >   mTHP_ZSWPOUT_16kB
> >   mTHP_ZSWPOUT_32kB
> >   mTHP_ZSWPOUT_64kB
> >   mTHP_ZSWPOUT_128kB
> >   mTHP_ZSWPOUT_256kB
> >   mTHP_ZSWPOUT_512kB
> >   mTHP_ZSWPOUT_1024kB
> >
> > zswap_store of a PMD-size THP, i.e., mTHP order HPAGE_PMD_ORDER, will
> > increment both these vmstat event counters:
> >
> >   ZSWPOUT_PMD_THP_FOLIO
> >   mTHP_ZSWPOUT_2048kB
> >
> > Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
> > ---
> >  mm/page_io.c | 44 ++++++++++++++++++++++++++++++++++++++++++++
> >  1 file changed, 44 insertions(+)
> >
> > diff --git a/mm/page_io.c b/mm/page_io.c
> > index 0a150c240bf4..ab54d2060cc4 100644
> > --- a/mm/page_io.c
> > +++ b/mm/page_io.c
> > @@ -172,6 +172,49 @@ int generic_swapfile_activate(struct
> swap_info_struct *sis,
> >         goto out;
> >  }
> >
> > +/*
> > + * Count vmstats for ZSWAP store of large folios (mTHP and PMD-size THP).
> > + */
> > +static inline void count_zswap_thp_swpout_vm_event(struct folio *folio)
> > +{
> > +       if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) &&
> folio_test_pmd_mappable(folio)) {
> > +               count_vm_event(ZSWPOUT_PMD_THP_FOLIO);
> > +               count_vm_event(mTHP_ZSWPOUT_2048kB);
> > +       } else if (folio_order(folio) == 0) {
> > +               count_vm_event(ZSWPOUT_4KB_FOLIO);
> > +       } else if (IS_ENABLED(CONFIG_THP_SWAP)) {
> > +               switch (folio_order(folio)) {
> > +               case 1:
> > +                       count_vm_event(mTHP_ZSWPOUT_8kB);
> > +                       break;
> > +               case 2:
> > +                       count_vm_event(mTHP_ZSWPOUT_16kB);
> > +                       break;
> > +               case 3:
> > +                       count_vm_event(mTHP_ZSWPOUT_32kB);
> > +                       break;
> > +               case 4:
> > +                       count_vm_event(mTHP_ZSWPOUT_64kB);
> > +                       break;
> > +               case 5:
> > +                       count_vm_event(mTHP_ZSWPOUT_128kB);
> > +                       break;
> > +               case 6:
> > +                       count_vm_event(mTHP_ZSWPOUT_256kB);
> > +                       break;
> > +               case 7:
> > +                       count_vm_event(mTHP_ZSWPOUT_512kB);
> > +                       break;
> > +               case 8:
> > +                       count_vm_event(mTHP_ZSWPOUT_1024kB);
> > +                       break;
> > +               case 9:
> > +                       count_vm_event(mTHP_ZSWPOUT_2048kB);
> > +                       break;
> > +               }
> 
> The number of orders is PMD_ORDER, also ilog2(MAX_PTRS_PER_PTE) .
> PMD_ORDER isn't necessarily 9. It seems we need some general way
> to handle this and avoid so many duplicated case 1, case 2.... case 9.

Thanks for this suggestion. The general way to do this appears to be
simply calling count_mthp_stat(folio_order(folio), MTHP_STAT_[Z]SWPOUT)
potentially with the addition of a new "MTHP_STAT_ZSWPOUT" to
"enum mthp_stat_item".

I will make this change in v2 accordingly.

Thanks,
Kanchana

> 
> > +       }
> > +}
> > +
> >  /*
> >   * We may have stale swap cache pages in memory: notice
> >   * them here and get rid of the unnecessary final write.
> > @@ -196,6 +239,7 @@ int swap_writepage(struct page *page, struct
> writeback_control *wbc)
> >                 return ret;
> >         }
> >         if (zswap_store(folio)) {
> > +               count_zswap_thp_swpout_vm_event(folio);
> >                 folio_start_writeback(folio);
> >                 folio_unlock(folio);
> >                 folio_end_writeback(folio);
> > --
> > 2.27.0
> >
> 
> Thanks
> Barry

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2024-08-15  1:37 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-08-14  6:28 [RFC PATCH v1 0/4] mm: ZSWAP swap-out of mTHP folios Kanchana P Sridhar
2024-08-14  6:28 ` [RFC PATCH v1 1/4] mm: zswap: zswap_is_folio_same_filled() takes an index in the folio Kanchana P Sridhar
2024-08-14  6:28 ` [RFC PATCH v1 2/4] mm: vmstat: Per mTHP-size zswap_store vmstat event counters Kanchana P Sridhar
2024-08-14  7:48   ` Barry Song
2024-08-14 17:40     ` Sridhar, Kanchana P
2024-08-14 23:24       ` Barry Song
2024-08-15  1:37         ` Sridhar, Kanchana P
2024-08-14  6:28 ` [RFC PATCH v1 3/4] mm: zswap: zswap_store() extended to handle mTHP folios Kanchana P Sridhar
2024-08-14  6:28 ` [RFC PATCH v1 4/4] mm: page_io: Count successful mTHP zswap stores in vmstat Kanchana P Sridhar
2024-08-14  7:53   ` Barry Song
2024-08-14 17:47     ` Sridhar, Kanchana P

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox