[PATCH RFC v3 0/4] mTHP-friendly compression in zsmalloc and zram based on multi-pages

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH RFC v3 0/4] mTHP-friendly compression in zsmalloc and zram based on multi-pages
@ 2024-11-21 22:25 Barry Song
  2024-11-21 22:25 ` [PATCH RFC v3 1/4] mm: zsmalloc: support objects compressed based on multiple pages Barry Song
                   ` (4 more replies)
  0 siblings, 5 replies; 19+ messages in thread
From: Barry Song @ 2024-11-21 22:25 UTC (permalink / raw)
  To: akpm, linux-mm
  Cc: axboe, bala.seshasayee, chrisl, david, hannes,
	kanchana.p.sridhar, kasong, linux-block, minchan, nphamcs,
	ryan.roberts, senozhatsky, surenb, terrelln, usamaarif642,
	v-songbaohua, wajdi.k.feghali, willy, ying.huang, yosryahmed,
	yuzhao, zhengtangquan, zhouchengming

From: Barry Song <v-songbaohua@oppo.com>

When large folios are compressed at a larger granularity, we observe
a notable reduction in CPU usage and a significant improvement in
compression ratios.

mTHP's ability to be swapped out without splitting and swapped back in
as a whole allows compression and decompression at larger granularities.

This patchset enhances zsmalloc and zram by adding support for dividing
large folios into multi-page blocks, typically configured with a
2-order granularity. Without this patchset, a large folio is always
divided into `nr_pages` 4KiB blocks.

The granularity can be set using the `ZSMALLOC_MULTI_PAGES_ORDER`
setting, where the default of 2 allows all anonymous THP to benefit.

Examples include:
* A 16KiB large folio will be compressed and stored as a single 16KiB
  block.
* A 64KiB large folio will be compressed and stored as four 16KiB
  blocks.

For example, swapping out and swapping in 100MiB of typical anonymous
data 100 times (with 16KB mTHP enabled) using zstd yields the following
results:

                        w/o patches        w/ patches
swap-out time(ms)       68711              49908
swap-in time(ms)        30687              20685
compression ratio       20.49%             16.9%

I deliberately created a test case with intense swap thrashing. On my
Intel i9 10-core, 20-thread PC, I imposed a 1GB memory limit on a memcg
to compile the Linux kernel, intending to amplify swap activity and
analyze its impact on system time. Using the ZSTD algorithm, my test
script, which builds the kernel for five rounds, is as follows:

#!/bin/bash

echo never > /sys/kernel/mm/transparent_hugepage/hugepages-64kB/enabled
echo never > /sys/kernel/mm/transparent_hugepage/hugepages-32kB/enabled
echo always > /sys/kernel/mm/transparent_hugepage/hugepages-16kB/enabled
echo never > /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled

vmstat_path="/proc/vmstat"
thp_base_path="/sys/kernel/mm/transparent_hugepage"

read_values() {
    pswpin=$(grep "pswpin" $vmstat_path | awk '{print $2}')
    pswpout=$(grep "pswpout" $vmstat_path | awk '{print $2}')
    pgpgin=$(grep "pgpgin" $vmstat_path | awk '{print $2}')
    pgpgout=$(grep "pgpgout" $vmstat_path | awk '{print $2}')
    swpout_64k=$(cat $thp_base_path/hugepages-64kB/stats/swpout 2>/dev/null || echo 0)
    swpout_32k=$(cat $thp_base_path/hugepages-32kB/stats/swpout 2>/dev/null || echo 0)
    swpout_16k=$(cat $thp_base_path/hugepages-16kB/stats/swpout 2>/dev/null || echo 0)

    swpin_64k=$(cat $thp_base_path/hugepages-64kB/stats/swpin 2>/dev/null || echo 0)
    swpin_32k=$(cat $thp_base_path/hugepages-32kB/stats/swpin 2>/dev/null || echo 0)
    swpin_16k=$(cat $thp_base_path/hugepages-16kB/stats/swpin 2>/dev/null || echo 0)

    echo "$pswpin $pswpout $swpout_64k $swpout_32k $swpout_16k $swpin_64k $swpin_32k $swpin_16k $pgpgin $pgpgout"
}

for ((i=1; i<=5; i++))
do
  echo
  echo "*** Executing round $i ***"
  make ARCH=arm64 CROSS_COMPILE=aarch64-linux-gnu- clean 1>/dev/null 2>/dev/null
  echo 3 > /proc/sys/vm/drop_caches

  #kernel build
  initial_values=($(read_values))
  time systemd-run --scope -p MemoryMax=1G make ARCH=arm64 \
        CROSS_COMPILE=aarch64-linux-gnu- vmlinux -j20 1>/dev/null 2>/dev/null
  final_values=($(read_values))

  echo "pswpin: $((final_values[0] - initial_values[0]))"
  echo "pswpout: $((final_values[1] - initial_values[1]))"
  echo "64kB-swpout: $((final_values[2] - initial_values[2]))"
  echo "32kB-swpout: $((final_values[3] - initial_values[3]))"
  echo "16kB-swpout: $((final_values[4] - initial_values[4]))"
  echo "64kB-swpin: $((final_values[5] - initial_values[5]))"
  echo "32kB-swpin: $((final_values[6] - initial_values[6]))"
  echo "pgpgin: $((final_values[8] - initial_values[8]))"
  echo "pgpgout: $((final_values[9] - initial_values[9]))"
done

******************  Test results

******* Without the patchset:

*** Executing round 1 ***

real	7m56.173s
user	81m29.401s
sys	42m57.470s
pswpin: 29815871
pswpout: 50548760
64kB-swpout: 0
32kB-swpout: 0
16kB-swpout: 11206086
64kB-swpin: 0
32kB-swpin: 0
16kB-swpin: 6596517
pgpgin: 146093656
pgpgout: 211024708

*** Executing round 2 ***

real	7m48.227s
user	81m20.558s
sys	43m0.940s
pswpin: 29798189
pswpout: 50882005
64kB-swpout: 0
32kB-swpout: 0
16kB-swpout: 11286587
64kB-swpin: 0
32kB-swpin: 0
16kB-swpin: 6596103
pgpgin: 146841468
pgpgout: 212374760

*** Executing round 3 ***

real	7m56.664s
user	81m10.936s
sys	43m5.991s
pswpin: 29760702
pswpout: 51230330
64kB-swpout: 0
32kB-swpout: 0
16kB-swpout: 11363346
64kB-swpin: 0
32kB-swpin: 0
16kB-swpin: 6586263
pgpgin: 145374744
pgpgout: 213355600

*** Executing round 4 ***

real	8m29.115s
user	81m18.955s
sys	42m49.050s
pswpin: 29651724
pswpout: 50631678
64kB-swpout: 0
32kB-swpout: 0
16kB-swpout: 11249036
64kB-swpin: 0
32kB-swpin: 0
16kB-swpin: 6583515
pgpgin: 145819060
pgpgout: 211373768

*** Executing round 5 ***

real	7m46.124s
user	80m29.780s
sys	41m37.005s
pswpin: 28805646
pswpout: 49570858
64kB-swpout: 0
32kB-swpout: 0
16kB-swpout: 11010873
64kB-swpin: 0
32kB-swpin: 0
16kB-swpin: 6391598
pgpgin: 142354376
pgpgout: 20713566


******* With the patchset:

*** Executing round 1 ***

real	7m43.760s
user	80m35.185s
sys	35m50.685s
pswpin: 29870407
pswpout: 50101263
64kB-swpout: 0
32kB-swpout: 0
16kB-swpout: 11140509
64kB-swpin: 0
32kB-swpin: 0
16kB-swpin: 6838090
pgpgin: 146500224
pgpgout: 209218896

*** Executing round 2 ***

real	7m31.820s
user	81m39.787s
sys	37m24.341s
pswpin: 31100304
pswpout: 51666202
64kB-swpout: 0
32kB-swpout: 0
16kB-swpout: 11471841
64kB-swpin: 0
32kB-swpin: 0
16kB-swpin: 7106112
pgpgin: 151763112
pgpgout: 215526464

*** Executing round 3 ***

real	7m35.732s
user	79m36.028s
sys	34m4.190s
pswpin: 28357528
pswpout: 47716236
64kB-swpout: 0
32kB-swpout: 0
16kB-swpout: 10619547
64kB-swpin: 0
32kB-swpin: 0
16kB-swpin: 6500899
pgpgin: 139903688
pgpgout: 199715908

*** Executing round 4 ***

real	7m38.242s
user	80m50.768s
sys	35m54.201s
pswpin: 29752937
pswpout: 49977585
64kB-swpout: 0
32kB-swpout: 0
16kB-swpout: 11117552
64kB-swpin: 0
32kB-swpin: 0
16kB-swpin: 6815571
pgpgin: 146293900
pgpgout: 208755500

*** Executing round 5 ***

real	8m2.692s
user	81m40.159s
sys	37m11.361s
pswpin: 30813683
pswpout: 51687672
64kB-swpout: 0
32kB-swpout: 0
16kB-swpout: 11481684
64kB-swpin: 0
32kB-swpin: 0
16kB-swpin: 7044988
pgpgin: 150231840
pgpgout: 215616760

Although the real time fluctuated significantly on my PC, the
sys time has clearly decreased from over 40 minutes to just
over 30 minutes across all five rounds.

-v3:
 * Added a patch to fall back to four smaller folios to avoid partial reads.
   discussed this option with Usama, Ying, and Nhat in v2. Not entirely sure
   it will be well-received, but I've done my best to minimize the complexity
   added to do_swap_page().
 * Add a patch to adjust zstd backend estimated_src_size;
 * Addressed one VM_WARN_ON in patch 1 for PageMovable();

-v2:
 https://lore.kernel.org/linux-mm/20241107101005.69121-1-21cnbao@gmail.com/

 While it is not mature yet, I know some people are waiting for
 an update :-)
 * Fixed some stability issues.
 * rebase againest the latest mm-unstable.
 * Set default order to 2 which benefits all anon mTHP.
 * multipages ZsPageMovable is not supported yet.

Barry Song (2):
  zram: backend_zstd: Adjust estimated_src_size to accommodate
    multi-page compression
  mm: fall back to four small folios if mTHP allocation fails

Tangquan Zheng (2):
  mm: zsmalloc: support objects compressed based on multiple pages
  zram: support compression at the granularity of multi-pages

 drivers/block/zram/Kconfig        |   9 +
 drivers/block/zram/backend_zstd.c |   6 +-
 drivers/block/zram/zcomp.c        |  17 +-
 drivers/block/zram/zcomp.h        |  12 +-
 drivers/block/zram/zram_drv.c     | 450 ++++++++++++++++++++++++++++--
 drivers/block/zram/zram_drv.h     |  45 +++
 include/linux/zsmalloc.h          |  10 +-
 mm/Kconfig                        |  18 ++
 mm/memory.c                       | 203 +++++++++++++-
 mm/zsmalloc.c                     | 235 ++++++++++++----
 10 files changed, 896 insertions(+), 109 deletions(-)

-- 
2.39.3 (Apple Git-146)



^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH RFC v3 1/4] mm: zsmalloc: support objects compressed based on multiple pages
  2024-11-21 22:25 [PATCH RFC v3 0/4] mTHP-friendly compression in zsmalloc and zram based on multi-pages Barry Song
@ 2024-11-21 22:25 ` Barry Song
  2024-11-26  5:37   ` Sergey Senozhatsky
  2024-11-21 22:25 ` [PATCH RFC v3 2/4] zram: support compression at the granularity of multi-pages Barry Song
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 19+ messages in thread
From: Barry Song @ 2024-11-21 22:25 UTC (permalink / raw)
  To: akpm, linux-mm
  Cc: axboe, bala.seshasayee, chrisl, david, hannes,
	kanchana.p.sridhar, kasong, linux-block, minchan, nphamcs,
	ryan.roberts, senozhatsky, surenb, terrelln, usamaarif642,
	v-songbaohua, wajdi.k.feghali, willy, ying.huang, yosryahmed,
	yuzhao, zhengtangquan, zhouchengming

From: Tangquan Zheng <zhengtangquan@oppo.com>

This patch introduces support for zsmalloc to store compressed objects
based on multi-pages. Previously, a large folio with nr_pages subpages
would undergo compression one by one, each at the granularity of
PAGE_SIZE. However, by compressing them at a larger granularity, we
can conserve both memory and CPU resources.

We define the granularity with a configuration option called
ZSMALLOC_MULTI_PAGES_ORDER, set to a default value of 2, which matches
the minimum order of anonymous mTHP. As a result, a large folio with
8 subpages will now be split into 2 parts instead of 8.

The introduction of the multi-pages feature necessitates the creation
of new size classes to accommodate it.

Signed-off-by: Tangquan Zheng <zhengtangquan@oppo.com>
Co-developed-by: Barry Song <v-songbaohua@oppo.com>
Signed-off-by: Barry Song <v-songbaohua@oppo.com>
---
 drivers/block/zram/zram_drv.c |   3 +-
 include/linux/zsmalloc.h      |  10 +-
 mm/Kconfig                    |  18 +++
 mm/zsmalloc.c                 | 235 ++++++++++++++++++++++++++--------
 4 files changed, 207 insertions(+), 59 deletions(-)

diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index 3dee026988dc..6cb7d1e57362 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -1461,8 +1461,7 @@ static bool zram_meta_alloc(struct zram *zram, u64 disksize)
 		return false;
 	}
 
-	if (!huge_class_size)
-		huge_class_size = zs_huge_class_size(zram->mem_pool);
+	huge_class_size = zs_huge_class_size(zram->mem_pool, 0);
 
 	for (index = 0; index < num_pages; index++)
 		spin_lock_init(&zram->table[index].lock);
diff --git a/include/linux/zsmalloc.h b/include/linux/zsmalloc.h
index a48cd0ffe57d..9fa3e7669557 100644
--- a/include/linux/zsmalloc.h
+++ b/include/linux/zsmalloc.h
@@ -33,6 +33,14 @@ enum zs_mapmode {
 	 */
 };
 
+enum zsmalloc_type {
+	ZSMALLOC_TYPE_BASEPAGE,
+#ifdef CONFIG_ZSMALLOC_MULTI_PAGES
+	ZSMALLOC_TYPE_MULTI_PAGES,
+#endif
+	ZSMALLOC_TYPE_MAX,
+};
+
 struct zs_pool_stats {
 	/* How many pages were migrated (freed) */
 	atomic_long_t pages_compacted;
@@ -46,7 +54,7 @@ void zs_destroy_pool(struct zs_pool *pool);
 unsigned long zs_malloc(struct zs_pool *pool, size_t size, gfp_t flags);
 void zs_free(struct zs_pool *pool, unsigned long obj);
 
-size_t zs_huge_class_size(struct zs_pool *pool);
+size_t zs_huge_class_size(struct zs_pool *pool, enum zsmalloc_type type);
 
 void *zs_map_object(struct zs_pool *pool, unsigned long handle,
 			enum zs_mapmode mm);
diff --git a/mm/Kconfig b/mm/Kconfig
index 33fa51d608dc..6b302b66fc0a 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -237,6 +237,24 @@ config ZSMALLOC_CHAIN_SIZE
 
 	  For more information, see zsmalloc documentation.
 
+config ZSMALLOC_MULTI_PAGES
+	bool "support zsmalloc multiple pages"
+	depends on ZSMALLOC && !CONFIG_HIGHMEM
+	help
+	  This option configures zsmalloc to support allocations larger than
+	  PAGE_SIZE, enabling compression across multiple pages. The size of
+	  these multiple pages is determined by the configured
+	  ZSMALLOC_MULTI_PAGES_ORDER.
+
+config ZSMALLOC_MULTI_PAGES_ORDER
+	int "zsmalloc multiple pages order"
+	default 2
+	range 1 9
+	depends on ZSMALLOC_MULTI_PAGES
+	help
+	  This option is used to configure zsmalloc to support the compression
+	  of multiple pages.
+
 menu "Slab allocator options"
 
 config SLUB
diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
index 64b66a4d3e6e..ab57266b43f6 100644
--- a/mm/zsmalloc.c
+++ b/mm/zsmalloc.c
@@ -70,6 +70,12 @@
 
 #define ZSPAGE_MAGIC	0x58
 
+#ifdef CONFIG_ZSMALLOC_MULTI_PAGES
+#define ZSMALLOC_MULTI_PAGES_ORDER	(_AC(CONFIG_ZSMALLOC_MULTI_PAGES_ORDER, UL))
+#define ZSMALLOC_MULTI_PAGES_NR		(1 << ZSMALLOC_MULTI_PAGES_ORDER)
+#define ZSMALLOC_MULTI_PAGES_SIZE	(PAGE_SIZE * ZSMALLOC_MULTI_PAGES_NR)
+#endif
+
 /*
  * This must be power of 2 and greater than or equal to sizeof(link_free).
  * These two conditions ensure that any 'struct link_free' itself doesn't
@@ -120,7 +126,8 @@
 
 #define HUGE_BITS	1
 #define FULLNESS_BITS	4
-#define CLASS_BITS	8
+#define CLASS_BITS	9
+#define ISOLATED_BITS	5
 #define MAGIC_VAL_BITS	8
 
 #define ZS_MAX_PAGES_PER_ZSPAGE	(_AC(CONFIG_ZSMALLOC_CHAIN_SIZE, UL))
@@ -129,7 +136,11 @@
 #define ZS_MIN_ALLOC_SIZE \
 	MAX(32, (ZS_MAX_PAGES_PER_ZSPAGE << PAGE_SHIFT >> OBJ_INDEX_BITS))
 /* each chunk includes extra space to keep handle */
+#ifdef CONFIG_ZSMALLOC_MULTI_PAGES
+#define ZS_MAX_ALLOC_SIZE	(ZSMALLOC_MULTI_PAGES_SIZE)
+#else
 #define ZS_MAX_ALLOC_SIZE	PAGE_SIZE
+#endif
 
 /*
  * On systems with 4K page size, this gives 255 size classes! There is a
@@ -144,9 +155,22 @@
  *  ZS_MIN_ALLOC_SIZE and ZS_SIZE_CLASS_DELTA must be multiple of ZS_ALIGN
  *  (reason above)
  */
-#define ZS_SIZE_CLASS_DELTA	(PAGE_SIZE >> CLASS_BITS)
-#define ZS_SIZE_CLASSES	(DIV_ROUND_UP(ZS_MAX_ALLOC_SIZE - ZS_MIN_ALLOC_SIZE, \
-				      ZS_SIZE_CLASS_DELTA) + 1)
+
+#define ZS_PAGE_SIZE_CLASS_DELTA	(PAGE_SIZE >> (CLASS_BITS - 1))
+#define ZS_PAGE_SIZE_CLASSES	(DIV_ROUND_UP(PAGE_SIZE - ZS_MIN_ALLOC_SIZE, \
+				      ZS_PAGE_SIZE_CLASS_DELTA) + 1)
+
+#ifdef CONFIG_ZSMALLOC_MULTI_PAGES
+#define ZS_MULTI_PAGES_SIZE_CLASS_DELTA	(ZSMALLOC_MULTI_PAGES_SIZE >> (CLASS_BITS - 1))
+#define ZS_MULTI_PAGES_SIZE_CLASSES	(DIV_ROUND_UP(ZS_MAX_ALLOC_SIZE - PAGE_SIZE, \
+				      ZS_MULTI_PAGES_SIZE_CLASS_DELTA) + 1)
+#endif
+
+#ifdef CONFIG_ZSMALLOC_MULTI_PAGES
+#define ZS_SIZE_CLASSES	(ZS_PAGE_SIZE_CLASSES + ZS_MULTI_PAGES_SIZE_CLASSES)
+#else
+#define ZS_SIZE_CLASSES	(ZS_PAGE_SIZE_CLASSES)
+#endif
 
 /*
  * Pages are distinguished by the ratio of used memory (that is the ratio
@@ -182,7 +206,8 @@ struct zs_size_stat {
 static struct dentry *zs_stat_root;
 #endif
 
-static size_t huge_class_size;
+/* huge_class_size[0] for page, huge_class_size[1] for multiple pages. */
+static size_t huge_class_size[ZSMALLOC_TYPE_MAX];
 
 struct size_class {
 	spinlock_t lock;
@@ -260,6 +285,29 @@ struct zspage {
 	rwlock_t lock;
 };
 
+#ifdef CONFIG_ZSMALLOC_MULTI_PAGES
+static inline unsigned int class_size_to_zs_order(unsigned long size)
+{
+	unsigned int order = 0;
+
+	/* used large order to alloc page for zspage when class_size > PAGE_SIZE */
+	if (size > PAGE_SIZE)
+		return ZSMALLOC_MULTI_PAGES_ORDER;
+
+	return order;
+}
+#else
+static inline unsigned int class_size_to_zs_order(unsigned long size)
+{
+	return 0;
+}
+#endif
+
+static inline unsigned long class_size_to_zs_size(unsigned long size)
+{
+	return PAGE_SIZE * (1 << class_size_to_zs_order(size));
+}
+
 struct mapping_area {
 	local_lock_t lock;
 	char *vm_buf; /* copy buffer for objects that span pages */
@@ -510,11 +558,22 @@ static int get_size_class_index(int size)
 {
 	int idx = 0;
 
+#ifdef CONFIG_ZSMALLOC_MULTI_PAGES
+	if (size > PAGE_SIZE + ZS_HANDLE_SIZE) {
+		idx = ZS_PAGE_SIZE_CLASSES;
+		idx += DIV_ROUND_UP(size - PAGE_SIZE,
+				ZS_MULTI_PAGES_SIZE_CLASS_DELTA);
+
+		return min_t(int, ZS_SIZE_CLASSES - 1, idx);
+	}
+#endif
+
+	idx = 0;
 	if (likely(size > ZS_MIN_ALLOC_SIZE))
-		idx = DIV_ROUND_UP(size - ZS_MIN_ALLOC_SIZE,
-				ZS_SIZE_CLASS_DELTA);
+		idx += DIV_ROUND_UP(size - ZS_MIN_ALLOC_SIZE,
+				ZS_PAGE_SIZE_CLASS_DELTA);
 
-	return min_t(int, ZS_SIZE_CLASSES - 1, idx);
+	return  min_t(int, ZS_PAGE_SIZE_CLASSES - 1, idx);
 }
 
 static inline void class_stat_add(struct size_class *class, int type,
@@ -564,11 +623,11 @@ static int zs_stats_size_show(struct seq_file *s, void *v)
 	unsigned long total_freeable = 0;
 	unsigned long inuse_totals[NR_FULLNESS_GROUPS] = {0, };
 
-	seq_printf(s, " %5s %5s %9s %9s %9s %9s %9s %9s %9s %9s %9s %9s %9s %13s %10s %10s %16s %8s\n",
-			"class", "size", "10%", "20%", "30%", "40%",
+	seq_printf(s, " %5s %5s %5s %9s %9s %9s %9s %9s %9s %9s %9s %9s %9s %9s %13s %10s %10s %16s %16s %8s\n",
+			"class", "size", "order", "10%", "20%", "30%", "40%",
 			"50%", "60%", "70%", "80%", "90%", "99%", "100%",
 			"obj_allocated", "obj_used", "pages_used",
-			"pages_per_zspage", "freeable");
+			"pages_per_zspage", "objs_per_zspage", "freeable");
 
 	for (i = 0; i < ZS_SIZE_CLASSES; i++) {
 
@@ -579,7 +638,7 @@ static int zs_stats_size_show(struct seq_file *s, void *v)
 
 		spin_lock(&class->lock);
 
-		seq_printf(s, " %5u %5u ", i, class->size);
+		seq_printf(s, " %5u %5u %5u", i, class->size, class_size_to_zs_order(class->size));
 		for (fg = ZS_INUSE_RATIO_10; fg < NR_FULLNESS_GROUPS; fg++) {
 			inuse_totals[fg] += class_stat_read(class, fg);
 			seq_printf(s, "%9lu ", class_stat_read(class, fg));
@@ -594,9 +653,9 @@ static int zs_stats_size_show(struct seq_file *s, void *v)
 		pages_used = obj_allocated / objs_per_zspage *
 				class->pages_per_zspage;
 
-		seq_printf(s, "%13lu %10lu %10lu %16d %8lu\n",
+		seq_printf(s, "%13lu %10lu %10lu %16d %16d %8lu\n",
 			   obj_allocated, obj_used, pages_used,
-			   class->pages_per_zspage, freeable);
+			   class->pages_per_zspage, objs_per_zspage, freeable);
 
 		total_objs += obj_allocated;
 		total_used_objs += obj_used;
@@ -811,7 +870,8 @@ static inline bool obj_allocated(struct page *page, void *obj,
 
 static void reset_page(struct page *page)
 {
-	__ClearPageMovable(page);
+	if (PageMovable(page))
+		__ClearPageMovable(page);
 	ClearPagePrivate(page);
 	set_page_private(page, 0);
 	page->index = 0;
@@ -863,7 +923,8 @@ static void __free_zspage(struct zs_pool *pool, struct size_class *class,
 	cache_free_zspage(pool, zspage);
 
 	class_stat_sub(class, ZS_OBJS_ALLOCATED, class->objs_per_zspage);
-	atomic_long_sub(class->pages_per_zspage, &pool->pages_allocated);
+	atomic_long_sub(class->pages_per_zspage * (1 << class_size_to_zs_order(class->size)),
+			&pool->pages_allocated);
 }
 
 static void free_zspage(struct zs_pool *pool, struct size_class *class,
@@ -892,6 +953,7 @@ static void init_zspage(struct size_class *class, struct zspage *zspage)
 	unsigned int freeobj = 1;
 	unsigned long off = 0;
 	struct page *page = get_first_page(zspage);
+	unsigned long page_size = class_size_to_zs_size(class->size);
 
 	while (page) {
 		struct page *next_page;
@@ -903,7 +965,7 @@ static void init_zspage(struct size_class *class, struct zspage *zspage)
 		vaddr = kmap_local_page(page);
 		link = (struct link_free *)vaddr + off / sizeof(*link);
 
-		while ((off += class->size) < PAGE_SIZE) {
+		while ((off += class->size) < page_size) {
 			link->next = freeobj++ << OBJ_TAG_BITS;
 			link += class->size / sizeof(*link);
 		}
@@ -925,7 +987,7 @@ static void init_zspage(struct size_class *class, struct zspage *zspage)
 		}
 		kunmap_local(vaddr);
 		page = next_page;
-		off %= PAGE_SIZE;
+		off %= page_size;
 	}
 
 	set_freeobj(zspage, 0);
@@ -975,6 +1037,8 @@ static struct zspage *alloc_zspage(struct zs_pool *pool,
 	struct page *pages[ZS_MAX_PAGES_PER_ZSPAGE];
 	struct zspage *zspage = cache_alloc_zspage(pool, gfp);
 
+	unsigned int order = class_size_to_zs_order(class->size);
+
 	if (!zspage)
 		return NULL;
 
@@ -984,12 +1048,14 @@ static struct zspage *alloc_zspage(struct zs_pool *pool,
 	for (i = 0; i < class->pages_per_zspage; i++) {
 		struct page *page;
 
-		page = alloc_page(gfp);
+		if (order > 0)
+			gfp &= ~__GFP_MOVABLE;
+		page = alloc_pages(gfp | __GFP_COMP, order);
 		if (!page) {
 			while (--i >= 0) {
 				dec_zone_page_state(pages[i], NR_ZSPAGES);
 				__ClearPageZsmalloc(pages[i]);
-				__free_page(pages[i]);
+				__free_pages(pages[i], order);
 			}
 			cache_free_zspage(pool, zspage);
 			return NULL;
@@ -1047,7 +1113,9 @@ static void *__zs_map_object(struct mapping_area *area,
 			struct page *pages[2], int off, int size)
 {
 	size_t sizes[2];
+	void *addr;
 	char *buf = area->vm_buf;
+	unsigned long page_size = class_size_to_zs_size(size);
 
 	/* disable page faults to match kmap_local_page() return conditions */
 	pagefault_disable();
@@ -1056,12 +1124,16 @@ static void *__zs_map_object(struct mapping_area *area,
 	if (area->vm_mm == ZS_MM_WO)
 		goto out;
 
-	sizes[0] = PAGE_SIZE - off;
+	sizes[0] = page_size - off;
 	sizes[1] = size - sizes[0];
 
 	/* copy object to per-cpu buffer */
-	memcpy_from_page(buf, pages[0], off, sizes[0]);
-	memcpy_from_page(buf + sizes[0], pages[1], 0, sizes[1]);
+	addr = kmap_local_page(pages[0]);
+	memcpy(buf, addr + off, sizes[0]);
+	kunmap_local(addr);
+	addr = kmap_local_page(pages[1]);
+	memcpy(buf + sizes[0], addr, sizes[1]);
+	kunmap_local(addr);
 out:
 	return area->vm_buf;
 }
@@ -1070,7 +1142,9 @@ static void __zs_unmap_object(struct mapping_area *area,
 			struct page *pages[2], int off, int size)
 {
 	size_t sizes[2];
+	void *addr;
 	char *buf;
+	unsigned long page_size = class_size_to_zs_size(size);
 
 	/* no write fastpath */
 	if (area->vm_mm == ZS_MM_RO)
@@ -1081,12 +1155,16 @@ static void __zs_unmap_object(struct mapping_area *area,
 	size -= ZS_HANDLE_SIZE;
 	off += ZS_HANDLE_SIZE;
 
-	sizes[0] = PAGE_SIZE - off;
+	sizes[0] = page_size - off;
 	sizes[1] = size - sizes[0];
 
 	/* copy per-cpu buffer to object */
-	memcpy_to_page(pages[0], off, buf, sizes[0]);
-	memcpy_to_page(pages[1], 0, buf + sizes[0], sizes[1]);
+	addr = kmap_local_page(pages[0]);
+	memcpy(addr + off, buf, sizes[0]);
+	kunmap_local(addr);
+	addr = kmap_local_page(pages[1]);
+	memcpy(addr, buf + sizes[0], sizes[1]);
+	kunmap_local(addr);
 
 out:
 	/* enable page faults to match kunmap_local() return conditions */
@@ -1184,6 +1262,8 @@ void *zs_map_object(struct zs_pool *pool, unsigned long handle,
 	struct mapping_area *area;
 	struct page *pages[2];
 	void *ret;
+	unsigned long page_size;
+	unsigned long page_mask;
 
 	/*
 	 * Because we use per-cpu mapping areas shared among the
@@ -1208,12 +1288,14 @@ void *zs_map_object(struct zs_pool *pool, unsigned long handle,
 	read_unlock(&pool->migrate_lock);
 
 	class = zspage_class(pool, zspage);
-	off = offset_in_page(class->size * obj_idx);
+	page_size = class_size_to_zs_size(class->size);
+	page_mask = ~(page_size - 1);
+	off = (class->size * obj_idx) & ~page_mask;
 
 	local_lock(&zs_map_area.lock);
 	area = this_cpu_ptr(&zs_map_area);
 	area->vm_mm = mm;
-	if (off + class->size <= PAGE_SIZE) {
+	if (off + class->size <= page_size) {
 		/* this object is contained entirely within a page */
 		area->vm_addr = kmap_local_page(page);
 		ret = area->vm_addr + off;
@@ -1243,15 +1325,20 @@ void zs_unmap_object(struct zs_pool *pool, unsigned long handle)
 
 	struct size_class *class;
 	struct mapping_area *area;
+	unsigned long page_size;
+	unsigned long page_mask;
 
 	obj = handle_to_obj(handle);
 	obj_to_location(obj, &page, &obj_idx);
 	zspage = get_zspage(page);
 	class = zspage_class(pool, zspage);
-	off = offset_in_page(class->size * obj_idx);
+
+	page_size = class_size_to_zs_size(class->size);
+	page_mask = ~(page_size - 1);
+	off = (class->size * obj_idx) & ~page_mask;
 
 	area = this_cpu_ptr(&zs_map_area);
-	if (off + class->size <= PAGE_SIZE)
+	if (off + class->size <= page_size)
 		kunmap_local(area->vm_addr);
 	else {
 		struct page *pages[2];
@@ -1281,9 +1368,9 @@ EXPORT_SYMBOL_GPL(zs_unmap_object);
  *
  * Return: the size (in bytes) of the first huge zsmalloc &size_class.
  */
-size_t zs_huge_class_size(struct zs_pool *pool)
+size_t zs_huge_class_size(struct zs_pool *pool, enum zsmalloc_type type)
 {
-	return huge_class_size;
+	return huge_class_size[type];
 }
 EXPORT_SYMBOL_GPL(zs_huge_class_size);
 
@@ -1298,13 +1385,21 @@ static unsigned long obj_malloc(struct zs_pool *pool,
 	struct page *m_page;
 	unsigned long m_offset;
 	void *vaddr;
+	unsigned long page_size;
+	unsigned long page_mask;
+	unsigned long page_shift;
 
 	class = pool->size_class[zspage->class];
 	obj = get_freeobj(zspage);
 
 	offset = obj * class->size;
-	nr_page = offset >> PAGE_SHIFT;
-	m_offset = offset_in_page(offset);
+	page_size = class_size_to_zs_size(class->size);
+	page_shift = PAGE_SHIFT + class_size_to_zs_order(class->size);
+	page_mask = ~(page_size - 1);
+
+	nr_page = offset >> page_shift;
+	m_offset = offset & ~page_mask;
+
 	m_page = get_first_page(zspage);
 
 	for (i = 0; i < nr_page; i++)
@@ -1385,12 +1480,14 @@ unsigned long zs_malloc(struct zs_pool *pool, size_t size, gfp_t gfp)
 	obj_malloc(pool, zspage, handle);
 	newfg = get_fullness_group(class, zspage);
 	insert_zspage(class, zspage, newfg);
-	atomic_long_add(class->pages_per_zspage, &pool->pages_allocated);
+	atomic_long_add(class->pages_per_zspage * (1 << class_size_to_zs_order(class->size)),
+			&pool->pages_allocated);
 	class_stat_add(class, ZS_OBJS_ALLOCATED, class->objs_per_zspage);
 	class_stat_add(class, ZS_OBJS_INUSE, 1);
 
 	/* We completely set up zspage so mark them as movable */
-	SetZsPageMovable(pool, zspage);
+	if (class_size_to_zs_order(class->size) == 0)
+		SetZsPageMovable(pool, zspage);
 out:
 	spin_unlock(&class->lock);
 
@@ -1406,9 +1503,14 @@ static void obj_free(int class_size, unsigned long obj)
 	unsigned long f_offset;
 	unsigned int f_objidx;
 	void *vaddr;
+	unsigned long page_size;
+	unsigned long page_mask;
 
 	obj_to_location(obj, &f_page, &f_objidx);
-	f_offset = offset_in_page(class_size * f_objidx);
+	page_size = class_size_to_zs_size(class_size);
+	page_mask = ~(page_size - 1);
+
+	f_offset = (class_size * f_objidx) & ~page_mask;
 	zspage = get_zspage(f_page);
 
 	vaddr = kmap_local_page(f_page);
@@ -1469,20 +1571,22 @@ static void zs_object_copy(struct size_class *class, unsigned long dst,
 	void *s_addr, *d_addr;
 	int s_size, d_size, size;
 	int written = 0;
+	unsigned long page_size = class_size_to_zs_size(class->size);
+	unsigned long page_mask =  ~(page_size - 1);
 
 	s_size = d_size = class->size;
 
 	obj_to_location(src, &s_page, &s_objidx);
 	obj_to_location(dst, &d_page, &d_objidx);
 
-	s_off = offset_in_page(class->size * s_objidx);
-	d_off = offset_in_page(class->size * d_objidx);
+	s_off = (class->size * s_objidx) & ~page_mask;
+	d_off = (class->size * d_objidx) & ~page_mask;
 
-	if (s_off + class->size > PAGE_SIZE)
-		s_size = PAGE_SIZE - s_off;
+	if (s_off + class->size > page_size)
+		s_size = page_size - s_off;
 
-	if (d_off + class->size > PAGE_SIZE)
-		d_size = PAGE_SIZE - d_off;
+	if (d_off + class->size > page_size)
+		d_size = page_size - d_off;
 
 	s_addr = kmap_local_page(s_page);
 	d_addr = kmap_local_page(d_page);
@@ -1507,7 +1611,7 @@ static void zs_object_copy(struct size_class *class, unsigned long dst,
 		 * kunmap_local(d_addr). For more details see
 		 * Documentation/mm/highmem.rst.
 		 */
-		if (s_off >= PAGE_SIZE) {
+		if (s_off >= page_size) {
 			kunmap_local(d_addr);
 			kunmap_local(s_addr);
 			s_page = get_next_page(s_page);
@@ -1517,7 +1621,7 @@ static void zs_object_copy(struct size_class *class, unsigned long dst,
 			s_off = 0;
 		}
 
-		if (d_off >= PAGE_SIZE) {
+		if (d_off >= page_size) {
 			kunmap_local(d_addr);
 			d_page = get_next_page(d_page);
 			d_addr = kmap_local_page(d_page);
@@ -1541,11 +1645,12 @@ static unsigned long find_alloced_obj(struct size_class *class,
 	int index = *obj_idx;
 	unsigned long handle = 0;
 	void *addr = kmap_local_page(page);
+	unsigned long page_size = class_size_to_zs_size(class->size);
 
 	offset = get_first_obj_offset(page);
 	offset += class->size * index;
 
-	while (offset < PAGE_SIZE) {
+	while (offset < page_size) {
 		if (obj_allocated(page, addr + offset, &handle))
 			break;
 
@@ -1765,6 +1870,7 @@ static int zs_page_migrate(struct page *newpage, struct page *page,
 	unsigned long handle;
 	unsigned long old_obj, new_obj;
 	unsigned int obj_idx;
+	unsigned int page_size = PAGE_SIZE;
 
 	VM_BUG_ON_PAGE(!PageIsolated(page), page);
 
@@ -1781,6 +1887,7 @@ static int zs_page_migrate(struct page *newpage, struct page *page,
 	 */
 	write_lock(&pool->migrate_lock);
 	class = zspage_class(pool, zspage);
+	page_size = class_size_to_zs_size(class->size);
 
 	/*
 	 * the class lock protects zpage alloc/free in the zspage.
@@ -1796,10 +1903,10 @@ static int zs_page_migrate(struct page *newpage, struct page *page,
 	 * Here, any user cannot access all objects in the zspage so let's move.
 	 */
 	d_addr = kmap_local_page(newpage);
-	copy_page(d_addr, s_addr);
+	memcpy(d_addr, s_addr, page_size);
 	kunmap_local(d_addr);
 
-	for (addr = s_addr + offset; addr < s_addr + PAGE_SIZE;
+	for (addr = s_addr + offset; addr < s_addr + page_size;
 					addr += class->size) {
 		if (obj_allocated(page, addr, &handle)) {
 
@@ -2085,6 +2192,7 @@ static int calculate_zspage_chain_size(int class_size)
 {
 	int i, min_waste = INT_MAX;
 	int chain_size = 1;
+	unsigned long page_size = class_size_to_zs_size(class_size);
 
 	if (is_power_of_2(class_size))
 		return chain_size;
@@ -2092,7 +2200,7 @@ static int calculate_zspage_chain_size(int class_size)
 	for (i = 1; i <= ZS_MAX_PAGES_PER_ZSPAGE; i++) {
 		int waste;
 
-		waste = (i * PAGE_SIZE) % class_size;
+		waste = (i * page_size) % class_size;
 		if (waste < min_waste) {
 			min_waste = waste;
 			chain_size = i;
@@ -2138,18 +2246,33 @@ struct zs_pool *zs_create_pool(const char *name)
 	 * for merging should be larger or equal to current size.
 	 */
 	for (i = ZS_SIZE_CLASSES - 1; i >= 0; i--) {
-		int size;
+		unsigned int size = 0;
 		int pages_per_zspage;
 		int objs_per_zspage;
 		struct size_class *class;
 		int fullness;
+		int order = 0;
+		int idx = ZSMALLOC_TYPE_BASEPAGE;
+
+		if (i < ZS_PAGE_SIZE_CLASSES)
+			size = ZS_MIN_ALLOC_SIZE + i * ZS_PAGE_SIZE_CLASS_DELTA;
+#ifdef CONFIG_ZSMALLOC_MULTI_PAGES
+		if (i >= ZS_PAGE_SIZE_CLASSES)
+			size = PAGE_SIZE + (i - ZS_PAGE_SIZE_CLASSES) *
+					   ZS_MULTI_PAGES_SIZE_CLASS_DELTA;
+#endif
 
-		size = ZS_MIN_ALLOC_SIZE + i * ZS_SIZE_CLASS_DELTA;
 		if (size > ZS_MAX_ALLOC_SIZE)
 			size = ZS_MAX_ALLOC_SIZE;
-		pages_per_zspage = calculate_zspage_chain_size(size);
-		objs_per_zspage = pages_per_zspage * PAGE_SIZE / size;
 
+#ifdef CONFIG_ZSMALLOC_MULTI_PAGES
+		order = class_size_to_zs_order(size);
+		if (order == ZSMALLOC_MULTI_PAGES_ORDER)
+			idx = ZSMALLOC_TYPE_MULTI_PAGES;
+#endif
+
+		pages_per_zspage = calculate_zspage_chain_size(size);
+		objs_per_zspage = pages_per_zspage * PAGE_SIZE * (1 << order) / size;
 		/*
 		 * We iterate from biggest down to smallest classes,
 		 * so huge_class_size holds the size of the first huge
@@ -2157,8 +2280,8 @@ struct zs_pool *zs_create_pool(const char *name)
 		 * endup in the huge class.
 		 */
 		if (pages_per_zspage != 1 && objs_per_zspage != 1 &&
-				!huge_class_size) {
-			huge_class_size = size;
+				!huge_class_size[idx]) {
+			huge_class_size[idx] = size;
 			/*
 			 * The object uses ZS_HANDLE_SIZE bytes to store the
 			 * handle. We need to subtract it, because zs_malloc()
@@ -2168,7 +2291,7 @@ struct zs_pool *zs_create_pool(const char *name)
 			 * class because it grows by ZS_HANDLE_SIZE extra bytes
 			 * right before class lookup.
 			 */
-			huge_class_size -= (ZS_HANDLE_SIZE - 1);
+			huge_class_size[idx] -= (ZS_HANDLE_SIZE - 1);
 		}
 
 		/*
-- 
2.39.3 (Apple Git-146)



^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH RFC v3 2/4] zram: support compression at the granularity of multi-pages
  2024-11-21 22:25 [PATCH RFC v3 0/4] mTHP-friendly compression in zsmalloc and zram based on multi-pages Barry Song
  2024-11-21 22:25 ` [PATCH RFC v3 1/4] mm: zsmalloc: support objects compressed based on multiple pages Barry Song
@ 2024-11-21 22:25 ` Barry Song
  2024-11-21 22:25 ` [PATCH RFC v3 3/4] zram: backend_zstd: Adjust estimated_src_size to accommodate multi-page compression Barry Song
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 19+ messages in thread
From: Barry Song @ 2024-11-21 22:25 UTC (permalink / raw)
  To: akpm, linux-mm
  Cc: axboe, bala.seshasayee, chrisl, david, hannes,
	kanchana.p.sridhar, kasong, linux-block, minchan, nphamcs,
	ryan.roberts, senozhatsky, surenb, terrelln, usamaarif642,
	v-songbaohua, wajdi.k.feghali, willy, ying.huang, yosryahmed,
	yuzhao, zhengtangquan, zhouchengming

From: Tangquan Zheng <zhengtangquan@oppo.com>

Currently, when a large folio with nr_pages is submitted to zram, it is
divided into nr_pages parts for compression and storage individually.
By transitioning to a higher granularity, we can notably enhance
compression rates while simultaneously reducing CPU consumption.

This patch introduces the capability for large folios to be divided
based on the granularity specified by ZSMALLOC_MULTI_PAGES_ORDER, which
defaults to 2. For instance, for folios sized at 128KiB, compression
will occur in eight 16KiB multi-pages.

This modification will notably reduce CPU consumption and enhance
compression ratios. The following data illustrates the time and
compressed data for typical anonymous pages gathered from Android
phones.

Signed-off-by: Tangquan Zheng <zhengtangquan@oppo.com>
Co-developed-by: Barry Song <v-songbaohua@oppo.com>
Signed-off-by: Barry Song <v-songbaohua@oppo.com>
---
 drivers/block/zram/Kconfig    |   9 +
 drivers/block/zram/zcomp.c    |  17 +-
 drivers/block/zram/zcomp.h    |  12 +-
 drivers/block/zram/zram_drv.c | 449 +++++++++++++++++++++++++++++++---
 drivers/block/zram/zram_drv.h |  45 ++++
 5 files changed, 495 insertions(+), 37 deletions(-)

diff --git a/drivers/block/zram/Kconfig b/drivers/block/zram/Kconfig
index 402b7b175863..716e92c5fdfe 100644
--- a/drivers/block/zram/Kconfig
+++ b/drivers/block/zram/Kconfig
@@ -145,3 +145,12 @@ config ZRAM_MULTI_COMP
 	  re-compress pages using a potentially slower but more effective
 	  compression algorithm. Note, that IDLE page recompression
 	  requires ZRAM_TRACK_ENTRY_ACTIME.
+
+config ZRAM_MULTI_PAGES
+	bool "Enable multiple pages compression and decompression"
+	depends on ZRAM && ZSMALLOC_MULTI_PAGES
+	help
+	  Initially, zram divided large folios into blocks of nr_pages, each sized
+	  equal to PAGE_SIZE, for compression. This option fine-tunes zram to
+	  improve compression granularity by dividing large folios into larger
+	  parts defined by the configuration option ZSMALLOC_MULTI_PAGES_ORDER.
diff --git a/drivers/block/zram/zcomp.c b/drivers/block/zram/zcomp.c
index bb514403e305..44f5b404495a 100644
--- a/drivers/block/zram/zcomp.c
+++ b/drivers/block/zram/zcomp.c
@@ -52,6 +52,11 @@ static void zcomp_strm_free(struct zcomp *comp, struct zcomp_strm *zstrm)
 
 static int zcomp_strm_init(struct zcomp *comp, struct zcomp_strm *zstrm)
 {
+#ifdef CONFIG_ZRAM_MULTI_PAGES
+	unsigned long page_size = ZCOMP_MULTI_PAGES_SIZE;
+#else
+	unsigned long page_size = PAGE_SIZE;
+#endif
 	int ret;
 
 	ret = comp->ops->create_ctx(comp->params, &zstrm->ctx);
@@ -62,7 +67,7 @@ static int zcomp_strm_init(struct zcomp *comp, struct zcomp_strm *zstrm)
 	 * allocate 2 pages. 1 for compressed data, plus 1 extra for the
 	 * case when compressed size is larger than the original one
 	 */
-	zstrm->buffer = vzalloc(2 * PAGE_SIZE);
+	zstrm->buffer = vzalloc(2 * page_size);
 	if (!zstrm->buffer) {
 		zcomp_strm_free(comp, zstrm);
 		return -ENOMEM;
@@ -119,13 +124,13 @@ void zcomp_stream_put(struct zcomp *comp)
 }
 
 int zcomp_compress(struct zcomp *comp, struct zcomp_strm *zstrm,
-		   const void *src, unsigned int *dst_len)
+		   const void *src, unsigned int src_len, unsigned int *dst_len)
 {
 	struct zcomp_req req = {
 		.src = src,
 		.dst = zstrm->buffer,
-		.src_len = PAGE_SIZE,
-		.dst_len = 2 * PAGE_SIZE,
+		.src_len = src_len,
+		.dst_len = src_len * 2,
 	};
 	int ret;
 
@@ -136,13 +141,13 @@ int zcomp_compress(struct zcomp *comp, struct zcomp_strm *zstrm,
 }
 
 int zcomp_decompress(struct zcomp *comp, struct zcomp_strm *zstrm,
-		     const void *src, unsigned int src_len, void *dst)
+		     const void *src, unsigned int src_len, void *dst, unsigned int dst_len)
 {
 	struct zcomp_req req = {
 		.src = src,
 		.dst = dst,
 		.src_len = src_len,
-		.dst_len = PAGE_SIZE,
+		.dst_len = dst_len,
 	};
 
 	return comp->ops->decompress(comp->params, &zstrm->ctx, &req);
diff --git a/drivers/block/zram/zcomp.h b/drivers/block/zram/zcomp.h
index ad5762813842..471c16be293c 100644
--- a/drivers/block/zram/zcomp.h
+++ b/drivers/block/zram/zcomp.h
@@ -30,6 +30,13 @@ struct zcomp_ctx {
 	void *context;
 };
 
+#ifdef CONFIG_ZRAM_MULTI_PAGES
+#define ZCOMP_MULTI_PAGES_ORDER	(_AC(CONFIG_ZSMALLOC_MULTI_PAGES_ORDER, UL))
+#define ZCOMP_MULTI_PAGES_NR	(1 << ZCOMP_MULTI_PAGES_ORDER)
+#define ZCOMP_MULTI_PAGES_SIZE	(PAGE_SIZE * ZCOMP_MULTI_PAGES_NR)
+#define MULTI_PAGE_SHIFT (ZCOMP_MULTI_PAGES_ORDER + PAGE_SHIFT)
+#endif
+
 struct zcomp_strm {
 	local_lock_t lock;
 	/* compression buffer */
@@ -80,8 +87,9 @@ struct zcomp_strm *zcomp_stream_get(struct zcomp *comp);
 void zcomp_stream_put(struct zcomp *comp);
 
 int zcomp_compress(struct zcomp *comp, struct zcomp_strm *zstrm,
-		   const void *src, unsigned int *dst_len);
+		   const void *src, unsigned int src_len, unsigned int *dst_len);
 int zcomp_decompress(struct zcomp *comp, struct zcomp_strm *zstrm,
-		     const void *src, unsigned int src_len, void *dst);
+		     const void *src, unsigned int src_len, void *dst,
+		     unsigned int dst_len);
 
 #endif /* _ZCOMP_H_ */
diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index 6cb7d1e57362..90f87894ff3e 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -50,7 +50,7 @@ static unsigned int num_devices = 1;
  * Pages that compress to sizes equals or greater than this are stored
  * uncompressed in memory.
  */
-static size_t huge_class_size;
+static size_t huge_class_size[ZSMALLOC_TYPE_MAX];
 
 static const struct block_device_operations zram_devops;
 
@@ -296,11 +296,11 @@ static inline void zram_fill_page(void *ptr, unsigned long len,
 	memset_l(ptr, value, len / sizeof(unsigned long));
 }
 
-static bool page_same_filled(void *ptr, unsigned long *element)
+static bool page_same_filled(void *ptr, unsigned long *element, unsigned int page_size)
 {
 	unsigned long *page;
 	unsigned long val;
-	unsigned int pos, last_pos = PAGE_SIZE / sizeof(*page) - 1;
+	unsigned int pos, last_pos = page_size / sizeof(*page) - 1;
 
 	page = (unsigned long *)ptr;
 	val = page[0];
@@ -1426,13 +1426,40 @@ static ssize_t debug_stat_show(struct device *dev,
 	return ret;
 }
 
+#ifdef CONFIG_ZRAM_MULTI_PAGES
+static ssize_t multi_pages_debug_stat_show(struct device *dev,
+		struct device_attribute *attr, char *buf)
+{
+	struct zram *zram = dev_to_zram(dev);
+	ssize_t ret = 0;
+
+	down_read(&zram->init_lock);
+	ret = scnprintf(buf, PAGE_SIZE,
+			"zram_bio write/read multi_pages count:%8llu %8llu\n"
+			"zram_bio failed write/read multi_pages count%8llu %8llu\n"
+			"zram_bio partial write/read multi_pages count%8llu %8llu\n"
+			"multi_pages_miss_free %8llu\n",
+			(u64)atomic64_read(&zram->stats.zram_bio_write_multi_pages_count),
+			(u64)atomic64_read(&zram->stats.zram_bio_read_multi_pages_count),
+			(u64)atomic64_read(&zram->stats.multi_pages_failed_writes),
+			(u64)atomic64_read(&zram->stats.multi_pages_failed_reads),
+			(u64)atomic64_read(&zram->stats.zram_bio_write_multi_pages_partial_count),
+			(u64)atomic64_read(&zram->stats.zram_bio_read_multi_pages_partial_count),
+			(u64)atomic64_read(&zram->stats.multi_pages_miss_free));
+	up_read(&zram->init_lock);
+
+	return ret;
+}
+#endif
 static DEVICE_ATTR_RO(io_stat);
 static DEVICE_ATTR_RO(mm_stat);
 #ifdef CONFIG_ZRAM_WRITEBACK
 static DEVICE_ATTR_RO(bd_stat);
 #endif
 static DEVICE_ATTR_RO(debug_stat);
-
+#ifdef CONFIG_ZRAM_MULTI_PAGES
+static DEVICE_ATTR_RO(multi_pages_debug_stat);
+#endif
 static void zram_meta_free(struct zram *zram, u64 disksize)
 {
 	size_t num_pages = disksize >> PAGE_SHIFT;
@@ -1449,6 +1476,7 @@ static void zram_meta_free(struct zram *zram, u64 disksize)
 static bool zram_meta_alloc(struct zram *zram, u64 disksize)
 {
 	size_t num_pages, index;
+	int i;
 
 	num_pages = disksize >> PAGE_SHIFT;
 	zram->table = vzalloc(array_size(num_pages, sizeof(*zram->table)));
@@ -1461,7 +1489,10 @@ static bool zram_meta_alloc(struct zram *zram, u64 disksize)
 		return false;
 	}
 
-	huge_class_size = zs_huge_class_size(zram->mem_pool, 0);
+	for (i = 0; i < ZSMALLOC_TYPE_MAX; i++) {
+		if (!huge_class_size[i])
+			huge_class_size[i] = zs_huge_class_size(zram->mem_pool, i);
+	}
 
 	for (index = 0; index < num_pages; index++)
 		spin_lock_init(&zram->table[index].lock);
@@ -1476,10 +1507,17 @@ static bool zram_meta_alloc(struct zram *zram, u64 disksize)
 static void zram_free_page(struct zram *zram, size_t index)
 {
 	unsigned long handle;
+	int nr_pages = 1;
 
 #ifdef CONFIG_ZRAM_TRACK_ENTRY_ACTIME
 	zram->table[index].ac_time = 0;
 #endif
+#ifdef CONFIG_ZRAM_MULTI_PAGES
+	if (zram_test_flag(zram, index, ZRAM_COMP_MULTI_PAGES)) {
+		zram_clear_flag(zram, index, ZRAM_COMP_MULTI_PAGES);
+		nr_pages = ZCOMP_MULTI_PAGES_NR;
+	}
+#endif
 
 	zram_clear_flag(zram, index, ZRAM_IDLE);
 	zram_clear_flag(zram, index, ZRAM_INCOMPRESSIBLE);
@@ -1503,7 +1541,7 @@ static void zram_free_page(struct zram *zram, size_t index)
 	 */
 	if (zram_test_flag(zram, index, ZRAM_SAME)) {
 		zram_clear_flag(zram, index, ZRAM_SAME);
-		atomic64_dec(&zram->stats.same_pages);
+		atomic64_sub(nr_pages, &zram->stats.same_pages);
 		goto out;
 	}
 
@@ -1516,7 +1554,7 @@ static void zram_free_page(struct zram *zram, size_t index)
 	atomic64_sub(zram_get_obj_size(zram, index),
 		     &zram->stats.compr_data_size);
 out:
-	atomic64_dec(&zram->stats.pages_stored);
+	atomic64_sub(nr_pages, &zram->stats.pages_stored);
 	zram_set_handle(zram, index, 0);
 	zram_set_obj_size(zram, index, 0);
 }
@@ -1526,7 +1564,7 @@ static void zram_free_page(struct zram *zram, size_t index)
  * Corresponding ZRAM slot should be locked.
  */
 static int zram_read_from_zspool(struct zram *zram, struct page *page,
-				 u32 index)
+				 u32 index, enum zsmalloc_type zs_type)
 {
 	struct zcomp_strm *zstrm;
 	unsigned long handle;
@@ -1534,6 +1572,12 @@ static int zram_read_from_zspool(struct zram *zram, struct page *page,
 	void *src, *dst;
 	u32 prio;
 	int ret;
+	unsigned long page_size = PAGE_SIZE;
+
+#ifdef CONFIG_ZRAM_MULTI_PAGES
+	if (zs_type == ZSMALLOC_TYPE_MULTI_PAGES)
+		page_size = ZCOMP_MULTI_PAGES_SIZE;
+#endif
 
 	handle = zram_get_handle(zram, index);
 	if (!handle || zram_test_flag(zram, index, ZRAM_SAME)) {
@@ -1542,28 +1586,28 @@ static int zram_read_from_zspool(struct zram *zram, struct page *page,
 
 		value = handle ? zram_get_element(zram, index) : 0;
 		mem = kmap_local_page(page);
-		zram_fill_page(mem, PAGE_SIZE, value);
+		zram_fill_page(mem, page_size, value);
 		kunmap_local(mem);
 		return 0;
 	}
 
 	size = zram_get_obj_size(zram, index);
 
-	if (size != PAGE_SIZE) {
+	if (size != page_size) {
 		prio = zram_get_priority(zram, index);
 		zstrm = zcomp_stream_get(zram->comps[prio]);
 	}
 
 	src = zs_map_object(zram->mem_pool, handle, ZS_MM_RO);
-	if (size == PAGE_SIZE) {
+	if (size == page_size) {
 		dst = kmap_local_page(page);
-		copy_page(dst, src);
+		memcpy(dst, src, page_size);
 		kunmap_local(dst);
 		ret = 0;
 	} else {
 		dst = kmap_local_page(page);
 		ret = zcomp_decompress(zram->comps[prio], zstrm,
-				       src, size, dst);
+				       src, size, dst, page_size);
 		kunmap_local(dst);
 		zcomp_stream_put(zram->comps[prio]);
 	}
@@ -1579,7 +1623,7 @@ static int zram_read_page(struct zram *zram, struct page *page, u32 index,
 	zram_slot_lock(zram, index);
 	if (!zram_test_flag(zram, index, ZRAM_WB)) {
 		/* Slot should be locked through out the function call */
-		ret = zram_read_from_zspool(zram, page, index);
+		ret = zram_read_from_zspool(zram, page, index, ZSMALLOC_TYPE_BASEPAGE);
 		zram_slot_unlock(zram, index);
 	} else {
 		/*
@@ -1636,13 +1680,24 @@ static int zram_write_page(struct zram *zram, struct page *page, u32 index)
 	struct zcomp_strm *zstrm;
 	unsigned long element = 0;
 	enum zram_pageflags flags = 0;
+	unsigned long page_size = PAGE_SIZE;
+	int huge_class_idx = ZSMALLOC_TYPE_BASEPAGE;
+	int nr_pages = 1;
+
+#ifdef CONFIG_ZRAM_MULTI_PAGES
+	if (folio_size(page_folio(page)) >= ZCOMP_MULTI_PAGES_SIZE) {
+		page_size = ZCOMP_MULTI_PAGES_SIZE;
+		huge_class_idx = ZSMALLOC_TYPE_MULTI_PAGES;
+		nr_pages = ZCOMP_MULTI_PAGES_NR;
+	}
+#endif
 
 	mem = kmap_local_page(page);
-	if (page_same_filled(mem, &element)) {
+	if (page_same_filled(mem, &element, page_size)) {
 		kunmap_local(mem);
 		/* Free memory associated with this sector now. */
 		flags = ZRAM_SAME;
-		atomic64_inc(&zram->stats.same_pages);
+		atomic64_add(nr_pages, &zram->stats.same_pages);
 		goto out;
 	}
 	kunmap_local(mem);
@@ -1651,7 +1706,7 @@ static int zram_write_page(struct zram *zram, struct page *page, u32 index)
 	zstrm = zcomp_stream_get(zram->comps[ZRAM_PRIMARY_COMP]);
 	src = kmap_local_page(page);
 	ret = zcomp_compress(zram->comps[ZRAM_PRIMARY_COMP], zstrm,
-			     src, &comp_len);
+			     src, page_size, &comp_len);
 	kunmap_local(src);
 
 	if (unlikely(ret)) {
@@ -1661,8 +1716,8 @@ static int zram_write_page(struct zram *zram, struct page *page, u32 index)
 		return ret;
 	}
 
-	if (comp_len >= huge_class_size)
-		comp_len = PAGE_SIZE;
+	if (comp_len >= huge_class_size[huge_class_idx])
+		comp_len = page_size;
 	/*
 	 * handle allocation has 2 paths:
 	 * a) fast path is executed with preemption disabled (for
@@ -1691,7 +1746,7 @@ static int zram_write_page(struct zram *zram, struct page *page, u32 index)
 		if (IS_ERR_VALUE(handle))
 			return PTR_ERR((void *)handle);
 
-		if (comp_len != PAGE_SIZE)
+		if (comp_len != page_size)
 			goto compress_again;
 		/*
 		 * If the page is not compressible, you need to acquire the
@@ -1715,10 +1770,10 @@ static int zram_write_page(struct zram *zram, struct page *page, u32 index)
 	dst = zs_map_object(zram->mem_pool, handle, ZS_MM_WO);
 
 	src = zstrm->buffer;
-	if (comp_len == PAGE_SIZE)
+	if (comp_len == page_size)
 		src = kmap_local_page(page);
 	memcpy(dst, src, comp_len);
-	if (comp_len == PAGE_SIZE)
+	if (comp_len == page_size)
 		kunmap_local(src);
 
 	zcomp_stream_put(zram->comps[ZRAM_PRIMARY_COMP]);
@@ -1732,7 +1787,7 @@ static int zram_write_page(struct zram *zram, struct page *page, u32 index)
 	zram_slot_lock(zram, index);
 	zram_free_page(zram, index);
 
-	if (comp_len == PAGE_SIZE) {
+	if (comp_len == page_size) {
 		zram_set_flag(zram, index, ZRAM_HUGE);
 		atomic64_inc(&zram->stats.huge_pages);
 		atomic64_inc(&zram->stats.huge_pages_since);
@@ -1745,10 +1800,19 @@ static int zram_write_page(struct zram *zram, struct page *page, u32 index)
 		zram_set_handle(zram, index, handle);
 		zram_set_obj_size(zram, index, comp_len);
 	}
+
+#ifdef CONFIG_ZRAM_MULTI_PAGES
+	if (page_size == ZCOMP_MULTI_PAGES_SIZE) {
+		/* Set multi-pages compression flag for free or overwriting */
+		for (int i = 0; i < ZCOMP_MULTI_PAGES_NR; i++)
+			zram_set_flag(zram, index + i, ZRAM_COMP_MULTI_PAGES);
+	}
+#endif
+
 	zram_slot_unlock(zram, index);
 
 	/* Update stats */
-	atomic64_inc(&zram->stats.pages_stored);
+	atomic64_add(nr_pages, &zram->stats.pages_stored);
 	return ret;
 }
 
@@ -1861,7 +1925,7 @@ static int recompress_slot(struct zram *zram, u32 index, struct page *page,
 	if (comp_len_old < threshold)
 		return 0;
 
-	ret = zram_read_from_zspool(zram, page, index);
+	ret = zram_read_from_zspool(zram, page, index, ZSMALLOC_TYPE_BASEPAGE);
 	if (ret)
 		return ret;
 
@@ -1892,7 +1956,7 @@ static int recompress_slot(struct zram *zram, u32 index, struct page *page,
 		zstrm = zcomp_stream_get(zram->comps[prio]);
 		src = kmap_local_page(page);
 		ret = zcomp_compress(zram->comps[prio], zstrm,
-				     src, &comp_len_new);
+				     src, PAGE_SIZE, &comp_len_new);
 		kunmap_local(src);
 
 		if (ret) {
@@ -2056,7 +2120,7 @@ static ssize_t recompress_store(struct device *dev,
 		}
 	}
 
-	if (threshold >= huge_class_size)
+	if (threshold >= huge_class_size[ZSMALLOC_TYPE_BASEPAGE])
 		return -EINVAL;
 
 	down_read(&zram->init_lock);
@@ -2178,7 +2242,7 @@ static void zram_bio_discard(struct zram *zram, struct bio *bio)
 	bio_endio(bio);
 }
 
-static void zram_bio_read(struct zram *zram, struct bio *bio)
+static void zram_bio_read_page(struct zram *zram, struct bio *bio)
 {
 	unsigned long start_time = bio_start_io_acct(bio);
 	struct bvec_iter iter = bio->bi_iter;
@@ -2209,7 +2273,7 @@ static void zram_bio_read(struct zram *zram, struct bio *bio)
 	bio_endio(bio);
 }
 
-static void zram_bio_write(struct zram *zram, struct bio *bio)
+static void zram_bio_write_page(struct zram *zram, struct bio *bio)
 {
 	unsigned long start_time = bio_start_io_acct(bio);
 	struct bvec_iter iter = bio->bi_iter;
@@ -2239,6 +2303,311 @@ static void zram_bio_write(struct zram *zram, struct bio *bio)
 	bio_endio(bio);
 }
 
+#ifdef CONFIG_ZRAM_MULTI_PAGES
+
+/*
+ * The index is compress by multi-pages when any index ZRAM_COMP_MULTI_PAGES flag is set.
+ * Return: 0	: compress by page
+ *         > 0	: compress by multi-pages
+ */
+static inline int __test_multi_pages_comp(struct zram *zram, u32 index)
+{
+	int i;
+	int count = 0;
+	int head_index = index & ~((unsigned long)ZCOMP_MULTI_PAGES_NR - 1);
+
+	for (i = 0; i < ZCOMP_MULTI_PAGES_NR; i++) {
+		if (zram_test_flag(zram, head_index + i, ZRAM_COMP_MULTI_PAGES))
+			count++;
+	}
+
+	return count;
+}
+
+static inline bool want_multi_pages_comp(struct zram *zram, struct bio *bio)
+{
+	u32 index = bio->bi_iter.bi_sector >> SECTORS_PER_PAGE_SHIFT;
+
+	if (bio->bi_io_vec->bv_len >= ZCOMP_MULTI_PAGES_SIZE)
+		return true;
+
+	zram_slot_lock(zram, index);
+	if (__test_multi_pages_comp(zram, index)) {
+		zram_slot_unlock(zram, index);
+		return true;
+	}
+	zram_slot_unlock(zram, index);
+
+	return false;
+}
+
+static inline bool test_multi_pages_comp(struct zram *zram, struct bio *bio)
+{
+	u32 index = bio->bi_iter.bi_sector >> SECTORS_PER_PAGE_SHIFT;
+
+	return !!__test_multi_pages_comp(zram, index);
+}
+
+static inline bool is_multi_pages_partial_io(struct bio_vec *bvec)
+{
+	return bvec->bv_len != ZCOMP_MULTI_PAGES_SIZE;
+}
+
+static int zram_read_multi_pages(struct zram *zram, struct page *page, u32 index,
+			  struct bio *parent)
+{
+	int ret;
+
+	zram_slot_lock(zram, index);
+	if (!zram_test_flag(zram, index, ZRAM_WB)) {
+		/* Slot should be locked through out the function call */
+		ret = zram_read_from_zspool(zram, page, index, ZSMALLOC_TYPE_MULTI_PAGES);
+		zram_slot_unlock(zram, index);
+	} else {
+		/*
+		 * The slot should be unlocked before reading from the backing
+		 * device.
+		 */
+		zram_slot_unlock(zram, index);
+
+		ret = read_from_bdev(zram, page, zram_get_element(zram, index),
+				     parent);
+	}
+
+	/* Should NEVER happen. Return bio error if it does. */
+	if (WARN_ON(ret < 0))
+		pr_err("Decompression failed! err=%d, page=%u\n", ret, index);
+
+	return ret;
+}
+
+static int zram_read_partial_from_zspool(struct zram *zram, struct page *page,
+				 u32 index, enum zsmalloc_type zs_type, int offset)
+{
+	struct zcomp_strm *zstrm;
+	unsigned long handle;
+	unsigned int size;
+	void *src, *dst;
+	u32 prio;
+	int ret;
+	unsigned long page_size = PAGE_SIZE;
+
+#ifdef CONFIG_ZRAM_MULTI_PAGES
+	if (zs_type == ZSMALLOC_TYPE_MULTI_PAGES)
+		page_size = ZCOMP_MULTI_PAGES_SIZE;
+#endif
+
+	handle = zram_get_handle(zram, index);
+	if (!handle || zram_test_flag(zram, index, ZRAM_SAME)) {
+		unsigned long value;
+		void *mem;
+
+		value = handle ? zram_get_element(zram, index) : 0;
+		mem = kmap_local_page(page);
+		atomic64_inc(&zram->stats.zram_bio_read_multi_pages_partial_count);
+		zram_fill_page(mem, PAGE_SIZE, value);
+		kunmap_local(mem);
+		return 0;
+	}
+
+	size = zram_get_obj_size(zram, index);
+
+	if (size != page_size) {
+		prio = zram_get_priority(zram, index);
+		zstrm = zcomp_stream_get(zram->comps[prio]);
+	}
+
+	src = zs_map_object(zram->mem_pool, handle, ZS_MM_RO);
+	if (size == page_size) {
+		dst = kmap_local_page(page);
+		atomic64_inc(&zram->stats.zram_bio_read_multi_pages_partial_count);
+		memcpy(dst, src + offset, PAGE_SIZE);
+		kunmap_local(dst);
+		ret = 0;
+	} else {
+		dst = kmap_local_page(page);
+		/* use zstrm->buffer to store decompress thp and copy page to dst */
+		atomic64_inc(&zram->stats.zram_bio_read_multi_pages_partial_count);
+		ret = zcomp_decompress(zram->comps[prio], zstrm, src, size, zstrm->buffer, page_size);
+		memcpy(dst, zstrm->buffer + offset, PAGE_SIZE);
+		kunmap_local(dst);
+		zcomp_stream_put(zram->comps[prio]);
+	}
+	zs_unmap_object(zram->mem_pool, handle);
+	return ret;
+}
+
+/*
+ * Use a temporary buffer to decompress the page, as the decompressor
+ * always expects a full page for the output.
+ */
+static int zram_bvec_read_multi_pages_partial(struct zram *zram, struct page *page, u32 index,
+			  struct bio *parent, int offset)
+{
+	int ret;
+
+	zram_slot_lock(zram, index);
+	if (!zram_test_flag(zram, index, ZRAM_WB)) {
+		/* Slot should be locked through out the function call */
+		ret = zram_read_partial_from_zspool(zram, page, index, ZSMALLOC_TYPE_MULTI_PAGES, offset);
+		zram_slot_unlock(zram, index);
+	} else {
+		/*
+		 * The slot should be unlocked before reading from the backing
+		 * device.
+		 */
+		zram_slot_unlock(zram, index);
+
+		ret = read_from_bdev(zram, page, zram_get_element(zram, index),
+				     parent);
+	}
+
+	/* Should NEVER happen. Return bio error if it does. */
+	if (WARN_ON(ret < 0))
+		pr_err("Decompression failed! err=%d, page=%u offset=%d\n", ret, index, offset);
+
+	return ret;
+}
+
+static int zram_bvec_read_multi_pages(struct zram *zram, struct bio_vec *bvec,
+			  u32 index, int offset, struct bio *bio)
+{
+	if (is_multi_pages_partial_io(bvec))
+		return zram_bvec_read_multi_pages_partial(zram, bvec->bv_page, index, bio, offset);
+	return zram_read_multi_pages(zram, bvec->bv_page, index, bio);
+}
+
+/*
+ * This is a partial IO. Read the full page before writing the changes.
+ */
+static int zram_bvec_write_multi_pages_partial(struct zram *zram, struct bio_vec *bvec,
+				   u32 index, int offset, struct bio *bio)
+{
+	struct page *page = alloc_pages(GFP_NOIO | __GFP_COMP, ZCOMP_MULTI_PAGES_ORDER);
+	int ret;
+	void *src, *dst;
+
+	if (!page)
+		return -ENOMEM;
+
+	ret = zram_read_multi_pages(zram, page, index, bio);
+	if (!ret) {
+		src = kmap_local_page(bvec->bv_page);
+		dst = kmap_local_page(page);
+		memcpy(dst + offset, src + bvec->bv_offset, bvec->bv_len);
+		kunmap_local(dst);
+		kunmap_local(src);
+
+		atomic64_inc(&zram->stats.zram_bio_write_multi_pages_partial_count);
+		ret = zram_write_page(zram, page, index);
+	}
+	__free_pages(page, ZCOMP_MULTI_PAGES_ORDER);
+	return ret;
+}
+
+static int zram_bvec_write_multi_pages(struct zram *zram, struct bio_vec *bvec,
+			   u32 index, int offset, struct bio *bio)
+{
+	if (is_multi_pages_partial_io(bvec))
+		return zram_bvec_write_multi_pages_partial(zram, bvec, index, offset, bio);
+	return zram_write_page(zram, bvec->bv_page, index);
+}
+
+
+static void zram_bio_read_multi_pages(struct zram *zram, struct bio *bio)
+{
+	unsigned long start_time = bio_start_io_acct(bio);
+	struct bvec_iter iter = bio->bi_iter;
+
+	do {
+		/* Use head index, and other indexes are used as offset */
+		u32 index = (iter.bi_sector >> SECTORS_PER_PAGE_SHIFT) &
+				~((unsigned long)ZCOMP_MULTI_PAGES_NR - 1);
+		u32 offset = (iter.bi_sector & (SECTORS_PER_MULTI_PAGE - 1)) << SECTOR_SHIFT;
+		struct bio_vec bv = multi_pages_bio_iter_iovec(bio, iter);
+
+		atomic64_add(1, &zram->stats.zram_bio_read_multi_pages_count);
+		bv.bv_len = min_t(u32, bv.bv_len, ZCOMP_MULTI_PAGES_SIZE - offset);
+
+		if (zram_bvec_read_multi_pages(zram, &bv, index, offset, bio) < 0) {
+			atomic64_inc(&zram->stats.multi_pages_failed_reads);
+			bio->bi_status = BLK_STS_IOERR;
+			break;
+		}
+		flush_dcache_page(bv.bv_page);
+
+		zram_slot_lock(zram, index);
+		zram_accessed(zram, index);
+		zram_slot_unlock(zram, index);
+
+		bio_advance_iter_single(bio, &iter, bv.bv_len);
+	} while (iter.bi_size);
+
+	bio_end_io_acct(bio, start_time);
+	bio_endio(bio);
+}
+
+static void zram_bio_write_multi_pages(struct zram *zram, struct bio *bio)
+{
+	unsigned long start_time = bio_start_io_acct(bio);
+	struct bvec_iter iter = bio->bi_iter;
+
+	do {
+		/* Use head index, and other indexes are used as offset */
+		u32 index = (iter.bi_sector >> SECTORS_PER_PAGE_SHIFT) &
+				~((unsigned long)ZCOMP_MULTI_PAGES_NR - 1);
+		u32 offset = (iter.bi_sector & (SECTORS_PER_MULTI_PAGE - 1)) << SECTOR_SHIFT;
+		struct bio_vec bv = multi_pages_bio_iter_iovec(bio, iter);
+
+		bv.bv_len = min_t(u32, bv.bv_len, ZCOMP_MULTI_PAGES_SIZE - offset);
+
+		atomic64_add(1, &zram->stats.zram_bio_write_multi_pages_count);
+		if (zram_bvec_write_multi_pages(zram, &bv, index, offset, bio) < 0) {
+			atomic64_inc(&zram->stats.multi_pages_failed_writes);
+			bio->bi_status = BLK_STS_IOERR;
+			break;
+		}
+
+		zram_slot_lock(zram, index);
+		zram_accessed(zram, index);
+		zram_slot_unlock(zram, index);
+
+		bio_advance_iter_single(bio, &iter, bv.bv_len);
+	} while (iter.bi_size);
+
+	bio_end_io_acct(bio, start_time);
+	bio_endio(bio);
+}
+#else
+static inline bool test_multi_pages_comp(struct zram *zram, struct bio *bio)
+{
+	return false;
+}
+
+static inline bool want_multi_pages_comp(struct zram *zram, struct bio *bio)
+{
+	return false;
+}
+static void zram_bio_read_multi_pages(struct zram *zram, struct bio *bio) {}
+static void zram_bio_write_multi_pages(struct zram *zram, struct bio *bio) {}
+#endif
+
+static void zram_bio_read(struct zram *zram, struct bio *bio)
+{
+	if (test_multi_pages_comp(zram, bio))
+		zram_bio_read_multi_pages(zram, bio);
+	else
+		zram_bio_read_page(zram, bio);
+}
+
+static void zram_bio_write(struct zram *zram, struct bio *bio)
+{
+	if (want_multi_pages_comp(zram, bio))
+		zram_bio_write_multi_pages(zram, bio);
+	else
+		zram_bio_write_page(zram, bio);
+}
+
 /*
  * Handler function for all zram I/O requests.
  */
@@ -2276,6 +2645,25 @@ static void zram_slot_free_notify(struct block_device *bdev,
 		return;
 	}
 
+#ifdef CONFIG_ZRAM_MULTI_PAGES
+	int comp_count = __test_multi_pages_comp(zram, index);
+
+	if (comp_count > 1) {
+		zram_clear_flag(zram, index, ZRAM_COMP_MULTI_PAGES);
+		zram_slot_unlock(zram, index);
+		return;
+	} else if (comp_count == 1) {
+		zram_clear_flag(zram, index, ZRAM_COMP_MULTI_PAGES);
+		zram_slot_unlock(zram, index);
+		/*only need to free head index*/
+		index &= ~((unsigned long)ZCOMP_MULTI_PAGES_NR - 1);
+		if (!zram_slot_trylock(zram, index)) {
+			atomic64_inc(&zram->stats.multi_pages_miss_free);
+			return;
+		}
+	}
+#endif
+
 	zram_free_page(zram, index);
 	zram_slot_unlock(zram, index);
 }
@@ -2493,6 +2881,9 @@ static struct attribute *zram_disk_attrs[] = {
 #endif
 	&dev_attr_io_stat.attr,
 	&dev_attr_mm_stat.attr,
+#ifdef CONFIG_ZRAM_MULTI_PAGES
+	&dev_attr_multi_pages_debug_stat.attr,
+#endif
 #ifdef CONFIG_ZRAM_WRITEBACK
 	&dev_attr_bd_stat.attr,
 #endif
diff --git a/drivers/block/zram/zram_drv.h b/drivers/block/zram/zram_drv.h
index 134be414e210..ac4eb4f39cb7 100644
--- a/drivers/block/zram/zram_drv.h
+++ b/drivers/block/zram/zram_drv.h
@@ -28,6 +28,10 @@
 #define ZRAM_SECTOR_PER_LOGICAL_BLOCK	\
 	(1 << (ZRAM_LOGICAL_BLOCK_SHIFT - SECTOR_SHIFT))
 
+#ifdef CONFIG_ZRAM_MULTI_PAGES
+#define SECTORS_PER_MULTI_PAGE_SHIFT	(MULTI_PAGE_SHIFT - SECTOR_SHIFT)
+#define SECTORS_PER_MULTI_PAGE	(1 << SECTORS_PER_MULTI_PAGE_SHIFT)
+#endif
 
 /*
  * ZRAM is mainly used for memory efficiency so we want to keep memory
@@ -38,7 +42,15 @@
  *
  * We use BUILD_BUG_ON() to make sure that zram pageflags don't overflow.
  */
+
+#ifdef CONFIG_ZRAM_MULTI_PAGES
+#define ZRAM_FLAG_SHIFT (PAGE_SHIFT +  \
+	CONFIG_ZSMALLOC_MULTI_PAGES_ORDER + 1)
+#else
 #define ZRAM_FLAG_SHIFT (PAGE_SHIFT + 1)
+#endif
+
+#define ENABLE_HUGEPAGE_ZRAM_DEBUG 1
 
 /* Only 2 bits are allowed for comp priority index */
 #define ZRAM_COMP_PRIORITY_MASK	0x3
@@ -55,6 +67,10 @@ enum zram_pageflags {
 	ZRAM_COMP_PRIORITY_BIT1, /* First bit of comp priority index */
 	ZRAM_COMP_PRIORITY_BIT2, /* Second bit of comp priority index */
 
+#ifdef CONFIG_ZRAM_MULTI_PAGES
+	ZRAM_COMP_MULTI_PAGES,	/* Compressed by multi-pages */
+#endif
+
 	__NR_ZRAM_PAGEFLAGS,
 };
 
@@ -90,6 +106,16 @@ struct zram_stats {
 	atomic64_t bd_reads;		/* no. of reads from backing device */
 	atomic64_t bd_writes;		/* no. of writes from backing device */
 #endif
+
+#ifdef CONFIG_ZRAM_MULTI_PAGES
+	atomic64_t zram_bio_write_multi_pages_count;
+	atomic64_t zram_bio_read_multi_pages_count;
+	atomic64_t multi_pages_failed_writes;
+	atomic64_t multi_pages_failed_reads;
+	atomic64_t zram_bio_write_multi_pages_partial_count;
+	atomic64_t zram_bio_read_multi_pages_partial_count;
+	atomic64_t multi_pages_miss_free;
+#endif
 };
 
 #ifdef CONFIG_ZRAM_MULTI_COMP
@@ -141,4 +167,23 @@ struct zram {
 #endif
 	atomic_t pp_in_progress;
 };
+
+#ifdef CONFIG_ZRAM_MULTI_PAGES
+ #define multi_pages_bvec_iter_offset(bvec, iter)				\
+	(mp_bvec_iter_offset((bvec), (iter)) % ZCOMP_MULTI_PAGES_SIZE)
+
+#define multi_pages_bvec_iter_len(bvec, iter)				\
+	min_t(unsigned int, mp_bvec_iter_len((bvec), (iter)),		\
+	      ZCOMP_MULTI_PAGES_SIZE - bvec_iter_offset((bvec), (iter)))
+
+#define multi_pages_bvec_iter_bvec(bvec, iter)				\
+((struct bio_vec) {						\
+	.bv_page	= bvec_iter_page((bvec), (iter)),	\
+	.bv_len		= multi_pages_bvec_iter_len((bvec), (iter)),	\
+	.bv_offset	= multi_pages_bvec_iter_offset((bvec), (iter)),	\
+})
+
+#define multi_pages_bio_iter_iovec(bio, iter)				\
+	multi_pages_bvec_iter_bvec((bio)->bi_io_vec, (iter))
+#endif
 #endif
-- 
2.39.3 (Apple Git-146)



^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH RFC v3 3/4] zram: backend_zstd: Adjust estimated_src_size to accommodate multi-page compression
  2024-11-21 22:25 [PATCH RFC v3 0/4] mTHP-friendly compression in zsmalloc and zram based on multi-pages Barry Song
  2024-11-21 22:25 ` [PATCH RFC v3 1/4] mm: zsmalloc: support objects compressed based on multiple pages Barry Song
  2024-11-21 22:25 ` [PATCH RFC v3 2/4] zram: support compression at the granularity of multi-pages Barry Song
@ 2024-11-21 22:25 ` Barry Song
  2024-11-21 22:25 ` [PATCH RFC v3 4/4] mm: fall back to four small folios if mTHP allocation fails Barry Song
  2024-11-26  5:09 ` [PATCH RFC v3 0/4] mTHP-friendly compression in zsmalloc and zram based on multi-pages Sergey Senozhatsky
  4 siblings, 0 replies; 19+ messages in thread
From: Barry Song @ 2024-11-21 22:25 UTC (permalink / raw)
  To: akpm, linux-mm
  Cc: axboe, bala.seshasayee, chrisl, david, hannes,
	kanchana.p.sridhar, kasong, linux-block, minchan, nphamcs,
	ryan.roberts, senozhatsky, surenb, terrelln, usamaarif642,
	v-songbaohua, wajdi.k.feghali, willy, ying.huang, yosryahmed,
	yuzhao, zhengtangquan, zhouchengming

From: Barry Song <v-songbaohua@oppo.com>

If we continue using PAGE_SIZE as the estimated_src_size, we won't
benefit from the reduced CPU usage and improved compression ratio
brought by larger block compression.

Signed-off-by: Barry Song <v-songbaohua@oppo.com>
---
 drivers/block/zram/backend_zstd.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/block/zram/backend_zstd.c b/drivers/block/zram/backend_zstd.c
index 1184c0036f44..e126615eeff2 100644
--- a/drivers/block/zram/backend_zstd.c
+++ b/drivers/block/zram/backend_zstd.c
@@ -70,12 +70,12 @@ static int zstd_setup_params(struct zcomp_params *params)
 	if (params->level == ZCOMP_PARAM_NO_LEVEL)
 		params->level = zstd_default_clevel();
 
-	zp->cprm = zstd_get_params(params->level, PAGE_SIZE);
+	zp->cprm = zstd_get_params(params->level, ZCOMP_MULTI_PAGES_SIZE);
 
 	zp->custom_mem.customAlloc = zstd_custom_alloc;
 	zp->custom_mem.customFree = zstd_custom_free;
 
-	prm = zstd_get_cparams(params->level, PAGE_SIZE,
+	prm = zstd_get_cparams(params->level, ZCOMP_MULTI_PAGES_SIZE,
 			       params->dict_sz);
 
 	zp->cdict = zstd_create_cdict_byreference(params->dict,
@@ -137,7 +137,7 @@ static int zstd_create(struct zcomp_params *params, struct zcomp_ctx *ctx)
 
 	ctx->context = zctx;
 	if (params->dict_sz == 0) {
-		prm = zstd_get_params(params->level, PAGE_SIZE);
+		prm = zstd_get_params(params->level, ZCOMP_MULTI_PAGES_SIZE);
 		sz = zstd_cctx_workspace_bound(&prm.cParams);
 		zctx->cctx_mem = vzalloc(sz);
 		if (!zctx->cctx_mem)
-- 
2.39.3 (Apple Git-146)



^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH RFC v3 4/4] mm: fall back to four small folios if mTHP allocation fails
  2024-11-21 22:25 [PATCH RFC v3 0/4] mTHP-friendly compression in zsmalloc and zram based on multi-pages Barry Song
                   ` (2 preceding siblings ...)
  2024-11-21 22:25 ` [PATCH RFC v3 3/4] zram: backend_zstd: Adjust estimated_src_size to accommodate multi-page compression Barry Song
@ 2024-11-21 22:25 ` Barry Song
  2024-11-22 14:54   ` Usama Arif
  2024-11-26  5:09 ` [PATCH RFC v3 0/4] mTHP-friendly compression in zsmalloc and zram based on multi-pages Sergey Senozhatsky
  4 siblings, 1 reply; 19+ messages in thread
From: Barry Song @ 2024-11-21 22:25 UTC (permalink / raw)
  To: akpm, linux-mm
  Cc: axboe, bala.seshasayee, chrisl, david, hannes,
	kanchana.p.sridhar, kasong, linux-block, minchan, nphamcs,
	ryan.roberts, senozhatsky, surenb, terrelln, usamaarif642,
	v-songbaohua, wajdi.k.feghali, willy, ying.huang, yosryahmed,
	yuzhao, zhengtangquan, zhouchengming, Chuanhua Han

From: Barry Song <v-songbaohua@oppo.com>

The swapfile can compress/decompress at 4 * PAGES granularity, reducing
CPU usage and improving the compression ratio. However, if allocating an
mTHP fails and we fall back to a single small folio, the entire large
block must still be decompressed. This results in a 16 KiB area requiring
4 page faults, where each fault decompresses 16 KiB but retrieves only
4 KiB of data from the block. To address this inefficiency, we instead
fall back to 4 small folios, ensuring that each decompression occurs
only once.

Allowing swap_read_folio() to decompress and read into an array of
4 folios would be extremely complex, requiring extensive changes
throughout the stack, including swap_read_folio, zeromap,
zswap, and final swap implementations like zRAM. In contrast,
having these components fill a large folio with 4 subpages is much
simpler.

To avoid a full-stack modification, we introduce a per-CPU order-2
large folio as a buffer. This buffer is used for swap_read_folio(),
after which the data is copied into the 4 small folios. Finally, in
do_swap_page(), all these small folios are mapped.

Co-developed-by: Chuanhua Han <chuanhuahan@gmail.com>
Signed-off-by: Chuanhua Han <chuanhuahan@gmail.com>
Signed-off-by: Barry Song <v-songbaohua@oppo.com>
---
 mm/memory.c | 203 +++++++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 192 insertions(+), 11 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 209885a4134f..e551570c1425 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4042,6 +4042,15 @@ static struct folio *__alloc_swap_folio(struct vm_fault *vmf)
 	return folio;
 }
 
+#define BATCH_SWPIN_ORDER 2
+#define BATCH_SWPIN_COUNT (1 << BATCH_SWPIN_ORDER)
+#define BATCH_SWPIN_SIZE (PAGE_SIZE << BATCH_SWPIN_ORDER)
+
+struct batch_swpin_buffer {
+	struct folio *folio;
+	struct mutex mutex;
+};
+
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 static inline int non_swapcache_batch(swp_entry_t entry, int max_nr)
 {
@@ -4120,7 +4129,101 @@ static inline unsigned long thp_swap_suitable_orders(pgoff_t swp_offset,
 	return orders;
 }
 
-static struct folio *alloc_swap_folio(struct vm_fault *vmf)
+static DEFINE_PER_CPU(struct batch_swpin_buffer, swp_buf);
+
+static int __init batch_swpin_buffer_init(void)
+{
+	int ret, cpu;
+	struct batch_swpin_buffer *buf;
+
+	for_each_possible_cpu(cpu) {
+		buf = per_cpu_ptr(&swp_buf, cpu);
+		buf->folio = (struct folio *)alloc_pages_node(cpu_to_node(cpu),
+				GFP_KERNEL | __GFP_COMP, BATCH_SWPIN_ORDER);
+		if (!buf->folio) {
+			ret = -ENOMEM;
+			goto err;
+		}
+		mutex_init(&buf->mutex);
+	}
+	return 0;
+
+err:
+	for_each_possible_cpu(cpu) {
+		buf = per_cpu_ptr(&swp_buf, cpu);
+		if (buf->folio) {
+			folio_put(buf->folio);
+			buf->folio = NULL;
+		}
+	}
+	return ret;
+}
+core_initcall(batch_swpin_buffer_init);
+
+static struct folio *alloc_batched_swap_folios(struct vm_fault *vmf,
+		struct batch_swpin_buffer **buf, struct folio **folios,
+		swp_entry_t entry)
+{
+	unsigned long haddr = ALIGN_DOWN(vmf->address, BATCH_SWPIN_SIZE);
+	struct batch_swpin_buffer *sbuf = raw_cpu_ptr(&swp_buf);
+	struct folio *folio = sbuf->folio;
+	unsigned long addr;
+	int i;
+
+	if (unlikely(!folio))
+		return NULL;
+
+	for (i = 0; i < BATCH_SWPIN_COUNT; i++) {
+		addr = haddr + i * PAGE_SIZE;
+		folios[i] = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0, vmf->vma, addr);
+		if (!folios[i])
+			goto err;
+		if (mem_cgroup_swapin_charge_folio(folios[i], vmf->vma->vm_mm,
+					GFP_KERNEL, entry))
+			goto err;
+	}
+
+	mutex_lock(&sbuf->mutex);
+	*buf = sbuf;
+#ifdef CONFIG_MEMCG
+	folio->memcg_data = (*folios)->memcg_data;
+#endif
+	return folio;
+
+err:
+	for (i--; i >= 0; i--)
+		folio_put(folios[i]);
+	return NULL;
+}
+
+static void fill_batched_swap_folios(struct vm_fault *vmf,
+		void *shadow, struct batch_swpin_buffer *buf,
+		struct folio *folio, struct folio **folios)
+{
+	unsigned long haddr = ALIGN_DOWN(vmf->address, BATCH_SWPIN_SIZE);
+	unsigned long addr;
+	int i;
+
+	for (i = 0; i < BATCH_SWPIN_COUNT; i++) {
+		addr = haddr + i * PAGE_SIZE;
+		__folio_set_locked(folios[i]);
+		__folio_set_swapbacked(folios[i]);
+		if (shadow)
+			workingset_refault(folios[i], shadow);
+		folio_add_lru(folios[i]);
+		copy_user_highpage(&folios[i]->page, folio_page(folio, i),
+				addr, vmf->vma);
+		if (folio_test_uptodate(folio))
+			folio_mark_uptodate(folios[i]);
+	}
+
+	folio->flags &= ~(PAGE_FLAGS_CHECK_AT_PREP & ~(1UL << PG_head));
+	mutex_unlock(&buf->mutex);
+}
+
+static struct folio *alloc_swap_folio(struct vm_fault *vmf,
+		struct batch_swpin_buffer **buf,
+		struct folio **folios)
 {
 	struct vm_area_struct *vma = vmf->vma;
 	unsigned long orders;
@@ -4180,6 +4283,9 @@ static struct folio *alloc_swap_folio(struct vm_fault *vmf)
 
 	pte_unmap_unlock(pte, ptl);
 
+	if (!orders)
+		goto fallback;
+
 	/* Try allocating the highest of the remaining orders. */
 	gfp = vma_thp_gfp_mask(vma);
 	while (orders) {
@@ -4194,14 +4300,29 @@ static struct folio *alloc_swap_folio(struct vm_fault *vmf)
 		order = next_order(&orders, order);
 	}
 
+	/*
+	 * During swap-out, a THP might have been compressed into multiple
+	 * order-2 blocks to optimize CPU usage and compression ratio.
+	 * Attempt to batch swap-in 4 smaller folios to ensure they are
+	 * decompressed together as a single unit only once.
+	 */
+	return alloc_batched_swap_folios(vmf, buf, folios, entry);
+
 fallback:
 	return __alloc_swap_folio(vmf);
 }
 #else /* !CONFIG_TRANSPARENT_HUGEPAGE */
-static struct folio *alloc_swap_folio(struct vm_fault *vmf)
+static struct folio *alloc_swap_folio(struct vm_fault *vmf,
+		struct batch_swpin_buffer **buf,
+		struct folio **folios)
 {
 	return __alloc_swap_folio(vmf);
 }
+static inline void fill_batched_swap_folios(struct vm_fault *vmf,
+		void *shadow, struct batch_swpin_buffer *buf,
+		struct folio *folio, struct folio **folios)
+{
+}
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
 static DECLARE_WAIT_QUEUE_HEAD(swapcache_wq);
@@ -4216,6 +4337,8 @@ static DECLARE_WAIT_QUEUE_HEAD(swapcache_wq);
  */
 vm_fault_t do_swap_page(struct vm_fault *vmf)
 {
+	struct folio *folios[BATCH_SWPIN_COUNT] = { NULL };
+	struct batch_swpin_buffer *buf = NULL;
 	struct vm_area_struct *vma = vmf->vma;
 	struct folio *swapcache, *folio = NULL;
 	DECLARE_WAITQUEUE(wait, current);
@@ -4228,7 +4351,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	pte_t pte;
 	vm_fault_t ret = 0;
 	void *shadow = NULL;
-	int nr_pages;
+	int nr_pages, i;
 	unsigned long page_idx;
 	unsigned long address;
 	pte_t *ptep;
@@ -4296,7 +4419,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 		if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
 		    __swap_count(entry) == 1) {
 			/* skip swapcache */
-			folio = alloc_swap_folio(vmf);
+			folio = alloc_swap_folio(vmf, &buf, folios);
 			if (folio) {
 				__folio_set_locked(folio);
 				__folio_set_swapbacked(folio);
@@ -4327,10 +4450,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 				mem_cgroup_swapin_uncharge_swap(entry, nr_pages);
 
 				shadow = get_shadow_from_swap_cache(entry);
-				if (shadow)
+				if (shadow && !buf)
 					workingset_refault(folio, shadow);
-
-				folio_add_lru(folio);
+				if (!buf)
+					folio_add_lru(folio);
 
 				/* To provide entry to swap_read_folio() */
 				folio->swap = entry;
@@ -4361,6 +4484,16 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 		count_vm_event(PGMAJFAULT);
 		count_memcg_event_mm(vma->vm_mm, PGMAJFAULT);
 		page = folio_file_page(folio, swp_offset(entry));
+		/*
+		 * Copy data into batched small folios from the large
+		 * folio buffer
+		 */
+		if (buf) {
+			fill_batched_swap_folios(vmf, shadow, buf, folio, folios);
+			folio = folios[0];
+			page = &folios[0]->page;
+			goto do_map;
+		}
 	} else if (PageHWPoison(page)) {
 		/*
 		 * hwpoisoned dirty swapcache pages are kept for killing
@@ -4415,6 +4548,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 			lru_add_drain();
 	}
 
+do_map:
 	folio_throttle_swaprate(folio, GFP_KERNEL);
 
 	/*
@@ -4431,8 +4565,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	}
 
 	/* allocated large folios for SWP_SYNCHRONOUS_IO */
-	if (folio_test_large(folio) && !folio_test_swapcache(folio)) {
-		unsigned long nr = folio_nr_pages(folio);
+	if ((folio_test_large(folio) || buf) && !folio_test_swapcache(folio)) {
+		unsigned long nr = buf ? BATCH_SWPIN_COUNT : folio_nr_pages(folio);
 		unsigned long folio_start = ALIGN_DOWN(vmf->address, nr * PAGE_SIZE);
 		unsigned long idx = (vmf->address - folio_start) / PAGE_SIZE;
 		pte_t *folio_ptep = vmf->pte - idx;
@@ -4527,6 +4661,42 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 		}
 	}
 
+	/* Batched mapping of allocated small folios for SWP_SYNCHRONOUS_IO */
+	if (buf) {
+		for (i = 0; i < nr_pages; i++)
+			arch_swap_restore(swp_entry(swp_type(entry),
+				swp_offset(entry) + i), folios[i]);
+		swap_free_nr(entry, nr_pages);
+		add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages);
+		add_mm_counter(vma->vm_mm, MM_SWAPENTS, -nr_pages);
+		rmap_flags |= RMAP_EXCLUSIVE;
+		for (i = 0; i < nr_pages; i++) {
+			unsigned long addr = address + i * PAGE_SIZE;
+
+			pte = mk_pte(&folios[i]->page, vma->vm_page_prot);
+			if (pte_swp_soft_dirty(vmf->orig_pte))
+				pte = pte_mksoft_dirty(pte);
+			if (pte_swp_uffd_wp(vmf->orig_pte))
+				pte = pte_mkuffd_wp(pte);
+			if ((vma->vm_flags & VM_WRITE) && !userfaultfd_pte_wp(vma, pte) &&
+			    !pte_needs_soft_dirty_wp(vma, pte)) {
+				pte = pte_mkwrite(pte, vma);
+				if ((vmf->flags & FAULT_FLAG_WRITE) && (i == page_idx)) {
+					pte = pte_mkdirty(pte);
+					vmf->flags &= ~FAULT_FLAG_WRITE;
+				}
+			}
+			flush_icache_page(vma, &folios[i]->page);
+			folio_add_new_anon_rmap(folios[i], vma, addr, rmap_flags);
+			set_pte_at(vma->vm_mm, addr, ptep + i, pte);
+			arch_do_swap_page_nr(vma->vm_mm, vma, addr, pte, pte, 1);
+			if (i == page_idx)
+				vmf->orig_pte = pte;
+			folio_unlock(folios[i]);
+		}
+		goto wp_page;
+	}
+
 	/*
 	 * Some architectures may have to restore extra metadata to the page
 	 * when reading from swap. This metadata may be indexed by swap entry
@@ -4612,6 +4782,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 		folio_put(swapcache);
 	}
 
+wp_page:
 	if (vmf->flags & FAULT_FLAG_WRITE) {
 		ret |= do_wp_page(vmf);
 		if (ret & VM_FAULT_ERROR)
@@ -4638,9 +4809,19 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	if (vmf->pte)
 		pte_unmap_unlock(vmf->pte, vmf->ptl);
 out_page:
-	folio_unlock(folio);
+	if (!buf) {
+		folio_unlock(folio);
+	} else {
+		for (i = 0; i < BATCH_SWPIN_COUNT; i++)
+			folio_unlock(folios[i]);
+	}
 out_release:
-	folio_put(folio);
+	if (!buf) {
+		folio_put(folio);
+	} else {
+		for (i = 0; i < BATCH_SWPIN_COUNT; i++)
+			folio_put(folios[i]);
+	}
 	if (folio != swapcache && swapcache) {
 		folio_unlock(swapcache);
 		folio_put(swapcache);
-- 
2.39.3 (Apple Git-146)



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC v3 4/4] mm: fall back to four small folios if mTHP allocation fails
  2024-11-21 22:25 ` [PATCH RFC v3 4/4] mm: fall back to four small folios if mTHP allocation fails Barry Song
@ 2024-11-22 14:54   ` Usama Arif
  2024-11-24 21:47     ` Barry Song
  0 siblings, 1 reply; 19+ messages in thread
From: Usama Arif @ 2024-11-22 14:54 UTC (permalink / raw)
  To: Barry Song, akpm, linux-mm
  Cc: axboe, bala.seshasayee, chrisl, david, hannes,
	kanchana.p.sridhar, kasong, linux-block, minchan, nphamcs,
	ryan.roberts, senozhatsky, surenb, terrelln, v-songbaohua,
	wajdi.k.feghali, willy, ying.huang, yosryahmed, yuzhao,
	zhengtangquan, zhouchengming, Chuanhua Han



On 21/11/2024 22:25, Barry Song wrote:
> From: Barry Song <v-songbaohua@oppo.com>
> 
> The swapfile can compress/decompress at 4 * PAGES granularity, reducing
> CPU usage and improving the compression ratio. However, if allocating an
> mTHP fails and we fall back to a single small folio, the entire large
> block must still be decompressed. This results in a 16 KiB area requiring
> 4 page faults, where each fault decompresses 16 KiB but retrieves only
> 4 KiB of data from the block. To address this inefficiency, we instead
> fall back to 4 small folios, ensuring that each decompression occurs
> only once.
> 
> Allowing swap_read_folio() to decompress and read into an array of
> 4 folios would be extremely complex, requiring extensive changes
> throughout the stack, including swap_read_folio, zeromap,
> zswap, and final swap implementations like zRAM. In contrast,
> having these components fill a large folio with 4 subpages is much
> simpler.
> 
> To avoid a full-stack modification, we introduce a per-CPU order-2
> large folio as a buffer. This buffer is used for swap_read_folio(),
> after which the data is copied into the 4 small folios. Finally, in
> do_swap_page(), all these small folios are mapped.
> 
> Co-developed-by: Chuanhua Han <chuanhuahan@gmail.com>
> Signed-off-by: Chuanhua Han <chuanhuahan@gmail.com>
> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> ---
>  mm/memory.c | 203 +++++++++++++++++++++++++++++++++++++++++++++++++---
>  1 file changed, 192 insertions(+), 11 deletions(-)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index 209885a4134f..e551570c1425 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4042,6 +4042,15 @@ static struct folio *__alloc_swap_folio(struct vm_fault *vmf)
>  	return folio;
>  }
>  
> +#define BATCH_SWPIN_ORDER 2

Hi Barry,

Thanks for the series and the numbers in the cover letter.

Just a few things.

Should BATCH_SWPIN_ORDER be ZSMALLOC_MULTI_PAGES_ORDER instead of 2?

Did you check the performance difference with and without patch 4?

I know that it wont help if you have a lot of unmovable pages
scattered everywhere, but were you able to compare the performance
of defrag=always vs patch 4? I feel like if you have space for 4 folios
then hopefully compaction should be able to do its job and you can
directly fill the large folio if the unmovable pages are better placed.
Johannes' series on preventing type mixing [1] would help.

[1] https://lore.kernel.org/all/20240320180429.678181-1-hannes@cmpxchg.org/ 

Thanks,
Usama

> +#define BATCH_SWPIN_COUNT (1 << BATCH_SWPIN_ORDER)
> +#define BATCH_SWPIN_SIZE (PAGE_SIZE << BATCH_SWPIN_ORDER)
> +
> +struct batch_swpin_buffer {
> +	struct folio *folio;
> +	struct mutex mutex;
> +};
> +
>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>  static inline int non_swapcache_batch(swp_entry_t entry, int max_nr)
>  {
> @@ -4120,7 +4129,101 @@ static inline unsigned long thp_swap_suitable_orders(pgoff_t swp_offset,
>  	return orders;
>  }
>  
> -static struct folio *alloc_swap_folio(struct vm_fault *vmf)
> +static DEFINE_PER_CPU(struct batch_swpin_buffer, swp_buf);
> +
> +static int __init batch_swpin_buffer_init(void)
> +{
> +	int ret, cpu;
> +	struct batch_swpin_buffer *buf;
> +
> +	for_each_possible_cpu(cpu) {
> +		buf = per_cpu_ptr(&swp_buf, cpu);
> +		buf->folio = (struct folio *)alloc_pages_node(cpu_to_node(cpu),
> +				GFP_KERNEL | __GFP_COMP, BATCH_SWPIN_ORDER);
> +		if (!buf->folio) {
> +			ret = -ENOMEM;
> +			goto err;
> +		}
> +		mutex_init(&buf->mutex);
> +	}
> +	return 0;
> +
> +err:
> +	for_each_possible_cpu(cpu) {
> +		buf = per_cpu_ptr(&swp_buf, cpu);
> +		if (buf->folio) {
> +			folio_put(buf->folio);
> +			buf->folio = NULL;
> +		}
> +	}
> +	return ret;
> +}
> +core_initcall(batch_swpin_buffer_init);
> +
> +static struct folio *alloc_batched_swap_folios(struct vm_fault *vmf,
> +		struct batch_swpin_buffer **buf, struct folio **folios,
> +		swp_entry_t entry)
> +{
> +	unsigned long haddr = ALIGN_DOWN(vmf->address, BATCH_SWPIN_SIZE);
> +	struct batch_swpin_buffer *sbuf = raw_cpu_ptr(&swp_buf);
> +	struct folio *folio = sbuf->folio;
> +	unsigned long addr;
> +	int i;
> +
> +	if (unlikely(!folio))
> +		return NULL;
> +
> +	for (i = 0; i < BATCH_SWPIN_COUNT; i++) {
> +		addr = haddr + i * PAGE_SIZE;
> +		folios[i] = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0, vmf->vma, addr);
> +		if (!folios[i])
> +			goto err;
> +		if (mem_cgroup_swapin_charge_folio(folios[i], vmf->vma->vm_mm,
> +					GFP_KERNEL, entry))
> +			goto err;
> +	}
> +
> +	mutex_lock(&sbuf->mutex);
> +	*buf = sbuf;
> +#ifdef CONFIG_MEMCG
> +	folio->memcg_data = (*folios)->memcg_data;
> +#endif
> +	return folio;
> +
> +err:
> +	for (i--; i >= 0; i--)
> +		folio_put(folios[i]);
> +	return NULL;
> +}
> +
> +static void fill_batched_swap_folios(struct vm_fault *vmf,
> +		void *shadow, struct batch_swpin_buffer *buf,
> +		struct folio *folio, struct folio **folios)
> +{
> +	unsigned long haddr = ALIGN_DOWN(vmf->address, BATCH_SWPIN_SIZE);
> +	unsigned long addr;
> +	int i;
> +
> +	for (i = 0; i < BATCH_SWPIN_COUNT; i++) {
> +		addr = haddr + i * PAGE_SIZE;
> +		__folio_set_locked(folios[i]);
> +		__folio_set_swapbacked(folios[i]);
> +		if (shadow)
> +			workingset_refault(folios[i], shadow);
> +		folio_add_lru(folios[i]);
> +		copy_user_highpage(&folios[i]->page, folio_page(folio, i),
> +				addr, vmf->vma);
> +		if (folio_test_uptodate(folio))
> +			folio_mark_uptodate(folios[i]);
> +	}
> +
> +	folio->flags &= ~(PAGE_FLAGS_CHECK_AT_PREP & ~(1UL << PG_head));
> +	mutex_unlock(&buf->mutex);
> +}
> +
> +static struct folio *alloc_swap_folio(struct vm_fault *vmf,
> +		struct batch_swpin_buffer **buf,
> +		struct folio **folios)
>  {
>  	struct vm_area_struct *vma = vmf->vma;
>  	unsigned long orders;
> @@ -4180,6 +4283,9 @@ static struct folio *alloc_swap_folio(struct vm_fault *vmf)
>  
>  	pte_unmap_unlock(pte, ptl);
>  
> +	if (!orders)
> +		goto fallback;
> +
>  	/* Try allocating the highest of the remaining orders. */
>  	gfp = vma_thp_gfp_mask(vma);
>  	while (orders) {
> @@ -4194,14 +4300,29 @@ static struct folio *alloc_swap_folio(struct vm_fault *vmf)
>  		order = next_order(&orders, order);
>  	}
>  
> +	/*
> +	 * During swap-out, a THP might have been compressed into multiple
> +	 * order-2 blocks to optimize CPU usage and compression ratio.
> +	 * Attempt to batch swap-in 4 smaller folios to ensure they are
> +	 * decompressed together as a single unit only once.
> +	 */
> +	return alloc_batched_swap_folios(vmf, buf, folios, entry);
> +
>  fallback:
>  	return __alloc_swap_folio(vmf);
>  }
>  #else /* !CONFIG_TRANSPARENT_HUGEPAGE */
> -static struct folio *alloc_swap_folio(struct vm_fault *vmf)
> +static struct folio *alloc_swap_folio(struct vm_fault *vmf,
> +		struct batch_swpin_buffer **buf,
> +		struct folio **folios)
>  {
>  	return __alloc_swap_folio(vmf);
>  }
> +static inline void fill_batched_swap_folios(struct vm_fault *vmf,
> +		void *shadow, struct batch_swpin_buffer *buf,
> +		struct folio *folio, struct folio **folios)
> +{
> +}
>  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>  
>  static DECLARE_WAIT_QUEUE_HEAD(swapcache_wq);
> @@ -4216,6 +4337,8 @@ static DECLARE_WAIT_QUEUE_HEAD(swapcache_wq);
>   */
>  vm_fault_t do_swap_page(struct vm_fault *vmf)
>  {
> +	struct folio *folios[BATCH_SWPIN_COUNT] = { NULL };
> +	struct batch_swpin_buffer *buf = NULL;
>  	struct vm_area_struct *vma = vmf->vma;
>  	struct folio *swapcache, *folio = NULL;
>  	DECLARE_WAITQUEUE(wait, current);
> @@ -4228,7 +4351,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>  	pte_t pte;
>  	vm_fault_t ret = 0;
>  	void *shadow = NULL;
> -	int nr_pages;
> +	int nr_pages, i;
>  	unsigned long page_idx;
>  	unsigned long address;
>  	pte_t *ptep;
> @@ -4296,7 +4419,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>  		if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
>  		    __swap_count(entry) == 1) {
>  			/* skip swapcache */
> -			folio = alloc_swap_folio(vmf);
> +			folio = alloc_swap_folio(vmf, &buf, folios);
>  			if (folio) {
>  				__folio_set_locked(folio);
>  				__folio_set_swapbacked(folio);
> @@ -4327,10 +4450,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>  				mem_cgroup_swapin_uncharge_swap(entry, nr_pages);
>  
>  				shadow = get_shadow_from_swap_cache(entry);
> -				if (shadow)
> +				if (shadow && !buf)
>  					workingset_refault(folio, shadow);
> -
> -				folio_add_lru(folio);
> +				if (!buf)
> +					folio_add_lru(folio);
>  
>  				/* To provide entry to swap_read_folio() */
>  				folio->swap = entry;
> @@ -4361,6 +4484,16 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>  		count_vm_event(PGMAJFAULT);
>  		count_memcg_event_mm(vma->vm_mm, PGMAJFAULT);
>  		page = folio_file_page(folio, swp_offset(entry));
> +		/*
> +		 * Copy data into batched small folios from the large
> +		 * folio buffer
> +		 */
> +		if (buf) {
> +			fill_batched_swap_folios(vmf, shadow, buf, folio, folios);
> +			folio = folios[0];
> +			page = &folios[0]->page;
> +			goto do_map;
> +		}
>  	} else if (PageHWPoison(page)) {
>  		/*
>  		 * hwpoisoned dirty swapcache pages are kept for killing
> @@ -4415,6 +4548,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>  			lru_add_drain();
>  	}
>  
> +do_map:
>  	folio_throttle_swaprate(folio, GFP_KERNEL);
>  
>  	/*
> @@ -4431,8 +4565,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>  	}
>  
>  	/* allocated large folios for SWP_SYNCHRONOUS_IO */
> -	if (folio_test_large(folio) && !folio_test_swapcache(folio)) {
> -		unsigned long nr = folio_nr_pages(folio);
> +	if ((folio_test_large(folio) || buf) && !folio_test_swapcache(folio)) {
> +		unsigned long nr = buf ? BATCH_SWPIN_COUNT : folio_nr_pages(folio);
>  		unsigned long folio_start = ALIGN_DOWN(vmf->address, nr * PAGE_SIZE);
>  		unsigned long idx = (vmf->address - folio_start) / PAGE_SIZE;
>  		pte_t *folio_ptep = vmf->pte - idx;
> @@ -4527,6 +4661,42 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>  		}
>  	}
>  
> +	/* Batched mapping of allocated small folios for SWP_SYNCHRONOUS_IO */
> +	if (buf) {
> +		for (i = 0; i < nr_pages; i++)
> +			arch_swap_restore(swp_entry(swp_type(entry),
> +				swp_offset(entry) + i), folios[i]);
> +		swap_free_nr(entry, nr_pages);
> +		add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages);
> +		add_mm_counter(vma->vm_mm, MM_SWAPENTS, -nr_pages);
> +		rmap_flags |= RMAP_EXCLUSIVE;
> +		for (i = 0; i < nr_pages; i++) {
> +			unsigned long addr = address + i * PAGE_SIZE;
> +
> +			pte = mk_pte(&folios[i]->page, vma->vm_page_prot);
> +			if (pte_swp_soft_dirty(vmf->orig_pte))
> +				pte = pte_mksoft_dirty(pte);
> +			if (pte_swp_uffd_wp(vmf->orig_pte))
> +				pte = pte_mkuffd_wp(pte);
> +			if ((vma->vm_flags & VM_WRITE) && !userfaultfd_pte_wp(vma, pte) &&
> +			    !pte_needs_soft_dirty_wp(vma, pte)) {
> +				pte = pte_mkwrite(pte, vma);
> +				if ((vmf->flags & FAULT_FLAG_WRITE) && (i == page_idx)) {
> +					pte = pte_mkdirty(pte);
> +					vmf->flags &= ~FAULT_FLAG_WRITE;
> +				}
> +			}
> +			flush_icache_page(vma, &folios[i]->page);
> +			folio_add_new_anon_rmap(folios[i], vma, addr, rmap_flags);
> +			set_pte_at(vma->vm_mm, addr, ptep + i, pte);
> +			arch_do_swap_page_nr(vma->vm_mm, vma, addr, pte, pte, 1);
> +			if (i == page_idx)
> +				vmf->orig_pte = pte;
> +			folio_unlock(folios[i]);
> +		}
> +		goto wp_page;
> +	}
> +
>  	/*
>  	 * Some architectures may have to restore extra metadata to the page
>  	 * when reading from swap. This metadata may be indexed by swap entry
> @@ -4612,6 +4782,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>  		folio_put(swapcache);
>  	}
>  
> +wp_page:
>  	if (vmf->flags & FAULT_FLAG_WRITE) {
>  		ret |= do_wp_page(vmf);
>  		if (ret & VM_FAULT_ERROR)
> @@ -4638,9 +4809,19 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>  	if (vmf->pte)
>  		pte_unmap_unlock(vmf->pte, vmf->ptl);
>  out_page:
> -	folio_unlock(folio);
> +	if (!buf) {
> +		folio_unlock(folio);
> +	} else {
> +		for (i = 0; i < BATCH_SWPIN_COUNT; i++)
> +			folio_unlock(folios[i]);
> +	}
>  out_release:
> -	folio_put(folio);
> +	if (!buf) {
> +		folio_put(folio);
> +	} else {
> +		for (i = 0; i < BATCH_SWPIN_COUNT; i++)
> +			folio_put(folios[i]);
> +	}
>  	if (folio != swapcache && swapcache) {
>  		folio_unlock(swapcache);
>  		folio_put(swapcache);



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC v3 4/4] mm: fall back to four small folios if mTHP allocation fails
  2024-11-22 14:54   ` Usama Arif
@ 2024-11-24 21:47     ` Barry Song
  2024-11-25 16:19       ` Usama Arif
  0 siblings, 1 reply; 19+ messages in thread
From: Barry Song @ 2024-11-24 21:47 UTC (permalink / raw)
  To: Usama Arif
  Cc: akpm, linux-mm, axboe, bala.seshasayee, chrisl, david, hannes,
	kanchana.p.sridhar, kasong, linux-block, minchan, nphamcs,
	ryan.roberts, senozhatsky, surenb, terrelln, v-songbaohua,
	wajdi.k.feghali, willy, ying.huang, yosryahmed, yuzhao,
	zhengtangquan, zhouchengming, Chuanhua Han

On Sat, Nov 23, 2024 at 3:54 AM Usama Arif <usamaarif642@gmail.com> wrote:
>
>
>
> On 21/11/2024 22:25, Barry Song wrote:
> > From: Barry Song <v-songbaohua@oppo.com>
> >
> > The swapfile can compress/decompress at 4 * PAGES granularity, reducing
> > CPU usage and improving the compression ratio. However, if allocating an
> > mTHP fails and we fall back to a single small folio, the entire large
> > block must still be decompressed. This results in a 16 KiB area requiring
> > 4 page faults, where each fault decompresses 16 KiB but retrieves only
> > 4 KiB of data from the block. To address this inefficiency, we instead
> > fall back to 4 small folios, ensuring that each decompression occurs
> > only once.
> >
> > Allowing swap_read_folio() to decompress and read into an array of
> > 4 folios would be extremely complex, requiring extensive changes
> > throughout the stack, including swap_read_folio, zeromap,
> > zswap, and final swap implementations like zRAM. In contrast,
> > having these components fill a large folio with 4 subpages is much
> > simpler.
> >
> > To avoid a full-stack modification, we introduce a per-CPU order-2
> > large folio as a buffer. This buffer is used for swap_read_folio(),
> > after which the data is copied into the 4 small folios. Finally, in
> > do_swap_page(), all these small folios are mapped.
> >
> > Co-developed-by: Chuanhua Han <chuanhuahan@gmail.com>
> > Signed-off-by: Chuanhua Han <chuanhuahan@gmail.com>
> > Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> > ---
> >  mm/memory.c | 203 +++++++++++++++++++++++++++++++++++++++++++++++++---
> >  1 file changed, 192 insertions(+), 11 deletions(-)
> >
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 209885a4134f..e551570c1425 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -4042,6 +4042,15 @@ static struct folio *__alloc_swap_folio(struct vm_fault *vmf)
> >       return folio;
> >  }
> >
> > +#define BATCH_SWPIN_ORDER 2
>
> Hi Barry,
>
> Thanks for the series and the numbers in the cover letter.
>
> Just a few things.
>
> Should BATCH_SWPIN_ORDER be ZSMALLOC_MULTI_PAGES_ORDER instead of 2?

Technically, yes. I'm also considering removing ZSMALLOC_MULTI_PAGES_ORDER
and always setting it to 2, which is the minimum anonymous mTHP order.  The main
reason is that it may be difficult for users to select the appropriate Kconfig?

On the other hand, 16KB provides the most advantages for zstd compression and
decompression with larger blocks. While increasing from 16KB to 32KB or 64KB
can offer additional benefits, the improvement is not as significant
as the jump from
4KB to 16KB.

As I use zstd to compress and decompress the 'Beyond Compare' software
package:

root@barry-desktop:~# ./zstd
File size: 182502912 bytes
4KB Block: Compression time = 0.765915 seconds, Decompression time =
0.203366 seconds
  Original size: 182502912 bytes
  Compressed size: 66089193 bytes
  Compression ratio: 36.21%
16KB Block: Compression time = 0.558595 seconds, Decompression time =
0.153837 seconds
  Original size: 182502912 bytes
  Compressed size: 59159073 bytes
  Compression ratio: 32.42%
32KB Block: Compression time = 0.538106 seconds, Decompression time =
0.137768 seconds
  Original size: 182502912 bytes
  Compressed size: 57958701 bytes
  Compression ratio: 31.76%
64KB Block: Compression time = 0.532212 seconds, Decompression time =
0.127592 seconds
  Original size: 182502912 bytes
  Compressed size: 56700795 bytes
  Compression ratio: 31.07%

In that case, would we no longer need to rely on ZSMALLOC_MULTI_PAGES_ORDER?

>
> Did you check the performance difference with and without patch 4?

I retested after reverting patch 4, and the sys time increased to over
40 minutes
again, though it was slightly better than without the entire series.

*** Executing round 1 ***

real 7m49.342s
user 80m53.675s
sys 42m28.393s
pswpin: 29965548
pswpout: 51127359
64kB-swpout: 0
32kB-swpout: 0
16kB-swpout: 11347712
64kB-swpin: 0
32kB-swpin: 0
16kB-swpin: 6641230
pgpgin: 147376000
pgpgout: 213343124

*** Executing round 2 ***

real 7m41.331s
user 81m16.631s
sys 41m39.845s
pswpin: 29208867
pswpout: 50006026
64kB-swpout: 0
32kB-swpout: 0
16kB-swpout: 11104912
64kB-swpin: 0
32kB-swpin: 0
16kB-swpin: 6483827
pgpgin: 144057340
pgpgout: 208887688


*** Executing round 3 ***

real 7m47.280s
user 78m36.767s
sys 37m32.210s
pswpin: 26426526
pswpout: 45420734
64kB-swpout: 0
32kB-swpout: 0
16kB-swpout: 10104304
64kB-swpin: 0
32kB-swpin: 0
16kB-swpin: 5884839
pgpgin: 132013648
pgpgout: 190537264

*** Executing round 4 ***

real 7m56.723s
user 80m36.837s
sys 41m35.979s
pswpin: 29367639
pswpout: 50059254
64kB-swpout: 0
32kB-swpout: 0
16kB-swpout: 11116176
64kB-swpin: 0
32kB-swpin: 0
16kB-swpin: 6514064
pgpgin: 144593828
pgpgout: 209080468

*** Executing round 5 ***

real 7m53.806s
user 80m30.953s
sys 40m14.870s
pswpin: 28091760
pswpout: 48495748
64kB-swpout: 0
32kB-swpout: 0
16kB-swpout: 10779720
64kB-swpin: 0
32kB-swpin: 0
16kB-swpin: 6244819
pgpgin: 138813124
pgpgout: 202885480

I guess it is due to the occurrence of numerous partial reads
(about 10%, 3505537/35159852).

root@barry-desktop:~# cat /sys/block/zram0/multi_pages_debug_stat

zram_bio write/read multi_pages count:54452828 35159852
zram_bio failed write/read multi_pages count       0        0
zram_bio partial write/read multi_pages count       4  3505537
multi_pages_miss_free        0

This workload doesn't cause fragmentation in the buddy allocator, so it’s
likely due to the failure of MEMCG_CHARGE.

>
> I know that it wont help if you have a lot of unmovable pages
> scattered everywhere, but were you able to compare the performance
> of defrag=always vs patch 4? I feel like if you have space for 4 folios
> then hopefully compaction should be able to do its job and you can
> directly fill the large folio if the unmovable pages are better placed.
> Johannes' series on preventing type mixing [1] would help.
>
> [1] https://lore.kernel.org/all/20240320180429.678181-1-hannes@cmpxchg.org/

I believe this could help, but defragmentation is a complex issue. Especially on
phones, where various components like drivers, DMA-BUF, multimedia, and
graphics utilize memory.

We observed that a fresh system could initially provide mTHP, but after a few
hours, obtaining mTHP became very challenging. I'm happy to arrange a test
of Johannes' series on phones (sometimes it is quite hard to backport to the
Android kernel) to see if it brings any improvements.

>
> Thanks,
> Usama
>
> > +#define BATCH_SWPIN_COUNT (1 << BATCH_SWPIN_ORDER)
> > +#define BATCH_SWPIN_SIZE (PAGE_SIZE << BATCH_SWPIN_ORDER)
> > +
> > +struct batch_swpin_buffer {
> > +     struct folio *folio;
> > +     struct mutex mutex;
> > +};
> > +
> >  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> >  static inline int non_swapcache_batch(swp_entry_t entry, int max_nr)
> >  {
> > @@ -4120,7 +4129,101 @@ static inline unsigned long thp_swap_suitable_orders(pgoff_t swp_offset,
> >       return orders;
> >  }
> >
> > -static struct folio *alloc_swap_folio(struct vm_fault *vmf)
> > +static DEFINE_PER_CPU(struct batch_swpin_buffer, swp_buf);
> > +
> > +static int __init batch_swpin_buffer_init(void)
> > +{
> > +     int ret, cpu;
> > +     struct batch_swpin_buffer *buf;
> > +
> > +     for_each_possible_cpu(cpu) {
> > +             buf = per_cpu_ptr(&swp_buf, cpu);
> > +             buf->folio = (struct folio *)alloc_pages_node(cpu_to_node(cpu),
> > +                             GFP_KERNEL | __GFP_COMP, BATCH_SWPIN_ORDER);
> > +             if (!buf->folio) {
> > +                     ret = -ENOMEM;
> > +                     goto err;
> > +             }
> > +             mutex_init(&buf->mutex);
> > +     }
> > +     return 0;
> > +
> > +err:
> > +     for_each_possible_cpu(cpu) {
> > +             buf = per_cpu_ptr(&swp_buf, cpu);
> > +             if (buf->folio) {
> > +                     folio_put(buf->folio);
> > +                     buf->folio = NULL;
> > +             }
> > +     }
> > +     return ret;
> > +}
> > +core_initcall(batch_swpin_buffer_init);
> > +
> > +static struct folio *alloc_batched_swap_folios(struct vm_fault *vmf,
> > +             struct batch_swpin_buffer **buf, struct folio **folios,
> > +             swp_entry_t entry)
> > +{
> > +     unsigned long haddr = ALIGN_DOWN(vmf->address, BATCH_SWPIN_SIZE);
> > +     struct batch_swpin_buffer *sbuf = raw_cpu_ptr(&swp_buf);
> > +     struct folio *folio = sbuf->folio;
> > +     unsigned long addr;
> > +     int i;
> > +
> > +     if (unlikely(!folio))
> > +             return NULL;
> > +
> > +     for (i = 0; i < BATCH_SWPIN_COUNT; i++) {
> > +             addr = haddr + i * PAGE_SIZE;
> > +             folios[i] = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0, vmf->vma, addr);
> > +             if (!folios[i])
> > +                     goto err;
> > +             if (mem_cgroup_swapin_charge_folio(folios[i], vmf->vma->vm_mm,
> > +                                     GFP_KERNEL, entry))
> > +                     goto err;
> > +     }
> > +
> > +     mutex_lock(&sbuf->mutex);
> > +     *buf = sbuf;
> > +#ifdef CONFIG_MEMCG
> > +     folio->memcg_data = (*folios)->memcg_data;
> > +#endif
> > +     return folio;
> > +
> > +err:
> > +     for (i--; i >= 0; i--)
> > +             folio_put(folios[i]);
> > +     return NULL;
> > +}
> > +
> > +static void fill_batched_swap_folios(struct vm_fault *vmf,
> > +             void *shadow, struct batch_swpin_buffer *buf,
> > +             struct folio *folio, struct folio **folios)
> > +{
> > +     unsigned long haddr = ALIGN_DOWN(vmf->address, BATCH_SWPIN_SIZE);
> > +     unsigned long addr;
> > +     int i;
> > +
> > +     for (i = 0; i < BATCH_SWPIN_COUNT; i++) {
> > +             addr = haddr + i * PAGE_SIZE;
> > +             __folio_set_locked(folios[i]);
> > +             __folio_set_swapbacked(folios[i]);
> > +             if (shadow)
> > +                     workingset_refault(folios[i], shadow);
> > +             folio_add_lru(folios[i]);
> > +             copy_user_highpage(&folios[i]->page, folio_page(folio, i),
> > +                             addr, vmf->vma);
> > +             if (folio_test_uptodate(folio))
> > +                     folio_mark_uptodate(folios[i]);
> > +     }
> > +
> > +     folio->flags &= ~(PAGE_FLAGS_CHECK_AT_PREP & ~(1UL << PG_head));
> > +     mutex_unlock(&buf->mutex);
> > +}
> > +
> > +static struct folio *alloc_swap_folio(struct vm_fault *vmf,
> > +             struct batch_swpin_buffer **buf,
> > +             struct folio **folios)
> >  {
> >       struct vm_area_struct *vma = vmf->vma;
> >       unsigned long orders;
> > @@ -4180,6 +4283,9 @@ static struct folio *alloc_swap_folio(struct vm_fault *vmf)
> >
> >       pte_unmap_unlock(pte, ptl);
> >
> > +     if (!orders)
> > +             goto fallback;
> > +
> >       /* Try allocating the highest of the remaining orders. */
> >       gfp = vma_thp_gfp_mask(vma);
> >       while (orders) {
> > @@ -4194,14 +4300,29 @@ static struct folio *alloc_swap_folio(struct vm_fault *vmf)
> >               order = next_order(&orders, order);
> >       }
> >
> > +     /*
> > +      * During swap-out, a THP might have been compressed into multiple
> > +      * order-2 blocks to optimize CPU usage and compression ratio.
> > +      * Attempt to batch swap-in 4 smaller folios to ensure they are
> > +      * decompressed together as a single unit only once.
> > +      */
> > +     return alloc_batched_swap_folios(vmf, buf, folios, entry);
> > +
> >  fallback:
> >       return __alloc_swap_folio(vmf);
> >  }
> >  #else /* !CONFIG_TRANSPARENT_HUGEPAGE */
> > -static struct folio *alloc_swap_folio(struct vm_fault *vmf)
> > +static struct folio *alloc_swap_folio(struct vm_fault *vmf,
> > +             struct batch_swpin_buffer **buf,
> > +             struct folio **folios)
> >  {
> >       return __alloc_swap_folio(vmf);
> >  }
> > +static inline void fill_batched_swap_folios(struct vm_fault *vmf,
> > +             void *shadow, struct batch_swpin_buffer *buf,
> > +             struct folio *folio, struct folio **folios)
> > +{
> > +}
> >  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
> >
> >  static DECLARE_WAIT_QUEUE_HEAD(swapcache_wq);
> > @@ -4216,6 +4337,8 @@ static DECLARE_WAIT_QUEUE_HEAD(swapcache_wq);
> >   */
> >  vm_fault_t do_swap_page(struct vm_fault *vmf)
> >  {
> > +     struct folio *folios[BATCH_SWPIN_COUNT] = { NULL };
> > +     struct batch_swpin_buffer *buf = NULL;
> >       struct vm_area_struct *vma = vmf->vma;
> >       struct folio *swapcache, *folio = NULL;
> >       DECLARE_WAITQUEUE(wait, current);
> > @@ -4228,7 +4351,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >       pte_t pte;
> >       vm_fault_t ret = 0;
> >       void *shadow = NULL;
> > -     int nr_pages;
> > +     int nr_pages, i;
> >       unsigned long page_idx;
> >       unsigned long address;
> >       pte_t *ptep;
> > @@ -4296,7 +4419,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >               if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
> >                   __swap_count(entry) == 1) {
> >                       /* skip swapcache */
> > -                     folio = alloc_swap_folio(vmf);
> > +                     folio = alloc_swap_folio(vmf, &buf, folios);
> >                       if (folio) {
> >                               __folio_set_locked(folio);
> >                               __folio_set_swapbacked(folio);
> > @@ -4327,10 +4450,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >                               mem_cgroup_swapin_uncharge_swap(entry, nr_pages);
> >
> >                               shadow = get_shadow_from_swap_cache(entry);
> > -                             if (shadow)
> > +                             if (shadow && !buf)
> >                                       workingset_refault(folio, shadow);
> > -
> > -                             folio_add_lru(folio);
> > +                             if (!buf)
> > +                                     folio_add_lru(folio);
> >
> >                               /* To provide entry to swap_read_folio() */
> >                               folio->swap = entry;
> > @@ -4361,6 +4484,16 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >               count_vm_event(PGMAJFAULT);
> >               count_memcg_event_mm(vma->vm_mm, PGMAJFAULT);
> >               page = folio_file_page(folio, swp_offset(entry));
> > +             /*
> > +              * Copy data into batched small folios from the large
> > +              * folio buffer
> > +              */
> > +             if (buf) {
> > +                     fill_batched_swap_folios(vmf, shadow, buf, folio, folios);
> > +                     folio = folios[0];
> > +                     page = &folios[0]->page;
> > +                     goto do_map;
> > +             }
> >       } else if (PageHWPoison(page)) {
> >               /*
> >                * hwpoisoned dirty swapcache pages are kept for killing
> > @@ -4415,6 +4548,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >                       lru_add_drain();
> >       }
> >
> > +do_map:
> >       folio_throttle_swaprate(folio, GFP_KERNEL);
> >
> >       /*
> > @@ -4431,8 +4565,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >       }
> >
> >       /* allocated large folios for SWP_SYNCHRONOUS_IO */
> > -     if (folio_test_large(folio) && !folio_test_swapcache(folio)) {
> > -             unsigned long nr = folio_nr_pages(folio);
> > +     if ((folio_test_large(folio) || buf) && !folio_test_swapcache(folio)) {
> > +             unsigned long nr = buf ? BATCH_SWPIN_COUNT : folio_nr_pages(folio);
> >               unsigned long folio_start = ALIGN_DOWN(vmf->address, nr * PAGE_SIZE);
> >               unsigned long idx = (vmf->address - folio_start) / PAGE_SIZE;
> >               pte_t *folio_ptep = vmf->pte - idx;
> > @@ -4527,6 +4661,42 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >               }
> >       }
> >
> > +     /* Batched mapping of allocated small folios for SWP_SYNCHRONOUS_IO */
> > +     if (buf) {
> > +             for (i = 0; i < nr_pages; i++)
> > +                     arch_swap_restore(swp_entry(swp_type(entry),
> > +                             swp_offset(entry) + i), folios[i]);
> > +             swap_free_nr(entry, nr_pages);
> > +             add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages);
> > +             add_mm_counter(vma->vm_mm, MM_SWAPENTS, -nr_pages);
> > +             rmap_flags |= RMAP_EXCLUSIVE;
> > +             for (i = 0; i < nr_pages; i++) {
> > +                     unsigned long addr = address + i * PAGE_SIZE;
> > +
> > +                     pte = mk_pte(&folios[i]->page, vma->vm_page_prot);
> > +                     if (pte_swp_soft_dirty(vmf->orig_pte))
> > +                             pte = pte_mksoft_dirty(pte);
> > +                     if (pte_swp_uffd_wp(vmf->orig_pte))
> > +                             pte = pte_mkuffd_wp(pte);
> > +                     if ((vma->vm_flags & VM_WRITE) && !userfaultfd_pte_wp(vma, pte) &&
> > +                         !pte_needs_soft_dirty_wp(vma, pte)) {
> > +                             pte = pte_mkwrite(pte, vma);
> > +                             if ((vmf->flags & FAULT_FLAG_WRITE) && (i == page_idx)) {
> > +                                     pte = pte_mkdirty(pte);
> > +                                     vmf->flags &= ~FAULT_FLAG_WRITE;
> > +                             }
> > +                     }
> > +                     flush_icache_page(vma, &folios[i]->page);
> > +                     folio_add_new_anon_rmap(folios[i], vma, addr, rmap_flags);
> > +                     set_pte_at(vma->vm_mm, addr, ptep + i, pte);
> > +                     arch_do_swap_page_nr(vma->vm_mm, vma, addr, pte, pte, 1);
> > +                     if (i == page_idx)
> > +                             vmf->orig_pte = pte;
> > +                     folio_unlock(folios[i]);
> > +             }
> > +             goto wp_page;
> > +     }
> > +
> >       /*
> >        * Some architectures may have to restore extra metadata to the page
> >        * when reading from swap. This metadata may be indexed by swap entry
> > @@ -4612,6 +4782,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >               folio_put(swapcache);
> >       }
> >
> > +wp_page:
> >       if (vmf->flags & FAULT_FLAG_WRITE) {
> >               ret |= do_wp_page(vmf);
> >               if (ret & VM_FAULT_ERROR)
> > @@ -4638,9 +4809,19 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >       if (vmf->pte)
> >               pte_unmap_unlock(vmf->pte, vmf->ptl);
> >  out_page:
> > -     folio_unlock(folio);
> > +     if (!buf) {
> > +             folio_unlock(folio);
> > +     } else {
> > +             for (i = 0; i < BATCH_SWPIN_COUNT; i++)
> > +                     folio_unlock(folios[i]);
> > +     }
> >  out_release:
> > -     folio_put(folio);
> > +     if (!buf) {
> > +             folio_put(folio);
> > +     } else {
> > +             for (i = 0; i < BATCH_SWPIN_COUNT; i++)
> > +                     folio_put(folios[i]);
> > +     }
> >       if (folio != swapcache && swapcache) {
> >               folio_unlock(swapcache);
> >               folio_put(swapcache);
>

Thanks
Barry


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC v3 4/4] mm: fall back to four small folios if mTHP allocation fails
  2024-11-24 21:47     ` Barry Song
@ 2024-11-25 16:19       ` Usama Arif
  2024-11-25 18:32         ` Barry Song
  0 siblings, 1 reply; 19+ messages in thread
From: Usama Arif @ 2024-11-25 16:19 UTC (permalink / raw)
  To: Barry Song
  Cc: akpm, linux-mm, axboe, bala.seshasayee, chrisl, david, hannes,
	kanchana.p.sridhar, kasong, linux-block, minchan, nphamcs,
	ryan.roberts, senozhatsky, surenb, terrelln, v-songbaohua,
	wajdi.k.feghali, willy, ying.huang, yosryahmed, yuzhao,
	zhengtangquan, zhouchengming, Chuanhua Han



On 24/11/2024 21:47, Barry Song wrote:
> On Sat, Nov 23, 2024 at 3:54 AM Usama Arif <usamaarif642@gmail.com> wrote:
>>
>>
>>
>> On 21/11/2024 22:25, Barry Song wrote:
>>> From: Barry Song <v-songbaohua@oppo.com>
>>>
>>> The swapfile can compress/decompress at 4 * PAGES granularity, reducing
>>> CPU usage and improving the compression ratio. However, if allocating an
>>> mTHP fails and we fall back to a single small folio, the entire large
>>> block must still be decompressed. This results in a 16 KiB area requiring
>>> 4 page faults, where each fault decompresses 16 KiB but retrieves only
>>> 4 KiB of data from the block. To address this inefficiency, we instead
>>> fall back to 4 small folios, ensuring that each decompression occurs
>>> only once.
>>>
>>> Allowing swap_read_folio() to decompress and read into an array of
>>> 4 folios would be extremely complex, requiring extensive changes
>>> throughout the stack, including swap_read_folio, zeromap,
>>> zswap, and final swap implementations like zRAM. In contrast,
>>> having these components fill a large folio with 4 subpages is much
>>> simpler.
>>>
>>> To avoid a full-stack modification, we introduce a per-CPU order-2
>>> large folio as a buffer. This buffer is used for swap_read_folio(),
>>> after which the data is copied into the 4 small folios. Finally, in
>>> do_swap_page(), all these small folios are mapped.
>>>
>>> Co-developed-by: Chuanhua Han <chuanhuahan@gmail.com>
>>> Signed-off-by: Chuanhua Han <chuanhuahan@gmail.com>
>>> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
>>> ---
>>>  mm/memory.c | 203 +++++++++++++++++++++++++++++++++++++++++++++++++---
>>>  1 file changed, 192 insertions(+), 11 deletions(-)
>>>
>>> diff --git a/mm/memory.c b/mm/memory.c
>>> index 209885a4134f..e551570c1425 100644
>>> --- a/mm/memory.c
>>> +++ b/mm/memory.c
>>> @@ -4042,6 +4042,15 @@ static struct folio *__alloc_swap_folio(struct vm_fault *vmf)
>>>       return folio;
>>>  }
>>>
>>> +#define BATCH_SWPIN_ORDER 2
>>
>> Hi Barry,
>>
>> Thanks for the series and the numbers in the cover letter.
>>
>> Just a few things.
>>
>> Should BATCH_SWPIN_ORDER be ZSMALLOC_MULTI_PAGES_ORDER instead of 2?
> 
> Technically, yes. I'm also considering removing ZSMALLOC_MULTI_PAGES_ORDER
> and always setting it to 2, which is the minimum anonymous mTHP order.  The main
> reason is that it may be difficult for users to select the appropriate Kconfig?
> 
> On the other hand, 16KB provides the most advantages for zstd compression and
> decompression with larger blocks. While increasing from 16KB to 32KB or 64KB
> can offer additional benefits, the improvement is not as significant
> as the jump from
> 4KB to 16KB.
> 
> As I use zstd to compress and decompress the 'Beyond Compare' software
> package:
> 
> root@barry-desktop:~# ./zstd
> File size: 182502912 bytes
> 4KB Block: Compression time = 0.765915 seconds, Decompression time =
> 0.203366 seconds
>   Original size: 182502912 bytes
>   Compressed size: 66089193 bytes
>   Compression ratio: 36.21%
> 16KB Block: Compression time = 0.558595 seconds, Decompression time =
> 0.153837 seconds
>   Original size: 182502912 bytes
>   Compressed size: 59159073 bytes
>   Compression ratio: 32.42%
> 32KB Block: Compression time = 0.538106 seconds, Decompression time =
> 0.137768 seconds
>   Original size: 182502912 bytes
>   Compressed size: 57958701 bytes
>   Compression ratio: 31.76%
> 64KB Block: Compression time = 0.532212 seconds, Decompression time =
> 0.127592 seconds
>   Original size: 182502912 bytes
>   Compressed size: 56700795 bytes
>   Compression ratio: 31.07%
> 
> In that case, would we no longer need to rely on ZSMALLOC_MULTI_PAGES_ORDER?
> 

Yes, I think if there isn't a very significant benefit of using a larger order,
then its better not to have this option. It would also simplify the code.

>>
>> Did you check the performance difference with and without patch 4?
> 
> I retested after reverting patch 4, and the sys time increased to over
> 40 minutes
> again, though it was slightly better than without the entire series.
> 
> *** Executing round 1 ***
> 
> real 7m49.342s
> user 80m53.675s
> sys 42m28.393s
> pswpin: 29965548
> pswpout: 51127359
> 64kB-swpout: 0
> 32kB-swpout: 0
> 16kB-swpout: 11347712
> 64kB-swpin: 0
> 32kB-swpin: 0
> 16kB-swpin: 6641230
> pgpgin: 147376000
> pgpgout: 213343124
> 
> *** Executing round 2 ***
> 
> real 7m41.331s
> user 81m16.631s
> sys 41m39.845s
> pswpin: 29208867
> pswpout: 50006026
> 64kB-swpout: 0
> 32kB-swpout: 0
> 16kB-swpout: 11104912
> 64kB-swpin: 0
> 32kB-swpin: 0
> 16kB-swpin: 6483827
> pgpgin: 144057340
> pgpgout: 208887688
> 
> 
> *** Executing round 3 ***
> 
> real 7m47.280s
> user 78m36.767s
> sys 37m32.210s
> pswpin: 26426526
> pswpout: 45420734
> 64kB-swpout: 0
> 32kB-swpout: 0
> 16kB-swpout: 10104304
> 64kB-swpin: 0
> 32kB-swpin: 0
> 16kB-swpin: 5884839
> pgpgin: 132013648
> pgpgout: 190537264
> 
> *** Executing round 4 ***
> 
> real 7m56.723s
> user 80m36.837s
> sys 41m35.979s
> pswpin: 29367639
> pswpout: 50059254
> 64kB-swpout: 0
> 32kB-swpout: 0
> 16kB-swpout: 11116176
> 64kB-swpin: 0
> 32kB-swpin: 0
> 16kB-swpin: 6514064
> pgpgin: 144593828
> pgpgout: 209080468
> 
> *** Executing round 5 ***
> 
> real 7m53.806s
> user 80m30.953s
> sys 40m14.870s
> pswpin: 28091760
> pswpout: 48495748
> 64kB-swpout: 0
> 32kB-swpout: 0
> 16kB-swpout: 10779720
> 64kB-swpin: 0
> 32kB-swpin: 0
> 16kB-swpin: 6244819
> pgpgin: 138813124
> pgpgout: 202885480
> 
> I guess it is due to the occurrence of numerous partial reads
> (about 10%, 3505537/35159852).
> 
> root@barry-desktop:~# cat /sys/block/zram0/multi_pages_debug_stat
> 
> zram_bio write/read multi_pages count:54452828 35159852
> zram_bio failed write/read multi_pages count       0        0
> zram_bio partial write/read multi_pages count       4  3505537
> multi_pages_miss_free        0
> 
> This workload doesn't cause fragmentation in the buddy allocator, so it’s
> likely due to the failure of MEMCG_CHARGE.
> 
>>
>> I know that it wont help if you have a lot of unmovable pages
>> scattered everywhere, but were you able to compare the performance
>> of defrag=always vs patch 4? I feel like if you have space for 4 folios
>> then hopefully compaction should be able to do its job and you can
>> directly fill the large folio if the unmovable pages are better placed.
>> Johannes' series on preventing type mixing [1] would help.
>>
>> [1] https://lore.kernel.org/all/20240320180429.678181-1-hannes@cmpxchg.org/
> 
> I believe this could help, but defragmentation is a complex issue. Especially on
> phones, where various components like drivers, DMA-BUF, multimedia, and
> graphics utilize memory.
> 
> We observed that a fresh system could initially provide mTHP, but after a few
> hours, obtaining mTHP became very challenging. I'm happy to arrange a test
> of Johannes' series on phones (sometimes it is quite hard to backport to the
> Android kernel) to see if it brings any improvements.
> 

I think its definitely worth trying. If we can improve memory allocation/compaction
instead of patch 4, then we should go for that. Maybe there won't be a need for TAO
if allocation is done in a smarter way?

Just out of curiosity, what is the base kernel version you are testing with?

Thanks,
Usama 


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC v3 4/4] mm: fall back to four small folios if mTHP allocation fails
  2024-11-25 16:19       ` Usama Arif
@ 2024-11-25 18:32         ` Barry Song
  0 siblings, 0 replies; 19+ messages in thread
From: Barry Song @ 2024-11-25 18:32 UTC (permalink / raw)
  To: Usama Arif
  Cc: akpm, linux-mm, axboe, bala.seshasayee, chrisl, david, hannes,
	kanchana.p.sridhar, kasong, linux-block, minchan, nphamcs,
	ryan.roberts, senozhatsky, surenb, terrelln, v-songbaohua,
	wajdi.k.feghali, willy, ying.huang, yosryahmed, yuzhao,
	zhengtangquan, zhouchengming, Chuanhua Han

On Tue, Nov 26, 2024 at 5:19 AM Usama Arif <usamaarif642@gmail.com> wrote:
>
>
>
> On 24/11/2024 21:47, Barry Song wrote:
> > On Sat, Nov 23, 2024 at 3:54 AM Usama Arif <usamaarif642@gmail.com> wrote:
> >>
> >>
> >>
> >> On 21/11/2024 22:25, Barry Song wrote:
> >>> From: Barry Song <v-songbaohua@oppo.com>
> >>>
> >>> The swapfile can compress/decompress at 4 * PAGES granularity, reducing
> >>> CPU usage and improving the compression ratio. However, if allocating an
> >>> mTHP fails and we fall back to a single small folio, the entire large
> >>> block must still be decompressed. This results in a 16 KiB area requiring
> >>> 4 page faults, where each fault decompresses 16 KiB but retrieves only
> >>> 4 KiB of data from the block. To address this inefficiency, we instead
> >>> fall back to 4 small folios, ensuring that each decompression occurs
> >>> only once.
> >>>
> >>> Allowing swap_read_folio() to decompress and read into an array of
> >>> 4 folios would be extremely complex, requiring extensive changes
> >>> throughout the stack, including swap_read_folio, zeromap,
> >>> zswap, and final swap implementations like zRAM. In contrast,
> >>> having these components fill a large folio with 4 subpages is much
> >>> simpler.
> >>>
> >>> To avoid a full-stack modification, we introduce a per-CPU order-2
> >>> large folio as a buffer. This buffer is used for swap_read_folio(),
> >>> after which the data is copied into the 4 small folios. Finally, in
> >>> do_swap_page(), all these small folios are mapped.
> >>>
> >>> Co-developed-by: Chuanhua Han <chuanhuahan@gmail.com>
> >>> Signed-off-by: Chuanhua Han <chuanhuahan@gmail.com>
> >>> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> >>> ---
> >>>  mm/memory.c | 203 +++++++++++++++++++++++++++++++++++++++++++++++++---
> >>>  1 file changed, 192 insertions(+), 11 deletions(-)
> >>>
> >>> diff --git a/mm/memory.c b/mm/memory.c
> >>> index 209885a4134f..e551570c1425 100644
> >>> --- a/mm/memory.c
> >>> +++ b/mm/memory.c
> >>> @@ -4042,6 +4042,15 @@ static struct folio *__alloc_swap_folio(struct vm_fault *vmf)
> >>>       return folio;
> >>>  }
> >>>
> >>> +#define BATCH_SWPIN_ORDER 2
> >>
> >> Hi Barry,
> >>
> >> Thanks for the series and the numbers in the cover letter.
> >>
> >> Just a few things.
> >>
> >> Should BATCH_SWPIN_ORDER be ZSMALLOC_MULTI_PAGES_ORDER instead of 2?
> >
> > Technically, yes. I'm also considering removing ZSMALLOC_MULTI_PAGES_ORDER
> > and always setting it to 2, which is the minimum anonymous mTHP order.  The main
> > reason is that it may be difficult for users to select the appropriate Kconfig?
> >
> > On the other hand, 16KB provides the most advantages for zstd compression and
> > decompression with larger blocks. While increasing from 16KB to 32KB or 64KB
> > can offer additional benefits, the improvement is not as significant
> > as the jump from
> > 4KB to 16KB.
> >
> > As I use zstd to compress and decompress the 'Beyond Compare' software
> > package:
> >
> > root@barry-desktop:~# ./zstd
> > File size: 182502912 bytes
> > 4KB Block: Compression time = 0.765915 seconds, Decompression time =
> > 0.203366 seconds
> >   Original size: 182502912 bytes
> >   Compressed size: 66089193 bytes
> >   Compression ratio: 36.21%
> > 16KB Block: Compression time = 0.558595 seconds, Decompression time =
> > 0.153837 seconds
> >   Original size: 182502912 bytes
> >   Compressed size: 59159073 bytes
> >   Compression ratio: 32.42%
> > 32KB Block: Compression time = 0.538106 seconds, Decompression time =
> > 0.137768 seconds
> >   Original size: 182502912 bytes
> >   Compressed size: 57958701 bytes
> >   Compression ratio: 31.76%
> > 64KB Block: Compression time = 0.532212 seconds, Decompression time =
> > 0.127592 seconds
> >   Original size: 182502912 bytes
> >   Compressed size: 56700795 bytes
> >   Compression ratio: 31.07%
> >
> > In that case, would we no longer need to rely on ZSMALLOC_MULTI_PAGES_ORDER?
> >
>
> Yes, I think if there isn't a very significant benefit of using a larger order,
> then its better not to have this option. It would also simplify the code.
>
> >>
> >> Did you check the performance difference with and without patch 4?
> >
> > I retested after reverting patch 4, and the sys time increased to over
> > 40 minutes
> > again, though it was slightly better than without the entire series.
> >
> > *** Executing round 1 ***
> >
> > real 7m49.342s
> > user 80m53.675s
> > sys 42m28.393s
> > pswpin: 29965548
> > pswpout: 51127359
> > 64kB-swpout: 0
> > 32kB-swpout: 0
> > 16kB-swpout: 11347712
> > 64kB-swpin: 0
> > 32kB-swpin: 0
> > 16kB-swpin: 6641230
> > pgpgin: 147376000
> > pgpgout: 213343124
> >
> > *** Executing round 2 ***
> >
> > real 7m41.331s
> > user 81m16.631s
> > sys 41m39.845s
> > pswpin: 29208867
> > pswpout: 50006026
> > 64kB-swpout: 0
> > 32kB-swpout: 0
> > 16kB-swpout: 11104912
> > 64kB-swpin: 0
> > 32kB-swpin: 0
> > 16kB-swpin: 6483827
> > pgpgin: 144057340
> > pgpgout: 208887688
> >
> >
> > *** Executing round 3 ***
> >
> > real 7m47.280s
> > user 78m36.767s
> > sys 37m32.210s
> > pswpin: 26426526
> > pswpout: 45420734
> > 64kB-swpout: 0
> > 32kB-swpout: 0
> > 16kB-swpout: 10104304
> > 64kB-swpin: 0
> > 32kB-swpin: 0
> > 16kB-swpin: 5884839
> > pgpgin: 132013648
> > pgpgout: 190537264
> >
> > *** Executing round 4 ***
> >
> > real 7m56.723s
> > user 80m36.837s
> > sys 41m35.979s
> > pswpin: 29367639
> > pswpout: 50059254
> > 64kB-swpout: 0
> > 32kB-swpout: 0
> > 16kB-swpout: 11116176
> > 64kB-swpin: 0
> > 32kB-swpin: 0
> > 16kB-swpin: 6514064
> > pgpgin: 144593828
> > pgpgout: 209080468
> >
> > *** Executing round 5 ***
> >
> > real 7m53.806s
> > user 80m30.953s
> > sys 40m14.870s
> > pswpin: 28091760
> > pswpout: 48495748
> > 64kB-swpout: 0
> > 32kB-swpout: 0
> > 16kB-swpout: 10779720
> > 64kB-swpin: 0
> > 32kB-swpin: 0
> > 16kB-swpin: 6244819
> > pgpgin: 138813124
> > pgpgout: 202885480
> >
> > I guess it is due to the occurrence of numerous partial reads
> > (about 10%, 3505537/35159852).
> >
> > root@barry-desktop:~# cat /sys/block/zram0/multi_pages_debug_stat
> >
> > zram_bio write/read multi_pages count:54452828 35159852
> > zram_bio failed write/read multi_pages count       0        0
> > zram_bio partial write/read multi_pages count       4  3505537
> > multi_pages_miss_free        0
> >
> > This workload doesn't cause fragmentation in the buddy allocator, so it’s
> > likely due to the failure of MEMCG_CHARGE.
> >
> >>
> >> I know that it wont help if you have a lot of unmovable pages
> >> scattered everywhere, but were you able to compare the performance
> >> of defrag=always vs patch 4? I feel like if you have space for 4 folios
> >> then hopefully compaction should be able to do its job and you can
> >> directly fill the large folio if the unmovable pages are better placed.
> >> Johannes' series on preventing type mixing [1] would help.
> >>
> >> [1] https://lore.kernel.org/all/20240320180429.678181-1-hannes@cmpxchg.org/
> >
> > I believe this could help, but defragmentation is a complex issue. Especially on
> > phones, where various components like drivers, DMA-BUF, multimedia, and
> > graphics utilize memory.
> >
> > We observed that a fresh system could initially provide mTHP, but after a few
> > hours, obtaining mTHP became very challenging. I'm happy to arrange a test
> > of Johannes' series on phones (sometimes it is quite hard to backport to the
> > Android kernel) to see if it brings any improvements.
> >
>
> I think its definitely worth trying. If we can improve memory allocation/compaction
> instead of patch 4, then we should go for that. Maybe there won't be a need for TAO
> if allocation is done in a smarter way?
>
> Just out of curiosity, what is the base kernel version you are testing with?

This kernel build testing was conducted on my Intel PC running mm-unstable,
which includes Johannes' series. As mentioned earlier, it still shows
many partial
reads without patch 4.

For phones, we have to backport to android kernel such as 6.6, 6.1 etc:
https://android.googlesource.com/kernel/common/+refs

Testing new patchset can sometimes be quite a pain ....

>
> Thanks,
> Usama

Thanks
Barry


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC v3 0/4] mTHP-friendly compression in zsmalloc and zram based on multi-pages
  2024-11-21 22:25 [PATCH RFC v3 0/4] mTHP-friendly compression in zsmalloc and zram based on multi-pages Barry Song
                   ` (3 preceding siblings ...)
  2024-11-21 22:25 ` [PATCH RFC v3 4/4] mm: fall back to four small folios if mTHP allocation fails Barry Song
@ 2024-11-26  5:09 ` Sergey Senozhatsky
  2024-11-26 10:52   ` Sergey Senozhatsky
  2024-11-26 20:20   ` Barry Song
  4 siblings, 2 replies; 19+ messages in thread
From: Sergey Senozhatsky @ 2024-11-26  5:09 UTC (permalink / raw)
  To: Barry Song
  Cc: akpm, linux-mm, axboe, bala.seshasayee, chrisl, david, hannes,
	kanchana.p.sridhar, kasong, linux-block, minchan, nphamcs,
	ryan.roberts, senozhatsky, surenb, terrelln, usamaarif642,
	v-songbaohua, wajdi.k.feghali, willy, ying.huang, yosryahmed,
	yuzhao, zhengtangquan, zhouchengming

On (24/11/22 11:25), Barry Song wrote:
> When large folios are compressed at a larger granularity, we observe
> a notable reduction in CPU usage and a significant improvement in
> compression ratios.
>
> This patchset enhances zsmalloc and zram by adding support for dividing
> large folios into multi-page blocks, typically configured with a
> 2-order granularity. Without this patchset, a large folio is always
> divided into `nr_pages` 4KiB blocks.
>
> The granularity can be set using the `ZSMALLOC_MULTI_PAGES_ORDER`
> setting, where the default of 2 allows all anonymous THP to benefit.

I can't say that I'm in love with this part.

Looking at zsmalloc stats, your new size-classes are significantly
further apart from each other than our tradition size classes.
For example, with ZSMALLOC_CHAIN_SIZE of 10, some size-classes are
more than 400 (that's almost 10% of PAGE_SIZE) bytes apart

// stripped
   344  9792
   348 10048
   351 10240
   353 10368
   355 10496
   361 10880
   368 11328
   370 11456
   373 11648
   377 11904
   383 12288
   387 12544
   390 12736
   395 13056
   400 13376
   404 13632
   410 14016
   415 14336

Which means that every objects of size, let's say, 10881 will
go into 11328 size class and have 447 bytes of padding between
each object.

And with ZSMALLOC_CHAIN_SIZE of 8, it seems, we have even larger
padding gaps:

// stripped
   348 10048
   351 10240
   353 10368
   361 10880
   370 11456
   373 11648
   377 11904
   383 12288
   390 12736
   395 13056
   404 13632
   410 14016
   415 14336
   418 14528
   447 16384

E.g. 13632 and 13056 are more than 500 bytes apart.

> swap-out time(ms)       68711              49908
> swap-in time(ms)        30687              20685
> compression ratio       20.49%             16.9%

These are not the only numbers to focus on, really important metrics
are: zsmalloc pages-used and zsmalloc max-pages-used.  Then we can
calculate the pool memory usage ratio (the size of compressed data vs
the number of pages zsmalloc pool allocated to keep them).

More importantly, dealing with internal fragmentation in a size-class,
let's say, of 14528 will be a little painful, as we'll need to move
around 14K objects.

As, for the speed part, well, it's a little unusual to see that you
are focusing on zstd.  zstd is slower than any from the lzX family,
sort of a fact, zstsd sports better compression ratio, but is slower.
Do you use zstd in your smartphones?  If speed is your main metrics,
another option might be to just use a faster algorithm and then utilize
post-processing (re-compression with zstd or writeback) for memory
savings?

Do you happen to have some data (pool memory usage ratio, etc.) for
lzo or lzo-rle, or lz4?

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC v3 1/4] mm: zsmalloc: support objects compressed based on multiple pages
  2024-11-21 22:25 ` [PATCH RFC v3 1/4] mm: zsmalloc: support objects compressed based on multiple pages Barry Song
@ 2024-11-26  5:37   ` Sergey Senozhatsky
  2024-11-27  1:53     ` Barry Song
  0 siblings, 1 reply; 19+ messages in thread
From: Sergey Senozhatsky @ 2024-11-26  5:37 UTC (permalink / raw)
  To: Barry Song
  Cc: akpm, linux-mm, axboe, bala.seshasayee, chrisl, david, hannes,
	kanchana.p.sridhar, kasong, linux-block, minchan, nphamcs,
	ryan.roberts, senozhatsky, surenb, terrelln, usamaarif642,
	v-songbaohua, wajdi.k.feghali, willy, ying.huang, yosryahmed,
	yuzhao, zhengtangquan, zhouchengming

On (24/11/22 11:25), Barry Song wrote:
>  static void reset_page(struct page *page)
>  {
> -	__ClearPageMovable(page);
> +	if (PageMovable(page))
> +		__ClearPageMovable(page);

A side note:
ERROR: modpost: "PageMovable" [mm/zsmalloc.ko] undefined!


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC v3 0/4] mTHP-friendly compression in zsmalloc and zram based on multi-pages
  2024-11-26  5:09 ` [PATCH RFC v3 0/4] mTHP-friendly compression in zsmalloc and zram based on multi-pages Sergey Senozhatsky
@ 2024-11-26 10:52   ` Sergey Senozhatsky
  2024-11-26 20:31     ` Barry Song
  2024-11-26 20:20   ` Barry Song
  1 sibling, 1 reply; 19+ messages in thread
From: Sergey Senozhatsky @ 2024-11-26 10:52 UTC (permalink / raw)
  To: Barry Song
  Cc: akpm, linux-mm, axboe, bala.seshasayee, chrisl, david, hannes,
	kanchana.p.sridhar, kasong, linux-block, minchan, nphamcs,
	ryan.roberts, surenb, terrelln, usamaarif642, v-songbaohua,
	wajdi.k.feghali, willy, ying.huang, yosryahmed, yuzhao,
	zhengtangquan, zhouchengming, Sergey Senozhatsky

On (24/11/26 14:09), Sergey Senozhatsky wrote:
> > swap-out time(ms)       68711              49908
> > swap-in time(ms)        30687              20685
> > compression ratio       20.49%             16.9%

I'm also sort of curious if you'd use zstd with pre-trained user
dictionary [1] (e.g. based on a dump of your swap-file under most
common workloads) would it give you desired compression ratio
improvements (on current zram, that does single page compression).

[1] https://github.com/facebook/zstd?tab=readme-ov-file#the-case-for-small-data-compression


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC v3 0/4] mTHP-friendly compression in zsmalloc and zram based on multi-pages
  2024-11-26  5:09 ` [PATCH RFC v3 0/4] mTHP-friendly compression in zsmalloc and zram based on multi-pages Sergey Senozhatsky
  2024-11-26 10:52   ` Sergey Senozhatsky
@ 2024-11-26 20:20   ` Barry Song
  2024-11-27  4:52     ` Sergey Senozhatsky
  1 sibling, 1 reply; 19+ messages in thread
From: Barry Song @ 2024-11-26 20:20 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: akpm, linux-mm, axboe, bala.seshasayee, chrisl, david, hannes,
	kanchana.p.sridhar, kasong, linux-block, minchan, nphamcs,
	ryan.roberts, surenb, terrelln, usamaarif642, v-songbaohua,
	wajdi.k.feghali, willy, ying.huang, yosryahmed, yuzhao,
	zhengtangquan, zhouchengming

On Tue, Nov 26, 2024 at 6:09 PM Sergey Senozhatsky
<senozhatsky@chromium.org> wrote:
>
> On (24/11/22 11:25), Barry Song wrote:
> > When large folios are compressed at a larger granularity, we observe
> > a notable reduction in CPU usage and a significant improvement in
> > compression ratios.
> >
> > This patchset enhances zsmalloc and zram by adding support for dividing
> > large folios into multi-page blocks, typically configured with a
> > 2-order granularity. Without this patchset, a large folio is always
> > divided into `nr_pages` 4KiB blocks.
> >
> > The granularity can be set using the `ZSMALLOC_MULTI_PAGES_ORDER`
> > setting, where the default of 2 allows all anonymous THP to benefit.
>
> I can't say that I'm in love with this part.
>
> Looking at zsmalloc stats, your new size-classes are significantly
> further apart from each other than our tradition size classes.
> For example, with ZSMALLOC_CHAIN_SIZE of 10, some size-classes are
> more than 400 (that's almost 10% of PAGE_SIZE) bytes apart
>
> // stripped
>    344  9792
>    348 10048
>    351 10240
>    353 10368
>    355 10496
>    361 10880
>    368 11328
>    370 11456
>    373 11648
>    377 11904
>    383 12288
>    387 12544
>    390 12736
>    395 13056
>    400 13376
>    404 13632
>    410 14016
>    415 14336
>
> Which means that every objects of size, let's say, 10881 will
> go into 11328 size class and have 447 bytes of padding between
> each object.
>
> And with ZSMALLOC_CHAIN_SIZE of 8, it seems, we have even larger
> padding gaps:
>
> // stripped
>    348 10048
>    351 10240
>    353 10368
>    361 10880
>    370 11456
>    373 11648
>    377 11904
>    383 12288
>    390 12736
>    395 13056
>    404 13632
>    410 14016
>    415 14336
>    418 14528
>    447 16384
>
> E.g. 13632 and 13056 are more than 500 bytes apart.
>
> > swap-out time(ms)       68711              49908
> > swap-in time(ms)        30687              20685
> > compression ratio       20.49%             16.9%
>
> These are not the only numbers to focus on, really important metrics
> are: zsmalloc pages-used and zsmalloc max-pages-used.  Then we can
> calculate the pool memory usage ratio (the size of compressed data vs
> the number of pages zsmalloc pool allocated to keep them).

To address this, we plan to collect more data and get back to you
afterwards. From my understanding, we still have an opportunity
to refine the CHAIN SIZE?
Essentially, each small object might cause some waste within the
original PAGE_SIZE. Now, with 4 * PAGE_SIZE, there could be a
single instance of waste. If we can manage the ratio, this could be
optimized?

>
> More importantly, dealing with internal fragmentation in a size-class,
> let's say, of 14528 will be a little painful, as we'll need to move
> around 14K objects.
>
> As, for the speed part, well, it's a little unusual to see that you
> are focusing on zstd.  zstd is slower than any from the lzX family,
> sort of a fact, zstsd sports better compression ratio, but is slower.
> Do you use zstd in your smartphones?  If speed is your main metrics,

Yes, essentially, zstd is too slow. However, with mTHP and this patch
set, the swap-out/swap-in bandwidth has significantly improved. As a
result, we are now using zstd directly on phones with two zRAM
devices:

zRAM0: swap-out/swap-in small folios using lz4;
zRAM1: swap-out/swap-in large folios using zstd

Without large folios, the latency of zstd for small folios is
unacceptable, which
is why zRAM0 uses lz4. On the other hand, zRAM1 strikes a balance by combining
the acceptable speed of large folios with the memory savings provided by zstd.

> another option might be to just use a faster algorithm and then utilize
> post-processing (re-compression with zstd or writeback) for memory
> savings?

The concern lies in power consumption, as re-compression would require
decompressing LZ4 and recompressing it into Zstd. Mobile phones are
particularly sensitive to both power consumption and standby time.

On the other hand, I don’t see any conflict between recompression and
the large block compression proposed by this patchset. Even during
recompression, the advantages of large block compression can be
utilized to enhance speed.

Writeback is another approach we are exploring. The main concern is that
it might require swapping in data from backend block devices. We need to
ensure that only truly cold data is stored there; otherwise, it could
significantly
impact app launch times when an app transitions from the background to the
foreground.

>
> Do you happen to have some data (pool memory usage ratio, etc.) for
> lzo or lzo-rle, or lz4?

TBH, I don't, because the current use case involves using zstd for large folios,
which is our main focus. We are not using lzo or lz4 for large folios, but
I can definitely collect some data on that.

Thanks
Barry

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC v3 0/4] mTHP-friendly compression in zsmalloc and zram based on multi-pages
  2024-11-26 10:52   ` Sergey Senozhatsky
@ 2024-11-26 20:31     ` Barry Song
  2024-11-27  5:04       ` Sergey Senozhatsky
  0 siblings, 1 reply; 19+ messages in thread
From: Barry Song @ 2024-11-26 20:31 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: akpm, linux-mm, axboe, bala.seshasayee, chrisl, david, hannes,
	kanchana.p.sridhar, kasong, linux-block, minchan, nphamcs,
	ryan.roberts, surenb, terrelln, usamaarif642, v-songbaohua,
	wajdi.k.feghali, willy, ying.huang, yosryahmed, yuzhao,
	zhengtangquan, zhouchengming

On Tue, Nov 26, 2024 at 11:53 PM Sergey Senozhatsky
<senozhatsky@chromium.org> wrote:
>
> On (24/11/26 14:09), Sergey Senozhatsky wrote:
> > > swap-out time(ms)       68711              49908
> > > swap-in time(ms)        30687              20685
> > > compression ratio       20.49%             16.9%
>
> I'm also sort of curious if you'd use zstd with pre-trained user
> dictionary [1] (e.g. based on a dump of your swap-file under most
> common workloads) would it give you desired compression ratio
> improvements (on current zram, that does single page compression).
>
> [1] https://github.com/facebook/zstd?tab=readme-ov-file#the-case-for-small-data-compression

Not yet, but it might be worth trying. A key difference between servers and
Android phones is that phones have millions of different applications
downloaded from the Google Play Store or other sources. In this case,
would using a dictionary be a feasible approach? Apologies if my question
seems too naive.

On the other hand, the advantage of a pre-trained user dictionary
doesn't outweigh the
benefits of large block compression? Can’t both be used together?

Thanks
Barry


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC v3 1/4] mm: zsmalloc: support objects compressed based on multiple pages
  2024-11-26  5:37   ` Sergey Senozhatsky
@ 2024-11-27  1:53     ` Barry Song
  0 siblings, 0 replies; 19+ messages in thread
From: Barry Song @ 2024-11-27  1:53 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: akpm, linux-mm, axboe, bala.seshasayee, chrisl, david, hannes,
	kanchana.p.sridhar, kasong, linux-block, minchan, nphamcs,
	ryan.roberts, surenb, terrelln, usamaarif642, v-songbaohua,
	wajdi.k.feghali, willy, ying.huang, yosryahmed, yuzhao,
	zhengtangquan, zhouchengming

On Tue, Nov 26, 2024 at 6:37 PM Sergey Senozhatsky
<senozhatsky@chromium.org> wrote:
>
> On (24/11/22 11:25), Barry Song wrote:
> >  static void reset_page(struct page *page)
> >  {
> > -     __ClearPageMovable(page);
> > +     if (PageMovable(page))
> > +             __ClearPageMovable(page);
>
> A side note:
> ERROR: modpost: "PageMovable" [mm/zsmalloc.ko] undefined!

My mistake. It could be if (!__PageMovable(page)). Ideally, we should support
movability for large block compression.

Thanks
Barry


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC v3 0/4] mTHP-friendly compression in zsmalloc and zram based on multi-pages
  2024-11-26 20:20   ` Barry Song
@ 2024-11-27  4:52     ` Sergey Senozhatsky
  2024-11-28 20:40       ` Barry Song
  0 siblings, 1 reply; 19+ messages in thread
From: Sergey Senozhatsky @ 2024-11-27  4:52 UTC (permalink / raw)
  To: Barry Song
  Cc: Sergey Senozhatsky, akpm, linux-mm, axboe, bala.seshasayee,
	chrisl, david, hannes, kanchana.p.sridhar, kasong, linux-block,
	minchan, nphamcs, ryan.roberts, surenb, terrelln, usamaarif642,
	v-songbaohua, wajdi.k.feghali, willy, ying.huang, yosryahmed,
	yuzhao, zhengtangquan, zhouchengming

On (24/11/27 09:20), Barry Song wrote:
[..]
> >    390 12736
> >    395 13056
> >    404 13632
> >    410 14016
> >    415 14336
> >    418 14528
> >    447 16384
> >
> > E.g. 13632 and 13056 are more than 500 bytes apart.
> >
> > > swap-out time(ms)       68711              49908
> > > swap-in time(ms)        30687              20685
> > > compression ratio       20.49%             16.9%
> >
> > These are not the only numbers to focus on, really important metrics
> > are: zsmalloc pages-used and zsmalloc max-pages-used.  Then we can
> > calculate the pool memory usage ratio (the size of compressed data vs
> > the number of pages zsmalloc pool allocated to keep them).
>
> To address this, we plan to collect more data and get back to you
> afterwards. From my understanding, we still have an opportunity
> to refine the CHAIN SIZE?

Do you mean changing the value?  It's configurable.

> Essentially, each small object might cause some waste within the
> original PAGE_SIZE. Now, with 4 * PAGE_SIZE, there could be a
> single instance of waste. If we can manage the ratio, this could be
> optimized?

All size classes work the same and we merge size-classes with equal
characteristics.  So in the example above

		395 13056
		404 13632

size-classes #396-403 are merged with size-class #404.  And #404 size-class
splits zspage into 13632-byte chunks, any smaller objects (e.g. an object
from size-class #396 (which can be just one byte larger than #395
objects)) takes that entire chunk and the rest of the space in the chunk
is just padding.

CHAIN_SIZE is how we find the optimal balance.  The larger the zspage
the more likely we squeeze some space for extra objects, which otherwise
would have been just a waste.  With large CHAIN_SIZE we also change
characteristics of many size classes so we merge less classes and have
more clusters.  The price, on the other hand, is more physical 0-order
pages per zspage, which can be painful.  On all the tests I ran 8 or 10
worked best.

[..]
> > another option might be to just use a faster algorithm and then utilize
> > post-processing (re-compression with zstd or writeback) for memory
> > savings?
>
> The concern lies in power consumption

But the power consumption concern is also in "decompress just one middle
page from very large object" case, and size-classes de-fragmentation
which requires moving around lots of objects in order to form more full
zspage and release empty zspages.  There are concerns everywhere, how
many of them are measured and analyzed and either ruled out or confirmed
is another question.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC v3 0/4] mTHP-friendly compression in zsmalloc and zram based on multi-pages
  2024-11-26 20:31     ` Barry Song
@ 2024-11-27  5:04       ` Sergey Senozhatsky
  2024-11-28 20:56         ` Barry Song
  0 siblings, 1 reply; 19+ messages in thread
From: Sergey Senozhatsky @ 2024-11-27  5:04 UTC (permalink / raw)
  To: Barry Song
  Cc: Sergey Senozhatsky, akpm, linux-mm, axboe, bala.seshasayee,
	chrisl, david, hannes, kanchana.p.sridhar, kasong, linux-block,
	minchan, nphamcs, ryan.roberts, surenb, terrelln, usamaarif642,
	v-songbaohua, wajdi.k.feghali, willy, ying.huang, yosryahmed,
	yuzhao, zhengtangquan, zhouchengming

On (24/11/27 09:31), Barry Song wrote:
> On Tue, Nov 26, 2024 at 11:53 PM Sergey Senozhatsky
> <senozhatsky@chromium.org> wrote:
> >
> > On (24/11/26 14:09), Sergey Senozhatsky wrote:
> > > > swap-out time(ms)       68711              49908
> > > > swap-in time(ms)        30687              20685
> > > > compression ratio       20.49%             16.9%
> >
> > I'm also sort of curious if you'd use zstd with pre-trained user
> > dictionary [1] (e.g. based on a dump of your swap-file under most
> > common workloads) would it give you desired compression ratio
> > improvements (on current zram, that does single page compression).
> >
> > [1] https://github.com/facebook/zstd?tab=readme-ov-file#the-case-for-small-data-compression
>
> Not yet, but it might be worth trying. A key difference between servers and
> Android phones is that phones have millions of different applications
> downloaded from the Google Play Store or other sources.

Maybe yes maybe not, I don't know.  It could be that that 99% of users
use the same 1% apps out of those millions.

> In this case, would using a dictionary be a feasible approach? Apologies
> if my question seems too naive.

It's a good question, and there is probably only one way to answer
it - through experiments, it's data dependent, so it's case-by-case.

> On the other hand, the advantage of a pre-trained user dictionary
> doesn't outweigh the benefits of large block compression? Can’t both
> be used together?

Well, so far the approach has many unmeasured unknowns and corner
cases, I don't think I personally even understand all of them to begin
with.  Not sure if I have a way to measure and analyze, that mTHP
swapout seems like a relatively new thing and it also seems that you
are still fixing some of its issues/shortcomings.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC v3 0/4] mTHP-friendly compression in zsmalloc and zram based on multi-pages
  2024-11-27  4:52     ` Sergey Senozhatsky
@ 2024-11-28 20:40       ` Barry Song
  0 siblings, 0 replies; 19+ messages in thread
From: Barry Song @ 2024-11-28 20:40 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: akpm, linux-mm, axboe, bala.seshasayee, chrisl, david, hannes,
	kanchana.p.sridhar, kasong, linux-block, minchan, nphamcs,
	ryan.roberts, surenb, terrelln, usamaarif642, v-songbaohua,
	wajdi.k.feghali, willy, ying.huang, yosryahmed, yuzhao,
	zhengtangquan, zhouchengming

On Wed, Nov 27, 2024 at 5:52 PM Sergey Senozhatsky
<senozhatsky@chromium.org> wrote:
>
> On (24/11/27 09:20), Barry Song wrote:
> [..]
> > >    390 12736
> > >    395 13056
> > >    404 13632
> > >    410 14016
> > >    415 14336
> > >    418 14528
> > >    447 16384
> > >
> > > E.g. 13632 and 13056 are more than 500 bytes apart.
> > >
> > > > swap-out time(ms)       68711              49908
> > > > swap-in time(ms)        30687              20685
> > > > compression ratio       20.49%             16.9%
> > >
> > > These are not the only numbers to focus on, really important metrics
> > > are: zsmalloc pages-used and zsmalloc max-pages-used.  Then we can
> > > calculate the pool memory usage ratio (the size of compressed data vs
> > > the number of pages zsmalloc pool allocated to keep them).
> >
> > To address this, we plan to collect more data and get back to you
> > afterwards. From my understanding, we still have an opportunity
> > to refine the CHAIN SIZE?
>
> Do you mean changing the value?  It's configurable.
>
> > Essentially, each small object might cause some waste within the
> > original PAGE_SIZE. Now, with 4 * PAGE_SIZE, there could be a
> > single instance of waste. If we can manage the ratio, this could be
> > optimized?
>
> All size classes work the same and we merge size-classes with equal
> characteristics.  So in the example above
>
>                 395 13056
>                 404 13632
>
> size-classes #396-403 are merged with size-class #404.  And #404 size-class
> splits zspage into 13632-byte chunks, any smaller objects (e.g. an object
> from size-class #396 (which can be just one byte larger than #395
> objects)) takes that entire chunk and the rest of the space in the chunk
> is just padding.
>
> CHAIN_SIZE is how we find the optimal balance.  The larger the zspage
> the more likely we squeeze some space for extra objects, which otherwise
> would have been just a waste.  With large CHAIN_SIZE we also change
> characteristics of many size classes so we merge less classes and have
> more clusters.  The price, on the other hand, is more physical 0-order
> pages per zspage, which can be painful.  On all the tests I ran 8 or 10
> worked best.

Thanks very much for the explanation. We’ll gather more data on this and follow
up with you.

>
> [..]
> > > another option might be to just use a faster algorithm and then utilize
> > > post-processing (re-compression with zstd or writeback) for memory
> > > savings?
> >
> > The concern lies in power consumption
>
> But the power consumption concern is also in "decompress just one middle
> page from very large object" case, and size-classes de-fragmentation

That's why we have "[patch 4/4] mm: fall back to four small folios if mTHP
allocation fails" to address the issue of "decompressing just one middle page
from a very large object."  I assume that recompression and writeback should
also focus on large objects if the original compression involves multiple pages?

> which requires moving around lots of objects in order to form more full
> zspage and release empty zspages.  There are concerns everywhere, how

I assume the cost of defragmentation is M * N, where:
* M is the number of objects,
* N is the size of the objects.

With large objects, M is reduced to 1/4 of the original number of
objects. Although
N increases, the overall M * N becomes slightly smaller than before,
as N is just
under 4 times the size of the original objects?

> many of them are measured and analyzed and either ruled out or confirmed
> is another question.

In phone scenarios, if recompression uses zstd and the original compression
is based on lz4 with 4KB blocks, the cost to obtain zstd-compressed objects
would be:

* A: Compression of 4 × 4KB using lz4
* B: Decompression of 4 × 4KB using lz4
* C: Compression of 4 × 4KB using zstd

By leveraging the speed advantages of mTHP swap and zstd's large-block
compression,
the cost becomes:
D: Compression of 16KB using zstd

Since D is significantly smaller than C (D < C), it follows that:
D < A + B + C  ?

Thanks
Barry


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC v3 0/4] mTHP-friendly compression in zsmalloc and zram based on multi-pages
  2024-11-27  5:04       ` Sergey Senozhatsky
@ 2024-11-28 20:56         ` Barry Song
  0 siblings, 0 replies; 19+ messages in thread
From: Barry Song @ 2024-11-28 20:56 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: akpm, linux-mm, axboe, bala.seshasayee, chrisl, david, hannes,
	kanchana.p.sridhar, kasong, linux-block, minchan, nphamcs,
	ryan.roberts, surenb, terrelln, usamaarif642, v-songbaohua,
	wajdi.k.feghali, willy, ying.huang, yosryahmed, yuzhao,
	zhengtangquan, zhouchengming

On Wed, Nov 27, 2024 at 6:04 PM Sergey Senozhatsky
<senozhatsky@chromium.org> wrote:
>
> On (24/11/27 09:31), Barry Song wrote:
> > On Tue, Nov 26, 2024 at 11:53 PM Sergey Senozhatsky
> > <senozhatsky@chromium.org> wrote:
> > >
> > > On (24/11/26 14:09), Sergey Senozhatsky wrote:
> > > > > swap-out time(ms)       68711              49908
> > > > > swap-in time(ms)        30687              20685
> > > > > compression ratio       20.49%             16.9%
> > >
> > > I'm also sort of curious if you'd use zstd with pre-trained user
> > > dictionary [1] (e.g. based on a dump of your swap-file under most
> > > common workloads) would it give you desired compression ratio
> > > improvements (on current zram, that does single page compression).
> > >
> > > [1] https://github.com/facebook/zstd?tab=readme-ov-file#the-case-for-small-data-compression
> >
> > Not yet, but it might be worth trying. A key difference between servers and
> > Android phones is that phones have millions of different applications
> > downloaded from the Google Play Store or other sources.
>
> Maybe yes maybe not, I don't know.  It could be that that 99% of users
> use the same 1% apps out of those millions.
>
> > In this case, would using a dictionary be a feasible approach? Apologies
> > if my question seems too naive.
>
> It's a good question, and there is probably only one way to answer
> it - through experiments, it's data dependent, so it's case-by-case.

Sure, we may collect data on the most popular apps (e.g., the top 100) and
train zstd using their anonymous data to identify patterns. We’ll follow up
with you afterward.

>
> > On the other hand, the advantage of a pre-trained user dictionary
> > doesn't outweigh the benefits of large block compression? Can’t both
> > be used together?
>
> Well, so far the approach has many unmeasured unknowns and corner
> cases, I don't think I personally even understand all of them to begin

I agree we can make an effort to dig deeper and collect more data, analyzing as
many corner cases as possible but many unknowns are a common characteristic
of new things :-)

> with.  Not sure if I have a way to measure and analyze, that mTHP
> swapout seems like a relatively new thing and it also seems that you
> are still fixing some of its issues/shortcomings.

A challenge is determining how to make mTHP fully transparent (e.g.,
not dependent
on sysfs controls for enabling/disabling) across various workloads.
The default policy
may not always be optimal for all workloads.

Despite that, there are certainly benefits we can gain from mTHP
within zsmalloc/zram.

Thanks
Barry


^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2024-11-28 20:57 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-11-21 22:25 [PATCH RFC v3 0/4] mTHP-friendly compression in zsmalloc and zram based on multi-pages Barry Song
2024-11-21 22:25 ` [PATCH RFC v3 1/4] mm: zsmalloc: support objects compressed based on multiple pages Barry Song
2024-11-26  5:37   ` Sergey Senozhatsky
2024-11-27  1:53     ` Barry Song
2024-11-21 22:25 ` [PATCH RFC v3 2/4] zram: support compression at the granularity of multi-pages Barry Song
2024-11-21 22:25 ` [PATCH RFC v3 3/4] zram: backend_zstd: Adjust estimated_src_size to accommodate multi-page compression Barry Song
2024-11-21 22:25 ` [PATCH RFC v3 4/4] mm: fall back to four small folios if mTHP allocation fails Barry Song
2024-11-22 14:54   ` Usama Arif
2024-11-24 21:47     ` Barry Song
2024-11-25 16:19       ` Usama Arif
2024-11-25 18:32         ` Barry Song
2024-11-26  5:09 ` [PATCH RFC v3 0/4] mTHP-friendly compression in zsmalloc and zram based on multi-pages Sergey Senozhatsky
2024-11-26 10:52   ` Sergey Senozhatsky
2024-11-26 20:31     ` Barry Song
2024-11-27  5:04       ` Sergey Senozhatsky
2024-11-28 20:56         ` Barry Song
2024-11-26 20:20   ` Barry Song
2024-11-27  4:52     ` Sergey Senozhatsky
2024-11-28 20:40       ` Barry Song

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox