[PATCH RFC 0/2] mTHP-friendly compression in zsmalloc and zram based on multi-pages

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH RFC 0/2] mTHP-friendly compression in zsmalloc and zram based on multi-pages
@ 2024-03-27 21:48 Barry Song
  2024-03-27 21:48 ` [PATCH RFC 1/2] mm: zsmalloc: support objects compressed based on multiple pages Barry Song
                   ` (2 more replies)
  0 siblings, 3 replies; 18+ messages in thread
From: Barry Song @ 2024-03-27 21:48 UTC (permalink / raw)
  To: akpm, minchan, senozhatsky, linux-block, axboe, linux-mm
  Cc: terrelln, chrisl, david, kasong, yuzhao, yosryahmed, nphamcs,
	willy, hannes, ying.huang, surenb, wajdi.k.feghali,
	kanchana.p.sridhar, corbet, zhouchengming, Barry Song

From: Barry Song <v-songbaohua@oppo.com>

mTHP is generally considered to potentially waste memory due to fragmentation,
but it may also serve as a source of memory savings.
When large folios are compressed at a larger granularity, we observe a remarkable
decrease in CPU utilization and a significant improvement in compression ratios.

The following data illustrates the time and compressed data for typical anonymous
pages gathered from Android phones.

granularity   orig_data_size   compr_data_size   time(us)
4KiB-zstd      1048576000       246876055        50259962
64KiB-zstd     1048576000       199763892        18330605

Due to mTHP's ability to be swapped out without splitting[1] and swapped in as a
whole[2], it enables compression and decompression to be performed at larger
granularities.

This patchset enhances zsmalloc and zram by introducing support for dividing large
folios into multi-pages, typically configured with a 4-order granularity. Here are
concrete examples:

* If a large folio's size is 32KiB, it will still be compressed and stored at a 4KiB
  granularity.
* If a large folio's size is 64KiB, it will be compressed and stored as a single 64KiB
  block.
* If a large folio's size is 128KiB, it will be compressed and stored as two 64KiB
  multi-pages.

Without the patchset, a large folio is always divided into nr_pages 4KiB blocks.

The granularity can be configured using the ZSMALLOC_MULTI_PAGES_ORDER setting.

[1] https://lore.kernel.org/linux-mm/20240327144537.4165578-1-ryan.roberts@arm.com/
[2] https://lore.kernel.org/linux-mm/20240304081348.197341-1-21cnbao@gmail.com/

Tangquan Zheng (2):
  mm: zsmalloc: support objects compressed based on multiple pages
  zram: support compression at the granularity of multi-pages

 drivers/block/zram/Kconfig    |   9 +
 drivers/block/zram/zcomp.c    |  23 ++-
 drivers/block/zram/zcomp.h    |  12 +-
 drivers/block/zram/zram_drv.c | 372 +++++++++++++++++++++++++++++++---
 drivers/block/zram/zram_drv.h |  21 ++
 include/linux/zsmalloc.h      |  10 +-
 mm/Kconfig                    |  18 ++
 mm/zsmalloc.c                 | 215 +++++++++++++++-----
 8 files changed, 586 insertions(+), 94 deletions(-)

-- 
2.34.1

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH RFC 1/2] mm: zsmalloc: support objects compressed based on multiple pages
  2024-03-27 21:48 [PATCH RFC 0/2] mTHP-friendly compression in zsmalloc and zram based on multi-pages Barry Song
@ 2024-03-27 21:48 ` Barry Song
  2024-10-21 23:26   ` Barry Song
  2024-03-27 21:48 ` [PATCH RFC 2/2] zram: support compression at the granularity of multi-pages Barry Song
  2024-03-27 22:01 ` [PATCH RFC 0/2] mTHP-friendly compression in zsmalloc and zram based on multi-pages Barry Song
  2 siblings, 1 reply; 18+ messages in thread
From: Barry Song @ 2024-03-27 21:48 UTC (permalink / raw)
  To: akpm, minchan, senozhatsky, linux-block, axboe, linux-mm
  Cc: terrelln, chrisl, david, kasong, yuzhao, yosryahmed, nphamcs,
	willy, hannes, ying.huang, surenb, wajdi.k.feghali,
	kanchana.p.sridhar, corbet, zhouchengming, Tangquan Zheng,
	Barry Song

From: Tangquan Zheng <zhengtangquan@oppo.com>

This patch introduces support for zsmalloc to store compressed objects based
on multi-pages. Previously, a large folio with nr_pages subpages would undergo
compression one by one, each at the granularity of PAGE_SIZE. However, by
compressing them at a larger granularity, we can conserve both memory and
CPU resources.

We define the granularity using a configuration option called
ZSMALLOC_MULTI_PAGES_ORDER, with a default value of 4. Consequently, a
large folio with 32 subpages will now be divided into 2 parts rather
than 32 parts.

The introduction of the multi-pages feature necessitates the creation
of new size classes to accommodate it.

Signed-off-by: Tangquan Zheng <zhengtangquan@oppo.com>
Co-developed-by: Barry Song <v-songbaohua@oppo.com>
Signed-off-by: Barry Song <v-songbaohua@oppo.com>
---
 include/linux/zsmalloc.h |  10 +-
 mm/Kconfig               |  18 ++++
 mm/zsmalloc.c            | 215 +++++++++++++++++++++++++++++----------
 3 files changed, 187 insertions(+), 56 deletions(-)

diff --git a/include/linux/zsmalloc.h b/include/linux/zsmalloc.h
index a48cd0ffe57d..9fa3e7669557 100644
--- a/include/linux/zsmalloc.h
+++ b/include/linux/zsmalloc.h
@@ -33,6 +33,14 @@ enum zs_mapmode {
 	 */
 };
 
+enum zsmalloc_type {
+	ZSMALLOC_TYPE_BASEPAGE,
+#ifdef CONFIG_ZSMALLOC_MULTI_PAGES
+	ZSMALLOC_TYPE_MULTI_PAGES,
+#endif
+	ZSMALLOC_TYPE_MAX,
+};
+
 struct zs_pool_stats {
 	/* How many pages were migrated (freed) */
 	atomic_long_t pages_compacted;
@@ -46,7 +54,7 @@ void zs_destroy_pool(struct zs_pool *pool);
 unsigned long zs_malloc(struct zs_pool *pool, size_t size, gfp_t flags);
 void zs_free(struct zs_pool *pool, unsigned long obj);
 
-size_t zs_huge_class_size(struct zs_pool *pool);
+size_t zs_huge_class_size(struct zs_pool *pool, enum zsmalloc_type type);
 
 void *zs_map_object(struct zs_pool *pool, unsigned long handle,
 			enum zs_mapmode mm);
diff --git a/mm/Kconfig b/mm/Kconfig
index b1448aa81e15..cedb07094e8e 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -224,6 +224,24 @@ config ZSMALLOC_CHAIN_SIZE
 
 	  For more information, see zsmalloc documentation.
 
+config ZSMALLOC_MULTI_PAGES
+	bool "support zsmalloc multiple pages"
+	depends on ZSMALLOC && !CONFIG_HIGHMEM
+	help
+	  This option configures zsmalloc to support allocations larger than
+	  PAGE_SIZE, enabling compression across multiple pages. The size of
+	  these multiple pages is determined by the configured
+	  ZSMALLOC_MULTI_PAGES_ORDER.
+
+config ZSMALLOC_MULTI_PAGES_ORDER
+	int "zsmalloc multiple pages order"
+	default 4
+	range 1 9
+	depends on ZSMALLOC_MULTI_PAGES
+	help
+	  This option is used to configure zsmalloc to support the compression
+	  of multiple pages.
+
 menu "Slab allocator options"
 
 config SLUB
diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
index b42d3545ca85..8658421cee11 100644
--- a/mm/zsmalloc.c
+++ b/mm/zsmalloc.c
@@ -65,6 +65,12 @@
 
 #define ZSPAGE_MAGIC	0x58
 
+#ifdef CONFIG_ZSMALLOC_MULTI_PAGES
+#define ZSMALLOC_MULTI_PAGES_ORDER	(_AC(CONFIG_ZSMALLOC_MULTI_PAGES_ORDER, UL))
+#define ZSMALLOC_MULTI_PAGES_NR		(1 << ZSMALLOC_MULTI_PAGES_ORDER)
+#define ZSMALLOC_MULTI_PAGES_SIZE	(PAGE_SIZE * ZSMALLOC_MULTI_PAGES_NR)
+#endif
+
 /*
  * This must be power of 2 and greater than or equal to sizeof(link_free).
  * These two conditions ensure that any 'struct link_free' itself doesn't
@@ -115,7 +121,8 @@
 
 #define HUGE_BITS	1
 #define FULLNESS_BITS	4
-#define CLASS_BITS	8
+#define CLASS_BITS	9
+#define ISOLATED_BITS	5
 #define MAGIC_VAL_BITS	8
 
 #define MAX(a, b) ((a) >= (b) ? (a) : (b))
@@ -126,7 +133,11 @@
 #define ZS_MIN_ALLOC_SIZE \
 	MAX(32, (ZS_MAX_PAGES_PER_ZSPAGE << PAGE_SHIFT >> OBJ_INDEX_BITS))
 /* each chunk includes extra space to keep handle */
+#ifdef CONFIG_ZSMALLOC_MULTI_PAGES
+#define ZS_MAX_ALLOC_SIZE	(ZSMALLOC_MULTI_PAGES_SIZE)
+#else
 #define ZS_MAX_ALLOC_SIZE	PAGE_SIZE
+#endif
 
 /*
  * On systems with 4K page size, this gives 255 size classes! There is a
@@ -141,9 +152,22 @@
  *  ZS_MIN_ALLOC_SIZE and ZS_SIZE_CLASS_DELTA must be multiple of ZS_ALIGN
  *  (reason above)
  */
-#define ZS_SIZE_CLASS_DELTA	(PAGE_SIZE >> CLASS_BITS)
-#define ZS_SIZE_CLASSES	(DIV_ROUND_UP(ZS_MAX_ALLOC_SIZE - ZS_MIN_ALLOC_SIZE, \
-				      ZS_SIZE_CLASS_DELTA) + 1)
+
+#define ZS_PAGE_SIZE_CLASS_DELTA	(PAGE_SIZE >> (CLASS_BITS - 1))
+#define ZS_PAGE_SIZE_CLASSES	(DIV_ROUND_UP(PAGE_SIZE - ZS_MIN_ALLOC_SIZE, \
+				      ZS_PAGE_SIZE_CLASS_DELTA) + 1)
+
+#ifdef CONFIG_ZSMALLOC_MULTI_PAGES
+#define ZS_MULTI_PAGES_SIZE_CLASS_DELTA	(ZSMALLOC_MULTI_PAGES_SIZE >> (CLASS_BITS - 1))
+#define ZS_MULTI_PAGES_SIZE_CLASSES	(DIV_ROUND_UP(ZS_MAX_ALLOC_SIZE - PAGE_SIZE, \
+				      ZS_MULTI_PAGES_SIZE_CLASS_DELTA) + 1)
+#endif
+
+#ifdef CONFIG_ZSMALLOC_MULTI_PAGES
+#define ZS_SIZE_CLASSES	(ZS_PAGE_SIZE_CLASSES + ZS_MULTI_PAGES_SIZE_CLASS_DELTA)
+#else
+#define ZS_SIZE_CLASSES	(ZS_PAGE_SIZE_CLASSES)
+#endif
 
 /*
  * Pages are distinguished by the ratio of used memory (that is the ratio
@@ -179,7 +203,8 @@ struct zs_size_stat {
 static struct dentry *zs_stat_root;
 #endif
 
-static size_t huge_class_size;
+/* huge_class_size[0] for page, huge_class_size[1] for multiple pages. */
+static size_t huge_class_size[ZSMALLOC_TYPE_MAX];
 
 struct size_class {
 	struct list_head fullness_list[NR_FULLNESS_GROUPS];
@@ -255,6 +280,29 @@ struct zspage {
 	rwlock_t lock;
 };
 
+#ifdef CONFIG_ZSMALLOC_MULTI_PAGES
+static inline unsigned int class_size_to_zs_order(unsigned long size)
+{
+	unsigned int order = 0;
+
+	/* used large order to alloc page for zspage when class_size > PAGE_SIZE */
+	if (size > PAGE_SIZE)
+		return ZSMALLOC_MULTI_PAGES_ORDER;
+
+	return order;
+}
+#else
+static inline unsigned int class_size_to_zs_order(unsigned long size)
+{
+	return 0;
+}
+#endif
+
+static inline unsigned long class_size_to_zs_size(unsigned long size)
+{
+	return PAGE_SIZE * (1 << class_size_to_zs_order(size));
+}
+
 struct mapping_area {
 	local_lock_t lock;
 	char *vm_buf; /* copy buffer for objects that span pages */
@@ -487,11 +535,22 @@ static int get_size_class_index(int size)
 {
 	int idx = 0;
 
+#ifdef CONFIG_ZSMALLOC_MULTI_PAGES
+	if (size > PAGE_SIZE) {
+		idx = ZS_PAGE_SIZE_CLASSES;
+		idx += DIV_ROUND_UP(size - PAGE_SIZE,
+				ZS_MULTI_PAGES_SIZE_CLASS_DELTA);
+
+		return min_t(int, ZS_SIZE_CLASSES - 1, idx);
+	}
+#endif
+
+	idx = 0;
 	if (likely(size > ZS_MIN_ALLOC_SIZE))
-		idx = DIV_ROUND_UP(size - ZS_MIN_ALLOC_SIZE,
-				ZS_SIZE_CLASS_DELTA);
+		idx += DIV_ROUND_UP(size - ZS_MIN_ALLOC_SIZE,
+				ZS_PAGE_SIZE_CLASS_DELTA);
 
-	return min_t(int, ZS_SIZE_CLASSES - 1, idx);
+	return  min_t(int, ZS_PAGE_SIZE_CLASSES - 1, idx);
 }
 
 static inline void class_stat_inc(struct size_class *class,
@@ -541,22 +600,19 @@ static int zs_stats_size_show(struct seq_file *s, void *v)
 	unsigned long total_freeable = 0;
 	unsigned long inuse_totals[NR_FULLNESS_GROUPS] = {0, };
 
-	seq_printf(s, " %5s %5s %9s %9s %9s %9s %9s %9s %9s %9s %9s %9s %9s %13s %10s %10s %16s %8s\n",
-			"class", "size", "10%", "20%", "30%", "40%",
+	seq_printf(s, " %5s %5s %5s %9s %9s %9s %9s %9s %9s %9s %9s %9s %9s %9s %13s %10s %10s %16s %16s %8s\n",
+			"class", "size", "order", "10%", "20%", "30%", "40%",
 			"50%", "60%", "70%", "80%", "90%", "99%", "100%",
 			"obj_allocated", "obj_used", "pages_used",
-			"pages_per_zspage", "freeable");
+			"pages_per_zspage", "objs_per_zspage", "freeable");
 
 	for (i = 0; i < ZS_SIZE_CLASSES; i++) {
-
 		class = pool->size_class[i];
-
 		if (class->index != i)
 			continue;
 
 		spin_lock(&pool->lock);
-
-		seq_printf(s, " %5u %5u ", i, class->size);
+		seq_printf(s, " %5u %5u %5u", i, class->size, class_size_to_zs_order(class->size));
 		for (fg = ZS_INUSE_RATIO_10; fg < NR_FULLNESS_GROUPS; fg++) {
 			inuse_totals[fg] += zs_stat_get(class, fg);
 			seq_printf(s, "%9lu ", zs_stat_get(class, fg));
@@ -571,9 +627,9 @@ static int zs_stats_size_show(struct seq_file *s, void *v)
 		pages_used = obj_allocated / objs_per_zspage *
 				class->pages_per_zspage;
 
-		seq_printf(s, "%13lu %10lu %10lu %16d %8lu\n",
+		seq_printf(s, "%13lu %10lu %10lu %16d %16d %8lu\n",
 			   obj_allocated, obj_used, pages_used,
-			   class->pages_per_zspage, freeable);
+			   class->pages_per_zspage, objs_per_zspage, freeable);
 
 		total_objs += obj_allocated;
 		total_used_objs += obj_used;
@@ -840,7 +896,8 @@ static void __free_zspage(struct zs_pool *pool, struct size_class *class,
 	cache_free_zspage(pool, zspage);
 
 	class_stat_dec(class, ZS_OBJS_ALLOCATED, class->objs_per_zspage);
-	atomic_long_sub(class->pages_per_zspage, &pool->pages_allocated);
+	atomic_long_sub(class->pages_per_zspage * (1 << class_size_to_zs_order(class->size)),
+			&pool->pages_allocated);
 }
 
 static void free_zspage(struct zs_pool *pool, struct size_class *class,
@@ -869,6 +926,7 @@ static void init_zspage(struct size_class *class, struct zspage *zspage)
 	unsigned int freeobj = 1;
 	unsigned long off = 0;
 	struct page *page = get_first_page(zspage);
+	unsigned long page_size = class_size_to_zs_size(class->size);
 
 	while (page) {
 		struct page *next_page;
@@ -880,7 +938,7 @@ static void init_zspage(struct size_class *class, struct zspage *zspage)
 		vaddr = kmap_atomic(page);
 		link = (struct link_free *)vaddr + off / sizeof(*link);
 
-		while ((off += class->size) < PAGE_SIZE) {
+		while ((off += class->size) < page_size) {
 			link->next = freeobj++ << OBJ_TAG_BITS;
 			link += class->size / sizeof(*link);
 		}
@@ -902,7 +960,7 @@ static void init_zspage(struct size_class *class, struct zspage *zspage)
 		}
 		kunmap_atomic(vaddr);
 		page = next_page;
-		off %= PAGE_SIZE;
+		off %= page_size;
 	}
 
 	set_freeobj(zspage, 0);
@@ -952,6 +1010,8 @@ static struct zspage *alloc_zspage(struct zs_pool *pool,
 	struct page *pages[ZS_MAX_PAGES_PER_ZSPAGE];
 	struct zspage *zspage = cache_alloc_zspage(pool, gfp);
 
+	unsigned int order = class_size_to_zs_order(class->size);
+
 	if (!zspage)
 		return NULL;
 
@@ -961,11 +1021,11 @@ static struct zspage *alloc_zspage(struct zs_pool *pool,
 	for (i = 0; i < class->pages_per_zspage; i++) {
 		struct page *page;
 
-		page = alloc_page(gfp);
+		page = alloc_pages(gfp | __GFP_COMP, order);
 		if (!page) {
 			while (--i >= 0) {
 				dec_zone_page_state(pages[i], NR_ZSPAGES);
-				__free_page(pages[i]);
+				__free_pages(pages[i], order);
 			}
 			cache_free_zspage(pool, zspage);
 			return NULL;
@@ -1024,6 +1084,7 @@ static void *__zs_map_object(struct mapping_area *area,
 	int sizes[2];
 	void *addr;
 	char *buf = area->vm_buf;
+	unsigned long page_size = class_size_to_zs_size(size);
 
 	/* disable page faults to match kmap_atomic() return conditions */
 	pagefault_disable();
@@ -1032,7 +1093,7 @@ static void *__zs_map_object(struct mapping_area *area,
 	if (area->vm_mm == ZS_MM_WO)
 		goto out;
 
-	sizes[0] = PAGE_SIZE - off;
+	sizes[0] = page_size - off;
 	sizes[1] = size - sizes[0];
 
 	/* copy object to per-cpu buffer */
@@ -1052,6 +1113,7 @@ static void __zs_unmap_object(struct mapping_area *area,
 	int sizes[2];
 	void *addr;
 	char *buf;
+	unsigned long page_size = class_size_to_zs_size(size);
 
 	/* no write fastpath */
 	if (area->vm_mm == ZS_MM_RO)
@@ -1062,7 +1124,7 @@ static void __zs_unmap_object(struct mapping_area *area,
 	size -= ZS_HANDLE_SIZE;
 	off += ZS_HANDLE_SIZE;
 
-	sizes[0] = PAGE_SIZE - off;
+	sizes[0] = page_size - off;
 	sizes[1] = size - sizes[0];
 
 	/* copy per-cpu buffer to object */
@@ -1169,6 +1231,8 @@ void *zs_map_object(struct zs_pool *pool, unsigned long handle,
 	struct mapping_area *area;
 	struct page *pages[2];
 	void *ret;
+	unsigned long page_size;
+	unsigned long page_mask;
 
 	/*
 	 * Because we use per-cpu mapping areas shared among the
@@ -1193,12 +1257,14 @@ void *zs_map_object(struct zs_pool *pool, unsigned long handle,
 	spin_unlock(&pool->lock);
 
 	class = zspage_class(pool, zspage);
-	off = offset_in_page(class->size * obj_idx);
+	page_size = class_size_to_zs_size(class->size);
+	page_mask = ~(page_size - 1);
+	off = (class->size * obj_idx) & ~page_mask;
 
 	local_lock(&zs_map_area.lock);
 	area = this_cpu_ptr(&zs_map_area);
 	area->vm_mm = mm;
-	if (off + class->size <= PAGE_SIZE) {
+	if (off + class->size <= page_size) {
 		/* this object is contained entirely within a page */
 		area->vm_addr = kmap_atomic(page);
 		ret = area->vm_addr + off;
@@ -1228,15 +1294,20 @@ void zs_unmap_object(struct zs_pool *pool, unsigned long handle)
 
 	struct size_class *class;
 	struct mapping_area *area;
+	unsigned long page_size;
+	unsigned long page_mask;
 
 	obj = handle_to_obj(handle);
 	obj_to_location(obj, &page, &obj_idx);
 	zspage = get_zspage(page);
 	class = zspage_class(pool, zspage);
-	off = offset_in_page(class->size * obj_idx);
+
+	page_size = class_size_to_zs_size(class->size);
+	page_mask = ~(page_size - 1);
+	off = (class->size * obj_idx) & ~page_mask;
 
 	area = this_cpu_ptr(&zs_map_area);
-	if (off + class->size <= PAGE_SIZE)
+	if (off + class->size <= page_size)
 		kunmap_atomic(area->vm_addr);
 	else {
 		struct page *pages[2];
@@ -1266,9 +1337,9 @@ EXPORT_SYMBOL_GPL(zs_unmap_object);
  *
  * Return: the size (in bytes) of the first huge zsmalloc &size_class.
  */
-size_t zs_huge_class_size(struct zs_pool *pool)
+size_t zs_huge_class_size(struct zs_pool *pool, enum zsmalloc_type type)
 {
-	return huge_class_size;
+	return huge_class_size[type];
 }
 EXPORT_SYMBOL_GPL(zs_huge_class_size);
 
@@ -1283,16 +1354,24 @@ static unsigned long obj_malloc(struct zs_pool *pool,
 	struct page *m_page;
 	unsigned long m_offset;
 	void *vaddr;
+	unsigned long page_size;
+	unsigned long page_mask;
+	unsigned long page_shift;
 
 	class = pool->size_class[zspage->class];
 	handle |= OBJ_ALLOCATED_TAG;
 	obj = get_freeobj(zspage);
 
 	offset = obj * class->size;
-	nr_page = offset >> PAGE_SHIFT;
-	m_offset = offset_in_page(offset);
-	m_page = get_first_page(zspage);
 
+	page_size = class_size_to_zs_size(class->size);
+	page_shift = PAGE_SHIFT + class_size_to_zs_order(class->size);
+	page_mask = ~(page_size - 1);
+
+	nr_page = offset >> page_shift;
+	m_offset = offset & ~page_mask;
+
+	m_page = get_first_page(zspage);
 	for (i = 0; i < nr_page; i++)
 		m_page = get_next_page(m_page);
 
@@ -1360,7 +1439,6 @@ unsigned long zs_malloc(struct zs_pool *pool, size_t size, gfp_t gfp)
 	}
 
 	spin_unlock(&pool->lock);
-
 	zspage = alloc_zspage(pool, class, gfp);
 	if (!zspage) {
 		cache_free_handle(pool, handle);
@@ -1372,7 +1450,8 @@ unsigned long zs_malloc(struct zs_pool *pool, size_t size, gfp_t gfp)
 	newfg = get_fullness_group(class, zspage);
 	insert_zspage(class, zspage, newfg);
 	record_obj(handle, obj);
-	atomic_long_add(class->pages_per_zspage, &pool->pages_allocated);
+	atomic_long_add(class->pages_per_zspage * (1 << class_size_to_zs_order(class->size)),
+			&pool->pages_allocated);
 	class_stat_inc(class, ZS_OBJS_ALLOCATED, class->objs_per_zspage);
 	class_stat_inc(class, ZS_OBJS_INUSE, 1);
 
@@ -1393,9 +1472,14 @@ static void obj_free(int class_size, unsigned long obj)
 	unsigned long f_offset;
 	unsigned int f_objidx;
 	void *vaddr;
+	unsigned long page_size;
+	unsigned long page_mask;
 
 	obj_to_location(obj, &f_page, &f_objidx);
-	f_offset = offset_in_page(class_size * f_objidx);
+	page_size = class_size_to_zs_size(class_size);
+	page_mask = ~(page_size - 1);
+
+	f_offset = (class_size * f_objidx) & ~page_mask;
 	zspage = get_zspage(f_page);
 
 	vaddr = kmap_atomic(f_page);
@@ -1454,20 +1538,22 @@ static void zs_object_copy(struct size_class *class, unsigned long dst,
 	void *s_addr, *d_addr;
 	int s_size, d_size, size;
 	int written = 0;
+	unsigned long page_size = class_size_to_zs_size(class->size);
+	unsigned long page_mask =  ~(page_size - 1);
 
 	s_size = d_size = class->size;
 
 	obj_to_location(src, &s_page, &s_objidx);
 	obj_to_location(dst, &d_page, &d_objidx);
 
-	s_off = offset_in_page(class->size * s_objidx);
-	d_off = offset_in_page(class->size * d_objidx);
+	s_off = (class->size * s_objidx) & ~page_mask;
+	d_off = (class->size * d_objidx) & ~page_mask;
 
-	if (s_off + class->size > PAGE_SIZE)
-		s_size = PAGE_SIZE - s_off;
+	if (s_off + class->size > page_size)
+		s_size = page_size - s_off;
 
-	if (d_off + class->size > PAGE_SIZE)
-		d_size = PAGE_SIZE - d_off;
+	if (d_off + class->size > page_size)
+		d_size = page_size - d_off;
 
 	s_addr = kmap_atomic(s_page);
 	d_addr = kmap_atomic(d_page);
@@ -1492,7 +1578,7 @@ static void zs_object_copy(struct size_class *class, unsigned long dst,
 		 * kunmap_atomic(d_addr). For more details see
 		 * Documentation/mm/highmem.rst.
 		 */
-		if (s_off >= PAGE_SIZE) {
+		if (s_off >= page_size) {
 			kunmap_atomic(d_addr);
 			kunmap_atomic(s_addr);
 			s_page = get_next_page(s_page);
@@ -1502,7 +1588,7 @@ static void zs_object_copy(struct size_class *class, unsigned long dst,
 			s_off = 0;
 		}
 
-		if (d_off >= PAGE_SIZE) {
+		if (d_off >= page_size) {
 			kunmap_atomic(d_addr);
 			d_page = get_next_page(d_page);
 			d_addr = kmap_atomic(d_page);
@@ -1526,11 +1612,12 @@ static unsigned long find_alloced_obj(struct size_class *class,
 	int index = *obj_idx;
 	unsigned long handle = 0;
 	void *addr = kmap_atomic(page);
+	unsigned long page_size = class_size_to_zs_size(class->size);
 
 	offset = get_first_obj_offset(page);
 	offset += class->size * index;
 
-	while (offset < PAGE_SIZE) {
+	while (offset < page_size) {
 		if (obj_allocated(page, addr + offset, &handle))
 			break;
 
@@ -1751,6 +1838,7 @@ static int zs_page_migrate(struct page *newpage, struct page *page,
 	unsigned long handle;
 	unsigned long old_obj, new_obj;
 	unsigned int obj_idx;
+	unsigned int page_size = PAGE_SIZE;
 
 	/*
 	 * We cannot support the _NO_COPY case here, because copy needs to
@@ -1772,6 +1860,7 @@ static int zs_page_migrate(struct page *newpage, struct page *page,
 	 */
 	spin_lock(&pool->lock);
 	class = zspage_class(pool, zspage);
+	page_size = class_size_to_zs_size(class->size);
 
 	/* the migrate_write_lock protects zpage access via zs_map_object */
 	migrate_write_lock(zspage);
@@ -1783,10 +1872,10 @@ static int zs_page_migrate(struct page *newpage, struct page *page,
 	 * Here, any user cannot access all objects in the zspage so let's move.
 	 */
 	d_addr = kmap_atomic(newpage);
-	copy_page(d_addr, s_addr);
+	memcpy(d_addr, s_addr, page_size);
 	kunmap_atomic(d_addr);
 
-	for (addr = s_addr + offset; addr < s_addr + PAGE_SIZE;
+	for (addr = s_addr + offset; addr < s_addr + page_size;
 					addr += class->size) {
 		if (obj_allocated(page, addr, &handle)) {
 
@@ -2066,6 +2155,7 @@ static int calculate_zspage_chain_size(int class_size)
 {
 	int i, min_waste = INT_MAX;
 	int chain_size = 1;
+	unsigned long page_size = class_size_to_zs_size(class_size);
 
 	if (is_power_of_2(class_size))
 		return chain_size;
@@ -2073,7 +2163,7 @@ static int calculate_zspage_chain_size(int class_size)
 	for (i = 1; i <= ZS_MAX_PAGES_PER_ZSPAGE; i++) {
 		int waste;
 
-		waste = (i * PAGE_SIZE) % class_size;
+		waste = (i * page_size) % class_size;
 		if (waste < min_waste) {
 			min_waste = waste;
 			chain_size = i;
@@ -2098,6 +2188,8 @@ struct zs_pool *zs_create_pool(const char *name)
 	int i;
 	struct zs_pool *pool;
 	struct size_class *prev_class = NULL;
+	int idx = ZSMALLOC_TYPE_BASEPAGE;
+	int order = 0;
 
 	pool = kzalloc(sizeof(*pool), GFP_KERNEL);
 	if (!pool)
@@ -2119,18 +2211,31 @@ struct zs_pool *zs_create_pool(const char *name)
 	 * for merging should be larger or equal to current size.
 	 */
 	for (i = ZS_SIZE_CLASSES - 1; i >= 0; i--) {
-		int size;
+		unsigned int size = 0;
 		int pages_per_zspage;
 		int objs_per_zspage;
 		struct size_class *class;
 		int fullness;
 
-		size = ZS_MIN_ALLOC_SIZE + i * ZS_SIZE_CLASS_DELTA;
+		if (i < ZS_PAGE_SIZE_CLASSES)
+			size = ZS_MIN_ALLOC_SIZE + i * ZS_PAGE_SIZE_CLASS_DELTA;
+#ifdef CONFIG_ZSMALLOC_MULTI_PAGES
+		if (i >= ZS_PAGE_SIZE_CLASSES)
+			size = PAGE_SIZE + (i - ZS_PAGE_SIZE_CLASSES) *
+					   ZS_MULTI_PAGES_SIZE_CLASS_DELTA;
+#endif
+
 		if (size > ZS_MAX_ALLOC_SIZE)
 			size = ZS_MAX_ALLOC_SIZE;
-		pages_per_zspage = calculate_zspage_chain_size(size);
-		objs_per_zspage = pages_per_zspage * PAGE_SIZE / size;
 
+#ifdef CONFIG_ZSMALLOC_MULTI_PAGES
+		order = class_size_to_zs_order(size);
+		if (order == ZSMALLOC_MULTI_PAGES_ORDER)
+			idx = ZSMALLOC_TYPE_MULTI_PAGES;
+#endif
+
+		pages_per_zspage = calculate_zspage_chain_size(size);
+		objs_per_zspage = pages_per_zspage * PAGE_SIZE * (1 << order) / size;
 		/*
 		 * We iterate from biggest down to smallest classes,
 		 * so huge_class_size holds the size of the first huge
@@ -2138,8 +2243,8 @@ struct zs_pool *zs_create_pool(const char *name)
 		 * endup in the huge class.
 		 */
 		if (pages_per_zspage != 1 && objs_per_zspage != 1 &&
-				!huge_class_size) {
-			huge_class_size = size;
+				!huge_class_size[idx]) {
+			huge_class_size[idx] = size;
 			/*
 			 * The object uses ZS_HANDLE_SIZE bytes to store the
 			 * handle. We need to subtract it, because zs_malloc()
@@ -2149,7 +2254,7 @@ struct zs_pool *zs_create_pool(const char *name)
 			 * class because it grows by ZS_HANDLE_SIZE extra bytes
 			 * right before class lookup.
 			 */
-			huge_class_size -= (ZS_HANDLE_SIZE - 1);
+			huge_class_size[idx] -= (ZS_HANDLE_SIZE - 1);
 		}
 
 		/*
-- 
2.34.1



^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH RFC 1/2] mm: zsmalloc: support objects compressed based on multiple pages
  2024-03-27 21:48 ` [PATCH RFC 1/2] mm: zsmalloc: support objects compressed based on multiple pages Barry Song
@ 2024-10-21 23:26   ` Barry Song
  0 siblings, 0 replies; 18+ messages in thread
From: Barry Song @ 2024-10-21 23:26 UTC (permalink / raw)
  To: 21cnbao
  Cc: akpm, axboe, chrisl, corbet, david, hannes, kanchana.p.sridhar,
	kasong, linux-block, linux-mm, minchan, nphamcs, senozhatsky,
	surenb, terrelln, v-songbaohua, wajdi.k.feghali, willy,
	ying.huang, yosryahmed, yuzhao, zhengtangquan, zhouchengming,
	bala.seshasayee

> From: Tangquan Zheng <zhengtangquan@oppo.com>
> 
> This patch introduces support for zsmalloc to store compressed objects based
> on multi-pages. Previously, a large folio with nr_pages subpages would undergo
> compression one by one, each at the granularity of PAGE_SIZE. However, by
> compressing them at a larger granularity, we can conserve both memory and
> CPU resources.
> 
> We define the granularity using a configuration option called
> ZSMALLOC_MULTI_PAGES_ORDER, with a default value of 4. Consequently, a
> large folio with 32 subpages will now be divided into 2 parts rather
> than 32 parts.
> 
> The introduction of the multi-pages feature necessitates the creation
> of new size classes to accommodate it.
> 
> Signed-off-by: Tangquan Zheng <zhengtangquan@oppo.com>
> Co-developed-by: Barry Song <v-songbaohua@oppo.com>
> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> ---

Since some people are using our patches and occasionally encountering
crashes (reported to me privately, not on the mailing list), I'm sharing
these fixes now while we finalize v2, which will be sent shortly:

>  include/linux/zsmalloc.h |  10 +-
>  mm/Kconfig               |  18 ++++
>  mm/zsmalloc.c            | 215 +++++++++++++++++++++++++++++----------
>  3 files changed, 187 insertions(+), 56 deletions(-)
> 
> diff --git a/include/linux/zsmalloc.h b/include/linux/zsmalloc.h
> index a48cd0ffe57d..9fa3e7669557 100644
> --- a/include/linux/zsmalloc.h
> +++ b/include/linux/zsmalloc.h
> @@ -33,6 +33,14 @@ enum zs_mapmode {
>  	 */
>  };
>  
> +enum zsmalloc_type {
> +	ZSMALLOC_TYPE_BASEPAGE,
> +#ifdef CONFIG_ZSMALLOC_MULTI_PAGES
> +	ZSMALLOC_TYPE_MULTI_PAGES,
> +#endif
> +	ZSMALLOC_TYPE_MAX,
> +};
> +
>  struct zs_pool_stats {
>  	/* How many pages were migrated (freed) */
>  	atomic_long_t pages_compacted;
> @@ -46,7 +54,7 @@ void zs_destroy_pool(struct zs_pool *pool);
>  unsigned long zs_malloc(struct zs_pool *pool, size_t size, gfp_t flags);
>  void zs_free(struct zs_pool *pool, unsigned long obj);
>  
> -size_t zs_huge_class_size(struct zs_pool *pool);
> +size_t zs_huge_class_size(struct zs_pool *pool, enum zsmalloc_type type);
>  
>  void *zs_map_object(struct zs_pool *pool, unsigned long handle,
>  			enum zs_mapmode mm);
> diff --git a/mm/Kconfig b/mm/Kconfig
> index b1448aa81e15..cedb07094e8e 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -224,6 +224,24 @@ config ZSMALLOC_CHAIN_SIZE
>  
>  	  For more information, see zsmalloc documentation.
>  
> +config ZSMALLOC_MULTI_PAGES
> +	bool "support zsmalloc multiple pages"
> +	depends on ZSMALLOC && !CONFIG_HIGHMEM
> +	help
> +	  This option configures zsmalloc to support allocations larger than
> +	  PAGE_SIZE, enabling compression across multiple pages. The size of
> +	  these multiple pages is determined by the configured
> +	  ZSMALLOC_MULTI_PAGES_ORDER.
> +
> +config ZSMALLOC_MULTI_PAGES_ORDER
> +	int "zsmalloc multiple pages order"
> +	default 4
> +	range 1 9
> +	depends on ZSMALLOC_MULTI_PAGES
> +	help
> +	  This option is used to configure zsmalloc to support the compression
> +	  of multiple pages.
> +
>  menu "Slab allocator options"
>  
>  config SLUB
> diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
> index b42d3545ca85..8658421cee11 100644
> --- a/mm/zsmalloc.c
> +++ b/mm/zsmalloc.c
> @@ -65,6 +65,12 @@
>  
>  #define ZSPAGE_MAGIC	0x58
>  
> +#ifdef CONFIG_ZSMALLOC_MULTI_PAGES
> +#define ZSMALLOC_MULTI_PAGES_ORDER	(_AC(CONFIG_ZSMALLOC_MULTI_PAGES_ORDER, UL))
> +#define ZSMALLOC_MULTI_PAGES_NR		(1 << ZSMALLOC_MULTI_PAGES_ORDER)
> +#define ZSMALLOC_MULTI_PAGES_SIZE	(PAGE_SIZE * ZSMALLOC_MULTI_PAGES_NR)
> +#endif
> +
>  /*
>   * This must be power of 2 and greater than or equal to sizeof(link_free).
>   * These two conditions ensure that any 'struct link_free' itself doesn't
> @@ -115,7 +121,8 @@
>  
>  #define HUGE_BITS	1
>  #define FULLNESS_BITS	4
> -#define CLASS_BITS	8
> +#define CLASS_BITS	9
> +#define ISOLATED_BITS	5
>  #define MAGIC_VAL_BITS	8
>  
>  #define MAX(a, b) ((a) >= (b) ? (a) : (b))
> @@ -126,7 +133,11 @@
>  #define ZS_MIN_ALLOC_SIZE \
>  	MAX(32, (ZS_MAX_PAGES_PER_ZSPAGE << PAGE_SHIFT >> OBJ_INDEX_BITS))
>  /* each chunk includes extra space to keep handle */
> +#ifdef CONFIG_ZSMALLOC_MULTI_PAGES
> +#define ZS_MAX_ALLOC_SIZE	(ZSMALLOC_MULTI_PAGES_SIZE)
> +#else
>  #define ZS_MAX_ALLOC_SIZE	PAGE_SIZE
> +#endif
>  
>  /*
>   * On systems with 4K page size, this gives 255 size classes! There is a
> @@ -141,9 +152,22 @@
>   *  ZS_MIN_ALLOC_SIZE and ZS_SIZE_CLASS_DELTA must be multiple of ZS_ALIGN
>   *  (reason above)
>   */
> -#define ZS_SIZE_CLASS_DELTA	(PAGE_SIZE >> CLASS_BITS)
> -#define ZS_SIZE_CLASSES	(DIV_ROUND_UP(ZS_MAX_ALLOC_SIZE - ZS_MIN_ALLOC_SIZE, \
> -				      ZS_SIZE_CLASS_DELTA) + 1)
> +
> +#define ZS_PAGE_SIZE_CLASS_DELTA	(PAGE_SIZE >> (CLASS_BITS - 1))
> +#define ZS_PAGE_SIZE_CLASSES	(DIV_ROUND_UP(PAGE_SIZE - ZS_MIN_ALLOC_SIZE, \
> +				      ZS_PAGE_SIZE_CLASS_DELTA) + 1)
> +
> +#ifdef CONFIG_ZSMALLOC_MULTI_PAGES
> +#define ZS_MULTI_PAGES_SIZE_CLASS_DELTA	(ZSMALLOC_MULTI_PAGES_SIZE >> (CLASS_BITS - 1))
> +#define ZS_MULTI_PAGES_SIZE_CLASSES	(DIV_ROUND_UP(ZS_MAX_ALLOC_SIZE - PAGE_SIZE, \
> +				      ZS_MULTI_PAGES_SIZE_CLASS_DELTA) + 1)
> +#endif
> +
> +#ifdef CONFIG_ZSMALLOC_MULTI_PAGES
> +#define ZS_SIZE_CLASSES	(ZS_PAGE_SIZE_CLASSES + ZS_MULTI_PAGES_SIZE_CLASS_DELTA)
> +#else
> +#define ZS_SIZE_CLASSES	(ZS_PAGE_SIZE_CLASSES)
> +#endif
>  
>  /*
>   * Pages are distinguished by the ratio of used memory (that is the ratio
> @@ -179,7 +203,8 @@ struct zs_size_stat {
>  static struct dentry *zs_stat_root;
>  #endif
>  
> -static size_t huge_class_size;
> +/* huge_class_size[0] for page, huge_class_size[1] for multiple pages. */
> +static size_t huge_class_size[ZSMALLOC_TYPE_MAX];
>  
>  struct size_class {
>  	struct list_head fullness_list[NR_FULLNESS_GROUPS];
> @@ -255,6 +280,29 @@ struct zspage {
>  	rwlock_t lock;
>  };
>  
> +#ifdef CONFIG_ZSMALLOC_MULTI_PAGES
> +static inline unsigned int class_size_to_zs_order(unsigned long size)
> +{
> +	unsigned int order = 0;
> +
> +	/* used large order to alloc page for zspage when class_size > PAGE_SIZE */
> +	if (size > PAGE_SIZE)
> +		return ZSMALLOC_MULTI_PAGES_ORDER;
> +
> +	return order;
> +}
> +#else
> +static inline unsigned int class_size_to_zs_order(unsigned long size)
> +{
> +	return 0;
> +}
> +#endif
> +
> +static inline unsigned long class_size_to_zs_size(unsigned long size)
> +{
> +	return PAGE_SIZE * (1 << class_size_to_zs_order(size));
> +}
> +
>  struct mapping_area {
>  	local_lock_t lock;
>  	char *vm_buf; /* copy buffer for objects that span pages */
> @@ -487,11 +535,22 @@ static int get_size_class_index(int size)
>  {
>  	int idx = 0;
>  
> +#ifdef CONFIG_ZSMALLOC_MULTI_PAGES
> +	if (size > PAGE_SIZE) {

should be:

if (size > PAGE_SIZE + ZS_HANDLE_SIZE) {

> +		idx = ZS_PAGE_SIZE_CLASSES;
> +		idx += DIV_ROUND_UP(size - PAGE_SIZE,
> +				ZS_MULTI_PAGES_SIZE_CLASS_DELTA);
> +
> +		return min_t(int, ZS_SIZE_CLASSES - 1, idx);
> +	}
> +#endif
> +
> +	idx = 0;
>  	if (likely(size > ZS_MIN_ALLOC_SIZE))
> -		idx = DIV_ROUND_UP(size - ZS_MIN_ALLOC_SIZE,
> -				ZS_SIZE_CLASS_DELTA);
> +		idx += DIV_ROUND_UP(size - ZS_MIN_ALLOC_SIZE,
> +				ZS_PAGE_SIZE_CLASS_DELTA);
>  
> -	return min_t(int, ZS_SIZE_CLASSES - 1, idx);
> +	return  min_t(int, ZS_PAGE_SIZE_CLASSES - 1, idx);
>  }
>  
>  static inline void class_stat_inc(struct size_class *class,
> @@ -541,22 +600,19 @@ static int zs_stats_size_show(struct seq_file *s, void *v)
>  	unsigned long total_freeable = 0;
>  	unsigned long inuse_totals[NR_FULLNESS_GROUPS] = {0, };
>  
> -	seq_printf(s, " %5s %5s %9s %9s %9s %9s %9s %9s %9s %9s %9s %9s %9s %13s %10s %10s %16s %8s\n",
> -			"class", "size", "10%", "20%", "30%", "40%",
> +	seq_printf(s, " %5s %5s %5s %9s %9s %9s %9s %9s %9s %9s %9s %9s %9s %9s %13s %10s %10s %16s %16s %8s\n",
> +			"class", "size", "order", "10%", "20%", "30%", "40%",
>  			"50%", "60%", "70%", "80%", "90%", "99%", "100%",
>  			"obj_allocated", "obj_used", "pages_used",
> -			"pages_per_zspage", "freeable");
> +			"pages_per_zspage", "objs_per_zspage", "freeable");
>  
>  	for (i = 0; i < ZS_SIZE_CLASSES; i++) {
> -
>  		class = pool->size_class[i];
> -
>  		if (class->index != i)
>  			continue;
>  
>  		spin_lock(&pool->lock);
> -
> -		seq_printf(s, " %5u %5u ", i, class->size);
> +		seq_printf(s, " %5u %5u %5u", i, class->size, class_size_to_zs_order(class->size));
>  		for (fg = ZS_INUSE_RATIO_10; fg < NR_FULLNESS_GROUPS; fg++) {
>  			inuse_totals[fg] += zs_stat_get(class, fg);
>  			seq_printf(s, "%9lu ", zs_stat_get(class, fg));
> @@ -571,9 +627,9 @@ static int zs_stats_size_show(struct seq_file *s, void *v)
>  		pages_used = obj_allocated / objs_per_zspage *
>  				class->pages_per_zspage;
>  
> -		seq_printf(s, "%13lu %10lu %10lu %16d %8lu\n",
> +		seq_printf(s, "%13lu %10lu %10lu %16d %16d %8lu\n",
>  			   obj_allocated, obj_used, pages_used,
> -			   class->pages_per_zspage, freeable);
> +			   class->pages_per_zspage, objs_per_zspage, freeable);
>  
>  		total_objs += obj_allocated;
>  		total_used_objs += obj_used;
> @@ -840,7 +896,8 @@ static void __free_zspage(struct zs_pool *pool, struct size_class *class,
>  	cache_free_zspage(pool, zspage);
>  
>  	class_stat_dec(class, ZS_OBJS_ALLOCATED, class->objs_per_zspage);
> -	atomic_long_sub(class->pages_per_zspage, &pool->pages_allocated);
> +	atomic_long_sub(class->pages_per_zspage * (1 << class_size_to_zs_order(class->size)),
> +			&pool->pages_allocated);
>  }
>  
>  static void free_zspage(struct zs_pool *pool, struct size_class *class,
> @@ -869,6 +926,7 @@ static void init_zspage(struct size_class *class, struct zspage *zspage)
>  	unsigned int freeobj = 1;
>  	unsigned long off = 0;
>  	struct page *page = get_first_page(zspage);
> +	unsigned long page_size = class_size_to_zs_size(class->size);
>  
>  	while (page) {
>  		struct page *next_page;
> @@ -880,7 +938,7 @@ static void init_zspage(struct size_class *class, struct zspage *zspage)
>  		vaddr = kmap_atomic(page);
>  		link = (struct link_free *)vaddr + off / sizeof(*link);
>  
> -		while ((off += class->size) < PAGE_SIZE) {
> +		while ((off += class->size) < page_size) {
>  			link->next = freeobj++ << OBJ_TAG_BITS;
>  			link += class->size / sizeof(*link);
>  		}
> @@ -902,7 +960,7 @@ static void init_zspage(struct size_class *class, struct zspage *zspage)
>  		}
>  		kunmap_atomic(vaddr);
>  		page = next_page;
> -		off %= PAGE_SIZE;
> +		off %= page_size;
>  	}
>  
>  	set_freeobj(zspage, 0);
> @@ -952,6 +1010,8 @@ static struct zspage *alloc_zspage(struct zs_pool *pool,
>  	struct page *pages[ZS_MAX_PAGES_PER_ZSPAGE];
>  	struct zspage *zspage = cache_alloc_zspage(pool, gfp);
>  
> +	unsigned int order = class_size_to_zs_order(class->size);
> +
>  	if (!zspage)
>  		return NULL;
>  
> @@ -961,11 +1021,11 @@ static struct zspage *alloc_zspage(struct zs_pool *pool,
>  	for (i = 0; i < class->pages_per_zspage; i++) {
>  		struct page *page;
>  
> -		page = alloc_page(gfp);

	we don't support movement for multi-pages in zsmalloc yet:

+		if(order > 0){
+			gfp |= __GFP_COMP;
+			gfp &= ~__GFP_MOVABLE;
+		}

> +		page = alloc_pages(gfp | __GFP_COMP, order);
>  		if (!page) {
>  			while (--i >= 0) {
>  				dec_zone_page_state(pages[i], NR_ZSPAGES);
> -				__free_page(pages[i]);
> +				__free_pages(pages[i], order);
>  			}
>  			cache_free_zspage(pool, zspage);
>  			return NULL;

also need the below:
@@ static struct zspage *alloc_zspage(struct zs_pool *pool,
 	create_page_chain(class, zspage, pages);
 	init_zspage(class, zspage);
 	zspage->pool = pool;
+	zspage->class = class->index;
 
 	return zspage;
 }

> @@ -1024,6 +1084,7 @@ static void *__zs_map_object(struct mapping_area *area,
>  	int sizes[2];
>  	void *addr;
>  	char *buf = area->vm_buf;
> +	unsigned long page_size = class_size_to_zs_size(size);
>  
>  	/* disable page faults to match kmap_atomic() return conditions */
>  	pagefault_disable();
> @@ -1032,7 +1093,7 @@ static void *__zs_map_object(struct mapping_area *area,
>  	if (area->vm_mm == ZS_MM_WO)
>  		goto out;
>  
> -	sizes[0] = PAGE_SIZE - off;
> +	sizes[0] = page_size - off;
>  	sizes[1] = size - sizes[0];
>  
>  	/* copy object to per-cpu buffer */
> @@ -1052,6 +1113,7 @@ static void __zs_unmap_object(struct mapping_area *area,
>  	int sizes[2];
>  	void *addr;
>  	char *buf;
> +	unsigned long page_size = class_size_to_zs_size(size);
>  
>  	/* no write fastpath */
>  	if (area->vm_mm == ZS_MM_RO)
> @@ -1062,7 +1124,7 @@ static void __zs_unmap_object(struct mapping_area *area,
>  	size -= ZS_HANDLE_SIZE;
>  	off += ZS_HANDLE_SIZE;
>  
> -	sizes[0] = PAGE_SIZE - off;
> +	sizes[0] = page_size - off;
>  	sizes[1] = size - sizes[0];
>  
>  	/* copy per-cpu buffer to object */
> @@ -1169,6 +1231,8 @@ void *zs_map_object(struct zs_pool *pool, unsigned long handle,
>  	struct mapping_area *area;
>  	struct page *pages[2];
>  	void *ret;
> +	unsigned long page_size;
> +	unsigned long page_mask;
>  
>  	/*
>  	 * Because we use per-cpu mapping areas shared among the
> @@ -1193,12 +1257,14 @@ void *zs_map_object(struct zs_pool *pool, unsigned long handle,
>  	spin_unlock(&pool->lock);
>  
>  	class = zspage_class(pool, zspage);
> -	off = offset_in_page(class->size * obj_idx);
> +	page_size = class_size_to_zs_size(class->size);
> +	page_mask = ~(page_size - 1);
> +	off = (class->size * obj_idx) & ~page_mask;
>  
>  	local_lock(&zs_map_area.lock);
>  	area = this_cpu_ptr(&zs_map_area);
>  	area->vm_mm = mm;
> -	if (off + class->size <= PAGE_SIZE) {
> +	if (off + class->size <= page_size) {
>  		/* this object is contained entirely within a page */
>  		area->vm_addr = kmap_atomic(page);
>  		ret = area->vm_addr + off;
> @@ -1228,15 +1294,20 @@ void zs_unmap_object(struct zs_pool *pool, unsigned long handle)
>  
>  	struct size_class *class;
>  	struct mapping_area *area;
> +	unsigned long page_size;
> +	unsigned long page_mask;
>  
>  	obj = handle_to_obj(handle);
>  	obj_to_location(obj, &page, &obj_idx);
>  	zspage = get_zspage(page);
>  	class = zspage_class(pool, zspage);
> -	off = offset_in_page(class->size * obj_idx);
> +
> +	page_size = class_size_to_zs_size(class->size);
> +	page_mask = ~(page_size - 1);
> +	off = (class->size * obj_idx) & ~page_mask;
>  
>  	area = this_cpu_ptr(&zs_map_area);
> -	if (off + class->size <= PAGE_SIZE)
> +	if (off + class->size <= page_size)
>  		kunmap_atomic(area->vm_addr);
>  	else {
>  		struct page *pages[2];
> @@ -1266,9 +1337,9 @@ EXPORT_SYMBOL_GPL(zs_unmap_object);
>   *
>   * Return: the size (in bytes) of the first huge zsmalloc &size_class.
>   */
> -size_t zs_huge_class_size(struct zs_pool *pool)
> +size_t zs_huge_class_size(struct zs_pool *pool, enum zsmalloc_type type)
>  {
> -	return huge_class_size;
> +	return huge_class_size[type];
>  }
>  EXPORT_SYMBOL_GPL(zs_huge_class_size);
>  
> @@ -1283,16 +1354,24 @@ static unsigned long obj_malloc(struct zs_pool *pool,
>  	struct page *m_page;
>  	unsigned long m_offset;
>  	void *vaddr;
> +	unsigned long page_size;
> +	unsigned long page_mask;
> +	unsigned long page_shift;
>  
>  	class = pool->size_class[zspage->class];
>  	handle |= OBJ_ALLOCATED_TAG;
>  	obj = get_freeobj(zspage);
>  
>  	offset = obj * class->size;
> -	nr_page = offset >> PAGE_SHIFT;
> -	m_offset = offset_in_page(offset);
> -	m_page = get_first_page(zspage);
>  
> +	page_size = class_size_to_zs_size(class->size);
> +	page_shift = PAGE_SHIFT + class_size_to_zs_order(class->size);
> +	page_mask = ~(page_size - 1);
> +
> +	nr_page = offset >> page_shift;
> +	m_offset = offset & ~page_mask;
> +
> +	m_page = get_first_page(zspage);
>  	for (i = 0; i < nr_page; i++)
>  		m_page = get_next_page(m_page);
>  
> @@ -1360,7 +1439,6 @@ unsigned long zs_malloc(struct zs_pool *pool, size_t size, gfp_t gfp)
>  	}
>  
>  	spin_unlock(&pool->lock);
> -
>  	zspage = alloc_zspage(pool, class, gfp);
>  	if (!zspage) {
>  		cache_free_handle(pool, handle);
> @@ -1372,7 +1450,8 @@ unsigned long zs_malloc(struct zs_pool *pool, size_t size, gfp_t gfp)
>  	newfg = get_fullness_group(class, zspage);
>  	insert_zspage(class, zspage, newfg);

	need the below:
	set_zspage_mapping(zspage, class->index, newfg);

>  	record_obj(handle, obj);
> -	atomic_long_add(class->pages_per_zspage, &pool->pages_allocated);
> +	atomic_long_add(class->pages_per_zspage * (1 << class_size_to_zs_order(class->size)),
> +			&pool->pages_allocated);
>  	class_stat_inc(class, ZS_OBJS_ALLOCATED, class->objs_per_zspage);
>  	class_stat_inc(class, ZS_OBJS_INUSE, 1);
> 
	need the below:
	/* We completely set up zspage so mark them as movable */
-	SetZsPageMovable(pool, zspage);
+	if (0 == class_size_to_zs_order(class->size))
+		SetZsPageMovable(pool, zspage);
 out:
 	spin_unlock(&pool->lock);
 
> @@ -1393,9 +1472,14 @@ static void obj_free(int class_size, unsigned long obj)
>  	unsigned long f_offset;
>  	unsigned int f_objidx;
>  	void *vaddr;
> +	unsigned long page_size;
> +	unsigned long page_mask;
>  
>  	obj_to_location(obj, &f_page, &f_objidx);
> -	f_offset = offset_in_page(class_size * f_objidx);
> +	page_size = class_size_to_zs_size(class_size);
> +	page_mask = ~(page_size - 1);
> +
> +	f_offset = (class_size * f_objidx) & ~page_mask;
>  	zspage = get_zspage(f_page);
>  
>  	vaddr = kmap_atomic(f_page);
> @@ -1454,20 +1538,22 @@ static void zs_object_copy(struct size_class *class, unsigned long dst,
>  	void *s_addr, *d_addr;
>  	int s_size, d_size, size;
>  	int written = 0;
> +	unsigned long page_size = class_size_to_zs_size(class->size);
> +	unsigned long page_mask =  ~(page_size - 1);
>  
>  	s_size = d_size = class->size;
>  
>  	obj_to_location(src, &s_page, &s_objidx);
>  	obj_to_location(dst, &d_page, &d_objidx);
>  
> -	s_off = offset_in_page(class->size * s_objidx);
> -	d_off = offset_in_page(class->size * d_objidx);
> +	s_off = (class->size * s_objidx) & ~page_mask;
> +	d_off = (class->size * d_objidx) & ~page_mask;
>  
> -	if (s_off + class->size > PAGE_SIZE)
> -		s_size = PAGE_SIZE - s_off;
> +	if (s_off + class->size > page_size)
> +		s_size = page_size - s_off;
>  
> -	if (d_off + class->size > PAGE_SIZE)
> -		d_size = PAGE_SIZE - d_off;
> +	if (d_off + class->size > page_size)
> +		d_size = page_size - d_off;
>  
>  	s_addr = kmap_atomic(s_page);
>  	d_addr = kmap_atomic(d_page);
> @@ -1492,7 +1578,7 @@ static void zs_object_copy(struct size_class *class, unsigned long dst,
>  		 * kunmap_atomic(d_addr). For more details see
>  		 * Documentation/mm/highmem.rst.
>  		 */
> -		if (s_off >= PAGE_SIZE) {
> +		if (s_off >= page_size) {
>  			kunmap_atomic(d_addr);
>  			kunmap_atomic(s_addr);
>  			s_page = get_next_page(s_page);
> @@ -1502,7 +1588,7 @@ static void zs_object_copy(struct size_class *class, unsigned long dst,
>  			s_off = 0;
>  		}
>  
> -		if (d_off >= PAGE_SIZE) {
> +		if (d_off >= page_size) {
>  			kunmap_atomic(d_addr);
>  			d_page = get_next_page(d_page);
>  			d_addr = kmap_atomic(d_page);
> @@ -1526,11 +1612,12 @@ static unsigned long find_alloced_obj(struct size_class *class,
>  	int index = *obj_idx;
>  	unsigned long handle = 0;
>  	void *addr = kmap_atomic(page);
> +	unsigned long page_size = class_size_to_zs_size(class->size);
>  
>  	offset = get_first_obj_offset(page);
>  	offset += class->size * index;
>  
> -	while (offset < PAGE_SIZE) {
> +	while (offset < page_size) {
>  		if (obj_allocated(page, addr + offset, &handle))
>  			break;
>  
> @@ -1751,6 +1838,7 @@ static int zs_page_migrate(struct page *newpage, struct page *page,
>  	unsigned long handle;
>  	unsigned long old_obj, new_obj;
>  	unsigned int obj_idx;
> +	unsigned int page_size = PAGE_SIZE;
>  
>  	/*
>  	 * We cannot support the _NO_COPY case here, because copy needs to
> @@ -1772,6 +1860,7 @@ static int zs_page_migrate(struct page *newpage, struct page *page,
>  	 */
>  	spin_lock(&pool->lock);
>  	class = zspage_class(pool, zspage);
> +	page_size = class_size_to_zs_size(class->size);
>  
>  	/* the migrate_write_lock protects zpage access via zs_map_object */
>  	migrate_write_lock(zspage);
> @@ -1783,10 +1872,10 @@ static int zs_page_migrate(struct page *newpage, struct page *page,
>  	 * Here, any user cannot access all objects in the zspage so let's move.
>  	 */
>  	d_addr = kmap_atomic(newpage);
> -	copy_page(d_addr, s_addr);
> +	memcpy(d_addr, s_addr, page_size);
>  	kunmap_atomic(d_addr);
>  
> -	for (addr = s_addr + offset; addr < s_addr + PAGE_SIZE;
> +	for (addr = s_addr + offset; addr < s_addr + page_size;
>  					addr += class->size) {
>  		if (obj_allocated(page, addr, &handle)) {
>  
> @@ -2066,6 +2155,7 @@ static int calculate_zspage_chain_size(int class_size)
>  {
>  	int i, min_waste = INT_MAX;
>  	int chain_size = 1;
> +	unsigned long page_size = class_size_to_zs_size(class_size);
>  
>  	if (is_power_of_2(class_size))
>  		return chain_size;
> @@ -2073,7 +2163,7 @@ static int calculate_zspage_chain_size(int class_size)
>  	for (i = 1; i <= ZS_MAX_PAGES_PER_ZSPAGE; i++) {
>  		int waste;
>  
> -		waste = (i * PAGE_SIZE) % class_size;
> +		waste = (i * page_size) % class_size;
>  		if (waste < min_waste) {
>  			min_waste = waste;
>  			chain_size = i;
> @@ -2098,6 +2188,8 @@ struct zs_pool *zs_create_pool(const char *name)
>  	int i;
>  	struct zs_pool *pool;
>  	struct size_class *prev_class = NULL;

drop the below two lines:

> +	int idx = ZSMALLOC_TYPE_BASEPAGE;
> +	int order = 0;
> 

>  	pool = kzalloc(sizeof(*pool), GFP_KERNEL);
>  	if (!pool)
> @@ -2119,18 +2211,31 @@ struct zs_pool *zs_create_pool(const char *name)
>  	 * for merging should be larger or equal to current size.
>  	 */
>  	for (i = ZS_SIZE_CLASSES - 1; i >= 0; i--) {
> -		int size;
> +		unsigned int size = 0;
>  		int pages_per_zspage;
>  		int objs_per_zspage;
>  		struct size_class *class;
>  		int fullness;
>

move to here:
+		int order = 0;
+		int idx = ZSMALLOC_TYPE_BASEPAGE;
+

> -		size = ZS_MIN_ALLOC_SIZE + i * ZS_SIZE_CLASS_DELTA;
> +		if (i < ZS_PAGE_SIZE_CLASSES)
> +			size = ZS_MIN_ALLOC_SIZE + i * ZS_PAGE_SIZE_CLASS_DELTA;
> +#ifdef CONFIG_ZSMALLOC_MULTI_PAGES
> +		if (i >= ZS_PAGE_SIZE_CLASSES)
> +			size = PAGE_SIZE + (i - ZS_PAGE_SIZE_CLASSES) *
> +					   ZS_MULTI_PAGES_SIZE_CLASS_DELTA;
> +#endif
> +
>  		if (size > ZS_MAX_ALLOC_SIZE)
>  			size = ZS_MAX_ALLOC_SIZE;
> -		pages_per_zspage = calculate_zspage_chain_size(size);
> -		objs_per_zspage = pages_per_zspage * PAGE_SIZE / size;
>  
> +#ifdef CONFIG_ZSMALLOC_MULTI_PAGES
> +		order = class_size_to_zs_order(size);
> +		if (order == ZSMALLOC_MULTI_PAGES_ORDER)
> +			idx = ZSMALLOC_TYPE_MULTI_PAGES;
> +#endif
> +
> +		pages_per_zspage = calculate_zspage_chain_size(size);
> +		objs_per_zspage = pages_per_zspage * PAGE_SIZE * (1 << order) / size;
>  		/*
>  		 * We iterate from biggest down to smallest classes,
>  		 * so huge_class_size holds the size of the first huge
> @@ -2138,8 +2243,8 @@ struct zs_pool *zs_create_pool(const char *name)
>  		 * endup in the huge class.
>  		 */
>  		if (pages_per_zspage != 1 && objs_per_zspage != 1 &&
> -				!huge_class_size) {
> -			huge_class_size = size;
> +				!huge_class_size[idx]) {
> +			huge_class_size[idx] = size;
>  			/*
>  			 * The object uses ZS_HANDLE_SIZE bytes to store the
>  			 * handle. We need to subtract it, because zs_malloc()
> @@ -2149,7 +2254,7 @@ struct zs_pool *zs_create_pool(const char *name)
>  			 * class because it grows by ZS_HANDLE_SIZE extra bytes
>  			 * right before class lookup.
>  			 */
> -			huge_class_size -= (ZS_HANDLE_SIZE - 1);
> +			huge_class_size[idx] -= (ZS_HANDLE_SIZE - 1);
>  		}
>  
>  		/*
> -- 
> 2.34.1
>

Thanks
Barry



^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH RFC 2/2] zram: support compression at the granularity of multi-pages
  2024-03-27 21:48 [PATCH RFC 0/2] mTHP-friendly compression in zsmalloc and zram based on multi-pages Barry Song
  2024-03-27 21:48 ` [PATCH RFC 1/2] mm: zsmalloc: support objects compressed based on multiple pages Barry Song
@ 2024-03-27 21:48 ` Barry Song
  2024-04-11  0:40   ` Sergey Senozhatsky
                     ` (2 more replies)
  2024-03-27 22:01 ` [PATCH RFC 0/2] mTHP-friendly compression in zsmalloc and zram based on multi-pages Barry Song
  2 siblings, 3 replies; 18+ messages in thread
From: Barry Song @ 2024-03-27 21:48 UTC (permalink / raw)
  To: akpm, minchan, senozhatsky, linux-block, axboe, linux-mm
  Cc: terrelln, chrisl, david, kasong, yuzhao, yosryahmed, nphamcs,
	willy, hannes, ying.huang, surenb, wajdi.k.feghali,
	kanchana.p.sridhar, corbet, zhouchengming, Tangquan Zheng,
	Barry Song

From: Tangquan Zheng <zhengtangquan@oppo.com>

Currently, when a large folio with nr_pages is submitted to zram, it is
divided into nr_pages parts for compression and storage individually.
By transitioning to a higher granularity, we can notably enhance
compression rates while simultaneously reducing CPU consumption.

This patch introduces the capability for large folios to be divided
based on the granularity specified by ZSMALLOC_MULTI_PAGES_ORDER, which
defaults to 4. For instance, large folios smaller than 64KiB will continue
to be compressed at a 4KiB granularity. However, for folios sized at
128KiB, compression will occur in two 64KiB multi-pages.

This modification will notably reduce CPU consumption and enhance
compression ratios. The following data illustrates the time and
compressed data for typical anonymous pages gathered from Android
phones.

granularity   orig_data_size   compr_data_size   time(us)
4KiB-zstd      1048576000       246876055        50259962
64KiB-zstd     1048576000       199763892        18330605

We observe a precisely similar reduction in time required for decompressing
a 64KiB block compared to decompressing 16 * 4KiB blocks.

Signed-off-by: Tangquan Zheng <zhengtangquan@oppo.com>
Co-developed-by: Barry Song <v-songbaohua@oppo.com>
Signed-off-by: Barry Song <v-songbaohua@oppo.com>
---
 drivers/block/zram/Kconfig    |   9 +
 drivers/block/zram/zcomp.c    |  23 ++-
 drivers/block/zram/zcomp.h    |  12 +-
 drivers/block/zram/zram_drv.c | 372 +++++++++++++++++++++++++++++++---
 drivers/block/zram/zram_drv.h |  21 ++
 5 files changed, 399 insertions(+), 38 deletions(-)

diff --git a/drivers/block/zram/Kconfig b/drivers/block/zram/Kconfig
index 7b29cce60ab2..c8b44dd30d0f 100644
--- a/drivers/block/zram/Kconfig
+++ b/drivers/block/zram/Kconfig
@@ -96,3 +96,12 @@ config ZRAM_MULTI_COMP
 	  re-compress pages using a potentially slower but more effective
 	  compression algorithm. Note, that IDLE page recompression
 	  requires ZRAM_TRACK_ENTRY_ACTIME.
+
+config ZRAM_MULTI_PAGES
+	bool "Enable multiple pages compression and decompression"
+	depends on ZRAM && ZSMALLOC_MULTI_PAGES
+	help
+	  Initially, zram divided large folios into blocks of nr_pages, each sized
+	  equal to PAGE_SIZE, for compression. This option fine-tunes zram to
+	  improve compression granularity by dividing large folios into larger
+	  parts defined by the configuration option ZSMALLOC_MULTI_PAGES_ORDER.
diff --git a/drivers/block/zram/zcomp.c b/drivers/block/zram/zcomp.c
index 8237b08c49d8..ff6df838c066 100644
--- a/drivers/block/zram/zcomp.c
+++ b/drivers/block/zram/zcomp.c
@@ -12,7 +12,6 @@
 #include <linux/cpu.h>
 #include <linux/crypto.h>
 #include <linux/vmalloc.h>
-
 #include "zcomp.h"
 
 static const char * const backends[] = {
@@ -50,11 +49,16 @@ static void zcomp_strm_free(struct zcomp_strm *zstrm)
 static int zcomp_strm_init(struct zcomp_strm *zstrm, struct zcomp *comp)
 {
 	zstrm->tfm = crypto_alloc_comp(comp->name, 0, 0);
+	unsigned long page_size = PAGE_SIZE;
+
+#ifdef CONFIG_ZRAM_MULTI_PAGES
+	page_size = ZCOMP_MULTI_PAGES_SIZE;
+#endif
 	/*
 	 * allocate 2 pages. 1 for compressed data, plus 1 extra for the
 	 * case when compressed size is larger than the original one
 	 */
-	zstrm->buffer = vzalloc(2 * PAGE_SIZE);
+	zstrm->buffer = vzalloc(2 * page_size);
 	if (IS_ERR_OR_NULL(zstrm->tfm) || !zstrm->buffer) {
 		zcomp_strm_free(zstrm);
 		return -ENOMEM;
@@ -115,8 +119,8 @@ void zcomp_stream_put(struct zcomp *comp)
 	local_unlock(&comp->stream->lock);
 }
 
-int zcomp_compress(struct zcomp_strm *zstrm,
-		const void *src, unsigned int *dst_len)
+int zcomp_compress(struct zcomp_strm *zstrm, const void *src, unsigned int src_len,
+		   unsigned int *dst_len)
 {
 	/*
 	 * Our dst memory (zstrm->buffer) is always `2 * PAGE_SIZE' sized
@@ -132,18 +136,17 @@ int zcomp_compress(struct zcomp_strm *zstrm,
 	 * the dst buffer, zram_drv will take care of the fact that
 	 * compressed buffer is too big.
 	 */
-	*dst_len = PAGE_SIZE * 2;
+
+	*dst_len = src_len * 2;
 
 	return crypto_comp_compress(zstrm->tfm,
-			src, PAGE_SIZE,
+			src, src_len,
 			zstrm->buffer, dst_len);
 }
 
-int zcomp_decompress(struct zcomp_strm *zstrm,
-		const void *src, unsigned int src_len, void *dst)
+int zcomp_decompress(struct zcomp_strm *zstrm, const void *src, unsigned int src_len,
+		     void *dst, unsigned int dst_len)
 {
-	unsigned int dst_len = PAGE_SIZE;
-
 	return crypto_comp_decompress(zstrm->tfm,
 			src, src_len,
 			dst, &dst_len);
diff --git a/drivers/block/zram/zcomp.h b/drivers/block/zram/zcomp.h
index e9fe63da0e9b..6788d1b2c30f 100644
--- a/drivers/block/zram/zcomp.h
+++ b/drivers/block/zram/zcomp.h
@@ -7,6 +7,12 @@
 #define _ZCOMP_H_
 #include <linux/local_lock.h>
 
+#ifdef CONFIG_ZRAM_MULTI_PAGES
+#define ZCOMP_MULTI_PAGES_ORDER	(_AC(CONFIG_ZSMALLOC_MULTI_PAGES_ORDER, UL))
+#define ZCOMP_MULTI_PAGES_NR	(1 << ZCOMP_MULTI_PAGES_ORDER)
+#define ZCOMP_MULTI_PAGES_SIZE	(PAGE_SIZE * ZCOMP_MULTI_PAGES_NR)
+#endif
+
 struct zcomp_strm {
 	/* The members ->buffer and ->tfm are protected by ->lock. */
 	local_lock_t lock;
@@ -34,9 +40,9 @@ struct zcomp_strm *zcomp_stream_get(struct zcomp *comp);
 void zcomp_stream_put(struct zcomp *comp);
 
 int zcomp_compress(struct zcomp_strm *zstrm,
-		const void *src, unsigned int *dst_len);
+		const void *src, unsigned int src_len, unsigned int *dst_len);
 
 int zcomp_decompress(struct zcomp_strm *zstrm,
-		const void *src, unsigned int src_len, void *dst);
-
+		const void *src, unsigned int src_len, void *dst, unsigned int dst_len);
+bool zcomp_set_max_streams(struct zcomp *comp, int num_strm);
 #endif /* _ZCOMP_H_ */
diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index f0639df6cd18..0d7b9efd4eb4 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -49,7 +49,7 @@ static unsigned int num_devices = 1;
  * Pages that compress to sizes equals or greater than this are stored
  * uncompressed in memory.
  */
-static size_t huge_class_size;
+static size_t huge_class_size[ZSMALLOC_TYPE_MAX];
 
 static const struct block_device_operations zram_devops;
 
@@ -201,11 +201,11 @@ static inline void zram_fill_page(void *ptr, unsigned long len,
 	memset_l(ptr, value, len / sizeof(unsigned long));
 }
 
-static bool page_same_filled(void *ptr, unsigned long *element)
+static bool page_same_filled(void *ptr, unsigned long *element, unsigned int page_size)
 {
 	unsigned long *page;
 	unsigned long val;
-	unsigned int pos, last_pos = PAGE_SIZE / sizeof(*page) - 1;
+	unsigned int pos, last_pos = page_size / sizeof(*page) - 1;
 
 	page = (unsigned long *)ptr;
 	val = page[0];
@@ -1204,13 +1204,40 @@ static ssize_t debug_stat_show(struct device *dev,
 	return ret;
 }
 
+#ifdef CONFIG_ZRAM_MULTI_PAGES
+static ssize_t multi_pages_debug_stat_show(struct device *dev,
+		struct device_attribute *attr, char *buf)
+{
+	struct zram *zram = dev_to_zram(dev);
+	ssize_t ret = 0;
+
+	down_read(&zram->init_lock);
+	ret = scnprintf(buf, PAGE_SIZE,
+			"zram_bio write/read multi_pages count:%8llu %8llu\n"
+			"zram_bio failed write/read multi_pages count%8llu %8llu\n"
+			"zram_bio partial write/read multi_pages count%8llu %8llu\n"
+			"multi_pages_miss_free %8llu\n",
+			(u64)atomic64_read(&zram->stats.zram_bio_write_multi_pages_count),
+			(u64)atomic64_read(&zram->stats.zram_bio_read_multi_pages_count),
+			(u64)atomic64_read(&zram->stats.multi_pages_failed_writes),
+			(u64)atomic64_read(&zram->stats.multi_pages_failed_reads),
+			(u64)atomic64_read(&zram->stats.zram_bio_write_multi_pages_partial_count),
+			(u64)atomic64_read(&zram->stats.zram_bio_read_multi_pages_partial_count),
+			(u64)atomic64_read(&zram->stats.multi_pages_miss_free));
+	up_read(&zram->init_lock);
+
+	return ret;
+}
+#endif
 static DEVICE_ATTR_RO(io_stat);
 static DEVICE_ATTR_RO(mm_stat);
 #ifdef CONFIG_ZRAM_WRITEBACK
 static DEVICE_ATTR_RO(bd_stat);
 #endif
 static DEVICE_ATTR_RO(debug_stat);
-
+#ifdef CONFIG_ZRAM_MULTI_PAGES
+static DEVICE_ATTR_RO(multi_pages_debug_stat);
+#endif
 static void zram_meta_free(struct zram *zram, u64 disksize)
 {
 	size_t num_pages = disksize >> PAGE_SHIFT;
@@ -1227,6 +1254,7 @@ static void zram_meta_free(struct zram *zram, u64 disksize)
 static bool zram_meta_alloc(struct zram *zram, u64 disksize)
 {
 	size_t num_pages;
+	int i;
 
 	num_pages = disksize >> PAGE_SHIFT;
 	zram->table = vzalloc(array_size(num_pages, sizeof(*zram->table)));
@@ -1239,8 +1267,11 @@ static bool zram_meta_alloc(struct zram *zram, u64 disksize)
 		return false;
 	}
 
-	if (!huge_class_size)
-		huge_class_size = zs_huge_class_size(zram->mem_pool);
+	for (i = 0; i < ZSMALLOC_TYPE_MAX; i++) {
+		if (!huge_class_size[i])
+			huge_class_size[i] = zs_huge_class_size(zram->mem_pool, i);
+	}
+
 	return true;
 }
 
@@ -1306,7 +1337,7 @@ static void zram_free_page(struct zram *zram, size_t index)
  * Corresponding ZRAM slot should be locked.
  */
 static int zram_read_from_zspool(struct zram *zram, struct page *page,
-				 u32 index)
+				 u32 index, enum zsmalloc_type zs_type)
 {
 	struct zcomp_strm *zstrm;
 	unsigned long handle;
@@ -1314,6 +1345,12 @@ static int zram_read_from_zspool(struct zram *zram, struct page *page,
 	void *src, *dst;
 	u32 prio;
 	int ret;
+	unsigned long page_size = PAGE_SIZE;
+
+#ifdef CONFIG_ZRAM_MULTI_PAGES
+	if (zs_type == ZSMALLOC_TYPE_MULTI_PAGES)
+		page_size = ZCOMP_MULTI_PAGES_SIZE;
+#endif
 
 	handle = zram_get_handle(zram, index);
 	if (!handle || zram_test_flag(zram, index, ZRAM_SAME)) {
@@ -1322,27 +1359,28 @@ static int zram_read_from_zspool(struct zram *zram, struct page *page,
 
 		value = handle ? zram_get_element(zram, index) : 0;
 		mem = kmap_local_page(page);
-		zram_fill_page(mem, PAGE_SIZE, value);
+		zram_fill_page(mem, page_size, value);
 		kunmap_local(mem);
 		return 0;
 	}
 
 	size = zram_get_obj_size(zram, index);
 
-	if (size != PAGE_SIZE) {
+	if (size != page_size) {
 		prio = zram_get_priority(zram, index);
 		zstrm = zcomp_stream_get(zram->comps[prio]);
 	}
 
 	src = zs_map_object(zram->mem_pool, handle, ZS_MM_RO);
-	if (size == PAGE_SIZE) {
+	if (size == page_size) {
 		dst = kmap_local_page(page);
 		copy_page(dst, src);
 		kunmap_local(dst);
 		ret = 0;
 	} else {
 		dst = kmap_local_page(page);
-		ret = zcomp_decompress(zstrm, src, size, dst);
+		ret = zcomp_decompress(zstrm, src, size, dst, page_size);
+
 		kunmap_local(dst);
 		zcomp_stream_put(zram->comps[prio]);
 	}
@@ -1358,7 +1396,7 @@ static int zram_read_page(struct zram *zram, struct page *page, u32 index,
 	zram_slot_lock(zram, index);
 	if (!zram_test_flag(zram, index, ZRAM_WB)) {
 		/* Slot should be locked through out the function call */
-		ret = zram_read_from_zspool(zram, page, index);
+		ret = zram_read_from_zspool(zram, page, index, ZSMALLOC_TYPE_BASEPAGE);
 		zram_slot_unlock(zram, index);
 	} else {
 		/*
@@ -1415,9 +1453,18 @@ static int zram_write_page(struct zram *zram, struct page *page, u32 index)
 	struct zcomp_strm *zstrm;
 	unsigned long element = 0;
 	enum zram_pageflags flags = 0;
+	unsigned long page_size = PAGE_SIZE;
+	int huge_class_idx = ZSMALLOC_TYPE_BASEPAGE;
+
+#ifdef CONFIG_ZRAM_MULTI_PAGES
+	if (folio_size(page_folio(page)) >= ZCOMP_MULTI_PAGES_SIZE) {
+		page_size = ZCOMP_MULTI_PAGES_SIZE;
+		huge_class_idx = ZSMALLOC_TYPE_MULTI_PAGES;
+	}
+#endif
 
 	mem = kmap_local_page(page);
-	if (page_same_filled(mem, &element)) {
+	if (page_same_filled(mem, &element, page_size)) {
 		kunmap_local(mem);
 		/* Free memory associated with this sector now. */
 		flags = ZRAM_SAME;
@@ -1429,7 +1476,7 @@ static int zram_write_page(struct zram *zram, struct page *page, u32 index)
 compress_again:
 	zstrm = zcomp_stream_get(zram->comps[ZRAM_PRIMARY_COMP]);
 	src = kmap_local_page(page);
-	ret = zcomp_compress(zstrm, src, &comp_len);
+	ret = zcomp_compress(zstrm, src, page_size, &comp_len);
 	kunmap_local(src);
 
 	if (unlikely(ret)) {
@@ -1439,8 +1486,8 @@ static int zram_write_page(struct zram *zram, struct page *page, u32 index)
 		return ret;
 	}
 
-	if (comp_len >= huge_class_size)
-		comp_len = PAGE_SIZE;
+	if (comp_len >= huge_class_size[huge_class_idx])
+		comp_len = page_size;
 	/*
 	 * handle allocation has 2 paths:
 	 * a) fast path is executed with preemption disabled (for
@@ -1469,7 +1516,7 @@ static int zram_write_page(struct zram *zram, struct page *page, u32 index)
 		if (IS_ERR_VALUE(handle))
 			return PTR_ERR((void *)handle);
 
-		if (comp_len != PAGE_SIZE)
+		if (comp_len != page_size)
 			goto compress_again;
 		/*
 		 * If the page is not compressible, you need to acquire the
@@ -1493,10 +1540,10 @@ static int zram_write_page(struct zram *zram, struct page *page, u32 index)
 	dst = zs_map_object(zram->mem_pool, handle, ZS_MM_WO);
 
 	src = zstrm->buffer;
-	if (comp_len == PAGE_SIZE)
+	if (comp_len == page_size)
 		src = kmap_local_page(page);
 	memcpy(dst, src, comp_len);
-	if (comp_len == PAGE_SIZE)
+	if (comp_len == page_size)
 		kunmap_local(src);
 
 	zcomp_stream_put(zram->comps[ZRAM_PRIMARY_COMP]);
@@ -1510,7 +1557,7 @@ static int zram_write_page(struct zram *zram, struct page *page, u32 index)
 	zram_slot_lock(zram, index);
 	zram_free_page(zram, index);
 
-	if (comp_len == PAGE_SIZE) {
+	if (comp_len == page_size) {
 		zram_set_flag(zram, index, ZRAM_HUGE);
 		atomic64_inc(&zram->stats.huge_pages);
 		atomic64_inc(&zram->stats.huge_pages_since);
@@ -1523,6 +1570,15 @@ static int zram_write_page(struct zram *zram, struct page *page, u32 index)
 		zram_set_handle(zram, index, handle);
 		zram_set_obj_size(zram, index, comp_len);
 	}
+
+#ifdef CONFIG_ZRAM_MULTI_PAGES
+	if (page_size == ZCOMP_MULTI_PAGES_SIZE) {
+		/* Set multi-pages compression flag for free or overwriting */
+		for (int i = 0; i < ZCOMP_MULTI_PAGES_NR; i++)
+			zram_set_flag(zram, index + i, ZRAM_COMP_MULTI_PAGES);
+	}
+#endif
+
 	zram_slot_unlock(zram, index);
 
 	/* Update stats */
@@ -1592,7 +1648,7 @@ static int zram_recompress(struct zram *zram, u32 index, struct page *page,
 	if (comp_len_old < threshold)
 		return 0;
 
-	ret = zram_read_from_zspool(zram, page, index);
+	ret = zram_read_from_zspool(zram, page, index, ZSMALLOC_TYPE_BASEPAGE);
 	if (ret)
 		return ret;
 
@@ -1615,7 +1671,7 @@ static int zram_recompress(struct zram *zram, u32 index, struct page *page,
 		num_recomps++;
 		zstrm = zcomp_stream_get(zram->comps[prio]);
 		src = kmap_local_page(page);
-		ret = zcomp_compress(zstrm, src, &comp_len_new);
+		ret = zcomp_compress(zstrm, src, PAGE_SIZE, &comp_len_new);
 		kunmap_local(src);
 
 		if (ret) {
@@ -1749,7 +1805,7 @@ static ssize_t recompress_store(struct device *dev,
 		}
 	}
 
-	if (threshold >= huge_class_size)
+	if (threshold >= huge_class_size[ZSMALLOC_TYPE_BASEPAGE])
 		return -EINVAL;
 
 	down_read(&zram->init_lock);
@@ -1864,7 +1920,7 @@ static void zram_bio_discard(struct zram *zram, struct bio *bio)
 	bio_endio(bio);
 }
 
-static void zram_bio_read(struct zram *zram, struct bio *bio)
+static void zram_bio_read_page(struct zram *zram, struct bio *bio)
 {
 	unsigned long start_time = bio_start_io_acct(bio);
 	struct bvec_iter iter = bio->bi_iter;
@@ -1895,7 +1951,7 @@ static void zram_bio_read(struct zram *zram, struct bio *bio)
 	bio_endio(bio);
 }
 
-static void zram_bio_write(struct zram *zram, struct bio *bio)
+static void zram_bio_write_page(struct zram *zram, struct bio *bio)
 {
 	unsigned long start_time = bio_start_io_acct(bio);
 	struct bvec_iter iter = bio->bi_iter;
@@ -1925,6 +1981,250 @@ static void zram_bio_write(struct zram *zram, struct bio *bio)
 	bio_endio(bio);
 }
 
+#ifdef CONFIG_ZRAM_MULTI_PAGES
+
+/*
+ * The index is compress by multi-pages when any index ZRAM_COMP_MULTI_PAGES flag is set.
+ * Return: 0	: compress by page
+ *         > 0	: compress by multi-pages
+ */
+static inline int __test_multi_pages_comp(struct zram *zram, u32 index)
+{
+	int i;
+	int count = 0;
+	int head_index = index & ~((unsigned long)ZCOMP_MULTI_PAGES_NR - 1);
+
+	for (i = 0; i < ZCOMP_MULTI_PAGES_NR; i++) {
+		if (zram_test_flag(zram, head_index + i, ZRAM_COMP_MULTI_PAGES))
+			count++;
+	}
+
+	return count;
+}
+
+static inline bool want_multi_pages_comp(struct zram *zram, struct bio *bio)
+{
+	u32 index = bio->bi_iter.bi_sector >> SECTORS_PER_PAGE_SHIFT;
+
+	if (bio->bi_io_vec->bv_len >= ZCOMP_MULTI_PAGES_SIZE)
+		return true;
+
+	zram_slot_lock(zram, index);
+	if (__test_multi_pages_comp(zram, index)) {
+		zram_slot_unlock(zram, index);
+		return true;
+	}
+	zram_slot_unlock(zram, index);
+
+	return false;
+}
+
+static inline bool test_multi_pages_comp(struct zram *zram, struct bio *bio)
+{
+	u32 index = bio->bi_iter.bi_sector >> SECTORS_PER_PAGE_SHIFT;
+
+	return !!__test_multi_pages_comp(zram, index);
+}
+
+static inline bool is_multi_pages_partial_io(struct bio_vec *bvec)
+{
+	return bvec->bv_len != ZCOMP_MULTI_PAGES_SIZE;
+}
+
+static int zram_read_multi_pages(struct zram *zram, struct page *page, u32 index,
+			  struct bio *parent)
+{
+	int ret;
+
+	zram_slot_lock(zram, index);
+	if (!zram_test_flag(zram, index, ZRAM_WB)) {
+		/* Slot should be locked through out the function call */
+		ret = zram_read_from_zspool(zram, page, index, ZSMALLOC_TYPE_MULTI_PAGES);
+		zram_slot_unlock(zram, index);
+	} else {
+		/*
+		 * The slot should be unlocked before reading from the backing
+		 * device.
+		 */
+		zram_slot_unlock(zram, index);
+
+		ret = read_from_bdev(zram, page, zram_get_element(zram, index),
+				     parent);
+	}
+
+	/* Should NEVER happen. Return bio error if it does. */
+	if (WARN_ON(ret < 0))
+		pr_err("Decompression failed! err=%d, page=%u\n", ret, index);
+
+	return ret;
+}
+/*
+ * Use a temporary buffer to decompress the page, as the decompressor
+ * always expects a full page for the output.
+ */
+static int zram_bvec_read_multi_pages_partial(struct zram *zram, struct bio_vec *bvec,
+				  u32 index, int offset)
+{
+	struct page *page = alloc_pages(GFP_NOIO | __GFP_COMP, ZCOMP_MULTI_PAGES_ORDER);
+	int ret;
+
+	if (!page)
+		return -ENOMEM;
+	ret = zram_read_multi_pages(zram, page, index, NULL);
+	if (likely(!ret)) {
+		atomic64_inc(&zram->stats.zram_bio_read_multi_pages_partial_count);
+		void *dst = kmap_local_page(bvec->bv_page);
+		void *src = kmap_local_page(page);
+
+		memcpy(dst + bvec->bv_offset, src + offset, bvec->bv_len);
+		kunmap_local(src);
+		kunmap_local(dst);
+	}
+	__free_pages(page, ZCOMP_MULTI_PAGES_ORDER);
+	return ret;
+}
+
+static int zram_bvec_read_multi_pages(struct zram *zram, struct bio_vec *bvec,
+			  u32 index, int offset, struct bio *bio)
+{
+	if (is_multi_pages_partial_io(bvec))
+		return zram_bvec_read_multi_pages_partial(zram, bvec, index, offset);
+	return zram_read_multi_pages(zram, bvec->bv_page, index, bio);
+}
+
+/*
+ * This is a partial IO. Read the full page before writing the changes.
+ */
+static int zram_bvec_write_multi_pages_partial(struct zram *zram, struct bio_vec *bvec,
+				   u32 index, int offset, struct bio *bio)
+{
+	struct page *page = alloc_pages(GFP_NOIO | __GFP_COMP, ZCOMP_MULTI_PAGES_ORDER);
+	int ret;
+	void *src, *dst;
+
+	if (!page)
+		return -ENOMEM;
+
+	ret = zram_read_multi_pages(zram, page, index, bio);
+	if (!ret) {
+		src = kmap_local_page(bvec->bv_page);
+		dst = kmap_local_page(page);
+		memcpy(dst + offset, src + bvec->bv_offset, bvec->bv_len);
+		kunmap_local(dst);
+		kunmap_local(src);
+
+		atomic64_inc(&zram->stats.zram_bio_write_multi_pages_partial_count);
+		ret = zram_write_page(zram, page, index);
+	}
+	__free_pages(page, ZCOMP_MULTI_PAGES_ORDER);
+	return ret;
+}
+
+static int zram_bvec_write_multi_pages(struct zram *zram, struct bio_vec *bvec,
+			   u32 index, int offset, struct bio *bio)
+{
+	if (is_multi_pages_partial_io(bvec))
+		return zram_bvec_write_multi_pages_partial(zram, bvec, index, offset, bio);
+	return zram_write_page(zram, bvec->bv_page, index);
+}
+
+
+static void zram_bio_read_multi_pages(struct zram *zram, struct bio *bio)
+{
+	unsigned long start_time = bio_start_io_acct(bio);
+	struct bvec_iter iter = bio->bi_iter;
+
+	do {
+		/* Use head index, and other indexes are used as offset */
+		u32 index = (iter.bi_sector >> SECTORS_PER_PAGE_SHIFT) &
+				~((unsigned long)ZCOMP_MULTI_PAGES_NR - 1);
+		u32 offset = (iter.bi_sector >> SECTORS_PER_PAGE_SHIFT) &
+				((unsigned long)ZCOMP_MULTI_PAGES_NR - 1);
+		struct bio_vec *pbv = bio->bi_io_vec;
+
+		atomic64_add(1, &zram->stats.zram_bio_read_multi_pages_count);
+		pbv->bv_len = min_t(u32, pbv->bv_len, ZCOMP_MULTI_PAGES_SIZE - offset);
+
+		if (zram_bvec_read_multi_pages(zram, pbv, index, offset, bio) < 0) {
+			atomic64_inc(&zram->stats.multi_pages_failed_reads);
+			bio->bi_status = BLK_STS_IOERR;
+			break;
+		}
+		flush_dcache_page(pbv->bv_page);
+
+		zram_slot_lock(zram, index);
+		zram_accessed(zram, index);
+		zram_slot_unlock(zram, index);
+
+		bio_advance_iter_single(bio, &iter, pbv->bv_len);
+	} while (iter.bi_size);
+
+	bio_end_io_acct(bio, start_time);
+	bio_endio(bio);
+}
+
+static void zram_bio_write_multi_pages(struct zram *zram, struct bio *bio)
+{
+	unsigned long start_time = bio_start_io_acct(bio);
+	struct bvec_iter iter = bio->bi_iter;
+
+	do {
+		/* Use head index, and other indexes are used as offset */
+		u32 index = (iter.bi_sector >> SECTORS_PER_PAGE_SHIFT) &
+				~((unsigned long)ZCOMP_MULTI_PAGES_NR - 1);
+		u32 offset = (iter.bi_sector >> SECTORS_PER_PAGE_SHIFT) &
+				((unsigned long)ZCOMP_MULTI_PAGES_NR - 1);
+		struct bio_vec *pbv = bio->bi_io_vec;
+
+		pbv->bv_len = min_t(u32, pbv->bv_len, ZCOMP_MULTI_PAGES_SIZE - offset);
+
+		atomic64_add(1, &zram->stats.zram_bio_write_multi_pages_count);
+		if (zram_bvec_write_multi_pages(zram, pbv, index, offset, bio) < 0) {
+			atomic64_inc(&zram->stats.multi_pages_failed_writes);
+			bio->bi_status = BLK_STS_IOERR;
+			break;
+		}
+
+		zram_slot_lock(zram, index);
+		zram_accessed(zram, index);
+		zram_slot_unlock(zram, index);
+
+		bio_advance_iter_single(bio, &iter, pbv->bv_len);
+	} while (iter.bi_size);
+
+	bio_end_io_acct(bio, start_time);
+	bio_endio(bio);
+}
+#else
+static inline bool test_multi_pages_comp(struct zram *zram, struct bio *bio)
+{
+	return false;
+}
+
+static inline bool want_multi_pages_comp(struct zram *zram, struct bio *bio)
+{
+	return false;
+}
+static void zram_bio_read_multi_pages(struct zram *zram, struct bio *bio) {}
+static void zram_bio_write_multi_pages(struct zram *zram, struct bio *bio) {}
+#endif
+
+static void zram_bio_read(struct zram *zram, struct bio *bio)
+{
+	if (test_multi_pages_comp(zram, bio))
+		zram_bio_read_multi_pages(zram, bio);
+	else
+		zram_bio_read_page(zram, bio);
+}
+
+static void zram_bio_write(struct zram *zram, struct bio *bio)
+{
+	if (want_multi_pages_comp(zram, bio))
+		zram_bio_write_multi_pages(zram, bio);
+	else
+		zram_bio_write_page(zram, bio);
+}
+
 /*
  * Handler function for all zram I/O requests.
  */
@@ -1962,6 +2262,25 @@ static void zram_slot_free_notify(struct block_device *bdev,
 		return;
 	}
 
+#ifdef CONFIG_ZRAM_MULTI_PAGES
+	int comp_count = __test_multi_pages_comp(zram, index);
+
+	if (comp_count > 1) {
+		zram_clear_flag(zram, index, ZRAM_COMP_MULTI_PAGES);
+		zram_slot_unlock(zram, index);
+		return;
+	} else if (comp_count == 1) {
+		zram_clear_flag(zram, index, ZRAM_COMP_MULTI_PAGES);
+		zram_slot_unlock(zram, index);
+		/*only need to free head index*/
+		index &= ~((unsigned long)ZCOMP_MULTI_PAGES_NR - 1);
+		if (!zram_slot_trylock(zram, index)) {
+			atomic64_inc(&zram->stats.multi_pages_miss_free);
+			return;
+		}
+	}
+#endif
+
 	zram_free_page(zram, index);
 	zram_slot_unlock(zram, index);
 }
@@ -2158,6 +2477,9 @@ static struct attribute *zram_disk_attrs[] = {
 #endif
 	&dev_attr_io_stat.attr,
 	&dev_attr_mm_stat.attr,
+#ifdef CONFIG_ZRAM_MULTI_PAGES
+	&dev_attr_multi_pages_debug_stat.attr,
+#endif
 #ifdef CONFIG_ZRAM_WRITEBACK
 	&dev_attr_bd_stat.attr,
 #endif
diff --git a/drivers/block/zram/zram_drv.h b/drivers/block/zram/zram_drv.h
index 37bf29f34d26..8481271b3ceb 100644
--- a/drivers/block/zram/zram_drv.h
+++ b/drivers/block/zram/zram_drv.h
@@ -38,7 +38,14 @@
  *
  * We use BUILD_BUG_ON() to make sure that zram pageflags don't overflow.
  */
+
+#ifdef CONFIG_ZRAM_MULTI_PAGES
+#define ZRAM_FLAG_SHIFT (CONT_PTE_SHIFT + 1)
+#else
 #define ZRAM_FLAG_SHIFT (PAGE_SHIFT + 1)
+#endif
+
+#define ENABLE_HUGEPAGE_ZRAM_DEBUG 1
 
 /* Only 2 bits are allowed for comp priority index */
 #define ZRAM_COMP_PRIORITY_MASK	0x3
@@ -57,6 +64,10 @@ enum zram_pageflags {
 	ZRAM_COMP_PRIORITY_BIT1, /* First bit of comp priority index */
 	ZRAM_COMP_PRIORITY_BIT2, /* Second bit of comp priority index */
 
+#ifdef CONFIG_ZRAM_MULTI_PAGES
+	ZRAM_COMP_MULTI_PAGES,	/* Compressed by multi-pages */
+#endif
+
 	__NR_ZRAM_PAGEFLAGS,
 };
 
@@ -91,6 +102,16 @@ struct zram_stats {
 	atomic64_t bd_reads;		/* no. of reads from backing device */
 	atomic64_t bd_writes;		/* no. of writes from backing device */
 #endif
+
+#ifdef CONFIG_ZRAM_MULTI_PAGES
+	atomic64_t zram_bio_write_multi_pages_count;
+	atomic64_t zram_bio_read_multi_pages_count;
+	atomic64_t multi_pages_failed_writes;
+	atomic64_t multi_pages_failed_reads;
+	atomic64_t zram_bio_write_multi_pages_partial_count;
+	atomic64_t zram_bio_read_multi_pages_partial_count;
+	atomic64_t multi_pages_miss_free;
+#endif
 };
 
 #ifdef CONFIG_ZRAM_MULTI_COMP
-- 
2.34.1



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC 2/2] zram: support compression at the granularity of multi-pages
  2024-03-27 21:48 ` [PATCH RFC 2/2] zram: support compression at the granularity of multi-pages Barry Song
@ 2024-04-11  0:40   ` Sergey Senozhatsky
  2024-04-11  1:24     ` Barry Song
  2024-04-11  1:42   ` Sergey Senozhatsky
  2024-10-21 23:28   ` Barry Song
  2 siblings, 1 reply; 18+ messages in thread
From: Sergey Senozhatsky @ 2024-04-11  0:40 UTC (permalink / raw)
  To: Barry Song
  Cc: akpm, minchan, senozhatsky, linux-block, axboe, linux-mm,
	terrelln, chrisl, david, kasong, yuzhao, yosryahmed, nphamcs,
	willy, hannes, ying.huang, surenb, wajdi.k.feghali,
	kanchana.p.sridhar, corbet, zhouchengming, Tangquan Zheng,
	Barry Song

On (24/03/28 10:48), Barry Song wrote:
[..]
> diff --git a/drivers/block/zram/zram_drv.h b/drivers/block/zram/zram_drv.h
> index 37bf29f34d26..8481271b3ceb 100644
> --- a/drivers/block/zram/zram_drv.h
> +++ b/drivers/block/zram/zram_drv.h
> @@ -38,7 +38,14 @@
>   *
>   * We use BUILD_BUG_ON() to make sure that zram pageflags don't overflow.
>   */
> +
> +#ifdef CONFIG_ZRAM_MULTI_PAGES
> +#define ZRAM_FLAG_SHIFT (CONT_PTE_SHIFT + 1)

So this is ARM-only?


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC 2/2] zram: support compression at the granularity of multi-pages
  2024-04-11  0:40   ` Sergey Senozhatsky
@ 2024-04-11  1:24     ` Barry Song
  0 siblings, 0 replies; 18+ messages in thread
From: Barry Song @ 2024-04-11  1:24 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: akpm, minchan, linux-block, axboe, linux-mm, terrelln, chrisl,
	david, kasong, yuzhao, yosryahmed, nphamcs, willy, hannes,
	ying.huang, surenb, wajdi.k.feghali, kanchana.p.sridhar, corbet,
	zhouchengming, Tangquan Zheng, Barry Song

On Thu, Apr 11, 2024 at 12:41 PM Sergey Senozhatsky
<senozhatsky@chromium.org> wrote:
>
> On (24/03/28 10:48), Barry Song wrote:
> [..]
> > diff --git a/drivers/block/zram/zram_drv.h b/drivers/block/zram/zram_drv.h
> > index 37bf29f34d26..8481271b3ceb 100644
> > --- a/drivers/block/zram/zram_drv.h
> > +++ b/drivers/block/zram/zram_drv.h
> > @@ -38,7 +38,14 @@
> >   *
> >   * We use BUILD_BUG_ON() to make sure that zram pageflags don't overflow.
> >   */
> > +
> > +#ifdef CONFIG_ZRAM_MULTI_PAGES
> > +#define ZRAM_FLAG_SHIFT (CONT_PTE_SHIFT + 1)
>
> So this is ARM-only?

No, it seems that this aspect was overlooked during the patch cleanup process.
Currently, our reliance is solely on !HIGHMEM for the safe utilization of kmap
for multi-pages.
will fix it in v2.

Thanks
Barry


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC 2/2] zram: support compression at the granularity of multi-pages
  2024-03-27 21:48 ` [PATCH RFC 2/2] zram: support compression at the granularity of multi-pages Barry Song
  2024-04-11  0:40   ` Sergey Senozhatsky
@ 2024-04-11  1:42   ` Sergey Senozhatsky
  2024-04-11  2:03     ` Barry Song
  2024-10-21 23:28   ` Barry Song
  2 siblings, 1 reply; 18+ messages in thread
From: Sergey Senozhatsky @ 2024-04-11  1:42 UTC (permalink / raw)
  To: Barry Song
  Cc: akpm, minchan, senozhatsky, linux-block, axboe, linux-mm,
	terrelln, chrisl, david, kasong, yuzhao, yosryahmed, nphamcs,
	willy, hannes, ying.huang, surenb, wajdi.k.feghali,
	kanchana.p.sridhar, corbet, zhouchengming, Tangquan Zheng,
	Barry Song

On (24/03/28 10:48), Barry Song wrote:
[..]
> +/*
> + * Use a temporary buffer to decompress the page, as the decompressor
> + * always expects a full page for the output.
> + */
> +static int zram_bvec_read_multi_pages_partial(struct zram *zram, struct bio_vec *bvec,
> +				  u32 index, int offset)
> +{
> +	struct page *page = alloc_pages(GFP_NOIO | __GFP_COMP, ZCOMP_MULTI_PAGES_ORDER);
> +	int ret;
> +
> +	if (!page)
> +		return -ENOMEM;
> +	ret = zram_read_multi_pages(zram, page, index, NULL);
> +	if (likely(!ret)) {
> +		atomic64_inc(&zram->stats.zram_bio_read_multi_pages_partial_count);
> +		void *dst = kmap_local_page(bvec->bv_page);
> +		void *src = kmap_local_page(page);
> +
> +		memcpy(dst + bvec->bv_offset, src + offset, bvec->bv_len);
> +		kunmap_local(src);
> +		kunmap_local(dst);
> +	}
> +	__free_pages(page, ZCOMP_MULTI_PAGES_ORDER);
> +	return ret;
> +}

[..]

> +static int zram_bvec_write_multi_pages_partial(struct zram *zram, struct bio_vec *bvec,
> +				   u32 index, int offset, struct bio *bio)
> +{
> +	struct page *page = alloc_pages(GFP_NOIO | __GFP_COMP, ZCOMP_MULTI_PAGES_ORDER);
> +	int ret;
> +	void *src, *dst;
> +
> +	if (!page)
> +		return -ENOMEM;
> +
> +	ret = zram_read_multi_pages(zram, page, index, bio);
> +	if (!ret) {
> +		src = kmap_local_page(bvec->bv_page);
> +		dst = kmap_local_page(page);
> +		memcpy(dst + offset, src + bvec->bv_offset, bvec->bv_len);
> +		kunmap_local(dst);
> +		kunmap_local(src);
> +
> +		atomic64_inc(&zram->stats.zram_bio_write_multi_pages_partial_count);
> +		ret = zram_write_page(zram, page, index);
> +	}
> +	__free_pages(page, ZCOMP_MULTI_PAGES_ORDER);
> +	return ret;
> +}

What type of testing you run on it? How often do you see partial
reads and writes? Because this looks concerning - zsmalloc memory
usage reduction is one metrics, but this also can be achieved via
recompression, writeback, or even a different compression algorithm,
but higher CPU/power usage/higher requirements for physically contig
pages cannot be offset easily. (Another corner case, assume we have
partial read requests on every CPU simultaneously.)


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC 2/2] zram: support compression at the granularity of multi-pages
  2024-04-11  1:42   ` Sergey Senozhatsky
@ 2024-04-11  2:03     ` Barry Song
  2024-04-11  4:14       ` Sergey Senozhatsky
  0 siblings, 1 reply; 18+ messages in thread
From: Barry Song @ 2024-04-11  2:03 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: akpm, minchan, linux-block, axboe, linux-mm, terrelln, chrisl,
	david, kasong, yuzhao, yosryahmed, nphamcs, willy, hannes,
	ying.huang, surenb, wajdi.k.feghali, kanchana.p.sridhar, corbet,
	zhouchengming, Tangquan Zheng, Barry Song

On Thu, Apr 11, 2024 at 1:42 PM Sergey Senozhatsky
<senozhatsky@chromium.org> wrote:
>
> On (24/03/28 10:48), Barry Song wrote:
> [..]
> > +/*
> > + * Use a temporary buffer to decompress the page, as the decompressor
> > + * always expects a full page for the output.
> > + */
> > +static int zram_bvec_read_multi_pages_partial(struct zram *zram, struct bio_vec *bvec,
> > +                               u32 index, int offset)
> > +{
> > +     struct page *page = alloc_pages(GFP_NOIO | __GFP_COMP, ZCOMP_MULTI_PAGES_ORDER);
> > +     int ret;
> > +
> > +     if (!page)
> > +             return -ENOMEM;
> > +     ret = zram_read_multi_pages(zram, page, index, NULL);
> > +     if (likely(!ret)) {
> > +             atomic64_inc(&zram->stats.zram_bio_read_multi_pages_partial_count);
> > +             void *dst = kmap_local_page(bvec->bv_page);
> > +             void *src = kmap_local_page(page);
> > +
> > +             memcpy(dst + bvec->bv_offset, src + offset, bvec->bv_len);
> > +             kunmap_local(src);
> > +             kunmap_local(dst);
> > +     }
> > +     __free_pages(page, ZCOMP_MULTI_PAGES_ORDER);
> > +     return ret;
> > +}
>
> [..]
>
> > +static int zram_bvec_write_multi_pages_partial(struct zram *zram, struct bio_vec *bvec,
> > +                                u32 index, int offset, struct bio *bio)
> > +{
> > +     struct page *page = alloc_pages(GFP_NOIO | __GFP_COMP, ZCOMP_MULTI_PAGES_ORDER);
> > +     int ret;
> > +     void *src, *dst;
> > +
> > +     if (!page)
> > +             return -ENOMEM;
> > +
> > +     ret = zram_read_multi_pages(zram, page, index, bio);
> > +     if (!ret) {
> > +             src = kmap_local_page(bvec->bv_page);
> > +             dst = kmap_local_page(page);
> > +             memcpy(dst + offset, src + bvec->bv_offset, bvec->bv_len);
> > +             kunmap_local(dst);
> > +             kunmap_local(src);
> > +
> > +             atomic64_inc(&zram->stats.zram_bio_write_multi_pages_partial_count);
> > +             ret = zram_write_page(zram, page, index);
> > +     }
> > +     __free_pages(page, ZCOMP_MULTI_PAGES_ORDER);
> > +     return ret;
> > +}
>
> What type of testing you run on it? How often do you see partial
> reads and writes? Because this looks concerning - zsmalloc memory
> usage reduction is one metrics, but this also can be achieved via
> recompression, writeback, or even a different compression algorithm,
> but higher CPU/power usage/higher requirements for physically contig
> pages cannot be offset easily. (Another corner case, assume we have
> partial read requests on every CPU simultaneously.)

This question brings up an interesting observation. In our actual product,
we've noticed a success rate of over 90% when allocating large folios in
do_swap_page, but occasionally, we encounter failures. In such cases,
instead of resorting to partial reads, we opt to allocate 16 small folios and
request zram to fill them all. This strategy effectively minimizes partial reads
to nearly zero. However, integrating this into the upstream codebase seems
like a considerable task, and for now, it remains part of our
out-of-tree code[1],
which is also open-source.
We're gradually sending patches for the swap-in process, systematically
cleaning up the product's code.

To enhance the success rate of large folio allocation, we've reserved some
page blocks for mTHP. This approach is currently absent from the mainline
codebase as well (Yu Zhao is trying to provide TAO [2]). Consequently, we
anticipate that partial reads may reach 50% or more until this method is
incorporated upstream.

[1] https://github.com/OnePlusOSS/android_kernel_oneplus_sm8550/tree/oneplus/sm8550_u_14.0.0_oneplus11
[2] https://lore.kernel.org/linux-mm/20240229183436.4110845-1-yuzhao@google.com/

Thanks
Barry


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC 2/2] zram: support compression at the granularity of multi-pages
  2024-04-11  2:03     ` Barry Song
@ 2024-04-11  4:14       ` Sergey Senozhatsky
  2024-04-11  7:49         ` Barry Song
  0 siblings, 1 reply; 18+ messages in thread
From: Sergey Senozhatsky @ 2024-04-11  4:14 UTC (permalink / raw)
  To: Barry Song
  Cc: Sergey Senozhatsky, akpm, minchan, linux-block, axboe, linux-mm,
	terrelln, chrisl, david, kasong, yuzhao, yosryahmed, nphamcs,
	willy, hannes, ying.huang, surenb, wajdi.k.feghali,
	kanchana.p.sridhar, corbet, zhouchengming, Tangquan Zheng,
	Barry Song

On (24/04/11 14:03), Barry Song wrote:
> > [..]
> >
> > > +static int zram_bvec_write_multi_pages_partial(struct zram *zram, struct bio_vec *bvec,
> > > +                                u32 index, int offset, struct bio *bio)
> > > +{
> > > +     struct page *page = alloc_pages(GFP_NOIO | __GFP_COMP, ZCOMP_MULTI_PAGES_ORDER);
> > > +     int ret;
> > > +     void *src, *dst;
> > > +
> > > +     if (!page)
> > > +             return -ENOMEM;
> > > +
> > > +     ret = zram_read_multi_pages(zram, page, index, bio);
> > > +     if (!ret) {
> > > +             src = kmap_local_page(bvec->bv_page);
> > > +             dst = kmap_local_page(page);
> > > +             memcpy(dst + offset, src + bvec->bv_offset, bvec->bv_len);
> > > +             kunmap_local(dst);
> > > +             kunmap_local(src);
> > > +
> > > +             atomic64_inc(&zram->stats.zram_bio_write_multi_pages_partial_count);
> > > +             ret = zram_write_page(zram, page, index);
> > > +     }
> > > +     __free_pages(page, ZCOMP_MULTI_PAGES_ORDER);
> > > +     return ret;
> > > +}
> >
> > What type of testing you run on it? How often do you see partial
> > reads and writes? Because this looks concerning - zsmalloc memory
> > usage reduction is one metrics, but this also can be achieved via
> > recompression, writeback, or even a different compression algorithm,
> > but higher CPU/power usage/higher requirements for physically contig
> > pages cannot be offset easily. (Another corner case, assume we have
> > partial read requests on every CPU simultaneously.)
> 
> This question brings up an interesting observation. In our actual product,
> we've noticed a success rate of over 90% when allocating large folios in
> do_swap_page, but occasionally, we encounter failures. In such cases,
> instead of resorting to partial reads, we opt to allocate 16 small folios and
> request zram to fill them all. This strategy effectively minimizes partial reads
> to nearly zero. However, integrating this into the upstream codebase seems
> like a considerable task, and for now, it remains part of our
> out-of-tree code[1],
> which is also open-source.
> We're gradually sending patches for the swap-in process, systematically
> cleaning up the product's code.

I see, thanks for explanation.
Does this sound like this series is ahead of its time?

> To enhance the success rate of large folio allocation, we've reserved some
> page blocks for mTHP. This approach is currently absent from the mainline
> codebase as well (Yu Zhao is trying to provide TAO [2]). Consequently, we
> anticipate that partial reads may reach 50% or more until this method is
> incorporated upstream.

These partial reads/writes are difficult to justify - instead of doing
comp_op(PAGE_SIZE) we, in the worst case, now can do ZCOMP_MULTI_PAGES_NR
of comp_op(ZCOMP_MULTI_PAGES_ORDER) (assuming a access pattern that
touches each of multi-pages individually). That is a potentially huge
increase in CPU/power usage, which cannot be easily sacrificed. In fact,
I'd probably say that power usage is more important here than zspool
memory usage (that we have means to deal with).

Have you evaluated power usage?

I also wonder if it brings down the number of ZRAM_SAME pages. Suppose
when several pages out of ZCOMP_MULTI_PAGES_ORDER are filled with zeroes
(or some other recognizable pattern) which previously would have been
stored using just unsigned long. Makes me even wonder if ZRAM_SAME test
makes sense on multi-page at all, for that matter.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC 2/2] zram: support compression at the granularity of multi-pages
  2024-04-11  4:14       ` Sergey Senozhatsky
@ 2024-04-11  7:49         ` Barry Song
  2024-04-19  3:41           ` Sergey Senozhatsky
  0 siblings, 1 reply; 18+ messages in thread
From: Barry Song @ 2024-04-11  7:49 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: akpm, minchan, linux-block, axboe, linux-mm, terrelln, chrisl,
	david, kasong, yuzhao, yosryahmed, nphamcs, willy, hannes,
	ying.huang, surenb, wajdi.k.feghali, kanchana.p.sridhar, corbet,
	zhouchengming, Tangquan Zheng, Barry Song

On Thu, Apr 11, 2024 at 4:14 PM Sergey Senozhatsky
<senozhatsky@chromium.org> wrote:
>
> On (24/04/11 14:03), Barry Song wrote:
> > > [..]
> > >
> > > > +static int zram_bvec_write_multi_pages_partial(struct zram *zram, struct bio_vec *bvec,
> > > > +                                u32 index, int offset, struct bio *bio)
> > > > +{
> > > > +     struct page *page = alloc_pages(GFP_NOIO | __GFP_COMP, ZCOMP_MULTI_PAGES_ORDER);
> > > > +     int ret;
> > > > +     void *src, *dst;
> > > > +
> > > > +     if (!page)
> > > > +             return -ENOMEM;
> > > > +
> > > > +     ret = zram_read_multi_pages(zram, page, index, bio);
> > > > +     if (!ret) {
> > > > +             src = kmap_local_page(bvec->bv_page);
> > > > +             dst = kmap_local_page(page);
> > > > +             memcpy(dst + offset, src + bvec->bv_offset, bvec->bv_len);
> > > > +             kunmap_local(dst);
> > > > +             kunmap_local(src);
> > > > +
> > > > +             atomic64_inc(&zram->stats.zram_bio_write_multi_pages_partial_count);
> > > > +             ret = zram_write_page(zram, page, index);
> > > > +     }
> > > > +     __free_pages(page, ZCOMP_MULTI_PAGES_ORDER);
> > > > +     return ret;
> > > > +}
> > >
> > > What type of testing you run on it? How often do you see partial
> > > reads and writes? Because this looks concerning - zsmalloc memory
> > > usage reduction is one metrics, but this also can be achieved via
> > > recompression, writeback, or even a different compression algorithm,
> > > but higher CPU/power usage/higher requirements for physically contig
> > > pages cannot be offset easily. (Another corner case, assume we have
> > > partial read requests on every CPU simultaneously.)
> >
> > This question brings up an interesting observation. In our actual product,
> > we've noticed a success rate of over 90% when allocating large folios in
> > do_swap_page, but occasionally, we encounter failures. In such cases,
> > instead of resorting to partial reads, we opt to allocate 16 small folios and
> > request zram to fill them all. This strategy effectively minimizes partial reads
> > to nearly zero. However, integrating this into the upstream codebase seems
> > like a considerable task, and for now, it remains part of our
> > out-of-tree code[1],
> > which is also open-source.
> > We're gradually sending patches for the swap-in process, systematically
> > cleaning up the product's code.
>
> I see, thanks for explanation.
> Does this sound like this series is ahead of its time?

I feel it is necessary to present the whole picture together with large folios
swp-in series[1]. On the other hand, there is a possibility this can
land earlier
before everything is really with default "disable", but for those
platforms which
have finely tuned partial read/write, they can enable it.

[1] https://lore.kernel.org/linux-mm/20240304081348.197341-1-21cnbao@gmail.com/

>
> > To enhance the success rate of large folio allocation, we've reserved some
> > page blocks for mTHP. This approach is currently absent from the mainline
> > codebase as well (Yu Zhao is trying to provide TAO [2]). Consequently, we
> > anticipate that partial reads may reach 50% or more until this method is
> > incorporated upstream.
>
> These partial reads/writes are difficult to justify - instead of doing
> comp_op(PAGE_SIZE) we, in the worst case, now can do ZCOMP_MULTI_PAGES_NR
> of comp_op(ZCOMP_MULTI_PAGES_ORDER) (assuming a access pattern that
> touches each of multi-pages individually). That is a potentially huge
> increase in CPU/power usage, which cannot be easily sacrificed. In fact,
> I'd probably say that power usage is more important here than zspool
> memory usage (that we have means to deal with).

Once Ryan's mTHP swapout without splitting [2] is integrated into the
mainline, this
patchset certainly gains an advantage for SWPOUT. However, for SWPIN,
the situation
is more nuanced. There's a risk of failing to allocate mTHP, which
could result in the
allocation of a small folio instead. In such cases, decompressing a
large folio but
copying only one subpage leads to inefficiency.

In real-world products, we've addressed this challenge in two ways:
1. We've enhanced reserved page blocks for mTHP to boost allocation
success rates.
2. In instances where we fail to allocate a large folio, we fall back
to allocating nr_pages
small folios instead of just one. so we still only decompress once for
multi-pages.

With these measures in place, we consistently achieve wins in both
power consumption and
memory savings. However, it's important to note that these
optimizations are specific to our
product, and there's still much work needed to upstream them all.

[2] https://lore.kernel.org/linux-mm/20240408183946.2991168-1-ryan.roberts@arm.com/

>
> Have you evaluated power usage?
>
> I also wonder if it brings down the number of ZRAM_SAME pages. Suppose
> when several pages out of ZCOMP_MULTI_PAGES_ORDER are filled with zeroes
> (or some other recognizable pattern) which previously would have been
> stored using just unsigned long. Makes me even wonder if ZRAM_SAME test
> makes sense on multi-page at all, for that matter.

I don't think we need to worry about ZRAM_SAME. ARM64 supports 4KB, 16KB, and
64KB base pages. Even if we configure the base page to 16KB or 64KB,
there's still
a possibility of missing out on identifying SAME PAGES that are
identical at the 4KB
level but not at the 16/64KB granularity.

In our product, we continue to observe many SAME PAGES using
multi-page mechanisms.
Even if we miss some opportunities to identify same pages at the 4KB
level, the compressed
data remains relatively small, though not as compact as SAME_PAGE.
Overall, in typical
12GiB/16GiB phones, we still achieve a memory saving of around 800MiB
by this patchset.

mTHP offers a means to emulate a 16KiB/64KiB base page while
maintaining software
compatibility with a 4KiB base page. The primary concern here lies in
partial read/write
operations. In our product, we've successfully addressed these issues. However,
convincing people in the mainline community may take considerable time
and effort :-)

Thanks
Barry

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC 2/2] zram: support compression at the granularity of multi-pages
  2024-04-11  7:49         ` Barry Song
@ 2024-04-19  3:41           ` Sergey Senozhatsky
  0 siblings, 0 replies; 18+ messages in thread
From: Sergey Senozhatsky @ 2024-04-19  3:41 UTC (permalink / raw)
  To: Barry Song
  Cc: Sergey Senozhatsky, akpm, minchan, linux-block, axboe, linux-mm,
	terrelln, chrisl, david, kasong, yuzhao, yosryahmed, nphamcs,
	willy, hannes, ying.huang, surenb, wajdi.k.feghali,
	kanchana.p.sridhar, corbet, zhouchengming, Tangquan Zheng,
	Barry Song

On (24/04/11 19:49), Barry Song wrote:
> > > This question brings up an interesting observation. In our actual product,
> > > we've noticed a success rate of over 90% when allocating large folios in
> > > do_swap_page, but occasionally, we encounter failures. In such cases,
> > > instead of resorting to partial reads, we opt to allocate 16 small folios and
> > > request zram to fill them all. This strategy effectively minimizes partial reads
> > > to nearly zero. However, integrating this into the upstream codebase seems
> > > like a considerable task, and for now, it remains part of our
> > > out-of-tree code[1],
> > > which is also open-source.
> > > We're gradually sending patches for the swap-in process, systematically
> > > cleaning up the product's code.
> >
> > I see, thanks for explanation.
> > Does this sound like this series is ahead of its time?
> 
> I feel it is necessary to present the whole picture together with large folios
> swp-in series[1]

Yeah, makes sense.

> > These partial reads/writes are difficult to justify - instead of doing
> > comp_op(PAGE_SIZE) we, in the worst case, now can do ZCOMP_MULTI_PAGES_NR
> > of comp_op(ZCOMP_MULTI_PAGES_ORDER) (assuming a access pattern that
> > touches each of multi-pages individually). That is a potentially huge
> > increase in CPU/power usage, which cannot be easily sacrificed. In fact,
> > I'd probably say that power usage is more important here than zspool
> > memory usage (that we have means to deal with).
> 
> Once Ryan's mTHP swapout without splitting [2] is integrated into the
> mainline, this
> patchset certainly gains an advantage for SWPOUT. However, for SWPIN,
> the situation
> is more nuanced. There's a risk of failing to allocate mTHP, which
> could result in the
> allocation of a small folio instead. In such cases, decompressing a
> large folio but
> copying only one subpage leads to inefficiency.
> 
> In real-world products, we've addressed this challenge in two ways:
> 1. We've enhanced reserved page blocks for mTHP to boost allocation
> success rates.
> 2. In instances where we fail to allocate a large folio, we fall back
> to allocating nr_pages
> small folios instead of just one. so we still only decompress once for
> multi-pages.
> 
> With these measures in place, we consistently achieve wins in both
> power consumption and
> memory savings. However, it's important to note that these
> optimizations are specific to our
> product, and there's still much work needed to upstream them all.

Do you track any other metrics? Memory savings is just one way of looking
at it. The other metrics is utilization ratio of zspool
	compressed size : zs_get_total_pages(zram->mem_pool)

Compaction and migration can also be interesting, given that
zsmalloc is changing.

> > Have you evaluated power usage?
> >
> > I also wonder if it brings down the number of ZRAM_SAME pages. Suppose
> > when several pages out of ZCOMP_MULTI_PAGES_ORDER are filled with zeroes
> > (or some other recognizable pattern) which previously would have been
> > stored using just unsigned long. Makes me even wonder if ZRAM_SAME test
> > makes sense on multi-page at all, for that matter.
> 
> I don't think we need to worry about ZRAM_SAME.

Oh, it's not that I worry about it, just another thing that is
changing. E.g. having memcpy() /* current ZRAM_SAME handing ling */
vs decomp(order 4) and then memcpy().

> mTHP offers a means to emulate a 16KiB/64KiB base page while
> maintaining software
> compatibility with a 4KiB base page. The primary concern here lies in
> partial read/write
> operations. In our product, we've successfully addressed these issues. However,
> convincing people in the mainline community may take considerable time
> and effort :-)

Do you have a rebased zram/zsmalloc series somewhere in public access
that I can test?


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH RFC 2/2] zram: support compression at the granularity of multi-pages
  2024-03-27 21:48 ` [PATCH RFC 2/2] zram: support compression at the granularity of multi-pages Barry Song
  2024-04-11  0:40   ` Sergey Senozhatsky
  2024-04-11  1:42   ` Sergey Senozhatsky
@ 2024-10-21 23:28   ` Barry Song
  2024-11-06 16:23     ` Usama Arif
  2 siblings, 1 reply; 18+ messages in thread
From: Barry Song @ 2024-10-21 23:28 UTC (permalink / raw)
  To: 21cnbao
  Cc: akpm, axboe, chrisl, corbet, david, hannes, kanchana.p.sridhar,
	kasong, linux-block, linux-mm, minchan, nphamcs, senozhatsky,
	surenb, terrelln, v-songbaohua, wajdi.k.feghali, willy,
	ying.huang, yosryahmed, yuzhao, zhengtangquan, zhouchengming,
	bala.seshasayee

> From: Tangquan Zheng <zhengtangquan@oppo.com>
> 
> Currently, when a large folio with nr_pages is submitted to zram, it is
> divided into nr_pages parts for compression and storage individually.
> By transitioning to a higher granularity, we can notably enhance
> compression rates while simultaneously reducing CPU consumption.
> 
> This patch introduces the capability for large folios to be divided
> based on the granularity specified by ZSMALLOC_MULTI_PAGES_ORDER, which
> defaults to 4. For instance, large folios smaller than 64KiB will continue
> to be compressed at a 4KiB granularity. However, for folios sized at
> 128KiB, compression will occur in two 64KiB multi-pages.
> 
> This modification will notably reduce CPU consumption and enhance
> compression ratios. The following data illustrates the time and
> compressed data for typical anonymous pages gathered from Android
> phones.
> 
> granularity   orig_data_size   compr_data_size   time(us)
> 4KiB-zstd      1048576000       246876055        50259962
> 64KiB-zstd     1048576000       199763892        18330605
> 
> We observe a precisely similar reduction in time required for decompressing
> a 64KiB block compared to decompressing 16 * 4KiB blocks.
> 
> Signed-off-by: Tangquan Zheng <zhengtangquan@oppo.com>
> Co-developed-by: Barry Song <v-songbaohua@oppo.com>
> Signed-off-by: Barry Song <v-songbaohua@oppo.com>

Since some people are using our patches and occasionally encountering
crashes (reported to me privately, not on the mailing list), I'm sharing
these fixes now while we finalize v2, which will be sent shortly:

> ---
>  drivers/block/zram/Kconfig    |   9 +
>  drivers/block/zram/zcomp.c    |  23 ++-
>  drivers/block/zram/zcomp.h    |  12 +-
>  drivers/block/zram/zram_drv.c | 372 +++++++++++++++++++++++++++++++---
>  drivers/block/zram/zram_drv.h |  21 ++
>  5 files changed, 399 insertions(+), 38 deletions(-)
> 
> diff --git a/drivers/block/zram/Kconfig b/drivers/block/zram/Kconfig
> index 7b29cce60ab2..c8b44dd30d0f 100644
> --- a/drivers/block/zram/Kconfig
> +++ b/drivers/block/zram/Kconfig
> @@ -96,3 +96,12 @@ config ZRAM_MULTI_COMP
>  	  re-compress pages using a potentially slower but more effective
>  	  compression algorithm. Note, that IDLE page recompression
>  	  requires ZRAM_TRACK_ENTRY_ACTIME.
> +
> +config ZRAM_MULTI_PAGES
> +	bool "Enable multiple pages compression and decompression"
> +	depends on ZRAM && ZSMALLOC_MULTI_PAGES
> +	help
> +	  Initially, zram divided large folios into blocks of nr_pages, each sized
> +	  equal to PAGE_SIZE, for compression. This option fine-tunes zram to
> +	  improve compression granularity by dividing large folios into larger
> +	  parts defined by the configuration option ZSMALLOC_MULTI_PAGES_ORDER.
> diff --git a/drivers/block/zram/zcomp.c b/drivers/block/zram/zcomp.c
> index 8237b08c49d8..ff6df838c066 100644
> --- a/drivers/block/zram/zcomp.c
> +++ b/drivers/block/zram/zcomp.c
> @@ -12,7 +12,6 @@
>  #include <linux/cpu.h>
>  #include <linux/crypto.h>
>  #include <linux/vmalloc.h>
> -
>  #include "zcomp.h"
>  
>  static const char * const backends[] = {
> @@ -50,11 +49,16 @@ static void zcomp_strm_free(struct zcomp_strm *zstrm)
>  static int zcomp_strm_init(struct zcomp_strm *zstrm, struct zcomp *comp)
>  {
>  	zstrm->tfm = crypto_alloc_comp(comp->name, 0, 0);
> +	unsigned long page_size = PAGE_SIZE;
> +
> +#ifdef CONFIG_ZRAM_MULTI_PAGES
> +	page_size = ZCOMP_MULTI_PAGES_SIZE;
> +#endif
>  	/*
>  	 * allocate 2 pages. 1 for compressed data, plus 1 extra for the
>  	 * case when compressed size is larger than the original one
>  	 */
> -	zstrm->buffer = vzalloc(2 * PAGE_SIZE);
> +	zstrm->buffer = vzalloc(2 * page_size);
>  	if (IS_ERR_OR_NULL(zstrm->tfm) || !zstrm->buffer) {
>  		zcomp_strm_free(zstrm);
>  		return -ENOMEM;
> @@ -115,8 +119,8 @@ void zcomp_stream_put(struct zcomp *comp)
>  	local_unlock(&comp->stream->lock);
>  }
>  
> -int zcomp_compress(struct zcomp_strm *zstrm,
> -		const void *src, unsigned int *dst_len)
> +int zcomp_compress(struct zcomp_strm *zstrm, const void *src, unsigned int src_len,
> +		   unsigned int *dst_len)
>  {
>  	/*
>  	 * Our dst memory (zstrm->buffer) is always `2 * PAGE_SIZE' sized
> @@ -132,18 +136,17 @@ int zcomp_compress(struct zcomp_strm *zstrm,
>  	 * the dst buffer, zram_drv will take care of the fact that
>  	 * compressed buffer is too big.
>  	 */
> -	*dst_len = PAGE_SIZE * 2;
> +
> +	*dst_len = src_len * 2;
>  
>  	return crypto_comp_compress(zstrm->tfm,
> -			src, PAGE_SIZE,
> +			src, src_len,
>  			zstrm->buffer, dst_len);
>  }
>  
> -int zcomp_decompress(struct zcomp_strm *zstrm,
> -		const void *src, unsigned int src_len, void *dst)
> +int zcomp_decompress(struct zcomp_strm *zstrm, const void *src, unsigned int src_len,
> +		     void *dst, unsigned int dst_len)
>  {
> -	unsigned int dst_len = PAGE_SIZE;
> -
>  	return crypto_comp_decompress(zstrm->tfm,
>  			src, src_len,
>  			dst, &dst_len);
> diff --git a/drivers/block/zram/zcomp.h b/drivers/block/zram/zcomp.h
> index e9fe63da0e9b..6788d1b2c30f 100644
> --- a/drivers/block/zram/zcomp.h
> +++ b/drivers/block/zram/zcomp.h
> @@ -7,6 +7,12 @@
>  #define _ZCOMP_H_
>  #include <linux/local_lock.h>
>  
> +#ifdef CONFIG_ZRAM_MULTI_PAGES
> +#define ZCOMP_MULTI_PAGES_ORDER	(_AC(CONFIG_ZSMALLOC_MULTI_PAGES_ORDER, UL))
> +#define ZCOMP_MULTI_PAGES_NR	(1 << ZCOMP_MULTI_PAGES_ORDER)
> +#define ZCOMP_MULTI_PAGES_SIZE	(PAGE_SIZE * ZCOMP_MULTI_PAGES_NR)
> +#endif
> +
>  struct zcomp_strm {
>  	/* The members ->buffer and ->tfm are protected by ->lock. */
>  	local_lock_t lock;
> @@ -34,9 +40,9 @@ struct zcomp_strm *zcomp_stream_get(struct zcomp *comp);
>  void zcomp_stream_put(struct zcomp *comp);
>  
>  int zcomp_compress(struct zcomp_strm *zstrm,
> -		const void *src, unsigned int *dst_len);
> +		const void *src, unsigned int src_len, unsigned int *dst_len);
>  
>  int zcomp_decompress(struct zcomp_strm *zstrm,
> -		const void *src, unsigned int src_len, void *dst);
> -
> +		const void *src, unsigned int src_len, void *dst, unsigned int dst_len);
> +bool zcomp_set_max_streams(struct zcomp *comp, int num_strm);
>  #endif /* _ZCOMP_H_ */
> diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
> index f0639df6cd18..0d7b9efd4eb4 100644
> --- a/drivers/block/zram/zram_drv.c
> +++ b/drivers/block/zram/zram_drv.c
> @@ -49,7 +49,7 @@ static unsigned int num_devices = 1;
>   * Pages that compress to sizes equals or greater than this are stored
>   * uncompressed in memory.
>   */
> -static size_t huge_class_size;
> +static size_t huge_class_size[ZSMALLOC_TYPE_MAX];
>  
>  static const struct block_device_operations zram_devops;
>  
> @@ -201,11 +201,11 @@ static inline void zram_fill_page(void *ptr, unsigned long len,
>  	memset_l(ptr, value, len / sizeof(unsigned long));
>  }
>  
> -static bool page_same_filled(void *ptr, unsigned long *element)
> +static bool page_same_filled(void *ptr, unsigned long *element, unsigned int page_size)
>  {
>  	unsigned long *page;
>  	unsigned long val;
> -	unsigned int pos, last_pos = PAGE_SIZE / sizeof(*page) - 1;
> +	unsigned int pos, last_pos = page_size / sizeof(*page) - 1;
>  
>  	page = (unsigned long *)ptr;
>  	val = page[0];
> @@ -1204,13 +1204,40 @@ static ssize_t debug_stat_show(struct device *dev,
>  	return ret;
>  }
>  
> +#ifdef CONFIG_ZRAM_MULTI_PAGES
> +static ssize_t multi_pages_debug_stat_show(struct device *dev,
> +		struct device_attribute *attr, char *buf)
> +{
> +	struct zram *zram = dev_to_zram(dev);
> +	ssize_t ret = 0;
> +
> +	down_read(&zram->init_lock);
> +	ret = scnprintf(buf, PAGE_SIZE,
> +			"zram_bio write/read multi_pages count:%8llu %8llu\n"
> +			"zram_bio failed write/read multi_pages count%8llu %8llu\n"
> +			"zram_bio partial write/read multi_pages count%8llu %8llu\n"
> +			"multi_pages_miss_free %8llu\n",
> +			(u64)atomic64_read(&zram->stats.zram_bio_write_multi_pages_count),
> +			(u64)atomic64_read(&zram->stats.zram_bio_read_multi_pages_count),
> +			(u64)atomic64_read(&zram->stats.multi_pages_failed_writes),
> +			(u64)atomic64_read(&zram->stats.multi_pages_failed_reads),
> +			(u64)atomic64_read(&zram->stats.zram_bio_write_multi_pages_partial_count),
> +			(u64)atomic64_read(&zram->stats.zram_bio_read_multi_pages_partial_count),
> +			(u64)atomic64_read(&zram->stats.multi_pages_miss_free));
> +	up_read(&zram->init_lock);
> +
> +	return ret;
> +}
> +#endif
>  static DEVICE_ATTR_RO(io_stat);
>  static DEVICE_ATTR_RO(mm_stat);
>  #ifdef CONFIG_ZRAM_WRITEBACK
>  static DEVICE_ATTR_RO(bd_stat);
>  #endif
>  static DEVICE_ATTR_RO(debug_stat);
> -
> +#ifdef CONFIG_ZRAM_MULTI_PAGES
> +static DEVICE_ATTR_RO(multi_pages_debug_stat);
> +#endif
>  static void zram_meta_free(struct zram *zram, u64 disksize)
>  {
>  	size_t num_pages = disksize >> PAGE_SHIFT;
> @@ -1227,6 +1254,7 @@ static void zram_meta_free(struct zram *zram, u64 disksize)
>  static bool zram_meta_alloc(struct zram *zram, u64 disksize)
>  {
>  	size_t num_pages;
> +	int i;
>  
>  	num_pages = disksize >> PAGE_SHIFT;
>  	zram->table = vzalloc(array_size(num_pages, sizeof(*zram->table)));
> @@ -1239,8 +1267,11 @@ static bool zram_meta_alloc(struct zram *zram, u64 disksize)
>  		return false;
>  	}
>  
> -	if (!huge_class_size)
> -		huge_class_size = zs_huge_class_size(zram->mem_pool);
> +	for (i = 0; i < ZSMALLOC_TYPE_MAX; i++) {
> +		if (!huge_class_size[i])
> +			huge_class_size[i] = zs_huge_class_size(zram->mem_pool, i);
> +	}
> +
>  	return true;
>  }
>  
> @@ -1306,7 +1337,7 @@ static void zram_free_page(struct zram *zram, size_t index)
>   * Corresponding ZRAM slot should be locked.
>   */
>  static int zram_read_from_zspool(struct zram *zram, struct page *page,
> -				 u32 index)
> +				 u32 index, enum zsmalloc_type zs_type)
>  {
>  	struct zcomp_strm *zstrm;
>  	unsigned long handle;
> @@ -1314,6 +1345,12 @@ static int zram_read_from_zspool(struct zram *zram, struct page *page,
>  	void *src, *dst;
>  	u32 prio;
>  	int ret;
> +	unsigned long page_size = PAGE_SIZE;
> +
> +#ifdef CONFIG_ZRAM_MULTI_PAGES
> +	if (zs_type == ZSMALLOC_TYPE_MULTI_PAGES)
> +		page_size = ZCOMP_MULTI_PAGES_SIZE;
> +#endif
>  
>  	handle = zram_get_handle(zram, index);
>  	if (!handle || zram_test_flag(zram, index, ZRAM_SAME)) {
> @@ -1322,27 +1359,28 @@ static int zram_read_from_zspool(struct zram *zram, struct page *page,
>  
>  		value = handle ? zram_get_element(zram, index) : 0;
>  		mem = kmap_local_page(page);
> -		zram_fill_page(mem, PAGE_SIZE, value);
> +		zram_fill_page(mem, page_size, value);
>  		kunmap_local(mem);
>  		return 0;
>  	}
>  
>  	size = zram_get_obj_size(zram, index);
>  
> -	if (size != PAGE_SIZE) {
> +	if (size != page_size) {
>  		prio = zram_get_priority(zram, index);
>  		zstrm = zcomp_stream_get(zram->comps[prio]);
>  	}
>  
>  	src = zs_map_object(zram->mem_pool, handle, ZS_MM_RO);
> -	if (size == PAGE_SIZE) {
> +	if (size == page_size) {
>  		dst = kmap_local_page(page);
>  		copy_page(dst, src);

	copy_page() should be changed to:

	memcpy(dst, src, page_size);

>  		kunmap_local(dst);
>  		ret = 0;
>  	} else {
>  		dst = kmap_local_page(page);
> -		ret = zcomp_decompress(zstrm, src, size, dst);
> +		ret = zcomp_decompress(zstrm, src, size, dst, page_size);
> +
>  		kunmap_local(dst);
>  		zcomp_stream_put(zram->comps[prio]);
>  	}
> @@ -1358,7 +1396,7 @@ static int zram_read_page(struct zram *zram, struct page *page, u32 index,
>  	zram_slot_lock(zram, index);
>  	if (!zram_test_flag(zram, index, ZRAM_WB)) {
>  		/* Slot should be locked through out the function call */
> -		ret = zram_read_from_zspool(zram, page, index);
> +		ret = zram_read_from_zspool(zram, page, index, ZSMALLOC_TYPE_BASEPAGE);
>  		zram_slot_unlock(zram, index);
>  	} else {
>  		/*
> @@ -1415,9 +1453,18 @@ static int zram_write_page(struct zram *zram, struct page *page, u32 index)
>  	struct zcomp_strm *zstrm;
>  	unsigned long element = 0;
>  	enum zram_pageflags flags = 0;
> +	unsigned long page_size = PAGE_SIZE;
> +	int huge_class_idx = ZSMALLOC_TYPE_BASEPAGE;
> +
> +#ifdef CONFIG_ZRAM_MULTI_PAGES
> +	if (folio_size(page_folio(page)) >= ZCOMP_MULTI_PAGES_SIZE) {
> +		page_size = ZCOMP_MULTI_PAGES_SIZE;
> +		huge_class_idx = ZSMALLOC_TYPE_MULTI_PAGES;
> +	}
> +#endif
>  
>  	mem = kmap_local_page(page);
> -	if (page_same_filled(mem, &element)) {
> +	if (page_same_filled(mem, &element, page_size)) {
>  		kunmap_local(mem);
>  		/* Free memory associated with this sector now. */
>  		flags = ZRAM_SAME;
> @@ -1429,7 +1476,7 @@ static int zram_write_page(struct zram *zram, struct page *page, u32 index)
>  compress_again:
>  	zstrm = zcomp_stream_get(zram->comps[ZRAM_PRIMARY_COMP]);
>  	src = kmap_local_page(page);
> -	ret = zcomp_compress(zstrm, src, &comp_len);
> +	ret = zcomp_compress(zstrm, src, page_size, &comp_len);
>  	kunmap_local(src);
>  
>  	if (unlikely(ret)) {
> @@ -1439,8 +1486,8 @@ static int zram_write_page(struct zram *zram, struct page *page, u32 index)
>  		return ret;
>  	}
>  
> -	if (comp_len >= huge_class_size)
> -		comp_len = PAGE_SIZE;
> +	if (comp_len >= huge_class_size[huge_class_idx])
> +		comp_len = page_size;
>  	/*
>  	 * handle allocation has 2 paths:
>  	 * a) fast path is executed with preemption disabled (for
> @@ -1469,7 +1516,7 @@ static int zram_write_page(struct zram *zram, struct page *page, u32 index)
>  		if (IS_ERR_VALUE(handle))
>  			return PTR_ERR((void *)handle);
>  
> -		if (comp_len != PAGE_SIZE)
> +		if (comp_len != page_size)
>  			goto compress_again;
>  		/*
>  		 * If the page is not compressible, you need to acquire the
> @@ -1493,10 +1540,10 @@ static int zram_write_page(struct zram *zram, struct page *page, u32 index)
>  	dst = zs_map_object(zram->mem_pool, handle, ZS_MM_WO);
>  
>  	src = zstrm->buffer;
> -	if (comp_len == PAGE_SIZE)
> +	if (comp_len == page_size)
>  		src = kmap_local_page(page);
>  	memcpy(dst, src, comp_len);
> -	if (comp_len == PAGE_SIZE)
> +	if (comp_len == page_size)
>  		kunmap_local(src);
>  
>  	zcomp_stream_put(zram->comps[ZRAM_PRIMARY_COMP]);
> @@ -1510,7 +1557,7 @@ static int zram_write_page(struct zram *zram, struct page *page, u32 index)
>  	zram_slot_lock(zram, index);
>  	zram_free_page(zram, index);
>  
> -	if (comp_len == PAGE_SIZE) {
> +	if (comp_len == page_size) {
>  		zram_set_flag(zram, index, ZRAM_HUGE);
>  		atomic64_inc(&zram->stats.huge_pages);
>  		atomic64_inc(&zram->stats.huge_pages_since);
> @@ -1523,6 +1570,15 @@ static int zram_write_page(struct zram *zram, struct page *page, u32 index)
>  		zram_set_handle(zram, index, handle);
>  		zram_set_obj_size(zram, index, comp_len);
>  	}
> +
> +#ifdef CONFIG_ZRAM_MULTI_PAGES
> +	if (page_size == ZCOMP_MULTI_PAGES_SIZE) {
> +		/* Set multi-pages compression flag for free or overwriting */
> +		for (int i = 0; i < ZCOMP_MULTI_PAGES_NR; i++)
> +			zram_set_flag(zram, index + i, ZRAM_COMP_MULTI_PAGES);
> +	}
> +#endif
> +
>  	zram_slot_unlock(zram, index);
>  
>  	/* Update stats */
> @@ -1592,7 +1648,7 @@ static int zram_recompress(struct zram *zram, u32 index, struct page *page,
>  	if (comp_len_old < threshold)
>  		return 0;
>  
> -	ret = zram_read_from_zspool(zram, page, index);
> +	ret = zram_read_from_zspool(zram, page, index, ZSMALLOC_TYPE_BASEPAGE);
>  	if (ret)
>  		return ret;
>  
> @@ -1615,7 +1671,7 @@ static int zram_recompress(struct zram *zram, u32 index, struct page *page,
>  		num_recomps++;
>  		zstrm = zcomp_stream_get(zram->comps[prio]);
>  		src = kmap_local_page(page);
> -		ret = zcomp_compress(zstrm, src, &comp_len_new);
> +		ret = zcomp_compress(zstrm, src, PAGE_SIZE, &comp_len_new);
>  		kunmap_local(src);
>  
>  		if (ret) {
> @@ -1749,7 +1805,7 @@ static ssize_t recompress_store(struct device *dev,
>  		}
>  	}
>  
> -	if (threshold >= huge_class_size)
> +	if (threshold >= huge_class_size[ZSMALLOC_TYPE_BASEPAGE])
>  		return -EINVAL;
>  
>  	down_read(&zram->init_lock);
> @@ -1864,7 +1920,7 @@ static void zram_bio_discard(struct zram *zram, struct bio *bio)
>  	bio_endio(bio);
>  }
>  
> -static void zram_bio_read(struct zram *zram, struct bio *bio)
> +static void zram_bio_read_page(struct zram *zram, struct bio *bio)
>  {
>  	unsigned long start_time = bio_start_io_acct(bio);
>  	struct bvec_iter iter = bio->bi_iter;
> @@ -1895,7 +1951,7 @@ static void zram_bio_read(struct zram *zram, struct bio *bio)
>  	bio_endio(bio);
>  }
>  
> -static void zram_bio_write(struct zram *zram, struct bio *bio)
> +static void zram_bio_write_page(struct zram *zram, struct bio *bio)
>  {
>  	unsigned long start_time = bio_start_io_acct(bio);
>  	struct bvec_iter iter = bio->bi_iter;
> @@ -1925,6 +1981,250 @@ static void zram_bio_write(struct zram *zram, struct bio *bio)
>  	bio_endio(bio);
>  }
>  
> +#ifdef CONFIG_ZRAM_MULTI_PAGES
> +
> +/*
> + * The index is compress by multi-pages when any index ZRAM_COMP_MULTI_PAGES flag is set.
> + * Return: 0	: compress by page
> + *         > 0	: compress by multi-pages
> + */
> +static inline int __test_multi_pages_comp(struct zram *zram, u32 index)
> +{
> +	int i;
> +	int count = 0;
> +	int head_index = index & ~((unsigned long)ZCOMP_MULTI_PAGES_NR - 1);
> +
> +	for (i = 0; i < ZCOMP_MULTI_PAGES_NR; i++) {
> +		if (zram_test_flag(zram, head_index + i, ZRAM_COMP_MULTI_PAGES))
> +			count++;
> +	}
> +
> +	return count;
> +}
> +
> +static inline bool want_multi_pages_comp(struct zram *zram, struct bio *bio)
> +{
> +	u32 index = bio->bi_iter.bi_sector >> SECTORS_PER_PAGE_SHIFT;
> +
> +	if (bio->bi_io_vec->bv_len >= ZCOMP_MULTI_PAGES_SIZE)
> +		return true;
> +
> +	zram_slot_lock(zram, index);
> +	if (__test_multi_pages_comp(zram, index)) {
> +		zram_slot_unlock(zram, index);
> +		return true;
> +	}
> +	zram_slot_unlock(zram, index);
> +
> +	return false;
> +}
> +
> +static inline bool test_multi_pages_comp(struct zram *zram, struct bio *bio)
> +{
> +	u32 index = bio->bi_iter.bi_sector >> SECTORS_PER_PAGE_SHIFT;
> +
> +	return !!__test_multi_pages_comp(zram, index);
> +}
> +
> +static inline bool is_multi_pages_partial_io(struct bio_vec *bvec)
> +{
> +	return bvec->bv_len != ZCOMP_MULTI_PAGES_SIZE;
> +}
> +
> +static int zram_read_multi_pages(struct zram *zram, struct page *page, u32 index,
> +			  struct bio *parent)
> +{
> +	int ret;
> +
> +	zram_slot_lock(zram, index);
> +	if (!zram_test_flag(zram, index, ZRAM_WB)) {
> +		/* Slot should be locked through out the function call */
> +		ret = zram_read_from_zspool(zram, page, index, ZSMALLOC_TYPE_MULTI_PAGES);
> +		zram_slot_unlock(zram, index);
> +	} else {
> +		/*
> +		 * The slot should be unlocked before reading from the backing
> +		 * device.
> +		 */
> +		zram_slot_unlock(zram, index);
> +
> +		ret = read_from_bdev(zram, page, zram_get_element(zram, index),
> +				     parent);
> +	}
> +
> +	/* Should NEVER happen. Return bio error if it does. */
> +	if (WARN_ON(ret < 0))
> +		pr_err("Decompression failed! err=%d, page=%u\n", ret, index);
> +
> +	return ret;
> +}
> +/*
> + * Use a temporary buffer to decompress the page, as the decompressor
> + * always expects a full page for the output.
> + */
> +static int zram_bvec_read_multi_pages_partial(struct zram *zram, struct bio_vec *bvec,
> +				  u32 index, int offset)
> +{
> +	struct page *page = alloc_pages(GFP_NOIO | __GFP_COMP, ZCOMP_MULTI_PAGES_ORDER);
> +	int ret;
> +
> +	if (!page)
> +		return -ENOMEM;
> +	ret = zram_read_multi_pages(zram, page, index, NULL);
> +	if (likely(!ret)) {
> +		atomic64_inc(&zram->stats.zram_bio_read_multi_pages_partial_count);
> +		void *dst = kmap_local_page(bvec->bv_page);
> +		void *src = kmap_local_page(page);
> +
> +		memcpy(dst + bvec->bv_offset, src + offset, bvec->bv_len);
> +		kunmap_local(src);
> +		kunmap_local(dst);
> +	}
> +	__free_pages(page, ZCOMP_MULTI_PAGES_ORDER);
> +	return ret;
> +}
> +

alloc_pages() might fail, so we don't depend on allocation:

+static int zram_read_partial_from_zspool(struct zram *zram, struct page *page,
+				 u32 index, enum zsmalloc_type zs_type, int offset)
+{
+	struct zcomp_strm *zstrm;
+	unsigned long handle;
+	unsigned int size;
+	void *src, *dst;
+	u32 prio;
+	int ret;
+	unsigned long page_size = PAGE_SIZE;
+
+#ifdef CONFIG_ZRAM_MULTI_PAGES
+	if (zs_type == ZSMALLOC_TYPE_MULTI_PAGES)
+		page_size = ZCOMP_MULTI_PAGES_SIZE;
+#endif
+
+	handle = zram_get_handle(zram, index);
+	if (!handle || zram_test_flag(zram, index, ZRAM_SAME)) {
+		unsigned long value;
+		void *mem;
+
+		value = handle ? zram_get_element(zram, index) : 0;
+		mem = kmap_atomic(page);
+		atomic64_inc(&zram->stats.zram_bio_read_multi_pages_partial_count);
+		zram_fill_page(mem, PAGE_SIZE, value); //multi_pages partial read
+		kunmap_atomic(mem);
+		return 0;
+	}
+
+	size = zram_get_obj_size(zram, index);
+
+	if (size != page_size) {
+		prio = zram_get_priority(zram, index);
+		zstrm = zcomp_stream_get(zram->comps[prio]);
+	}
+
+	src = zs_map_object(zram->mem_pool, handle, ZS_MM_RO);
+	if (size == page_size) {
+		dst = kmap_atomic(page);
+			atomic64_inc(&zram->stats.zram_bio_read_multi_pages_partial_count);
+			memcpy(dst, src + (offset << PAGE_SHIFT), PAGE_SIZE);	//multi_pages partial read
+		kunmap_atomic(dst);
+		ret = 0;
+	} else {
+		dst = kmap_atomic(page);
+		//use zstrm->buffer to store decompress thp and copy page to dst
+		atomic64_inc(&zram->stats.zram_bio_read_multi_pages_partial_count);
+		ret = zcomp_decompress(zstrm, src, size, zstrm->buffer, page_size);
+		memcpy(dst, zstrm->buffer + (offset << PAGE_SHIFT), PAGE_SIZE);  //multi_pages partial read
+		kunmap_atomic(dst);
+		zcomp_stream_put(zram->comps[prio]);
+	}
+	zs_unmap_object(zram->mem_pool, handle);
+	return ret;
+}
+
+/*
+ * Use a temporary buffer to decompress the page, as the decompressor
+ * always expects a full page for the output.
+ */
+static int zram_bvec_read_multi_pages_partial(struct zram *zram, struct page *page, u32 index,
+			  struct bio *parent, int offset)
+{
+	int ret;
+	zram_slot_lock(zram, index);
+	if (!zram_test_flag(zram, index, ZRAM_WB)) {
+		/* Slot should be locked through out the function call */
+		ret = zram_read_partial_from_zspool(zram, page, index, ZSMALLOC_TYPE_MULTI_PAGES, offset);
+		zram_slot_unlock(zram, index);
+	} else {
+		/*
+		 * The slot should be unlocked before reading from the backing
+		 * device.
+		 */
+		zram_slot_unlock(zram, index);
+
+		ret = read_from_bdev(zram, page, zram_get_element(zram, index),
+				     parent);
+	}
+
+	/* Should NEVER happen. Return bio error if it does. */
+	if (WARN_ON(ret < 0))
+		pr_err("Decompression failed! err=%d, page=%u offset=%d\n", ret, index,offset);
+
+	return ret;
+}

> +static int zram_bvec_read_multi_pages(struct zram *zram, struct bio_vec *bvec,
> +			  u32 index, int offset, struct bio *bio)
> +{
> +	if (is_multi_pages_partial_io(bvec))
> +		return zram_bvec_read_multi_pages_partial(zram, bvec, index, offset);

should be:
	return zram_bvec_read_multi_pages_partial(zram, bvec->bv_page, index, bio, offset);

> +	return zram_read_multi_pages(zram, bvec->bv_page, index, bio);
> +}
> +
> +/*
> + * This is a partial IO. Read the full page before writing the changes.
> + */
> +static int zram_bvec_write_multi_pages_partial(struct zram *zram, struct bio_vec *bvec,
> +				   u32 index, int offset, struct bio *bio)
> +{
> +	struct page *page = alloc_pages(GFP_NOIO | __GFP_COMP, ZCOMP_MULTI_PAGES_ORDER);
> +	int ret;
> +	void *src, *dst;
> +
> +	if (!page)
> +		return -ENOMEM;
> +
> +	ret = zram_read_multi_pages(zram, page, index, bio);
> +	if (!ret) {
> +		src = kmap_local_page(bvec->bv_page);
> +		dst = kmap_local_page(page);
> +		memcpy(dst + offset, src + bvec->bv_offset, bvec->bv_len);

should be:
	memcpy(dst + (offset << PAGE_SHIFT), src + bvec->bv_offset, bvec->bv_len);

> +		kunmap_local(dst);
> +		kunmap_local(src);
> +
> +		atomic64_inc(&zram->stats.zram_bio_write_multi_pages_partial_count);
> +		ret = zram_write_page(zram, page, index);
> +	}
> +	__free_pages(page, ZCOMP_MULTI_PAGES_ORDER);
> +	return ret;
> +}
> +
> +static int zram_bvec_write_multi_pages(struct zram *zram, struct bio_vec *bvec,
> +			   u32 index, int offset, struct bio *bio)
> +{
> +	if (is_multi_pages_partial_io(bvec))
> +		return zram_bvec_write_multi_pages_partial(zram, bvec, index, offset, bio);
> +	return zram_write_page(zram, bvec->bv_page, index);
> +}
> +
> +
> +static void zram_bio_read_multi_pages(struct zram *zram, struct bio *bio)
> +{
> +	unsigned long start_time = bio_start_io_acct(bio);
> +	struct bvec_iter iter = bio->bi_iter;
> +
> +	do {
> +		/* Use head index, and other indexes are used as offset */
> +		u32 index = (iter.bi_sector >> SECTORS_PER_PAGE_SHIFT) &
> +				~((unsigned long)ZCOMP_MULTI_PAGES_NR - 1);
> +		u32 offset = (iter.bi_sector >> SECTORS_PER_PAGE_SHIFT) &
> +				((unsigned long)ZCOMP_MULTI_PAGES_NR - 1);
> +		struct bio_vec *pbv = bio->bi_io_vec;
> +
> +		atomic64_add(1, &zram->stats.zram_bio_read_multi_pages_count);
> +		pbv->bv_len = min_t(u32, pbv->bv_len, ZCOMP_MULTI_PAGES_SIZE - offset);
> +
> +		if (zram_bvec_read_multi_pages(zram, pbv, index, offset, bio) < 0) {
> +			atomic64_inc(&zram->stats.multi_pages_failed_reads);
> +			bio->bi_status = BLK_STS_IOERR;
> +			break;
> +		}
> +		flush_dcache_page(pbv->bv_page);
> +
> +		zram_slot_lock(zram, index);
> +		zram_accessed(zram, index);
> +		zram_slot_unlock(zram, index);
> +
> +		bio_advance_iter_single(bio, &iter, pbv->bv_len);
> +	} while (iter.bi_size);
> +
> +	bio_end_io_acct(bio, start_time);
> +	bio_endio(bio);
> +}
> +
> +static void zram_bio_write_multi_pages(struct zram *zram, struct bio *bio)
> +{
> +	unsigned long start_time = bio_start_io_acct(bio);
> +	struct bvec_iter iter = bio->bi_iter;
> +
> +	do {
> +		/* Use head index, and other indexes are used as offset */
> +		u32 index = (iter.bi_sector >> SECTORS_PER_PAGE_SHIFT) &
> +				~((unsigned long)ZCOMP_MULTI_PAGES_NR - 1);
> +		u32 offset = (iter.bi_sector >> SECTORS_PER_PAGE_SHIFT) &
> +				((unsigned long)ZCOMP_MULTI_PAGES_NR - 1);
> +		struct bio_vec *pbv = bio->bi_io_vec;
> +
> +		pbv->bv_len = min_t(u32, pbv->bv_len, ZCOMP_MULTI_PAGES_SIZE - offset);
> +
> +		atomic64_add(1, &zram->stats.zram_bio_write_multi_pages_count);
> +		if (zram_bvec_write_multi_pages(zram, pbv, index, offset, bio) < 0) {
> +			atomic64_inc(&zram->stats.multi_pages_failed_writes);
> +			bio->bi_status = BLK_STS_IOERR;
> +			break;
> +		}
> +
> +		zram_slot_lock(zram, index);
> +		zram_accessed(zram, index);
> +		zram_slot_unlock(zram, index);
> +
> +		bio_advance_iter_single(bio, &iter, pbv->bv_len);
> +	} while (iter.bi_size);
> +
> +	bio_end_io_acct(bio, start_time);
> +	bio_endio(bio);
> +}
> +#else
> +static inline bool test_multi_pages_comp(struct zram *zram, struct bio *bio)
> +{
> +	return false;
> +}
> +
> +static inline bool want_multi_pages_comp(struct zram *zram, struct bio *bio)
> +{
> +	return false;
> +}
> +static void zram_bio_read_multi_pages(struct zram *zram, struct bio *bio) {}
> +static void zram_bio_write_multi_pages(struct zram *zram, struct bio *bio) {}
> +#endif
> +
> +static void zram_bio_read(struct zram *zram, struct bio *bio)
> +{
> +	if (test_multi_pages_comp(zram, bio))
> +		zram_bio_read_multi_pages(zram, bio);
> +	else
> +		zram_bio_read_page(zram, bio);
> +}
> +
> +static void zram_bio_write(struct zram *zram, struct bio *bio)
> +{
> +	if (want_multi_pages_comp(zram, bio))
> +		zram_bio_write_multi_pages(zram, bio);
> +	else
> +		zram_bio_write_page(zram, bio);
> +}
> +
>  /*
>   * Handler function for all zram I/O requests.
>   */
> @@ -1962,6 +2262,25 @@ static void zram_slot_free_notify(struct block_device *bdev,
>  		return;
>  	}
>  
> +#ifdef CONFIG_ZRAM_MULTI_PAGES
> +	int comp_count = __test_multi_pages_comp(zram, index);
> +
> +	if (comp_count > 1) {
> +		zram_clear_flag(zram, index, ZRAM_COMP_MULTI_PAGES);
> +		zram_slot_unlock(zram, index);
> +		return;
> +	} else if (comp_count == 1) {
> +		zram_clear_flag(zram, index, ZRAM_COMP_MULTI_PAGES);
> +		zram_slot_unlock(zram, index);
> +		/*only need to free head index*/
> +		index &= ~((unsigned long)ZCOMP_MULTI_PAGES_NR - 1);
> +		if (!zram_slot_trylock(zram, index)) {
> +			atomic64_inc(&zram->stats.multi_pages_miss_free);
> +			return;
> +		}
> +	}
> +#endif
> +
>  	zram_free_page(zram, index);
>  	zram_slot_unlock(zram, index);
>  }
> @@ -2158,6 +2477,9 @@ static struct attribute *zram_disk_attrs[] = {
>  #endif
>  	&dev_attr_io_stat.attr,
>  	&dev_attr_mm_stat.attr,
> +#ifdef CONFIG_ZRAM_MULTI_PAGES
> +	&dev_attr_multi_pages_debug_stat.attr,
> +#endif
>  #ifdef CONFIG_ZRAM_WRITEBACK
>  	&dev_attr_bd_stat.attr,
>  #endif
> diff --git a/drivers/block/zram/zram_drv.h b/drivers/block/zram/zram_drv.h
> index 37bf29f34d26..8481271b3ceb 100644
> --- a/drivers/block/zram/zram_drv.h
> +++ b/drivers/block/zram/zram_drv.h
> @@ -38,7 +38,14 @@
>   *
>   * We use BUILD_BUG_ON() to make sure that zram pageflags don't overflow.
>   */
> +
> +#ifdef CONFIG_ZRAM_MULTI_PAGES
> +#define ZRAM_FLAG_SHIFT (CONT_PTE_SHIFT + 1)
> +#else
>  #define ZRAM_FLAG_SHIFT (PAGE_SHIFT + 1)
> +#endif
> +
> +#define ENABLE_HUGEPAGE_ZRAM_DEBUG 1
>  
>  /* Only 2 bits are allowed for comp priority index */
>  #define ZRAM_COMP_PRIORITY_MASK	0x3
> @@ -57,6 +64,10 @@ enum zram_pageflags {
>  	ZRAM_COMP_PRIORITY_BIT1, /* First bit of comp priority index */
>  	ZRAM_COMP_PRIORITY_BIT2, /* Second bit of comp priority index */
>  
> +#ifdef CONFIG_ZRAM_MULTI_PAGES
> +	ZRAM_COMP_MULTI_PAGES,	/* Compressed by multi-pages */
> +#endif
> +
>  	__NR_ZRAM_PAGEFLAGS,
>  };
>  
> @@ -91,6 +102,16 @@ struct zram_stats {
>  	atomic64_t bd_reads;		/* no. of reads from backing device */
>  	atomic64_t bd_writes;		/* no. of writes from backing device */
>  #endif
> +
> +#ifdef CONFIG_ZRAM_MULTI_PAGES
> +	atomic64_t zram_bio_write_multi_pages_count;
> +	atomic64_t zram_bio_read_multi_pages_count;
> +	atomic64_t multi_pages_failed_writes;
> +	atomic64_t multi_pages_failed_reads;
> +	atomic64_t zram_bio_write_multi_pages_partial_count;
> +	atomic64_t zram_bio_read_multi_pages_partial_count;
> +	atomic64_t multi_pages_miss_free;
> +#endif
>  };
>  
>  #ifdef CONFIG_ZRAM_MULTI_COMP
> -- 
> 2.34.1
>

Thanks
Barry



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC 2/2] zram: support compression at the granularity of multi-pages
  2024-10-21 23:28   ` Barry Song
@ 2024-11-06 16:23     ` Usama Arif
  2024-11-07 10:25       ` Barry Song
  0 siblings, 1 reply; 18+ messages in thread
From: Usama Arif @ 2024-11-06 16:23 UTC (permalink / raw)
  To: Barry Song
  Cc: akpm, axboe, chrisl, corbet, david, hannes, kanchana.p.sridhar,
	kasong, linux-block, linux-mm, minchan, nphamcs, senozhatsky,
	surenb, terrelln, v-songbaohua, wajdi.k.feghali, willy,
	ying.huang, yosryahmed, yuzhao, zhengtangquan, zhouchengming,
	bala.seshasayee, Johannes Weiner

On 22/10/2024 00:28, Barry Song wrote:
>> From: Tangquan Zheng <zhengtangquan@oppo.com>
>>
>> +static int zram_bvec_write_multi_pages(struct zram *zram, struct bio_vec *bvec,
>> +			   u32 index, int offset, struct bio *bio)
>> +{
>> +	if (is_multi_pages_partial_io(bvec))
>> +		return zram_bvec_write_multi_pages_partial(zram, bvec, index, offset, bio);
>> +	return zram_write_page(zram, bvec->bv_page, index);
>> +}
>> +

Hi Barry,

I started reviewing this series just to get a better idea if we can do something
similar for zswap. I haven't looked at zram code before so this might be a basic
question:
How would you end up in zram_bvec_write_multi_pages_partial if using zram for swap?

We only swapout whole folios. If ZCOMP_MULTI_PAGES_SIZE=64K, any folio smaller
than 64K will end up in zram_bio_write_page. Folios greater than or equal to 64K
would be dispatched by zram_bio_write_multi_pages to zram_bvec_write_multi_pages
in 64K chunks. So for e.g. 128K folio would end up calling zram_bvec_write_multi_pages
twice.

Or is this for the case when you are using zram not for swap? In that case, I probably
dont need to consider zram_bvec_write_multi_pages_partial write case for zswap.

Thanks,
Usama

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC 2/2] zram: support compression at the granularity of multi-pages
  2024-11-06 16:23     ` Usama Arif
@ 2024-11-07 10:25       ` Barry Song
  2024-11-07 10:31         ` Barry Song
  0 siblings, 1 reply; 18+ messages in thread
From: Barry Song @ 2024-11-07 10:25 UTC (permalink / raw)
  To: Usama Arif
  Cc: akpm, axboe, chrisl, corbet, david, kanchana.p.sridhar, kasong,
	linux-block, linux-mm, minchan, nphamcs, senozhatsky, surenb,
	terrelln, v-songbaohua, wajdi.k.feghali, willy, ying.huang,
	yosryahmed, yuzhao, zhengtangquan, zhouchengming,
	bala.seshasayee, Johannes Weiner

On Thu, Nov 7, 2024 at 5:23 AM Usama Arif <usamaarif642@gmail.com> wrote:
>
>
>
> On 22/10/2024 00:28, Barry Song wrote:
> >> From: Tangquan Zheng <zhengtangquan@oppo.com>
> >>
> >> +static int zram_bvec_write_multi_pages(struct zram *zram, struct bio_vec *bvec,
> >> +                       u32 index, int offset, struct bio *bio)
> >> +{
> >> +    if (is_multi_pages_partial_io(bvec))
> >> +            return zram_bvec_write_multi_pages_partial(zram, bvec, index, offset, bio);
> >> +    return zram_write_page(zram, bvec->bv_page, index);
> >> +}
> >> +
>
> Hi Barry,
>
> I started reviewing this series just to get a better idea if we can do something
> similar for zswap. I haven't looked at zram code before so this might be a basic
> question:
> How would you end up in zram_bvec_write_multi_pages_partial if using zram for swap?

Hi Usama,

There’s a corner case where, for instance, a 32KiB mTHP is swapped
out. Then, if userspace
performs a MADV_DONTNEED on the 0~16KiB portion of this original mTHP,
it now consists
of 8 swap entries(mTHP has been released and unmapped). With
swap0-swap3 released
due to DONTNEED, they become available for reallocation, and other
folios may be swapped
out to those entries. Then it is a combination of the new smaller
folios with the original 32KiB
mTHP.

>
> We only swapout whole folios. If ZCOMP_MULTI_PAGES_SIZE=64K, any folio smaller
> than 64K will end up in zram_bio_write_page. Folios greater than or equal to 64K
> would be dispatched by zram_bio_write_multi_pages to zram_bvec_write_multi_pages
> in 64K chunks. So for e.g. 128K folio would end up calling zram_bvec_write_multi_pages
> twice.

In v2, I changed the default order to 2, allowing all anonymous mTHP
to benefit from this
feature.

>
> Or is this for the case when you are using zram not for swap? In that case, I probably
> dont need to consider zram_bvec_write_multi_pages_partial write case for zswap.
>
> Thanks,
> Usama

Thanks
barry


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC 2/2] zram: support compression at the granularity of multi-pages
  2024-11-07 10:25       ` Barry Song
@ 2024-11-07 10:31         ` Barry Song
  2024-11-07 11:49           ` Usama Arif
  0 siblings, 1 reply; 18+ messages in thread
From: Barry Song @ 2024-11-07 10:31 UTC (permalink / raw)
  To: Usama Arif
  Cc: akpm, axboe, chrisl, corbet, david, kanchana.p.sridhar, kasong,
	linux-block, linux-mm, minchan, nphamcs, senozhatsky, surenb,
	terrelln, v-songbaohua, wajdi.k.feghali, willy, ying.huang,
	yosryahmed, yuzhao, zhengtangquan, zhouchengming,
	bala.seshasayee, Johannes Weiner

On Thu, Nov 7, 2024 at 11:25 PM Barry Song <21cnbao@gmail.com> wrote:
>
> On Thu, Nov 7, 2024 at 5:23 AM Usama Arif <usamaarif642@gmail.com> wrote:
> >
> >
> >
> > On 22/10/2024 00:28, Barry Song wrote:
> > >> From: Tangquan Zheng <zhengtangquan@oppo.com>
> > >>
> > >> +static int zram_bvec_write_multi_pages(struct zram *zram, struct bio_vec *bvec,
> > >> +                       u32 index, int offset, struct bio *bio)
> > >> +{
> > >> +    if (is_multi_pages_partial_io(bvec))
> > >> +            return zram_bvec_write_multi_pages_partial(zram, bvec, index, offset, bio);
> > >> +    return zram_write_page(zram, bvec->bv_page, index);
> > >> +}
> > >> +
> >
> > Hi Barry,
> >
> > I started reviewing this series just to get a better idea if we can do something
> > similar for zswap. I haven't looked at zram code before so this might be a basic
> > question:
> > How would you end up in zram_bvec_write_multi_pages_partial if using zram for swap?
>
> Hi Usama,
>
> There’s a corner case where, for instance, a 32KiB mTHP is swapped
> out. Then, if userspace
> performs a MADV_DONTNEED on the 0~16KiB portion of this original mTHP,
> it now consists
> of 8 swap entries(mTHP has been released and unmapped). With
> swap0-swap3 released
> due to DONTNEED, they become available for reallocation, and other
> folios may be swapped
> out to those entries. Then it is a combination of the new smaller
> folios with the original 32KiB
> mTHP.

Sorry, I forgot to mention that the assumption is ZSMALLOC_MULTI_PAGES_ORDER=3,
so data is compressed in 32KiB blocks.

With Chris' and Kairui's new swap optimization, this should be minor,
as each cluster has
its own order. However, I recall that order-0 can still steal swap
slots from other orders'
clusters when swap space is limited by scanning all slots? Please
correct me if I'm
wrong, Kairui and Chris.

>
> >
> > We only swapout whole folios. If ZCOMP_MULTI_PAGES_SIZE=64K, any folio smaller
> > than 64K will end up in zram_bio_write_page. Folios greater than or equal to 64K
> > would be dispatched by zram_bio_write_multi_pages to zram_bvec_write_multi_pages
> > in 64K chunks. So for e.g. 128K folio would end up calling zram_bvec_write_multi_pages
> > twice.
>
> In v2, I changed the default order to 2, allowing all anonymous mTHP
> to benefit from this
> feature.
>
> >
> > Or is this for the case when you are using zram not for swap? In that case, I probably
> > dont need to consider zram_bvec_write_multi_pages_partial write case for zswap.
> >
> > Thanks,
> > Usama
>

Thanks
barry


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC 2/2] zram: support compression at the granularity of multi-pages
  2024-11-07 10:31         ` Barry Song
@ 2024-11-07 11:49           ` Usama Arif
  2024-11-07 20:53             ` Barry Song
  0 siblings, 1 reply; 18+ messages in thread
From: Usama Arif @ 2024-11-07 11:49 UTC (permalink / raw)
  To: Barry Song
  Cc: akpm, axboe, chrisl, corbet, david, kanchana.p.sridhar, kasong,
	linux-block, linux-mm, minchan, nphamcs, senozhatsky, surenb,
	terrelln, v-songbaohua, wajdi.k.feghali, willy, ying.huang,
	yosryahmed, yuzhao, zhengtangquan, zhouchengming,
	bala.seshasayee, Johannes Weiner



On 07/11/2024 10:31, Barry Song wrote:
> On Thu, Nov 7, 2024 at 11:25 PM Barry Song <21cnbao@gmail.com> wrote:
>>
>> On Thu, Nov 7, 2024 at 5:23 AM Usama Arif <usamaarif642@gmail.com> wrote:
>>>
>>>
>>>
>>> On 22/10/2024 00:28, Barry Song wrote:
>>>>> From: Tangquan Zheng <zhengtangquan@oppo.com>
>>>>>
>>>>> +static int zram_bvec_write_multi_pages(struct zram *zram, struct bio_vec *bvec,
>>>>> +                       u32 index, int offset, struct bio *bio)
>>>>> +{
>>>>> +    if (is_multi_pages_partial_io(bvec))
>>>>> +            return zram_bvec_write_multi_pages_partial(zram, bvec, index, offset, bio);
>>>>> +    return zram_write_page(zram, bvec->bv_page, index);
>>>>> +}
>>>>> +
>>>
>>> Hi Barry,
>>>
>>> I started reviewing this series just to get a better idea if we can do something
>>> similar for zswap. I haven't looked at zram code before so this might be a basic
>>> question:
>>> How would you end up in zram_bvec_write_multi_pages_partial if using zram for swap?
>>
>> Hi Usama,
>>
>> There’s a corner case where, for instance, a 32KiB mTHP is swapped
>> out. Then, if userspace
>> performs a MADV_DONTNEED on the 0~16KiB portion of this original mTHP,
>> it now consists
>> of 8 swap entries(mTHP has been released and unmapped). With
>> swap0-swap3 released
>> due to DONTNEED, they become available for reallocation, and other
>> folios may be swapped
>> out to those entries. Then it is a combination of the new smaller
>> folios with the original 32KiB
>> mTHP.
> 

Hi Barry,

Thanks for this. So in this example of 32K folio, when swap slots 0-3 are
released zram_slot_free_notify will only clear the ZRAM_COMP_MULTI_PAGES
flag on the 0-3 index and return (without calling zram_free_page on them).

I am assuming that if another folio is now swapped out to those entries,
zram allows to overwrite those pages, eventhough they haven't been freed?

Also, even if its allowed, I still dont think you will end up in
zram_bvec_write_multi_pages_partial when you try to write a 16K or
smaller folio to swap0-3. As want_multi_pages_comp will evaluate to false
as 16K is less than 32K, you will just end up in zram_bio_write_page?

Thanks,
Usama


> Sorry, I forgot to mention that the assumption is ZSMALLOC_MULTI_PAGES_ORDER=3,
> so data is compressed in 32KiB blocks.
> 
> With Chris' and Kairui's new swap optimization, this should be minor,
> as each cluster has
> its own order. However, I recall that order-0 can still steal swap
> slots from other orders'
> clusters when swap space is limited by scanning all slots? Please
> correct me if I'm
> wrong, Kairui and Chris.
> 
>>
>>>
>>> We only swapout whole folios. If ZCOMP_MULTI_PAGES_SIZE=64K, any folio smaller
>>> than 64K will end up in zram_bio_write_page. Folios greater than or equal to 64K
>>> would be dispatched by zram_bio_write_multi_pages to zram_bvec_write_multi_pages
>>> in 64K chunks. So for e.g. 128K folio would end up calling zram_bvec_write_multi_pages
>>> twice.
>>
>> In v2, I changed the default order to 2, allowing all anonymous mTHP
>> to benefit from this
>> feature.
>>
>>>
>>> Or is this for the case when you are using zram not for swap? In that case, I probably
>>> dont need to consider zram_bvec_write_multi_pages_partial write case for zswap.
>>>
>>> Thanks,
>>> Usama
>>
> 
> Thanks
> barry



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC 2/2] zram: support compression at the granularity of multi-pages
  2024-11-07 11:49           ` Usama Arif
@ 2024-11-07 20:53             ` Barry Song
  0 siblings, 0 replies; 18+ messages in thread
From: Barry Song @ 2024-11-07 20:53 UTC (permalink / raw)
  To: Usama Arif
  Cc: akpm, axboe, chrisl, corbet, david, kanchana.p.sridhar, kasong,
	linux-block, linux-mm, minchan, nphamcs, senozhatsky, surenb,
	terrelln, v-songbaohua, wajdi.k.feghali, willy, ying.huang,
	yosryahmed, yuzhao, zhengtangquan, zhouchengming,
	bala.seshasayee, Johannes Weiner

On Fri, Nov 8, 2024 at 12:49 AM Usama Arif <usamaarif642@gmail.com> wrote:
>
>
>
> On 07/11/2024 10:31, Barry Song wrote:
> > On Thu, Nov 7, 2024 at 11:25 PM Barry Song <21cnbao@gmail.com> wrote:
> >>
> >> On Thu, Nov 7, 2024 at 5:23 AM Usama Arif <usamaarif642@gmail.com> wrote:
> >>>
> >>>
> >>>
> >>> On 22/10/2024 00:28, Barry Song wrote:
> >>>>> From: Tangquan Zheng <zhengtangquan@oppo.com>
> >>>>>
> >>>>> +static int zram_bvec_write_multi_pages(struct zram *zram, struct bio_vec *bvec,
> >>>>> +                       u32 index, int offset, struct bio *bio)
> >>>>> +{
> >>>>> +    if (is_multi_pages_partial_io(bvec))
> >>>>> +            return zram_bvec_write_multi_pages_partial(zram, bvec, index, offset, bio);
> >>>>> +    return zram_write_page(zram, bvec->bv_page, index);
> >>>>> +}
> >>>>> +
> >>>
> >>> Hi Barry,
> >>>
> >>> I started reviewing this series just to get a better idea if we can do something
> >>> similar for zswap. I haven't looked at zram code before so this might be a basic
> >>> question:
> >>> How would you end up in zram_bvec_write_multi_pages_partial if using zram for swap?
> >>
> >> Hi Usama,
> >>
> >> There’s a corner case where, for instance, a 32KiB mTHP is swapped
> >> out. Then, if userspace
> >> performs a MADV_DONTNEED on the 0~16KiB portion of this original mTHP,
> >> it now consists
> >> of 8 swap entries(mTHP has been released and unmapped). With
> >> swap0-swap3 released
> >> due to DONTNEED, they become available for reallocation, and other
> >> folios may be swapped
> >> out to those entries. Then it is a combination of the new smaller
> >> folios with the original 32KiB
> >> mTHP.
> >
>
> Hi Barry,
>
> Thanks for this. So in this example of 32K folio, when swap slots 0-3 are
> released zram_slot_free_notify will only clear the ZRAM_COMP_MULTI_PAGES
> flag on the 0-3 index and return (without calling zram_free_page on them).
>
> I am assuming that if another folio is now swapped out to those entries,
> zram allows to overwrite those pages, eventhough they haven't been freed?

Correct. This is a typical case for zRAM. zRAM allows zram_slot_free_notify()
to be skipped entirely (known as miss_free). As long as swap_map[] indicates
that the slots are free, they can be reused.

>
> Also, even if its allowed, I still dont think you will end up in
> zram_bvec_write_multi_pages_partial when you try to write a 16K or
> smaller folio to swap0-3. As want_multi_pages_comp will evaluate to false
> as 16K is less than 32K, you will just end up in zram_bio_write_page?

Until all slots are cleared from ZRAM_COMP_MULTI_PAGES, these entries
remain available for storing small folios. Prior to this, the large
block remains
intact. For instance, if swap0 to swap3 are free and swap4 to swap7
still reference
the old compressed mTHP, writing only to swap0 would modify the large block.

static inline int __test_multi_pages_comp(struct zram *zram, u32 index)
{
        int i;
        int count = 0;

        int head_index = index & ~((unsigned long)ZCOMP_MULTI_PAGES_NR - 1);

        for (i = 0; i < ZCOMP_MULTI_PAGES_NR; i++) {
                if (zram_test_flag(zram, head_index + i, ZRAM_COMP_MULTI_PAGES))
                        count++;
        }

        return count;
}

a mapping exists between the head index and the large block of zsmalloc. As long
as any entry with the same head index remains, the large block persists.

Another possible option is:
swap4 to swap7 indexes reference the old large block, while swap0 to
swap3 point to
new small blocks compressed from small folios. This approach would
greatly increase
implementation complexity and could also raise zRAM's memory consumption.

With Chris's and Kairui's swap allocation optimizations, hopefully,
this corner case
will remain minimal.

>
> Thanks,
> Usama
>
>
> > Sorry, I forgot to mention that the assumption is ZSMALLOC_MULTI_PAGES_ORDER=3,
> > so data is compressed in 32KiB blocks.
> >
> > With Chris' and Kairui's new swap optimization, this should be minor,
> > as each cluster has
> > its own order. However, I recall that order-0 can still steal swap
> > slots from other orders'
> > clusters when swap space is limited by scanning all slots? Please
> > correct me if I'm
> > wrong, Kairui and Chris.
> >
> >>
> >>>
> >>> We only swapout whole folios. If ZCOMP_MULTI_PAGES_SIZE=64K, any folio smaller
> >>> than 64K will end up in zram_bio_write_page. Folios greater than or equal to 64K
> >>> would be dispatched by zram_bio_write_multi_pages to zram_bvec_write_multi_pages
> >>> in 64K chunks. So for e.g. 128K folio would end up calling zram_bvec_write_multi_pages
> >>> twice.
> >>
> >> In v2, I changed the default order to 2, allowing all anonymous mTHP
> >> to benefit from this
> >> feature.
> >>
> >>>
> >>> Or is this for the case when you are using zram not for swap? In that case, I probably
> >>> dont need to consider zram_bvec_write_multi_pages_partial write case for zswap.
> >>>
> >>> Thanks,
> >>> Usama
> >>
> >
> > Thanks
> > barry
>

Thanks
Barry


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC 0/2] mTHP-friendly compression in zsmalloc and zram based on multi-pages
  2024-03-27 21:48 [PATCH RFC 0/2] mTHP-friendly compression in zsmalloc and zram based on multi-pages Barry Song
  2024-03-27 21:48 ` [PATCH RFC 1/2] mm: zsmalloc: support objects compressed based on multiple pages Barry Song
  2024-03-27 21:48 ` [PATCH RFC 2/2] zram: support compression at the granularity of multi-pages Barry Song
@ 2024-03-27 22:01 ` Barry Song
  2 siblings, 0 replies; 18+ messages in thread
From: Barry Song @ 2024-03-27 22:01 UTC (permalink / raw)
  To: akpm, minchan, senozhatsky, linux-block, axboe, linux-mm, Ryan Roberts
  Cc: terrelln, chrisl, david, kasong, yuzhao, yosryahmed, nphamcs,
	willy, hannes, ying.huang, surenb, wajdi.k.feghali,
	kanchana.p.sridhar, corbet, zhouchengming, Barry Song,
	郑堂权(Blues Zheng)

Apologies for the top posting.

+Ryan, I missed adding Ryan at the last moment :-)

On Thu, Mar 28, 2024 at 10:48 AM Barry Song <21cnbao@gmail.com> wrote:
>
> From: Barry Song <v-songbaohua@oppo.com>
>
> mTHP is generally considered to potentially waste memory due to fragmentation,
> but it may also serve as a source of memory savings.
> When large folios are compressed at a larger granularity, we observe a remarkable
> decrease in CPU utilization and a significant improvement in compression ratios.
>
> The following data illustrates the time and compressed data for typical anonymous
> pages gathered from Android phones.
>
> granularity   orig_data_size   compr_data_size   time(us)
> 4KiB-zstd      1048576000       246876055        50259962
> 64KiB-zstd     1048576000       199763892        18330605
>
> Due to mTHP's ability to be swapped out without splitting[1] and swapped in as a
> whole[2], it enables compression and decompression to be performed at larger
> granularities.
>
> This patchset enhances zsmalloc and zram by introducing support for dividing large
> folios into multi-pages, typically configured with a 4-order granularity. Here are
> concrete examples:
>
> * If a large folio's size is 32KiB, it will still be compressed and stored at a 4KiB
>   granularity.
> * If a large folio's size is 64KiB, it will be compressed and stored as a single 64KiB
>   block.
> * If a large folio's size is 128KiB, it will be compressed and stored as two 64KiB
>   multi-pages.
>
> Without the patchset, a large folio is always divided into nr_pages 4KiB blocks.
>
> The granularity can be configured using the ZSMALLOC_MULTI_PAGES_ORDER setting.
>
> [1] https://lore.kernel.org/linux-mm/20240327144537.4165578-1-ryan.roberts@arm.com/
> [2] https://lore.kernel.org/linux-mm/20240304081348.197341-1-21cnbao@gmail.com/
>
> Tangquan Zheng (2):
>   mm: zsmalloc: support objects compressed based on multiple pages
>   zram: support compression at the granularity of multi-pages
>
>  drivers/block/zram/Kconfig    |   9 +
>  drivers/block/zram/zcomp.c    |  23 ++-
>  drivers/block/zram/zcomp.h    |  12 +-
>  drivers/block/zram/zram_drv.c | 372 +++++++++++++++++++++++++++++++---
>  drivers/block/zram/zram_drv.h |  21 ++
>  include/linux/zsmalloc.h      |  10 +-
>  mm/Kconfig                    |  18 ++
>  mm/zsmalloc.c                 | 215 +++++++++++++++-----
>  8 files changed, 586 insertions(+), 94 deletions(-)
>
> --
> 2.34.1
>


^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2024-11-07 20:54 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-03-27 21:48 [PATCH RFC 0/2] mTHP-friendly compression in zsmalloc and zram based on multi-pages Barry Song
2024-03-27 21:48 ` [PATCH RFC 1/2] mm: zsmalloc: support objects compressed based on multiple pages Barry Song
2024-10-21 23:26   ` Barry Song
2024-03-27 21:48 ` [PATCH RFC 2/2] zram: support compression at the granularity of multi-pages Barry Song
2024-04-11  0:40   ` Sergey Senozhatsky
2024-04-11  1:24     ` Barry Song
2024-04-11  1:42   ` Sergey Senozhatsky
2024-04-11  2:03     ` Barry Song
2024-04-11  4:14       ` Sergey Senozhatsky
2024-04-11  7:49         ` Barry Song
2024-04-19  3:41           ` Sergey Senozhatsky
2024-10-21 23:28   ` Barry Song
2024-11-06 16:23     ` Usama Arif
2024-11-07 10:25       ` Barry Song
2024-11-07 10:31         ` Barry Song
2024-11-07 11:49           ` Usama Arif
2024-11-07 20:53             ` Barry Song
2024-03-27 22:01 ` [PATCH RFC 0/2] mTHP-friendly compression in zsmalloc and zram based on multi-pages Barry Song

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox