[RFC PATCH 0/2] zsmalloc: size-classes chain-length tunings

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH 0/2] zsmalloc: size-classes chain-length tunings
@ 2026-01-01  1:38 Sergey Senozhatsky
  2026-01-01  1:38 ` [RFC PATCH 1/2] zsmalloc: drop hard limit on the number of size classes Sergey Senozhatsky
  2026-01-01  1:38 ` [RFC PATCH 2/2] zsmalloc: chain-length configuration should consider other metrics Sergey Senozhatsky
  0 siblings, 2 replies; 28+ messages in thread
From: Sergey Senozhatsky @ 2026-01-01  1:38 UTC (permalink / raw)
  To: Andrew Morton, Yosry Ahmed, Nhat Pham
  Cc: Minchan Kim, Johannes Weiner, Brian Geffon, linux-kernel,
	linux-mm, Sergey Senozhatsky

This is an RFC series that follows up on the handling of 16K
PAGE_SIZE discussion [1].

[1] https://lore.kernel.org/linux-mm/fui4gqm6pealaxooz3xv3dnnqxscefyvhw5bhntedwh4tgjvdq@ootmbuoc3dpa

Sergey Senozhatsky (2):
  zsmalloc: drop hard limit on the number of size classes
  zsmalloc: chain-length configuration should consider other metrics

 mm/zsmalloc.c | 48 ++++++++++++++++++++++++++++++++++++++----------
 1 file changed, 38 insertions(+), 10 deletions(-)

-- 
2.52.0.351.gbe84eed79e-goog



^ permalink raw reply	[flat|nested] 28+ messages in thread

* [RFC PATCH 1/2] zsmalloc: drop hard limit on the number of size classes
  2026-01-01  1:38 [RFC PATCH 0/2] zsmalloc: size-classes chain-length tunings Sergey Senozhatsky
@ 2026-01-01  1:38 ` Sergey Senozhatsky
  2026-01-01  1:38 ` [RFC PATCH 2/2] zsmalloc: chain-length configuration should consider other metrics Sergey Senozhatsky
  1 sibling, 0 replies; 28+ messages in thread
From: Sergey Senozhatsky @ 2026-01-01  1:38 UTC (permalink / raw)
  To: Andrew Morton, Yosry Ahmed, Nhat Pham
  Cc: Minchan Kim, Johannes Weiner, Brian Geffon, linux-kernel,
	linux-mm, Sergey Senozhatsky

For the reasons unknown, zsmalloc limits the number of size-classes
to 256.  On PAGE_SIZE 4K systems this works pretty fine, as those
256 classes are 4096/256 = 16 (known as size class delta) bytes apart.
However, as the PAGE_SIZE grows, e.g. 16K, the hard limit pushes the
size-class delta significantly further (e.g. 16384/256 = 64) leading
to increased internal fragmentation.  For example, on a 16K page system,
an object of size 65 bytes is rounded up to the next 64-byte boundary
(128 bytes), wasting nearly 50% of the allocated space.

Instead of calculating size-class delta based on both PAGE_SIZE and
hard limit of 256 the ZS_SIZE_CLASS_DELTA is set to constant value
of 16 bytes.  This results in a much higher than 256 number of size
classes on systems with PAGE_SIZE larger than 4K.  These extra size
classes split existing cluster into smaller ones.  For example, using
tool [1] 16K PAGE_SIZE, chain size 8:

BASE (delta 64 bytes)
=====================
Log   | Phys  | Chain | Objs/Page | TailWaste  | MergeWaste
[..]
 1072 |  1120 |  8    |  117      |  32        |  5616
 1088 |  1120 |  8    |  117      |  32        |  3744
 1104 |  1120 |  8    |  117      |  32        |  1872
 1120 |  1120 |  8    |  117      |  32        |  0
[..]

PATCHED (delta 16 bytes)
========================
[..]
 1072 |  1072 |  4    |  61       |  144       |  0
 1088 |  1088 |  1    |  15       |  64        |  0
 1104 |  1104 |  6    |  89       |  48        |  0
 1120 |  1120 |  8    |  117      |  32        |  0
[..]

In default configuration (delta 64) size classes 1072 to 1104
are merged into 1120.  Size class 1120 holds 117 objects
per-zspage, so worst case every zspage can lose 5616 bytes
(1120-1072 times 117).  With delta 16 this cluster doesn't
exist, reducing memory waste.

[1] https://github.com/sergey-senozhatsky/simulate-zsmalloc/blob/main/simulate_zsmalloc.c

Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
---
 mm/zsmalloc.c | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
index 5bf832f9c05c..5e7501d36161 100644
--- a/mm/zsmalloc.c
+++ b/mm/zsmalloc.c
@@ -92,7 +92,7 @@

 #define HUGE_BITS	1
 #define FULLNESS_BITS	4
-#define CLASS_BITS	8
+#define CLASS_BITS	12
 #define MAGIC_VAL_BITS	8

 #define ZS_MAX_PAGES_PER_ZSPAGE	(_AC(CONFIG_ZSMALLOC_CHAIN_SIZE, UL))
@@ -115,8 +115,13 @@
  *
  *  ZS_MIN_ALLOC_SIZE and ZS_SIZE_CLASS_DELTA must be multiple of ZS_ALIGN
  *  (reason above)
+ *
+ * We set ZS_SIZE_CLASS_DELTA to 16 bytes to maintain high granularity
+ * even on systems with large PAGE_SIZE (e.g. 16K, 64K). This prevents
+ * internal fragmentation. CLASS_BITS is increased to 12 to address the
+ * larger number of size classes on such systems (up to 4096 classes on 64K).
  */
-#define ZS_SIZE_CLASS_DELTA	(PAGE_SIZE >> CLASS_BITS)
+#define ZS_SIZE_CLASS_DELTA	16
 #define ZS_SIZE_CLASSES	(DIV_ROUND_UP(ZS_MAX_ALLOC_SIZE - ZS_MIN_ALLOC_SIZE, \
 				      ZS_SIZE_CLASS_DELTA) + 1)

-- 
2.52.0.351.gbe84eed79e-goog

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [RFC PATCH 2/2] zsmalloc: chain-length configuration should consider other metrics
  2026-01-01  1:38 [RFC PATCH 0/2] zsmalloc: size-classes chain-length tunings Sergey Senozhatsky
  2026-01-01  1:38 ` [RFC PATCH 1/2] zsmalloc: drop hard limit on the number of size classes Sergey Senozhatsky
@ 2026-01-01  1:38 ` Sergey Senozhatsky
  2026-01-02 18:29   ` Yosry Ahmed
  1 sibling, 1 reply; 28+ messages in thread
From: Sergey Senozhatsky @ 2026-01-01  1:38 UTC (permalink / raw)
  To: Andrew Morton, Yosry Ahmed, Nhat Pham
  Cc: Minchan Kim, Johannes Weiner, Brian Geffon, linux-kernel,
	linux-mm, Sergey Senozhatsky

This is the first step towards re-thinking optimization strategy
during chain-size (the number of 0-order physical pages a zspage
chains for most optimal performance) configuration. Currently,
we only consider one metric - "wasted" memory - and try various
chain length configurations in order to find the minimal wasted
space configuration.  However, this strategy doesn't consider
the fact that our optimization space is not single-dimensional.
When we increase zspage chain length we at the same increase the
number of spanning objects (objects that span two physical pages).
Such objects slow down read() operations because zsmalloc needs to
kmap both pages and memcpy objects' chunks.  This clearly increases
CPU usage and battery drain.

We, most likely, need to consider numerous metrics and optimize
in a multi-dimensional space.  These can be wired in later on, for
now we just add some heuristic to increase zspage chain length only
if there are substantial savings memory usage wise.  We can tune
these threshold values (there is a simple user-space tool [2] to
experiment with those knobs), but what we currently is already
interesting enough.  Where does this bring us, using a synthetic
test [1], which produces byte-to-byte comparable workloads, on a
4K PAGE_SIZE, chain size 10 system:

BASE
====
 zsmalloc_test: num write objects: 339598
 zsmalloc_test: pool pages used 175111, total allocated size 698213488
 zsmalloc_test: pool memory utilization: 97.3
 zsmalloc_test: num read objects: 339598
 zsmalloc_test: spanning objects: 110377, total memcpy size: 278318624

PATCHED
=======
 zsmalloc_test: num write objects: 339598
 zsmalloc_test: pool pages used 175920, total allocated size 698213488
 zsmalloc_test: pool memory utilization: 96.8
 zsmalloc_test: num read objects: 339598
 zsmalloc_test: spanning objects: 103256, total memcpy size: 265378608

At a price of 0.5% increased pool memory usage there was a 6.5%
reduction in a number of spanning objects (4.6% less copied bytes).

Note, the results are specific to this particular test case.  The
savings are not uniformly distributed: according to [2] for some
size classes the reduction in the number of spanning objects
per-zspage goes down from 7 to 0 (e.g. size class 368), for other
from 4 to 2 (e.g. size class 640).  So the actual memcpy savings
are data-pattern dependent, as always.

[1] https://github.com/sergey-senozhatsky/simulate-zsmalloc/blob/main/0001-zsmalloc-add-zsmalloc_test-module.patch
[2] https://github.com/sergey-senozhatsky/simulate-zsmalloc/blob/main/simulate_zsmalloc.c

Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
---
 mm/zsmalloc.c | 39 +++++++++++++++++++++++++++++++--------
 1 file changed, 31 insertions(+), 8 deletions(-)

diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
index 5e7501d36161..929db7cf6c19 100644
--- a/mm/zsmalloc.c
+++ b/mm/zsmalloc.c
@@ -2000,22 +2000,45 @@ static int zs_register_shrinker(struct zs_pool *pool)
 static int calculate_zspage_chain_size(int class_size)
 {
 	int i, min_waste = INT_MAX;
-	int chain_size = 1;
+	int best_chain_size = 1;

 	if (is_power_of_2(class_size))
-		return chain_size;
+		return best_chain_size;

 	for (i = 1; i <= ZS_MAX_PAGES_PER_ZSPAGE; i++) {
-		int waste;
+		int curr_waste = (i * PAGE_SIZE) % class_size;

-		waste = (i * PAGE_SIZE) % class_size;
-		if (waste < min_waste) {
-			min_waste = waste;
-			chain_size = i;
+		if (curr_waste == 0)
+			return i;
+
+		/*
+		 * Accept the new chain size if:
+		 * 1. The current best is wasteful (> 10% of zspage size),
+		 *    accept anything that is better.
+		 * 2. The current best is efficient, accept only significant
+		 *    (25%) improvement.
+		 */
+		if (min_waste * 10 > best_chain_size * PAGE_SIZE) {
+			if (curr_waste < min_waste) {
+				min_waste = curr_waste;
+				best_chain_size = i;
+			}
+		} else {
+			if (curr_waste * 4 < min_waste * 3) {
+				min_waste = curr_waste;
+				best_chain_size = i;
+			}
 		}
+
+		/*
+		 * If the current best chain has low waste (approx < 1.5%
+		 * relative to zspage size) then accept it right away.
+		 */
+		if (min_waste * 64 <= best_chain_size * PAGE_SIZE)
+			break;
 	}

-	return chain_size;
+	return best_chain_size;
 }

 /**
-- 
2.52.0.351.gbe84eed79e-goog

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH 2/2] zsmalloc: chain-length configuration should consider other metrics
  2026-01-01  1:38 ` [RFC PATCH 2/2] zsmalloc: chain-length configuration should consider other metrics Sergey Senozhatsky
@ 2026-01-02 18:29   ` Yosry Ahmed
  2026-01-05  1:42     ` Sergey Senozhatsky
  0 siblings, 1 reply; 28+ messages in thread
From: Yosry Ahmed @ 2026-01-02 18:29 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Andrew Morton, Nhat Pham, Minchan Kim, Johannes Weiner,
	Brian Geffon, linux-kernel, linux-mm

On Thu, Jan 01, 2026 at 10:38:14AM +0900, Sergey Senozhatsky wrote:
> This is the first step towards re-thinking optimization strategy
> during chain-size (the number of 0-order physical pages a zspage
> chains for most optimal performance) configuration. Currently,
> we only consider one metric - "wasted" memory - and try various
> chain length configurations in order to find the minimal wasted
> space configuration.  However, this strategy doesn't consider
> the fact that our optimization space is not single-dimensional.
> When we increase zspage chain length we at the same increase the
> number of spanning objects (objects that span two physical pages).
> Such objects slow down read() operations because zsmalloc needs to
> kmap both pages and memcpy objects' chunks.  This clearly increases
> CPU usage and battery drain.
> 
> We, most likely, need to consider numerous metrics and optimize
> in a multi-dimensional space.  These can be wired in later on, for
> now we just add some heuristic to increase zspage chain length only
> if there are substantial savings memory usage wise.  We can tune
> these threshold values (there is a simple user-space tool [2] to
> experiment with those knobs), but what we currently is already
> interesting enough.  Where does this bring us, using a synthetic
> test [1], which produces byte-to-byte comparable workloads, on a
> 4K PAGE_SIZE, chain size 10 system:
> 
> BASE
> ====
>  zsmalloc_test: num write objects: 339598
>  zsmalloc_test: pool pages used 175111, total allocated size 698213488
>  zsmalloc_test: pool memory utilization: 97.3
>  zsmalloc_test: num read objects: 339598
>  zsmalloc_test: spanning objects: 110377, total memcpy size: 278318624
> 
> PATCHED
> =======
>  zsmalloc_test: num write objects: 339598
>  zsmalloc_test: pool pages used 175920, total allocated size 698213488
>  zsmalloc_test: pool memory utilization: 96.8
>  zsmalloc_test: num read objects: 339598
>  zsmalloc_test: spanning objects: 103256, total memcpy size: 265378608
> 
> At a price of 0.5% increased pool memory usage there was a 6.5%
> reduction in a number of spanning objects (4.6% less copied bytes).
> 
> Note, the results are specific to this particular test case.  The
> savings are not uniformly distributed: according to [2] for some
> size classes the reduction in the number of spanning objects
> per-zspage goes down from 7 to 0 (e.g. size class 368), for other
> from 4 to 2 (e.g. size class 640).  So the actual memcpy savings
> are data-pattern dependent, as always.

I worry that the heuristics are too hand-wavy, and I wonder if the
memcpy savings actually show up as perf improvements in any real life
workload. Do we have data about this?

I also vaguely recall discussions about other ways to avoid the memcpy
using scatterlists, so I am wondering if this is the right metric to
optimize.

What are the main pain points for PAGE_SIZE > 4K configs? Is it the
compression/decompression time? In my experience this is usually not the
bottleneck, I would imagine the real problem would be the internal
fragmentation.

> 
> [1] https://github.com/sergey-senozhatsky/simulate-zsmalloc/blob/main/0001-zsmalloc-add-zsmalloc_test-module.patch
> [2] https://github.com/sergey-senozhatsky/simulate-zsmalloc/blob/main/simulate_zsmalloc.c
> 
> Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
> ---
>  mm/zsmalloc.c | 39 +++++++++++++++++++++++++++++++--------
>  1 file changed, 31 insertions(+), 8 deletions(-)
> 
> diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
> index 5e7501d36161..929db7cf6c19 100644
> --- a/mm/zsmalloc.c
> +++ b/mm/zsmalloc.c
> @@ -2000,22 +2000,45 @@ static int zs_register_shrinker(struct zs_pool *pool)
>  static int calculate_zspage_chain_size(int class_size)
>  {
>  	int i, min_waste = INT_MAX;
> -	int chain_size = 1;
> +	int best_chain_size = 1;
>  
>  	if (is_power_of_2(class_size))
> -		return chain_size;
> +		return best_chain_size;
>  
>  	for (i = 1; i <= ZS_MAX_PAGES_PER_ZSPAGE; i++) {
> -		int waste;
> +		int curr_waste = (i * PAGE_SIZE) % class_size;
>  
> -		waste = (i * PAGE_SIZE) % class_size;
> -		if (waste < min_waste) {
> -			min_waste = waste;
> -			chain_size = i;
> +		if (curr_waste == 0)
> +			return i;
> +
> +		/*
> +		 * Accept the new chain size if:
> +		 * 1. The current best is wasteful (> 10% of zspage size),
> +		 *    accept anything that is better.
> +		 * 2. The current best is efficient, accept only significant
> +		 *    (25%) improvement.
> +		 */
> +		if (min_waste * 10 > best_chain_size * PAGE_SIZE) {
> +			if (curr_waste < min_waste) {
> +				min_waste = curr_waste;
> +				best_chain_size = i;
> +			}
> +		} else {
> +			if (curr_waste * 4 < min_waste * 3) {
> +				min_waste = curr_waste;
> +				best_chain_size = i;
> +			}
>  		}
> +
> +		/*
> +		 * If the current best chain has low waste (approx < 1.5%
> +		 * relative to zspage size) then accept it right away.
> +		 */
> +		if (min_waste * 64 <= best_chain_size * PAGE_SIZE)
> +			break;
>  	}
>  
> -	return chain_size;
> +	return best_chain_size;
>  }
>  
>  /**
> -- 
> 2.52.0.351.gbe84eed79e-goog
> 


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH 2/2] zsmalloc: chain-length configuration should consider other metrics
  2026-01-02 18:29   ` Yosry Ahmed
@ 2026-01-05  1:42     ` Sergey Senozhatsky
  2026-01-05  7:23       ` Sergey Senozhatsky
  2026-01-05 15:58       ` Yosry Ahmed
  0 siblings, 2 replies; 28+ messages in thread
From: Sergey Senozhatsky @ 2026-01-05  1:42 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Sergey Senozhatsky, Andrew Morton, Nhat Pham, Minchan Kim,
	Johannes Weiner, Brian Geffon, linux-kernel, linux-mm

On (26/01/02 18:29), Yosry Ahmed wrote:
> On Thu, Jan 01, 2026 at 10:38:14AM +0900, Sergey Senozhatsky wrote:
[..]
> 
> I worry that the heuristics are too hand-wavy

I don't disagree.  Am not super excited about the heuristics either.

> and I wonder if the memcpy savings actually show up as perf improvements
> in any real life workload. Do we have data about this?

I don't have real life 16K PAGE_SIZE devices.  However, on 16K PAGE_SIZE
systems we have "normal" size-classes up to a very large size, and normal
class means chaining of 0-order physical pages, and chaining means spanning.
So on 16K memcpy overhead is expected to be somewhat noticeable.

> I also vaguely recall discussions about other ways to avoid the memcpy
> using scatterlists, so I am wondering if this is the right metric to
> optimize.

As far as I understand SG-list based approach is that it will require
implementing split-data handling on the compression algorithms side,
which is not trivial (especially if the only reason to do that is
zsmalloc).

Alternatively, we maybe can try to vmap spanning objects:

---
 mm/zsmalloc.c | 24 +++++++++++++-----------
 1 file changed, 13 insertions(+), 11 deletions(-)

diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
index 6fc216ab8190..4a68c27cb5d4 100644
--- a/mm/zsmalloc.c
+++ b/mm/zsmalloc.c
@@ -38,6 +38,7 @@
 #include <linux/zsmalloc.h>
 #include <linux/fs.h>
 #include <linux/workqueue.h>
+#include <linux/vmalloc.h>
 #include "zpdesc.h"
 
 #define ZSPAGE_MAGIC	0x58
@@ -1097,19 +1098,15 @@ void *zs_obj_read_begin(struct zs_pool *pool, unsigned long handle,
 		addr = kmap_local_zpdesc(zpdesc);
 		addr += off;
 	} else {
-		size_t sizes[2];
+		struct page *pages[2];
 
 		/* this object spans two pages */
-		sizes[0] = PAGE_SIZE - off;
-		sizes[1] = class->size - sizes[0];
-		addr = local_copy;
-
-		memcpy_from_page(addr, zpdesc_page(zpdesc),
-				 off, sizes[0]);
-		zpdesc = get_next_zpdesc(zpdesc);
-		memcpy_from_page(addr + sizes[0],
-				 zpdesc_page(zpdesc),
-				 0, sizes[1]);
+		pages[0] = zpdesc_page(zpdesc);
+		pages[1] = zpdesc_page(get_next_zpdesc(zpdesc));
+		addr = vm_map_ram(pages, 2, NUMA_NO_NODE);
+		if (!addr)
+			return NULL;
+		addr += off;
 	}
 
 	if (!ZsHugePage(zspage))
@@ -1139,6 +1136,11 @@ void zs_obj_read_end(struct zs_pool *pool, unsigned long handle,
 			off += ZS_HANDLE_SIZE;
 		handle_mem -= off;
 		kunmap_local(handle_mem);
+	} else {
+		if (!ZsHugePage(zspage))
+			off += ZS_HANDLE_SIZE;
+		handle_mem -= off;
+		vm_unmap_ram(handle_mem, 2);
 	}
 
 	zspage_read_unlock(zspage);
-- 
2.52.0.351.gbe84eed79e-goog


> What are the main pain points for PAGE_SIZE > 4K configs? Is it the
> compression/decompression time? In my experience this is usually not the
> bottleneck, I would imagine the real problem would be the internal
> fragmentation.

Right, internal fragmentation can be the main problem.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH 2/2] zsmalloc: chain-length configuration should consider other metrics
  2026-01-05  1:42     ` Sergey Senozhatsky
@ 2026-01-05  7:23       ` Sergey Senozhatsky
  2026-01-05 16:01         ` Yosry Ahmed
  2026-01-05 15:58       ` Yosry Ahmed
  1 sibling, 1 reply; 28+ messages in thread
From: Sergey Senozhatsky @ 2026-01-05  7:23 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Andrew Morton, Nhat Pham, Minchan Kim, Johannes Weiner,
	Brian Geffon, linux-kernel, linux-mm, Sergey Senozhatsky

On (26/01/05 10:42), Sergey Senozhatsky wrote:
> On (26/01/02 18:29), Yosry Ahmed wrote:
> > On Thu, Jan 01, 2026 at 10:38:14AM +0900, Sergey Senozhatsky wrote:
> [..]
> > 
> > I worry that the heuristics are too hand-wavy
> 
> I don't disagree.  Am not super excited about the heuristics either.
> 
> > and I wonder if the memcpy savings actually show up as perf improvements
> > in any real life workload. Do we have data about this?
> 
> I don't have real life 16K PAGE_SIZE devices.  However, on 16K PAGE_SIZE
> systems we have "normal" size-classes up to a very large size, and normal
> class means chaining of 0-order physical pages, and chaining means spanning.
> So on 16K memcpy overhead is expected to be somewhat noticeable.

By the way, while looking at it, I think we need to "fix" obj_read_begin().
Currently, it uses "off + class->size" to detect spanning objects, which is
incorrect: size classes get merged, so a typical size class can hold a range
of sizes, using padding for smaller objects.  So instead of class->size we
need to use the actual compressed objects size, just in case if actual written
size was small enough to fit into the first physical page (we do that in
obj_write()).  I'll cook a patch.

Something like this:

---

 drivers/block/zram/zram_drv.c | 8 +++++---
 include/linux/zsmalloc.h      | 2 +-
 mm/zsmalloc.c                 | 4 ++--
 mm/zswap.c                    | 3 ++-
 4 files changed, 10 insertions(+), 7 deletions(-)

diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index a6587bed6a03..b371ba6bfec2 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -2065,7 +2065,7 @@ static int read_incompressible_page(struct zram *zram, struct page *page,
 	void *src, *dst;
 
 	handle = get_slot_handle(zram, index);
-	src = zs_obj_read_begin(zram->mem_pool, handle, NULL);
+	src = zs_obj_read_begin(zram->mem_pool, handle, PAGE_SIZE, NULL);
 	dst = kmap_local_page(page);
 	copy_page(dst, src);
 	kunmap_local(dst);
@@ -2087,7 +2087,8 @@ static int read_compressed_page(struct zram *zram, struct page *page, u32 index)
 	prio = get_slot_comp_priority(zram, index);
 
 	zstrm = zcomp_stream_get(zram->comps[prio]);
-	src = zs_obj_read_begin(zram->mem_pool, handle, zstrm->local_copy);
+	src = zs_obj_read_begin(zram->mem_pool, handle, size,
+				zstrm->local_copy);
 	dst = kmap_local_page(page);
 	ret = zcomp_decompress(zram->comps[prio], zstrm, src, size, dst);
 	kunmap_local(dst);
@@ -2114,7 +2115,8 @@ static int read_from_zspool_raw(struct zram *zram, struct page *page, u32 index)
 	 * takes place here, as we read raw compressed data.
 	 */
 	zstrm = zcomp_stream_get(zram->comps[ZRAM_PRIMARY_COMP]);
-	src = zs_obj_read_begin(zram->mem_pool, handle, zstrm->local_copy);
+	src = zs_obj_read_begin(zram->mem_pool, handle, size,
+				zstrm->local_copy);
 	memcpy_to_page(page, 0, src, size);
 	zs_obj_read_end(zram->mem_pool, handle, src);
 	zcomp_stream_put(zstrm);
diff --git a/include/linux/zsmalloc.h b/include/linux/zsmalloc.h
index f3ccff2d966c..64f65c1f14d6 100644
--- a/include/linux/zsmalloc.h
+++ b/include/linux/zsmalloc.h
@@ -40,7 +40,7 @@ unsigned int zs_lookup_class_index(struct zs_pool *pool, unsigned int size);
 void zs_pool_stats(struct zs_pool *pool, struct zs_pool_stats *stats);
 
 void *zs_obj_read_begin(struct zs_pool *pool, unsigned long handle,
-			void *local_copy);
+			size_t mem_len, void *local_copy);
 void zs_obj_read_end(struct zs_pool *pool, unsigned long handle,
 		     void *handle_mem);
 void zs_obj_write(struct zs_pool *pool, unsigned long handle,
diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
index be385609ef8a..2da60c23cd18 100644
--- a/mm/zsmalloc.c
+++ b/mm/zsmalloc.c
@@ -1070,7 +1070,7 @@ unsigned long zs_get_total_pages(struct zs_pool *pool)
 EXPORT_SYMBOL_GPL(zs_get_total_pages);
 
 void *zs_obj_read_begin(struct zs_pool *pool, unsigned long handle,
-			void *local_copy)
+			size_t mem_len, void *local_copy)
 {
 	struct zspage *zspage;
 	struct zpdesc *zpdesc;
@@ -1092,7 +1092,7 @@ void *zs_obj_read_begin(struct zs_pool *pool, unsigned long handle,
 	class = zspage_class(pool, zspage);
 	off = offset_in_page(class->size * obj_idx);
 
-	if (off + class->size <= PAGE_SIZE) {
+	if (off + mem_len <= PAGE_SIZE) {
 		/* this object is contained entirely within a page */
 		addr = kmap_local_zpdesc(zpdesc);
 		addr += off;
diff --git a/mm/zswap.c b/mm/zswap.c
index de8858ff1521..291352629616 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -937,7 +937,8 @@ static bool zswap_decompress(struct zswap_entry *entry, struct folio *folio)
 	u8 *src, *obj;
 
 	acomp_ctx = acomp_ctx_get_cpu_lock(pool);
-	obj = zs_obj_read_begin(pool->zs_pool, entry->handle, acomp_ctx->buffer);
+	obj = zs_obj_read_begin(pool->zs_pool, entry->handle, entry->length,
+				acomp_ctx->buffer);
 
 	/* zswap entries of length PAGE_SIZE are not compressed. */
 	if (entry->length == PAGE_SIZE) {
-- 
2.52.0.351.gbe84eed79e-goog


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH 2/2] zsmalloc: chain-length configuration should consider other metrics
  2026-01-05  1:42     ` Sergey Senozhatsky
  2026-01-05  7:23       ` Sergey Senozhatsky
@ 2026-01-05 15:58       ` Yosry Ahmed
  2026-01-06  4:20         ` Sergey Senozhatsky
  1 sibling, 1 reply; 28+ messages in thread
From: Yosry Ahmed @ 2026-01-05 15:58 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Andrew Morton, Nhat Pham, Minchan Kim, Johannes Weiner,
	Brian Geffon, linux-kernel, Herbert Xu, linux-mm

On Mon, Jan 05, 2026 at 10:42:51AM +0900, Sergey Senozhatsky wrote:
> On (26/01/02 18:29), Yosry Ahmed wrote:
> > On Thu, Jan 01, 2026 at 10:38:14AM +0900, Sergey Senozhatsky wrote:
> [..]
> > 
> > I worry that the heuristics are too hand-wavy
> 
> I don't disagree.  Am not super excited about the heuristics either.
> 
> > and I wonder if the memcpy savings actually show up as perf improvements
> > in any real life workload. Do we have data about this?
> 
> I don't have real life 16K PAGE_SIZE devices.  However, on 16K PAGE_SIZE
> systems we have "normal" size-classes up to a very large size, and normal
> class means chaining of 0-order physical pages, and chaining means spanning.
> So on 16K memcpy overhead is expected to be somewhat noticeable.

I don't disagree that it could be a problem, I am just against
optimizations without data. It makes it hard to modify these heuristics
later or remove them, since we don't really know what effect they had in
the first place.

We also don't know if the 0.5% increase in memory usage is actually
offset by CPU gains.

> 
> > I also vaguely recall discussions about other ways to avoid the memcpy
> > using scatterlists, so I am wondering if this is the right metric to
> > optimize.
> 
> As far as I understand SG-list based approach is that it will require
> implementing split-data handling on the compression algorithms side,
> which is not trivial (especially if the only reason to do that is
> zsmalloc).

I am not sure tbh, adding Herbert here. I remember looking at the code
in scomp_acomp_comp_decomp() at some point, and I think it will take
care of non-contiguous SG-lists. Not sure if that's the correct place to
look tho.

> 
> Alternatively, we maybe can try to vmap spanning objects:

Using vmap makes sense in theory, but in practice (at least for zswap)
it doesn't help because SG lists do not support vmap addresses. Zswap
will actually treat them the same as highmem and copy them to a buffer
before putting them in an SG list, so we effectively just do the
memcpy() in zswap instead of zsmalloc.

> 
> ---
>  mm/zsmalloc.c | 24 +++++++++++++-----------
>  1 file changed, 13 insertions(+), 11 deletions(-)
> 
> diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
> index 6fc216ab8190..4a68c27cb5d4 100644
> --- a/mm/zsmalloc.c
> +++ b/mm/zsmalloc.c
> @@ -38,6 +38,7 @@
>  #include <linux/zsmalloc.h>
>  #include <linux/fs.h>
>  #include <linux/workqueue.h>
> +#include <linux/vmalloc.h>
>  #include "zpdesc.h"
>  
>  #define ZSPAGE_MAGIC	0x58
> @@ -1097,19 +1098,15 @@ void *zs_obj_read_begin(struct zs_pool *pool, unsigned long handle,
>  		addr = kmap_local_zpdesc(zpdesc);
>  		addr += off;
>  	} else {
> -		size_t sizes[2];
> +		struct page *pages[2];
>  
>  		/* this object spans two pages */
> -		sizes[0] = PAGE_SIZE - off;
> -		sizes[1] = class->size - sizes[0];
> -		addr = local_copy;
> -
> -		memcpy_from_page(addr, zpdesc_page(zpdesc),
> -				 off, sizes[0]);
> -		zpdesc = get_next_zpdesc(zpdesc);
> -		memcpy_from_page(addr + sizes[0],
> -				 zpdesc_page(zpdesc),
> -				 0, sizes[1]);
> +		pages[0] = zpdesc_page(zpdesc);
> +		pages[1] = zpdesc_page(get_next_zpdesc(zpdesc));
> +		addr = vm_map_ram(pages, 2, NUMA_NO_NODE);
> +		if (!addr)
> +			return NULL;
> +		addr += off;
>  	}
>  
>  	if (!ZsHugePage(zspage))
> @@ -1139,6 +1136,11 @@ void zs_obj_read_end(struct zs_pool *pool, unsigned long handle,
>  			off += ZS_HANDLE_SIZE;
>  		handle_mem -= off;
>  		kunmap_local(handle_mem);
> +	} else {
> +		if (!ZsHugePage(zspage))
> +			off += ZS_HANDLE_SIZE;
> +		handle_mem -= off;
> +		vm_unmap_ram(handle_mem, 2);
>  	}
>  
>  	zspage_read_unlock(zspage);
> -- 
> 2.52.0.351.gbe84eed79e-goog
> 
> 
> > What are the main pain points for PAGE_SIZE > 4K configs? Is it the
> > compression/decompression time? In my experience this is usually not the
> > bottleneck, I would imagine the real problem would be the internal
> > fragmentation.
> 
> Right, internal fragmentation can be the main problem.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH 2/2] zsmalloc: chain-length configuration should consider other metrics
  2026-01-05  7:23       ` Sergey Senozhatsky
@ 2026-01-05 16:01         ` Yosry Ahmed
  2026-01-06  4:10           ` Sergey Senozhatsky
  0 siblings, 1 reply; 28+ messages in thread
From: Yosry Ahmed @ 2026-01-05 16:01 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Andrew Morton, Nhat Pham, Minchan Kim, Johannes Weiner,
	Brian Geffon, linux-kernel, linux-mm

On Mon, Jan 05, 2026 at 04:23:39PM +0900, Sergey Senozhatsky wrote:
> On (26/01/05 10:42), Sergey Senozhatsky wrote:
> > On (26/01/02 18:29), Yosry Ahmed wrote:
> > > On Thu, Jan 01, 2026 at 10:38:14AM +0900, Sergey Senozhatsky wrote:
> > [..]
> > > 
> > > I worry that the heuristics are too hand-wavy
> > 
> > I don't disagree.  Am not super excited about the heuristics either.
> > 
> > > and I wonder if the memcpy savings actually show up as perf improvements
> > > in any real life workload. Do we have data about this?
> > 
> > I don't have real life 16K PAGE_SIZE devices.  However, on 16K PAGE_SIZE
> > systems we have "normal" size-classes up to a very large size, and normal
> > class means chaining of 0-order physical pages, and chaining means spanning.
> > So on 16K memcpy overhead is expected to be somewhat noticeable.
> 
> By the way, while looking at it, I think we need to "fix" obj_read_begin().
> Currently, it uses "off + class->size" to detect spanning objects, which is
> incorrect: size classes get merged, so a typical size class can hold a range
> of sizes, using padding for smaller objects.  So instead of class->size we
> need to use the actual compressed objects size, just in case if actual written
> size was small enough to fit into the first physical page (we do that in
> obj_write()).  I'll cook a patch.

We also need to handle zs_obj_read_end() to do the kunmap() call
correctly.

> 
> Something like this:
> 
> ---
> 
>  drivers/block/zram/zram_drv.c | 8 +++++---
>  include/linux/zsmalloc.h      | 2 +-
>  mm/zsmalloc.c                 | 4 ++--
>  mm/zswap.c                    | 3 ++-
>  4 files changed, 10 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
> index a6587bed6a03..b371ba6bfec2 100644
> --- a/drivers/block/zram/zram_drv.c
> +++ b/drivers/block/zram/zram_drv.c
> @@ -2065,7 +2065,7 @@ static int read_incompressible_page(struct zram *zram, struct page *page,
>  	void *src, *dst;
>  
>  	handle = get_slot_handle(zram, index);
> -	src = zs_obj_read_begin(zram->mem_pool, handle, NULL);
> +	src = zs_obj_read_begin(zram->mem_pool, handle, PAGE_SIZE, NULL);
>  	dst = kmap_local_page(page);
>  	copy_page(dst, src);
>  	kunmap_local(dst);
> @@ -2087,7 +2087,8 @@ static int read_compressed_page(struct zram *zram, struct page *page, u32 index)
>  	prio = get_slot_comp_priority(zram, index);
>  
>  	zstrm = zcomp_stream_get(zram->comps[prio]);
> -	src = zs_obj_read_begin(zram->mem_pool, handle, zstrm->local_copy);
> +	src = zs_obj_read_begin(zram->mem_pool, handle, size,
> +				zstrm->local_copy);
>  	dst = kmap_local_page(page);
>  	ret = zcomp_decompress(zram->comps[prio], zstrm, src, size, dst);
>  	kunmap_local(dst);
> @@ -2114,7 +2115,8 @@ static int read_from_zspool_raw(struct zram *zram, struct page *page, u32 index)
>  	 * takes place here, as we read raw compressed data.
>  	 */
>  	zstrm = zcomp_stream_get(zram->comps[ZRAM_PRIMARY_COMP]);
> -	src = zs_obj_read_begin(zram->mem_pool, handle, zstrm->local_copy);
> +	src = zs_obj_read_begin(zram->mem_pool, handle, size,
> +				zstrm->local_copy);
>  	memcpy_to_page(page, 0, src, size);
>  	zs_obj_read_end(zram->mem_pool, handle, src);
>  	zcomp_stream_put(zstrm);
> diff --git a/include/linux/zsmalloc.h b/include/linux/zsmalloc.h
> index f3ccff2d966c..64f65c1f14d6 100644
> --- a/include/linux/zsmalloc.h
> +++ b/include/linux/zsmalloc.h
> @@ -40,7 +40,7 @@ unsigned int zs_lookup_class_index(struct zs_pool *pool, unsigned int size);
>  void zs_pool_stats(struct zs_pool *pool, struct zs_pool_stats *stats);
>  
>  void *zs_obj_read_begin(struct zs_pool *pool, unsigned long handle,
> -			void *local_copy);
> +			size_t mem_len, void *local_copy);
>  void zs_obj_read_end(struct zs_pool *pool, unsigned long handle,
>  		     void *handle_mem);
>  void zs_obj_write(struct zs_pool *pool, unsigned long handle,
> diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
> index be385609ef8a..2da60c23cd18 100644
> --- a/mm/zsmalloc.c
> +++ b/mm/zsmalloc.c
> @@ -1070,7 +1070,7 @@ unsigned long zs_get_total_pages(struct zs_pool *pool)
>  EXPORT_SYMBOL_GPL(zs_get_total_pages);
>  
>  void *zs_obj_read_begin(struct zs_pool *pool, unsigned long handle,
> -			void *local_copy)
> +			size_t mem_len, void *local_copy)
>  {
>  	struct zspage *zspage;
>  	struct zpdesc *zpdesc;
> @@ -1092,7 +1092,7 @@ void *zs_obj_read_begin(struct zs_pool *pool, unsigned long handle,
>  	class = zspage_class(pool, zspage);
>  	off = offset_in_page(class->size * obj_idx);
>  
> -	if (off + class->size <= PAGE_SIZE) {
> +	if (off + mem_len <= PAGE_SIZE) {
>  		/* this object is contained entirely within a page */
>  		addr = kmap_local_zpdesc(zpdesc);
>  		addr += off;
> diff --git a/mm/zswap.c b/mm/zswap.c
> index de8858ff1521..291352629616 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -937,7 +937,8 @@ static bool zswap_decompress(struct zswap_entry *entry, struct folio *folio)
>  	u8 *src, *obj;
>  
>  	acomp_ctx = acomp_ctx_get_cpu_lock(pool);
> -	obj = zs_obj_read_begin(pool->zs_pool, entry->handle, acomp_ctx->buffer);
> +	obj = zs_obj_read_begin(pool->zs_pool, entry->handle, entry->length,
> +				acomp_ctx->buffer);
>  
>  	/* zswap entries of length PAGE_SIZE are not compressed. */
>  	if (entry->length == PAGE_SIZE) {
> -- 
> 2.52.0.351.gbe84eed79e-goog


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH 2/2] zsmalloc: chain-length configuration should consider other metrics
  2026-01-05 16:01         ` Yosry Ahmed
@ 2026-01-06  4:10           ` Sergey Senozhatsky
  0 siblings, 0 replies; 28+ messages in thread
From: Sergey Senozhatsky @ 2026-01-06  4:10 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Sergey Senozhatsky, Andrew Morton, Nhat Pham, Minchan Kim,
	Johannes Weiner, Brian Geffon, linux-kernel, linux-mm

On (26/01/05 16:01), Yosry Ahmed wrote:
> On Mon, Jan 05, 2026 at 04:23:39PM +0900, Sergey Senozhatsky wrote:
> > On (26/01/05 10:42), Sergey Senozhatsky wrote:
> > > On (26/01/02 18:29), Yosry Ahmed wrote:
> > > > On Thu, Jan 01, 2026 at 10:38:14AM +0900, Sergey Senozhatsky wrote:
> > > [..]
> > > > 
> > > > I worry that the heuristics are too hand-wavy
> > > 
> > > I don't disagree.  Am not super excited about the heuristics either.
> > > 
> > > > and I wonder if the memcpy savings actually show up as perf improvements
> > > > in any real life workload. Do we have data about this?
> > > 
> > > I don't have real life 16K PAGE_SIZE devices.  However, on 16K PAGE_SIZE
> > > systems we have "normal" size-classes up to a very large size, and normal
> > > class means chaining of 0-order physical pages, and chaining means spanning.
> > > So on 16K memcpy overhead is expected to be somewhat noticeable.
> > 
> > By the way, while looking at it, I think we need to "fix" obj_read_begin().
> > Currently, it uses "off + class->size" to detect spanning objects, which is
> > incorrect: size classes get merged, so a typical size class can hold a range
> > of sizes, using padding for smaller objects.  So instead of class->size we
> > need to use the actual compressed objects size, just in case if actual written
> > size was small enough to fit into the first physical page (we do that in
> > obj_write()).  I'll cook a patch.
> 
> We also need to handle zs_obj_read_end() to do the kunmap() call
> correctly.

Good catch, I realized that only after I started working on the patch.
We also need to account for inlined zs_handle.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH 2/2] zsmalloc: chain-length configuration should consider other metrics
  2026-01-05 15:58       ` Yosry Ahmed
@ 2026-01-06  4:20         ` Sergey Senozhatsky
  2026-01-06  4:22           ` Sergey Senozhatsky
  2026-01-06  9:47           ` Sergey Senozhatsky
  0 siblings, 2 replies; 28+ messages in thread
From: Sergey Senozhatsky @ 2026-01-06  4:20 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Sergey Senozhatsky, Andrew Morton, Nhat Pham, Minchan Kim,
	Johannes Weiner, Brian Geffon, linux-kernel, Herbert Xu,
	linux-mm

On (26/01/05 15:58), Yosry Ahmed wrote:
> On Mon, Jan 05, 2026 at 10:42:51AM +0900, Sergey Senozhatsky wrote:
> > On (26/01/02 18:29), Yosry Ahmed wrote:
> > > On Thu, Jan 01, 2026 at 10:38:14AM +0900, Sergey Senozhatsky wrote:
> > [..]
> > > 
> > > I worry that the heuristics are too hand-wavy
> > 
> > I don't disagree.  Am not super excited about the heuristics either.
> > 
> > > and I wonder if the memcpy savings actually show up as perf improvements
> > > in any real life workload. Do we have data about this?
> > 
> > I don't have real life 16K PAGE_SIZE devices.  However, on 16K PAGE_SIZE
> > systems we have "normal" size-classes up to a very large size, and normal
> > class means chaining of 0-order physical pages, and chaining means spanning.
> > So on 16K memcpy overhead is expected to be somewhat noticeable.
> 
> I don't disagree that it could be a problem, I am just against
> optimizations without data. It makes it hard to modify these heuristics
> later or remove them, since we don't really know what effect they had in
> the first place.
> 
> We also don't know if the 0.5% increase in memory usage is actually
> offset by CPU gains.

Sure, we are on the same page here.

Another area where we potentially could apply similar heuristics
is size-calsses merge logic: sheer fact that two size-classes have
similar objects per zspage and pages per zspage does not necessarily
mean that merging them will be beneficial.  E.g. if padding between
class->size and smallest possible object (when multiplied by the number
of objects per zspage) becomes a large enough wasted space.

But again, heuristics are hard.  I'm fine with us dropping that idea
for the time being.

> > > I also vaguely recall discussions about other ways to avoid the memcpy
> > > using scatterlists, so I am wondering if this is the right metric to
> > > optimize.
> > 
> > As far as I understand SG-list based approach is that it will require
> > implementing split-data handling on the compression algorithms side,
> > which is not trivial (especially if the only reason to do that is
> > zsmalloc).
> 
> I am not sure tbh, adding Herbert here. I remember looking at the code
> in scomp_acomp_comp_decomp() at some point, and I think it will take
> care of non-contiguous SG-lists. Not sure if that's the correct place to
> look tho.

Ah, so it does kmap under the hood.  I suppose that can work.

> > Alternatively, we maybe can try to vmap spanning objects:
> 
> Using vmap makes sense in theory, but in practice (at least for zswap)
> it doesn't help

OK.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH 2/2] zsmalloc: chain-length configuration should consider other metrics
  2026-01-06  4:20         ` Sergey Senozhatsky
@ 2026-01-06  4:22           ` Sergey Senozhatsky
  2026-01-06  5:08             ` Herbert Xu
  2026-01-06  9:47           ` Sergey Senozhatsky
  1 sibling, 1 reply; 28+ messages in thread
From: Sergey Senozhatsky @ 2026-01-06  4:22 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Yosry Ahmed, Andrew Morton, Nhat Pham, Minchan Kim,
	Johannes Weiner, Brian Geffon, linux-kernel, Herbert Xu,
	linux-mm

On (26/01/06 13:20), Sergey Senozhatsky wrote:
[..]
> > I am not sure tbh, adding Herbert here. I remember looking at the code
> > in scomp_acomp_comp_decomp() at some point, and I think it will take
> > care of non-contiguous SG-lists. Not sure if that's the correct place to
> > look tho.
> 
> Ah, so it does kmap under the hood.  I suppose that can work.

I'm hallucinating, sorry.  Yeah, let's hear from Herbert what's
the direction here.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH 2/2] zsmalloc: chain-length configuration should consider other metrics
  2026-01-06  4:22           ` Sergey Senozhatsky
@ 2026-01-06  5:08             ` Herbert Xu
  2026-01-06 16:24               ` Yosry Ahmed
  0 siblings, 1 reply; 28+ messages in thread
From: Herbert Xu @ 2026-01-06  5:08 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Yosry Ahmed, Andrew Morton, Nhat Pham, Minchan Kim,
	Johannes Weiner, Brian Geffon, linux-kernel, linux-mm

On Tue, Jan 06, 2026 at 01:22:45PM +0900, Sergey Senozhatsky wrote:
> On (26/01/06 13:20), Sergey Senozhatsky wrote:
> [..]
> > > I am not sure tbh, adding Herbert here. I remember looking at the code
> > > in scomp_acomp_comp_decomp() at some point, and I think it will take
> > > care of non-contiguous SG-lists. Not sure if that's the correct place to
> > > look tho.
> > 
> > Ah, so it does kmap under the hood.  I suppose that can work.
> 
> I'm hallucinating, sorry.  Yeah, let's hear from Herbert what's
> the direction here.

I have not implemented the underlying SG support yet because
there are no users in the kernel as of now.  But if this is
useful for you then we can certainly do this, at least for
LZO which is fairly simple.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH 2/2] zsmalloc: chain-length configuration should consider other metrics
  2026-01-06  4:20         ` Sergey Senozhatsky
  2026-01-06  4:22           ` Sergey Senozhatsky
@ 2026-01-06  9:47           ` Sergey Senozhatsky
  1 sibling, 0 replies; 28+ messages in thread
From: Sergey Senozhatsky @ 2026-01-06  9:47 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Yosry Ahmed, Andrew Morton, Nhat Pham, Minchan Kim,
	Johannes Weiner, Brian Geffon, linux-kernel, Herbert Xu,
	linux-mm

On (26/01/06 13:20), Sergey Senozhatsky wrote:
> Another area where we potentially could apply similar heuristics
> is size-calsses merge logic: sheer fact that two size-classes have
> similar objects per zspage and pages per zspage does not necessarily
> mean that merging them will be beneficial.

That's nonsense.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH 2/2] zsmalloc: chain-length configuration should consider other metrics
  2026-01-06  5:08             ` Herbert Xu
@ 2026-01-06 16:24               ` Yosry Ahmed
  2026-01-07  5:25                 ` Herbert Xu
  0 siblings, 1 reply; 28+ messages in thread
From: Yosry Ahmed @ 2026-01-06 16:24 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Sergey Senozhatsky, Andrew Morton, Nhat Pham, Minchan Kim,
	Johannes Weiner, Brian Geffon, linux-kernel, linux-mm

On Tue, Jan 06, 2026 at 01:08:09PM +0800, Herbert Xu wrote:
> On Tue, Jan 06, 2026 at 01:22:45PM +0900, Sergey Senozhatsky wrote:
> > On (26/01/06 13:20), Sergey Senozhatsky wrote:
> > [..]
> > > > I am not sure tbh, adding Herbert here. I remember looking at the code
> > > > in scomp_acomp_comp_decomp() at some point, and I think it will take
> > > > care of non-contiguous SG-lists. Not sure if that's the correct place to
> > > > look tho.
> > > 
> > > Ah, so it does kmap under the hood.  I suppose that can work.
> > 
> > I'm hallucinating, sorry.  Yeah, let's hear from Herbert what's
> > the direction here.
> 
> I have not implemented the underlying SG support yet because
> there are no users in the kernel as of now.  But if this is
> useful for you then we can certainly do this, at least for
> LZO which is fairly simple.

Just to clarify, IIUC the SG support would mean that zram or zswap can
pass a non-contiguous SG-list to the crypto API, regardless of
compressor support. I assume that the crypto layer will either pass the
SG-list as-is to the compressor if it supports it, or copy it into
scratch space to be contiguous if needed.

So zswap, for example, will get an SG list from zsmalloc and pass it
directly to the crypto API for decompression. Then the effort to add
support to compressors can be done separately.

Did I get this right?

> 
> Cheers,
> -- 
> Email: Herbert Xu <herbert@gondor.apana.org.au>
> Home Page: http://gondor.apana.org.au/~herbert/
> PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH 2/2] zsmalloc: chain-length configuration should consider other metrics
  2026-01-06 16:24               ` Yosry Ahmed
@ 2026-01-07  5:25                 ` Herbert Xu
  2026-01-07  5:39                   ` Yosry Ahmed
  0 siblings, 1 reply; 28+ messages in thread
From: Herbert Xu @ 2026-01-07  5:25 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Sergey Senozhatsky, Andrew Morton, Nhat Pham, Minchan Kim,
	Johannes Weiner, Brian Geffon, linux-kernel, linux-mm

On Tue, Jan 06, 2026 at 04:24:45PM +0000, Yosry Ahmed wrote:
>
> Just to clarify, IIUC the SG support would mean that zram or zswap can
> pass a non-contiguous SG-list to the crypto API, regardless of
> compressor support. I assume that the crypto layer will either pass the
> SG-list as-is to the compressor if it supports it, or copy it into
> scratch space to be contiguous if needed.
>
> So zswap, for example, will get an SG list from zsmalloc and pass it
> directly to the crypto API for decompression. Then the effort to add
> support to compressors can be done separately.
> 
> Did I get this right?

Correct, you can already do that today with the scomp layer providing
the fallback linearisation.

Adding native SG support to LZO simply means removing the memcpy that
scomp would otherwise have to do.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH 2/2] zsmalloc: chain-length configuration should consider other metrics
  2026-01-07  5:25                 ` Herbert Xu
@ 2026-01-07  5:39                   ` Yosry Ahmed
  2026-01-07  5:42                     ` Herbert Xu
  2026-01-07  5:43                     ` Sergey Senozhatsky
  0 siblings, 2 replies; 28+ messages in thread
From: Yosry Ahmed @ 2026-01-07  5:39 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Sergey Senozhatsky, Andrew Morton, Nhat Pham, Minchan Kim,
	Johannes Weiner, Brian Geffon, linux-kernel, linux-mm

January 6, 2026 at 9:25 PM, "Herbert Xu" <herbert@gondor.apana.org.au> wrote:


> 
> On Tue, Jan 06, 2026 at 04:24:45PM +0000, Yosry Ahmed wrote:
> 
> > 
> > Just to clarify, IIUC the SG support would mean that zram or zswap can
> >  pass a non-contiguous SG-list to the crypto API, regardless of
> >  compressor support. I assume that the crypto layer will either pass the
> >  SG-list as-is to the compressor if it supports it, or copy it into
> >  scratch space to be contiguous if needed.
> > 
> >  So zswap, for example, will get an SG list from zsmalloc and pass it
> >  directly to the crypto API for decompression. Then the effort to add
> >  support to compressors can be done separately.
> >  
> >  Did I get this right?
> > 
> Correct, you can already do that today with the scomp layer providing
> the fallback linearisation.

In this case I think we can just make zsmalloc return an SG list and use it in both zswap and zram. We will fallback to the memcpy() in scomp instead of the memcpy() in zsmalloc, and we'll drop the memcpy() logic in zswap as a bonus.

Will this for acomp though?

> 
> Adding native SG support to LZO simply means removing the memcpy that
> scomp would otherwise have to do.

Yeah the effort to add native support to compressors can be done separately. For zswap, I think the most common compressors are actually zstd and LZ4.

> 
> Cheers,
> -- 
> Email: Herbert Xu <>
> Home Page: http://gondor.apana.org.au/~herbert/
> PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
>


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH 2/2] zsmalloc: chain-length configuration should consider other metrics
  2026-01-07  5:39                   ` Yosry Ahmed
@ 2026-01-07  5:42                     ` Herbert Xu
  2026-01-07  5:43                     ` Sergey Senozhatsky
  1 sibling, 0 replies; 28+ messages in thread
From: Herbert Xu @ 2026-01-07  5:42 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Sergey Senozhatsky, Andrew Morton, Nhat Pham, Minchan Kim,
	Johannes Weiner, Brian Geffon, linux-kernel, linux-mm

On Wed, Jan 07, 2026 at 05:39:50AM +0000, Yosry Ahmed wrote:
>
> In this case I think we can just make zsmalloc return an SG list and use it in both zswap and zram. We will fallback to the memcpy() in scomp instead of the memcpy() in zsmalloc, and we'll drop the memcpy() logic in zswap as a bonus.
> 
> Will this for acomp though?

Yes it works for acomp too since scomp sits underneath acomp.
 
> Yeah the effort to add native support to compressors can be done separately. For zswap, I think the most common compressors are actually zstd and LZ4.

OK once you start using it then I could add the native SG support
to LZO and it should become a little bit faster (and eliminate the
need for the scomp bounce page).

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH 2/2] zsmalloc: chain-length configuration should consider other metrics
  2026-01-07  5:39                   ` Yosry Ahmed
  2026-01-07  5:42                     ` Herbert Xu
@ 2026-01-07  5:43                     ` Sergey Senozhatsky
  2026-01-07 17:12                       ` Yosry Ahmed
  1 sibling, 1 reply; 28+ messages in thread
From: Sergey Senozhatsky @ 2026-01-07  5:43 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Herbert Xu, Sergey Senozhatsky, Andrew Morton, Nhat Pham,
	Minchan Kim, Johannes Weiner, Brian Geffon, linux-kernel,
	linux-mm

On (26/01/07 05:39), Yosry Ahmed wrote:
> > On Tue, Jan 06, 2026 at 04:24:45PM +0000, Yosry Ahmed wrote:
[..]
> > Adding native SG support to LZO simply means removing the memcpy that
> > scomp would otherwise have to do.
> 
> Yeah the effort to add native support to compressors can be done
> separately. For zswap, I think the most common compressors are
> actually zstd and LZ4.

I think it's the same for chromeos: lz4 (primary) and zstd
(secondary/recompression).  zstd looks very complicated,
not sure if we really want to diverge its codebase from
the upstream (meta github repo).


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH 2/2] zsmalloc: chain-length configuration should consider other metrics
  2026-01-07  5:43                     ` Sergey Senozhatsky
@ 2026-01-07 17:12                       ` Yosry Ahmed
  2026-01-08  7:37                         ` Sergey Senozhatsky
  0 siblings, 1 reply; 28+ messages in thread
From: Yosry Ahmed @ 2026-01-07 17:12 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Herbert Xu, Andrew Morton, Nhat Pham, Minchan Kim,
	Johannes Weiner, Brian Geffon, linux-kernel, linux-mm

On Wed, Jan 07, 2026 at 02:43:42PM +0900, Sergey Senozhatsky wrote:
> On (26/01/07 05:39), Yosry Ahmed wrote:
> > > On Tue, Jan 06, 2026 at 04:24:45PM +0000, Yosry Ahmed wrote:
> [..]
> > > Adding native SG support to LZO simply means removing the memcpy that
> > > scomp would otherwise have to do.
> > 
> > Yeah the effort to add native support to compressors can be done
> > separately. For zswap, I think the most common compressors are
> > actually zstd and LZ4.
> 
> I think it's the same for chromeos: lz4 (primary) and zstd
> (secondary/recompression).  zstd looks very complicated,
> not sure if we really want to diverge its codebase from
> the upstream (meta github repo).

I think there's value in using SG lists even if we do not have support
for lz4 or zstd. We'll remove the memcpy() logic in zsmalloc and the
kmap handling memcpy() in zswap if we just pass SG lists from zsmalloc
to zswap/zram.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH 2/2] zsmalloc: chain-length configuration should consider other metrics
  2026-01-07 17:12                       ` Yosry Ahmed
@ 2026-01-08  7:37                         ` Sergey Senozhatsky
  2026-01-08  8:01                           ` Yosry Ahmed
  0 siblings, 1 reply; 28+ messages in thread
From: Sergey Senozhatsky @ 2026-01-08  7:37 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Sergey Senozhatsky, Herbert Xu, Andrew Morton, Nhat Pham,
	Minchan Kim, Johannes Weiner, Brian Geffon, linux-kernel,
	linux-mm

On (26/01/07 17:12), Yosry Ahmed wrote:
> On Wed, Jan 07, 2026 at 02:43:42PM +0900, Sergey Senozhatsky wrote:
> > On (26/01/07 05:39), Yosry Ahmed wrote:
> > > > On Tue, Jan 06, 2026 at 04:24:45PM +0000, Yosry Ahmed wrote:
> > [..]
> > > > Adding native SG support to LZO simply means removing the memcpy that
> > > > scomp would otherwise have to do.
> > > 
> > > Yeah the effort to add native support to compressors can be done
> > > separately. For zswap, I think the most common compressors are
> > > actually zstd and LZ4.
> > 
> > I think it's the same for chromeos: lz4 (primary) and zstd
> > (secondary/recompression).  zstd looks very complicated,
> > not sure if we really want to diverge its codebase from
> > the upstream (meta github repo).
> 
> I think there's value in using SG lists even if we do not have support
> for lz4 or zstd. We'll remove the memcpy() logic in zsmalloc and the
> kmap handling memcpy() in zswap if we just pass SG lists from zsmalloc
> to zswap/zram.

Yeah I agree, I guess I can cook something up.

For transition period we can have:
- current "memcpy" API
  for zswap

- SG-list API

I can vmap either on the zram side or have new zsmalloc vmap API
(alongside the memcpy and SG-list APIs).

Once crypto API supports SG-list and algorithms tunables I can
switch zram over from zcomp to crypto API and remove memcpy and
vmap APIs from zsmalloc.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH 2/2] zsmalloc: chain-length configuration should consider other metrics
  2026-01-08  7:37                         ` Sergey Senozhatsky
@ 2026-01-08  8:01                           ` Yosry Ahmed
  2026-01-08  8:05                             ` Herbert Xu
  2026-01-09  3:29                             ` Sergey Senozhatsky
  0 siblings, 2 replies; 28+ messages in thread
From: Yosry Ahmed @ 2026-01-08  8:01 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Herbert Xu, Andrew Morton, Nhat Pham, Minchan Kim,
	Johannes Weiner, Brian Geffon, linux-kernel, linux-mm

On Thu, Jan 08, 2026 at 04:37:11PM +0900, Sergey Senozhatsky wrote:
> On (26/01/07 17:12), Yosry Ahmed wrote:
> > On Wed, Jan 07, 2026 at 02:43:42PM +0900, Sergey Senozhatsky wrote:
> > > On (26/01/07 05:39), Yosry Ahmed wrote:
> > > > > On Tue, Jan 06, 2026 at 04:24:45PM +0000, Yosry Ahmed wrote:
> > > [..]
> > > > > Adding native SG support to LZO simply means removing the memcpy that
> > > > > scomp would otherwise have to do.
> > > > 
> > > > Yeah the effort to add native support to compressors can be done
> > > > separately. For zswap, I think the most common compressors are
> > > > actually zstd and LZ4.
> > > 
> > > I think it's the same for chromeos: lz4 (primary) and zstd
> > > (secondary/recompression).  zstd looks very complicated,
> > > not sure if we really want to diverge its codebase from
> > > the upstream (meta github repo).
> > 
> > I think there's value in using SG lists even if we do not have support
> > for lz4 or zstd. We'll remove the memcpy() logic in zsmalloc and the
> > kmap handling memcpy() in zswap if we just pass SG lists from zsmalloc
> > to zswap/zram.
> 
> Yeah I agree, I guess I can cook something up.
> 
> For transition period we can have:
> - current "memcpy" API
>   for zswap
> 
> - SG-list API
> 
> I can vmap either on the zram side or have new zsmalloc vmap API
> (alongside the memcpy and SG-list APIs).
> 
> Once crypto API supports SG-list and algorithms tunables I can
> switch zram over from zcomp to crypto API and remove memcpy and
> vmap APIs from zsmalloc.

IIUC based on Herbert's previous response, crypto and scomp already
support passing in a discontiguous SG-list. So for zswap, if zsmalloc
returns an SG-list, it will just be passed as-is to the crypto API.

If zram can also use the SG-list, then we can completely drop the
memcpy() logic from zsmalloc.

Herbert, please correct me if I am wrong.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH 2/2] zsmalloc: chain-length configuration should consider other metrics
  2026-01-08  8:01                           ` Yosry Ahmed
@ 2026-01-08  8:05                             ` Herbert Xu
  2026-01-09  3:29                             ` Sergey Senozhatsky
  1 sibling, 0 replies; 28+ messages in thread
From: Herbert Xu @ 2026-01-08  8:05 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Sergey Senozhatsky, Andrew Morton, Nhat Pham, Minchan Kim,
	Johannes Weiner, Brian Geffon, linux-kernel, linux-mm

On Thu, Jan 08, 2026 at 08:01:20AM +0000, Yosry Ahmed wrote:
>
> IIUC based on Herbert's previous response, crypto and scomp already
> support passing in a discontiguous SG-list. So for zswap, if zsmalloc
> returns an SG-list, it will just be passed as-is to the crypto API.
> 
> If zram can also use the SG-list, then we can completely drop the
> memcpy() logic from zsmalloc.
> 
> Herbert, please correct me if I am wrong.

You're correct.  The SG support has always been there.  You can
use that today and scomp will perform the memcpy for you if and
only if it's necessary.

The algorithm-specific SG support would simply be a performance
enhancement that eliminates the memcpy entirely for whichever
algorithm that we choose to implement it for.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH 2/2] zsmalloc: chain-length configuration should consider other metrics
  2026-01-08  8:01                           ` Yosry Ahmed
  2026-01-08  8:05                             ` Herbert Xu
@ 2026-01-09  3:29                             ` Sergey Senozhatsky
  2026-01-09 16:02                               ` Yosry Ahmed
  1 sibling, 1 reply; 28+ messages in thread
From: Sergey Senozhatsky @ 2026-01-09  3:29 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Sergey Senozhatsky, Herbert Xu, Andrew Morton, Nhat Pham,
	Minchan Kim, Johannes Weiner, Brian Geffon, linux-kernel,
	linux-mm

On (26/01/08 08:01), Yosry Ahmed wrote:
> > Yeah I agree, I guess I can cook something up.
> > 
> > For transition period we can have:
> > - current "memcpy" API
> >   for zswap
> > 
> > - SG-list API
> > 
> > I can vmap either on the zram side or have new zsmalloc vmap API
> > (alongside the memcpy and SG-list APIs).
> > 
> > Once crypto API supports SG-list and algorithms tunables I can
> > switch zram over from zcomp to crypto API and remove memcpy and
> > vmap APIs from zsmalloc.
> 
> IIUC based on Herbert's previous response, crypto and scomp already
> support passing in a discontiguous SG-list. So for zswap, if zsmalloc
> returns an SG-list, it will just be passed as-is to the crypto API.

Oh, okay,

Something like below?  Not really familiar with SG-list API.

---
 include/linux/zsmalloc.h |  4 +++
 mm/zsmalloc.c            | 65 ++++++++++++++++++++++++++++++++++++++++
 2 files changed, 69 insertions(+)

diff --git a/include/linux/zsmalloc.h b/include/linux/zsmalloc.h
index 5565c3171007..11e614663dd3 100644
--- a/include/linux/zsmalloc.h
+++ b/include/linux/zsmalloc.h
@@ -22,6 +22,7 @@ struct zs_pool_stats {
 };
 
 struct zs_pool;
+struct scatterlist;
 
 struct zs_pool *zs_create_pool(const char *name);
 void zs_destroy_pool(struct zs_pool *pool);
@@ -43,6 +44,9 @@ void *zs_obj_read_begin(struct zs_pool *pool, unsigned long handle,
 			size_t mem_len, void *local_copy);
 void zs_obj_read_end(struct zs_pool *pool, unsigned long handle,
 		     size_t mem_len, void *handle_mem);
+int zs_obj_read_sg_begin(struct zs_pool *pool, unsigned long handle,
+		   struct scatterlist *sg, size_t mem_len);
+void zs_obj_read_sg_end(struct zs_pool *pool, unsigned long handle);
 void zs_obj_write(struct zs_pool *pool, unsigned long handle,
 		  void *handle_mem, size_t mem_len);
 
diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
index 16d5587a052a..8f7569058147 100644
--- a/mm/zsmalloc.c
+++ b/mm/zsmalloc.c
@@ -30,6 +30,7 @@
 #include <linux/highmem.h>
 #include <linux/string.h>
 #include <linux/slab.h>
+#include <linux/scatterlist.h>
 #include <linux/spinlock.h>
 #include <linux/sprintf.h>
 #include <linux/shrinker.h>
@@ -1146,6 +1147,70 @@ void zs_obj_read_end(struct zs_pool *pool, unsigned long handle,
 }
 EXPORT_SYMBOL_GPL(zs_obj_read_end);
 
+int zs_obj_read_sg_begin(struct zs_pool *pool, unsigned long handle,
+		   struct scatterlist *sg, size_t mem_len)
+{
+	struct zspage *zspage;
+	struct zpdesc *zpdesc;
+	unsigned long obj, off;
+	unsigned int obj_idx;
+	struct size_class *class;
+
+	/* Guarantee we can get zspage from handle safely */
+	read_lock(&pool->lock);
+	obj = handle_to_obj(handle);
+	obj_to_location(obj, &zpdesc, &obj_idx);
+	zspage = get_zspage(zpdesc);
+
+	/* Make sure migration doesn't move any pages in this zspage */
+	zspage_read_lock(zspage);
+	read_unlock(&pool->lock);
+
+	class = zspage_class(pool, zspage);
+	off = offset_in_page(class->size * obj_idx);
+
+	if (!ZsHugePage(zspage))
+		off += ZS_HANDLE_SIZE;
+
+	if (off + mem_len <= PAGE_SIZE) {
+		/* this object is contained entirely within a page */
+		sg_init_table(sg, 1);
+		sg_set_page(sg, zpdesc_page(zpdesc), mem_len, off);
+	} else {
+		size_t sizes[2];
+
+		/* this object spans two pages */
+		sizes[0] = PAGE_SIZE - off;
+		sizes[1] = mem_len - sizes[0];
+
+		sg_init_table(sg, 2);
+		sg_set_page(sg, zpdesc_page(zpdesc), sizes[0], off);
+
+		zpdesc = get_next_zpdesc(zpdesc);
+		sg = sg_next(sg);
+
+		sg_set_page(sg, zpdesc_page(zpdesc), sizes[1], 0);
+	}
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(zs_obj_read_sg_begin);
+
+void zs_obj_read_sg_end(struct zs_pool *pool, unsigned long handle)
+{
+	struct zspage *zspage;
+	struct zpdesc *zpdesc;
+	unsigned long obj;
+	unsigned int obj_idx;
+
+	obj = handle_to_obj(handle);
+	obj_to_location(obj, &zpdesc, &obj_idx);
+	zspage = get_zspage(zpdesc);
+
+	zspage_read_unlock(zspage);
+}
+EXPORT_SYMBOL_GPL(zs_obj_read_sg_end);
+
 void zs_obj_write(struct zs_pool *pool, unsigned long handle,
 		  void *handle_mem, size_t mem_len)
 {
-- 
2.52.0.457.g6b5491de43-goog



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH 2/2] zsmalloc: chain-length configuration should consider other metrics
  2026-01-09  3:29                             ` Sergey Senozhatsky
@ 2026-01-09 16:02                               ` Yosry Ahmed
  2026-01-12  5:01                                 ` Herbert Xu
  0 siblings, 1 reply; 28+ messages in thread
From: Yosry Ahmed @ 2026-01-09 16:02 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Herbert Xu, Andrew Morton, Nhat Pham, Minchan Kim,
	Johannes Weiner, Brian Geffon, linux-kernel, linux-mm

On Fri, Jan 09, 2026 at 12:29:58PM +0900, Sergey Senozhatsky wrote:
> On (26/01/08 08:01), Yosry Ahmed wrote:
> > > Yeah I agree, I guess I can cook something up.
> > > 
> > > For transition period we can have:
> > > - current "memcpy" API
> > >   for zswap
> > > 
> > > - SG-list API
> > > 
> > > I can vmap either on the zram side or have new zsmalloc vmap API
> > > (alongside the memcpy and SG-list APIs).
> > > 
> > > Once crypto API supports SG-list and algorithms tunables I can
> > > switch zram over from zcomp to crypto API and remove memcpy and
> > > vmap APIs from zsmalloc.
> > 
> > IIUC based on Herbert's previous response, crypto and scomp already
> > support passing in a discontiguous SG-list. So for zswap, if zsmalloc
> > returns an SG-list, it will just be passed as-is to the crypto API.
> 
> Oh, okay,
> 
> Something like below?  Not really familiar with SG-list API.

That makes two of us :P

Herbert, do you mind taking a look at this? It looks sane to me except
for one question below.

I can try to test this next week with zswap and see if it blows up.

> 
> ---
>  include/linux/zsmalloc.h |  4 +++
>  mm/zsmalloc.c            | 65 ++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 69 insertions(+)
> 
> diff --git a/include/linux/zsmalloc.h b/include/linux/zsmalloc.h
> index 5565c3171007..11e614663dd3 100644
> --- a/include/linux/zsmalloc.h
> +++ b/include/linux/zsmalloc.h
> @@ -22,6 +22,7 @@ struct zs_pool_stats {
>  };
>  
>  struct zs_pool;
> +struct scatterlist;
>  
>  struct zs_pool *zs_create_pool(const char *name);
>  void zs_destroy_pool(struct zs_pool *pool);
> @@ -43,6 +44,9 @@ void *zs_obj_read_begin(struct zs_pool *pool, unsigned long handle,
>  			size_t mem_len, void *local_copy);
>  void zs_obj_read_end(struct zs_pool *pool, unsigned long handle,
>  		     size_t mem_len, void *handle_mem);
> +int zs_obj_read_sg_begin(struct zs_pool *pool, unsigned long handle,
> +		   struct scatterlist *sg, size_t mem_len);
> +void zs_obj_read_sg_end(struct zs_pool *pool, unsigned long handle);
>  void zs_obj_write(struct zs_pool *pool, unsigned long handle,
>  		  void *handle_mem, size_t mem_len);
>  
> diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
> index 16d5587a052a..8f7569058147 100644
> --- a/mm/zsmalloc.c
> +++ b/mm/zsmalloc.c
> @@ -30,6 +30,7 @@
>  #include <linux/highmem.h>
>  #include <linux/string.h>
>  #include <linux/slab.h>
> +#include <linux/scatterlist.h>
>  #include <linux/spinlock.h>
>  #include <linux/sprintf.h>
>  #include <linux/shrinker.h>
> @@ -1146,6 +1147,70 @@ void zs_obj_read_end(struct zs_pool *pool, unsigned long handle,
>  }
>  EXPORT_SYMBOL_GPL(zs_obj_read_end);
>  
> +int zs_obj_read_sg_begin(struct zs_pool *pool, unsigned long handle,
> +		   struct scatterlist *sg, size_t mem_len)
> +{
> +	struct zspage *zspage;
> +	struct zpdesc *zpdesc;
> +	unsigned long obj, off;
> +	unsigned int obj_idx;
> +	struct size_class *class;
> +
> +	/* Guarantee we can get zspage from handle safely */
> +	read_lock(&pool->lock);
> +	obj = handle_to_obj(handle);
> +	obj_to_location(obj, &zpdesc, &obj_idx);
> +	zspage = get_zspage(zpdesc);
> +
> +	/* Make sure migration doesn't move any pages in this zspage */
> +	zspage_read_lock(zspage);
> +	read_unlock(&pool->lock);
> +
> +	class = zspage_class(pool, zspage);
> +	off = offset_in_page(class->size * obj_idx);
> +
> +	if (!ZsHugePage(zspage))
> +		off += ZS_HANDLE_SIZE;
> +
> +	if (off + mem_len <= PAGE_SIZE) {
> +		/* this object is contained entirely within a page */
> +		sg_init_table(sg, 1);
> +		sg_set_page(sg, zpdesc_page(zpdesc), mem_len, off);
> +	} else {
> +		size_t sizes[2];
> +
> +		/* this object spans two pages */
> +		sizes[0] = PAGE_SIZE - off;
> +		sizes[1] = mem_len - sizes[0];
> +
> +		sg_init_table(sg, 2);
> +		sg_set_page(sg, zpdesc_page(zpdesc), sizes[0], off);
> +
> +		zpdesc = get_next_zpdesc(zpdesc);
> +		sg = sg_next(sg);

Is this stateful? Will the SG list be returned pointing at the second
page now?

> +
> +		sg_set_page(sg, zpdesc_page(zpdesc), sizes[1], 0);
> +	}
> +
> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(zs_obj_read_sg_begin);
> +
> +void zs_obj_read_sg_end(struct zs_pool *pool, unsigned long handle)
> +{
> +	struct zspage *zspage;
> +	struct zpdesc *zpdesc;
> +	unsigned long obj;
> +	unsigned int obj_idx;
> +
> +	obj = handle_to_obj(handle);
> +	obj_to_location(obj, &zpdesc, &obj_idx);
> +	zspage = get_zspage(zpdesc);
> +
> +	zspage_read_unlock(zspage);
> +}
> +EXPORT_SYMBOL_GPL(zs_obj_read_sg_end);
> +
>  void zs_obj_write(struct zs_pool *pool, unsigned long handle,
>  		  void *handle_mem, size_t mem_len)
>  {
> -- 
> 2.52.0.457.g6b5491de43-goog
> 


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH 2/2] zsmalloc: chain-length configuration should consider other metrics
  2026-01-09 16:02                               ` Yosry Ahmed
@ 2026-01-12  5:01                                 ` Herbert Xu
  2026-01-12  5:07                                   ` Sergey Senozhatsky
  0 siblings, 1 reply; 28+ messages in thread
From: Herbert Xu @ 2026-01-12  5:01 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Sergey Senozhatsky, Andrew Morton, Nhat Pham, Minchan Kim,
	Johannes Weiner, Brian Geffon, linux-kernel, linux-mm

On Fri, Jan 09, 2026 at 04:02:51PM +0000, Yosry Ahmed wrote:
> 
> Herbert, do you mind taking a look at this? It looks sane to me except
> for one question below.

Looks alright to me too.

> > +		size_t sizes[2];
> > +
> > +		/* this object spans two pages */
> > +		sizes[0] = PAGE_SIZE - off;
> > +		sizes[1] = mem_len - sizes[0];
> > +
> > +		sg_init_table(sg, 2);
> > +		sg_set_page(sg, zpdesc_page(zpdesc), sizes[0], off);
> > +
> > +		zpdesc = get_next_zpdesc(zpdesc);
> > +		sg = sg_next(sg);
> 
> Is this stateful? Will the SG list be returned pointing at the second
> page now?

It makes no difference because we just called sg_init_table(sg, 2),
so sg_next(sg) is equivalent to &sg[1].

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH 2/2] zsmalloc: chain-length configuration should consider other metrics
  2026-01-12  5:01                                 ` Herbert Xu
@ 2026-01-12  5:07                                   ` Sergey Senozhatsky
  2026-01-12 20:56                                     ` Yosry Ahmed
  0 siblings, 1 reply; 28+ messages in thread
From: Sergey Senozhatsky @ 2026-01-12  5:07 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Yosry Ahmed, Sergey Senozhatsky, Andrew Morton, Nhat Pham,
	Minchan Kim, Johannes Weiner, Brian Geffon, linux-kernel,
	linux-mm

On (26/01/12 13:01), Herbert Xu wrote:
> On Fri, Jan 09, 2026 at 04:02:51PM +0000, Yosry Ahmed wrote:
> > 
> > Herbert, do you mind taking a look at this? It looks sane to me except
> > for one question below.
> 
> Looks alright to me too.
> 
> > > +		size_t sizes[2];
> > > +
> > > +		/* this object spans two pages */
> > > +		sizes[0] = PAGE_SIZE - off;
> > > +		sizes[1] = mem_len - sizes[0];
> > > +
> > > +		sg_init_table(sg, 2);
> > > +		sg_set_page(sg, zpdesc_page(zpdesc), sizes[0], off);
> > > +
> > > +		zpdesc = get_next_zpdesc(zpdesc);
> > > +		sg = sg_next(sg);
> > 
> > Is this stateful? Will the SG list be returned pointing at the second
> > page now?
> 
> It makes no difference because we just called sg_init_table(sg, 2),
> so sg_next(sg) is equivalent to &sg[1].

I did it this way for (sort of) consistency: sg next follows zpdesc next.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH 2/2] zsmalloc: chain-length configuration should consider other metrics
  2026-01-12  5:07                                   ` Sergey Senozhatsky
@ 2026-01-12 20:56                                     ` Yosry Ahmed
  2026-01-13  2:36                                       ` Sergey Senozhatsky
  0 siblings, 1 reply; 28+ messages in thread
From: Yosry Ahmed @ 2026-01-12 20:56 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Herbert Xu, Andrew Morton, Nhat Pham, Minchan Kim,
	Johannes Weiner, Brian Geffon, linux-kernel, linux-mm

On Mon, Jan 12, 2026 at 02:07:24PM +0900, Sergey Senozhatsky wrote:
> On (26/01/12 13:01), Herbert Xu wrote:
> > On Fri, Jan 09, 2026 at 04:02:51PM +0000, Yosry Ahmed wrote:
> > > 
> > > Herbert, do you mind taking a look at this? It looks sane to me except
> > > for one question below.
> > 
> > Looks alright to me too.
> > 
> > > > +		size_t sizes[2];
> > > > +
> > > > +		/* this object spans two pages */
> > > > +		sizes[0] = PAGE_SIZE - off;
> > > > +		sizes[1] = mem_len - sizes[0];
> > > > +
> > > > +		sg_init_table(sg, 2);
> > > > +		sg_set_page(sg, zpdesc_page(zpdesc), sizes[0], off);
> > > > +
> > > > +		zpdesc = get_next_zpdesc(zpdesc);
> > > > +		sg = sg_next(sg);
> > > 
> > > Is this stateful? Will the SG list be returned pointing at the second
> > > page now?
> > 
> > It makes no difference because we just called sg_init_table(sg, 2),
> > so sg_next(sg) is equivalent to &sg[1].
> 
> I did it this way for (sort of) consistency: sg next follows zpdesc next.

Makes sense, I think I confused myself earlier. Do you plan to switch
the existing interfaces to use SG lists for both zswap and zram? Or does
zram still need the old interfaces?


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH 2/2] zsmalloc: chain-length configuration should consider other metrics
  2026-01-12 20:56                                     ` Yosry Ahmed
@ 2026-01-13  2:36                                       ` Sergey Senozhatsky
  0 siblings, 0 replies; 28+ messages in thread
From: Sergey Senozhatsky @ 2026-01-13  2:36 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Sergey Senozhatsky, Herbert Xu, Andrew Morton, Nhat Pham,
	Minchan Kim, Johannes Weiner, Brian Geffon, linux-kernel,
	linux-mm

On (26/01/12 20:56), Yosry Ahmed wrote:
> > > Looks alright to me too.
> > > 
> > > > > +		size_t sizes[2];
> > > > > +
> > > > > +		/* this object spans two pages */
> > > > > +		sizes[0] = PAGE_SIZE - off;
> > > > > +		sizes[1] = mem_len - sizes[0];
> > > > > +
> > > > > +		sg_init_table(sg, 2);
> > > > > +		sg_set_page(sg, zpdesc_page(zpdesc), sizes[0], off);
> > > > > +
> > > > > +		zpdesc = get_next_zpdesc(zpdesc);
> > > > > +		sg = sg_next(sg);
> > > > 
> > > > Is this stateful? Will the SG list be returned pointing at the second
> > > > page now?
> > > 
> > > It makes no difference because we just called sg_init_table(sg, 2),
> > > so sg_next(sg) is equivalent to &sg[1].
> > 
> > I did it this way for (sort of) consistency: sg next follows zpdesc next.
> 
> Makes sense, I think I confused myself earlier. Do you plan to switch
> the existing interfaces to use SG lists for both zswap and zram? Or does
> zram still need the old interfaces?

That's a good question.  I don't know.  I was thinking of providing
SG-list based API for zswap first and keeping zram as is, for now.
zram doesn't use crypto API at the moment so supporting new interface
will come with extra work.


^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2026-01-13  2:36 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2026-01-01  1:38 [RFC PATCH 0/2] zsmalloc: size-classes chain-length tunings Sergey Senozhatsky
2026-01-01  1:38 ` [RFC PATCH 1/2] zsmalloc: drop hard limit on the number of size classes Sergey Senozhatsky
2026-01-01  1:38 ` [RFC PATCH 2/2] zsmalloc: chain-length configuration should consider other metrics Sergey Senozhatsky
2026-01-02 18:29   ` Yosry Ahmed
2026-01-05  1:42     ` Sergey Senozhatsky
2026-01-05  7:23       ` Sergey Senozhatsky
2026-01-05 16:01         ` Yosry Ahmed
2026-01-06  4:10           ` Sergey Senozhatsky
2026-01-05 15:58       ` Yosry Ahmed
2026-01-06  4:20         ` Sergey Senozhatsky
2026-01-06  4:22           ` Sergey Senozhatsky
2026-01-06  5:08             ` Herbert Xu
2026-01-06 16:24               ` Yosry Ahmed
2026-01-07  5:25                 ` Herbert Xu
2026-01-07  5:39                   ` Yosry Ahmed
2026-01-07  5:42                     ` Herbert Xu
2026-01-07  5:43                     ` Sergey Senozhatsky
2026-01-07 17:12                       ` Yosry Ahmed
2026-01-08  7:37                         ` Sergey Senozhatsky
2026-01-08  8:01                           ` Yosry Ahmed
2026-01-08  8:05                             ` Herbert Xu
2026-01-09  3:29                             ` Sergey Senozhatsky
2026-01-09 16:02                               ` Yosry Ahmed
2026-01-12  5:01                                 ` Herbert Xu
2026-01-12  5:07                                   ` Sergey Senozhatsky
2026-01-12 20:56                                     ` Yosry Ahmed
2026-01-13  2:36                                       ` Sergey Senozhatsky
2026-01-06  9:47           ` Sergey Senozhatsky

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox