* [RFC PATCH 1/2] zsmalloc: drop hard limit on the number of size classes
2026-01-01 1:38 [RFC PATCH 0/2] zsmalloc: size-classes chain-length tunings Sergey Senozhatsky
@ 2026-01-01 1:38 ` Sergey Senozhatsky
2026-01-01 1:38 ` [RFC PATCH 2/2] zsmalloc: chain-length configuration should consider other metrics Sergey Senozhatsky
1 sibling, 0 replies; 3+ messages in thread
From: Sergey Senozhatsky @ 2026-01-01 1:38 UTC (permalink / raw)
To: Andrew Morton, Yosry Ahmed, Nhat Pham
Cc: Minchan Kim, Johannes Weiner, Brian Geffon, linux-kernel,
linux-mm, Sergey Senozhatsky
For the reasons unknown, zsmalloc limits the number of size-classes
to 256. On PAGE_SIZE 4K systems this works pretty fine, as those
256 classes are 4096/256 = 16 (known as size class delta) bytes apart.
However, as the PAGE_SIZE grows, e.g. 16K, the hard limit pushes the
size-class delta significantly further (e.g. 16384/256 = 64) leading
to increased internal fragmentation. For example, on a 16K page system,
an object of size 65 bytes is rounded up to the next 64-byte boundary
(128 bytes), wasting nearly 50% of the allocated space.
Instead of calculating size-class delta based on both PAGE_SIZE and
hard limit of 256 the ZS_SIZE_CLASS_DELTA is set to constant value
of 16 bytes. This results in a much higher than 256 number of size
classes on systems with PAGE_SIZE larger than 4K. These extra size
classes split existing cluster into smaller ones. For example, using
tool [1] 16K PAGE_SIZE, chain size 8:
BASE (delta 64 bytes)
=====================
Log | Phys | Chain | Objs/Page | TailWaste | MergeWaste
[..]
1072 | 1120 | 8 | 117 | 32 | 5616
1088 | 1120 | 8 | 117 | 32 | 3744
1104 | 1120 | 8 | 117 | 32 | 1872
1120 | 1120 | 8 | 117 | 32 | 0
[..]
PATCHED (delta 16 bytes)
========================
[..]
1072 | 1072 | 4 | 61 | 144 | 0
1088 | 1088 | 1 | 15 | 64 | 0
1104 | 1104 | 6 | 89 | 48 | 0
1120 | 1120 | 8 | 117 | 32 | 0
[..]
In default configuration (delta 64) size classes 1072 to 1104
are merged into 1120. Size class 1120 holds 117 objects
per-zspage, so worst case every zspage can lose 5616 bytes
(1120-1072 times 117). With delta 16 this cluster doesn't
exist, reducing memory waste.
[1] https://github.com/sergey-senozhatsky/simulate-zsmalloc/blob/main/simulate_zsmalloc.c
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
---
mm/zsmalloc.c | 9 +++++++--
1 file changed, 7 insertions(+), 2 deletions(-)
diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
index 5bf832f9c05c..5e7501d36161 100644
--- a/mm/zsmalloc.c
+++ b/mm/zsmalloc.c
@@ -92,7 +92,7 @@
#define HUGE_BITS 1
#define FULLNESS_BITS 4
-#define CLASS_BITS 8
+#define CLASS_BITS 12
#define MAGIC_VAL_BITS 8
#define ZS_MAX_PAGES_PER_ZSPAGE (_AC(CONFIG_ZSMALLOC_CHAIN_SIZE, UL))
@@ -115,8 +115,13 @@
*
* ZS_MIN_ALLOC_SIZE and ZS_SIZE_CLASS_DELTA must be multiple of ZS_ALIGN
* (reason above)
+ *
+ * We set ZS_SIZE_CLASS_DELTA to 16 bytes to maintain high granularity
+ * even on systems with large PAGE_SIZE (e.g. 16K, 64K). This prevents
+ * internal fragmentation. CLASS_BITS is increased to 12 to address the
+ * larger number of size classes on such systems (up to 4096 classes on 64K).
*/
-#define ZS_SIZE_CLASS_DELTA (PAGE_SIZE >> CLASS_BITS)
+#define ZS_SIZE_CLASS_DELTA 16
#define ZS_SIZE_CLASSES (DIV_ROUND_UP(ZS_MAX_ALLOC_SIZE - ZS_MIN_ALLOC_SIZE, \
ZS_SIZE_CLASS_DELTA) + 1)
--
2.52.0.351.gbe84eed79e-goog
^ permalink raw reply [flat|nested] 3+ messages in thread
* [RFC PATCH 2/2] zsmalloc: chain-length configuration should consider other metrics
2026-01-01 1:38 [RFC PATCH 0/2] zsmalloc: size-classes chain-length tunings Sergey Senozhatsky
2026-01-01 1:38 ` [RFC PATCH 1/2] zsmalloc: drop hard limit on the number of size classes Sergey Senozhatsky
@ 2026-01-01 1:38 ` Sergey Senozhatsky
1 sibling, 0 replies; 3+ messages in thread
From: Sergey Senozhatsky @ 2026-01-01 1:38 UTC (permalink / raw)
To: Andrew Morton, Yosry Ahmed, Nhat Pham
Cc: Minchan Kim, Johannes Weiner, Brian Geffon, linux-kernel,
linux-mm, Sergey Senozhatsky
This is the first step towards re-thinking optimization strategy
during chain-size (the number of 0-order physical pages a zspage
chains for most optimal performance) configuration. Currently,
we only consider one metric - "wasted" memory - and try various
chain length configurations in order to find the minimal wasted
space configuration. However, this strategy doesn't consider
the fact that our optimization space is not single-dimensional.
When we increase zspage chain length we at the same increase the
number of spanning objects (objects that span two physical pages).
Such objects slow down read() operations because zsmalloc needs to
kmap both pages and memcpy objects' chunks. This clearly increases
CPU usage and battery drain.
We, most likely, need to consider numerous metrics and optimize
in a multi-dimensional space. These can be wired in later on, for
now we just add some heuristic to increase zspage chain length only
if there are substantial savings memory usage wise. We can tune
these threshold values (there is a simple user-space tool [2] to
experiment with those knobs), but what we currently is already
interesting enough. Where does this bring us, using a synthetic
test [1], which produces byte-to-byte comparable workloads, on a
4K PAGE_SIZE, chain size 10 system:
BASE
====
zsmalloc_test: num write objects: 339598
zsmalloc_test: pool pages used 175111, total allocated size 698213488
zsmalloc_test: pool memory utilization: 97.3
zsmalloc_test: num read objects: 339598
zsmalloc_test: spanning objects: 110377, total memcpy size: 278318624
PATCHED
=======
zsmalloc_test: num write objects: 339598
zsmalloc_test: pool pages used 175920, total allocated size 698213488
zsmalloc_test: pool memory utilization: 96.8
zsmalloc_test: num read objects: 339598
zsmalloc_test: spanning objects: 103256, total memcpy size: 265378608
At a price of 0.5% increased pool memory usage there was a 6.5%
reduction in a number of spanning objects (4.6% less copied bytes).
Note, the results are specific to this particular test case. The
savings are not uniformly distributed: according to [2] for some
size classes the reduction in the number of spanning objects
per-zspage goes down from 7 to 0 (e.g. size class 368), for other
from 4 to 2 (e.g. size class 640). So the actual memcpy savings
are data-pattern dependent, as always.
[1] https://github.com/sergey-senozhatsky/simulate-zsmalloc/blob/main/0001-zsmalloc-add-zsmalloc_test-module.patch
[2] https://github.com/sergey-senozhatsky/simulate-zsmalloc/blob/main/simulate_zsmalloc.c
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
---
mm/zsmalloc.c | 39 +++++++++++++++++++++++++++++++--------
1 file changed, 31 insertions(+), 8 deletions(-)
diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
index 5e7501d36161..929db7cf6c19 100644
--- a/mm/zsmalloc.c
+++ b/mm/zsmalloc.c
@@ -2000,22 +2000,45 @@ static int zs_register_shrinker(struct zs_pool *pool)
static int calculate_zspage_chain_size(int class_size)
{
int i, min_waste = INT_MAX;
- int chain_size = 1;
+ int best_chain_size = 1;
if (is_power_of_2(class_size))
- return chain_size;
+ return best_chain_size;
for (i = 1; i <= ZS_MAX_PAGES_PER_ZSPAGE; i++) {
- int waste;
+ int curr_waste = (i * PAGE_SIZE) % class_size;
- waste = (i * PAGE_SIZE) % class_size;
- if (waste < min_waste) {
- min_waste = waste;
- chain_size = i;
+ if (curr_waste == 0)
+ return i;
+
+ /*
+ * Accept the new chain size if:
+ * 1. The current best is wasteful (> 10% of zspage size),
+ * accept anything that is better.
+ * 2. The current best is efficient, accept only significant
+ * (25%) improvement.
+ */
+ if (min_waste * 10 > best_chain_size * PAGE_SIZE) {
+ if (curr_waste < min_waste) {
+ min_waste = curr_waste;
+ best_chain_size = i;
+ }
+ } else {
+ if (curr_waste * 4 < min_waste * 3) {
+ min_waste = curr_waste;
+ best_chain_size = i;
+ }
}
+
+ /*
+ * If the current best chain has low waste (approx < 1.5%
+ * relative to zspage size) then accept it right away.
+ */
+ if (min_waste * 64 <= best_chain_size * PAGE_SIZE)
+ break;
}
- return chain_size;
+ return best_chain_size;
}
/**
--
2.52.0.351.gbe84eed79e-goog
^ permalink raw reply [flat|nested] 3+ messages in thread