[PATCH v5 00/18] zsmalloc/zram: there be preemption

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v5 00/18] zsmalloc/zram: there be preemption
@ 2025-02-12  6:26 Sergey Senozhatsky
  2025-02-12  6:26 ` [PATCH v5 01/18] zram: sleepable entry locking Sergey Senozhatsky
                   ` (18 more replies)
  0 siblings, 19 replies; 40+ messages in thread
From: Sergey Senozhatsky @ 2025-02-12  6:26 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Yosry Ahmed, Kairui Song, Minchan Kim, linux-mm, linux-kernel,
	Sergey Senozhatsky

Currently zram runs compression and decompression in non-preemptible
sections, e.g.

    zcomp_stream_get()     // grabs CPU local lock
    zcomp_compress()

or

    zram_slot_lock()       // grabs entry spin-lock
    zcomp_stream_get()     // grabs CPU local lock
    zs_map_object()        // grabs rwlock and CPU local lock
    zcomp_decompress()

Potentially a little troublesome for a number of reasons.

For instance, this makes it impossible to use async compression
algorithms or/and H/W compression algorithms, which can wait for OP
completion or resource availability.  This also restricts what
compression algorithms can do internally, for example, zstd can
allocate internal state memory for C/D dictionaries:

do_fsync()
 do_writepages()
  zram_bio_write()
   zram_write_page()                          // become non-preemptible
    zcomp_compress()
     zstd_compress()
      ZSTD_compress_usingCDict()
       ZSTD_compressBegin_usingCDict_internal()
        ZSTD_resetCCtx_usingCDict()
         ZSTD_resetCCtx_internal()
          zstd_custom_alloc()                 // memory allocation

Not to mention that the system can be configured to maximize
compression ratio at a cost of CPU/HW time (e.g. lz4hc or deflate
with very high compression level) so zram can stay in non-preemptible
section (even under spin-lock or/and rwlock) for an extended period
of time.  Aside from compression algorithms, this also restricts what
zram can do.  One particular example is zram_write_page() zsmalloc
handle allocation, which has an optimistic allocation (disallowing
direct reclaim) and a pessimistic fallback path, which then forces
zram to compress the page one more time.

This series changes zram to not directly impose atomicity restrictions
on compression algorithms (and on itself), which makes zram write()
fully preemptible; zram read(), sadly, is not always preemptible yet.
There are still indirect atomicity restrictions imposed by zsmalloc().
One notable example is object mapping API, which returns with:
a) local CPU lock held
b) zspage rwlock held

First, zsmalloc is converted to use sleepable RW-"lock" (it's atomic_t
in fact) for zspage migration protection.  Second, a new handle mapping
is introduced which doesn't use per-CPU buffers (and hence no local CPU
lock), does fewer memcpy() calls, but requires users to provide a
pointer to temp buffer for object copy-in (when needed).  Third, zram is
converted to the new zsmalloc mapping API and thus zram read() becomes
preemptible.

v4 -> v5:
- switched to preemptible per-CPU comp streams (Yosry)
- switched to preemptible bit-locks for zram entry locking (Andrew)
- added lockdep annotations to new zsmalloc/zram locks (Hillf, Yosry)
- perf measurements
- reworked re-compression loop (a bunch of minor fixes)
- fixed potential physical page leaks on writeback/recompression error
  paths
- documented new locking rules

Sergey Senozhatsky (18):
  zram: sleepable entry locking
  zram: permit preemption with active compression stream
  zram: remove crypto include
  zram: remove max_comp_streams device attr
  zram: remove two-staged handle allocation
  zram: remove writestall zram_stats member
  zram: limit max recompress prio to num_active_comps
  zram: filter out recomp targets based on priority
  zram: rework recompression loop
  zsmalloc: factor out pool locking helpers
  zsmalloc: factor out size-class locking helpers
  zsmalloc: make zspage lock preemptible
  zsmalloc: introduce new object mapping API
  zram: switch to new zsmalloc object mapping API
  zram: permit reclaim in zstd custom allocator
  zram: do not leak page on recompress_store error path
  zram: do not leak page on writeback_store error path
  zram: add might_sleep to zcomp API

 Documentation/ABI/testing/sysfs-block-zram  |   8 -
 Documentation/admin-guide/blockdev/zram.rst |  36 +-
 drivers/block/zram/backend_zstd.c           |  11 +-
 drivers/block/zram/zcomp.c                  |  43 +-
 drivers/block/zram/zcomp.h                  |   8 +-
 drivers/block/zram/zram_drv.c               | 286 +++++++------
 drivers/block/zram/zram_drv.h               |  22 +-
 include/linux/zsmalloc.h                    |   8 +
 mm/zsmalloc.c                               | 420 +++++++++++++++-----
 9 files changed, 536 insertions(+), 306 deletions(-)

-- 
2.48.1.502.g6dc24dfdaf-goog

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v5 01/18] zram: sleepable entry locking
  2025-02-12  6:26 [PATCH v5 00/18] zsmalloc/zram: there be preemption Sergey Senozhatsky
@ 2025-02-12  6:26 ` Sergey Senozhatsky
  2025-02-13  0:08   ` Andrew Morton
  2025-02-12  6:27 ` [PATCH v5 02/18] zram: permit preemption with active compression stream Sergey Senozhatsky
                   ` (17 subsequent siblings)
  18 siblings, 1 reply; 40+ messages in thread
From: Sergey Senozhatsky @ 2025-02-12  6:26 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Yosry Ahmed, Kairui Song, Minchan Kim, linux-mm, linux-kernel,
	Sergey Senozhatsky

Concurrent modifications of meta table entries is now handled
by per-entry spin-lock.  This has a number of shortcomings.

First, this imposes atomic requirements on compression backends.
zram can call both zcomp_compress() and zcomp_decompress() under
entry spin-lock, which implies that we can use only compression
algorithms that don't schedule/sleep/wait during compression and
decompression.  This, for instance, makes it impossible to use
some of the ASYNC compression algorithms (H/W compression, etc.)
implementations.

Second, this can potentially trigger watchdogs.  For example,
entry re-compression with secondary algorithms is performed
under entry spin-lock.  Given that we chain secondary
compression algorithms and that some of them can be configured
for best compression ratio (and worst compression speed) zram
can stay under spin-lock for quite some time.

Having a per-entry mutex (or, for instance, a rw-semaphore)
significantly increases sizeof() of each entry and hence the
meta table.  Therefore entry locking returns back to bit
locking, as before, however, this time also preempt-rt friendly,
because if waits-on-bit instead of spinning-on-bit.  Lock owners
are also now permitted to schedule, which is a first step on the
path of making zram non-atomic.

Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
---
 drivers/block/zram/zram_drv.c | 65 ++++++++++++++++++++++++++++-------
 drivers/block/zram/zram_drv.h | 20 +++++++----
 2 files changed, 67 insertions(+), 18 deletions(-)

diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index 9f5020b077c5..3708436f1d1f 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -58,19 +58,57 @@ static void zram_free_page(struct zram *zram, size_t index);
 static int zram_read_from_zspool(struct zram *zram, struct page *page,
 				 u32 index);
 
-static int zram_slot_trylock(struct zram *zram, u32 index)
+static void zram_slot_lock_init(struct zram *zram, u32 index)
 {
-	return spin_trylock(&zram->table[index].lock);
+#ifdef CONFIG_DEBUG_LOCK_ALLOC
+	lockdep_init_map(&zram->table[index].lockdep_map, "zram-entry->lock",
+			 &zram->table_lockdep_key, 0);
+#endif
+}
+
+/*
+ * entry locking rules:
+ *
+ * 1) Lock is exclusive
+ *
+ * 2) lock() function can sleep waiting for the lock
+ *
+ * 3) Lock owner can sleep
+ *
+ * 4) Use TRY lock variant when in atomic context
+ *    - must check return value and handle locking failers
+ */
+static __must_check bool zram_slot_try_lock(struct zram *zram, u32 index)
+{
+	unsigned long *lock = &zram->table[index].flags;
+
+	if (!test_and_set_bit_lock(ZRAM_ENTRY_LOCK, lock)) {
+#ifdef CONFIG_DEBUG_LOCK_ALLOC
+		mutex_acquire(&zram->table[index].lockdep_map, 0, 1, _RET_IP_);
+#endif
+		return true;
+	}
+	return false;
 }
 
 static void zram_slot_lock(struct zram *zram, u32 index)
 {
-	spin_lock(&zram->table[index].lock);
+	unsigned long *lock = &zram->table[index].flags;
+
+#ifdef CONFIG_DEBUG_LOCK_ALLOC
+	mutex_acquire(&zram->table[index].lockdep_map, 0, 0, _RET_IP_);
+#endif
+	wait_on_bit_lock(lock, ZRAM_ENTRY_LOCK, TASK_UNINTERRUPTIBLE);
 }
 
 static void zram_slot_unlock(struct zram *zram, u32 index)
 {
-	spin_unlock(&zram->table[index].lock);
+	unsigned long *lock = &zram->table[index].flags;
+
+#ifdef CONFIG_DEBUG_LOCK_ALLOC
+	mutex_release(&zram->table[index].lockdep_map, _RET_IP_);
+#endif
+	clear_and_wake_up_bit(ZRAM_ENTRY_LOCK, lock);
 }
 
 static inline bool init_done(struct zram *zram)
@@ -93,7 +131,6 @@ static void zram_set_handle(struct zram *zram, u32 index, unsigned long handle)
 	zram->table[index].handle = handle;
 }
 
-/* flag operations require table entry bit_spin_lock() being held */
 static bool zram_test_flag(struct zram *zram, u32 index,
 			enum zram_pageflags flag)
 {
@@ -1473,15 +1510,11 @@ static bool zram_meta_alloc(struct zram *zram, u64 disksize)
 		huge_class_size = zs_huge_class_size(zram->mem_pool);
 
 	for (index = 0; index < num_pages; index++)
-		spin_lock_init(&zram->table[index].lock);
+		zram_slot_lock_init(zram, index);
+
 	return true;
 }
 
-/*
- * To protect concurrent access to the same index entry,
- * caller should hold this table index entry's bit_spinlock to
- * indicate this index entry is accessing.
- */
 static void zram_free_page(struct zram *zram, size_t index)
 {
 	unsigned long handle;
@@ -2321,7 +2354,7 @@ static void zram_slot_free_notify(struct block_device *bdev,
 	zram = bdev->bd_disk->private_data;
 
 	atomic64_inc(&zram->stats.notify_free);
-	if (!zram_slot_trylock(zram, index)) {
+	if (!zram_slot_try_lock(zram, index)) {
 		atomic64_inc(&zram->stats.miss_free);
 		return;
 	}
@@ -2625,6 +2658,10 @@ static int zram_add(void)
 	if (ret)
 		goto out_cleanup_disk;
 
+#ifdef CONFIG_DEBUG_LOCK_ALLOC
+	lockdep_register_key(&zram->table_lockdep_key);
+#endif
+
 	zram_debugfs_register(zram);
 	pr_info("Added device: %s\n", zram->disk->disk_name);
 	return device_id;
@@ -2681,6 +2718,10 @@ static int zram_remove(struct zram *zram)
 	 */
 	zram_reset_device(zram);
 
+#ifdef CONFIG_DEBUG_LOCK_ALLOC
+	lockdep_unregister_key(&zram->table_lockdep_key);
+#endif
+
 	put_disk(zram->disk);
 	kfree(zram);
 	return 0;
diff --git a/drivers/block/zram/zram_drv.h b/drivers/block/zram/zram_drv.h
index db78d7c01b9a..63b933059cb6 100644
--- a/drivers/block/zram/zram_drv.h
+++ b/drivers/block/zram/zram_drv.h
@@ -28,7 +28,6 @@
 #define ZRAM_SECTOR_PER_LOGICAL_BLOCK	\
 	(1 << (ZRAM_LOGICAL_BLOCK_SHIFT - SECTOR_SHIFT))
 
-
 /*
  * ZRAM is mainly used for memory efficiency so we want to keep memory
  * footprint small and thus squeeze size and zram pageflags into a flags
@@ -46,6 +45,7 @@
 /* Flags for zram pages (table[page_no].flags) */
 enum zram_pageflags {
 	ZRAM_SAME = ZRAM_FLAG_SHIFT,	/* Page consists the same element */
+	ZRAM_ENTRY_LOCK, /* entry access lock bit */
 	ZRAM_WB,	/* page is stored on backing_device */
 	ZRAM_PP_SLOT,	/* Selected for post-processing */
 	ZRAM_HUGE,	/* Incompressible page */
@@ -58,13 +58,18 @@ enum zram_pageflags {
 	__NR_ZRAM_PAGEFLAGS,
 };
 
-/*-- Data structures */
-
-/* Allocated for each disk page */
+/*
+ * Allocated for each disk page.  We use bit-lock (ZRAM_ENTRY_LOCK bit
+ * of flags) to save memory.  There can be plenty of entries and standard
+ * locking primitives (e.g. mutex) will significantly increase sizeof()
+ * of each entry and hence of the meta table.
+ */
 struct zram_table_entry {
 	unsigned long handle;
-	unsigned int flags;
-	spinlock_t lock;
+	unsigned long flags;
+#ifdef CONFIG_DEBUG_LOCK_ALLOC
+	struct lockdep_map lockdep_map;
+#endif
 #ifdef CONFIG_ZRAM_TRACK_ENTRY_ACTIME
 	ktime_t ac_time;
 #endif
@@ -137,5 +142,8 @@ struct zram {
 	struct dentry *debugfs_dir;
 #endif
 	atomic_t pp_in_progress;
+#ifdef CONFIG_DEBUG_LOCK_ALLOC
+	struct lock_class_key table_lockdep_key;
+#endif
 };
 #endif
-- 
2.48.1.502.g6dc24dfdaf-goog



^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v5 02/18] zram: permit preemption with active compression stream
  2025-02-12  6:26 [PATCH v5 00/18] zsmalloc/zram: there be preemption Sergey Senozhatsky
  2025-02-12  6:26 ` [PATCH v5 01/18] zram: sleepable entry locking Sergey Senozhatsky
@ 2025-02-12  6:27 ` Sergey Senozhatsky
  2025-02-12 16:01   ` Yosry Ahmed
  2025-02-12  6:27 ` [PATCH v5 03/18] zram: remove crypto include Sergey Senozhatsky
                   ` (16 subsequent siblings)
  18 siblings, 1 reply; 40+ messages in thread
From: Sergey Senozhatsky @ 2025-02-12  6:27 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Yosry Ahmed, Kairui Song, Minchan Kim, linux-mm, linux-kernel,
	Sergey Senozhatsky

Currently, per-CPU stream access is done from a non-preemptible
(atomic) section, which imposes the same atomicity requirements on
compression backends as entry spin-lock, and makes it impossible
to use algorithms that can schedule/wait/sleep during compression
and decompression.

Switch to preemptible per-CPU model, similar to the one used
in zswap.  Instead of a per-CPU local lock, each stream carries
a mutex which is locked throughout entire time zram uses it
for compression or decompression, so that cpu-dead event waits
for zram to stop using a particular per-CPU stream and release
it.

Suggested-by: Yosry Ahmed <yosry.ahmed@linux.dev>
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
---
 drivers/block/zram/zcomp.c    | 36 +++++++++++++++++++++++++----------
 drivers/block/zram/zcomp.h    |  6 +++---
 drivers/block/zram/zram_drv.c | 20 +++++++++----------
 3 files changed, 39 insertions(+), 23 deletions(-)

diff --git a/drivers/block/zram/zcomp.c b/drivers/block/zram/zcomp.c
index bb514403e305..e83dd9a80a81 100644
--- a/drivers/block/zram/zcomp.c
+++ b/drivers/block/zram/zcomp.c
@@ -7,6 +7,7 @@
 #include <linux/wait.h>
 #include <linux/sched.h>
 #include <linux/cpu.h>
+#include <linux/cpuhotplug.h>
 #include <linux/crypto.h>
 #include <linux/vmalloc.h>
 
@@ -54,6 +55,7 @@ static int zcomp_strm_init(struct zcomp *comp, struct zcomp_strm *zstrm)
 {
 	int ret;
 
+	mutex_init(&zstrm->lock);
 	ret = comp->ops->create_ctx(comp->params, &zstrm->ctx);
 	if (ret)
 		return ret;
@@ -109,13 +111,29 @@ ssize_t zcomp_available_show(const char *comp, char *buf)
 
 struct zcomp_strm *zcomp_stream_get(struct zcomp *comp)
 {
-	local_lock(&comp->stream->lock);
-	return this_cpu_ptr(comp->stream);
+	for (;;) {
+		struct zcomp_strm *zstrm = raw_cpu_ptr(comp->stream);
+
+		/*
+		 * Inspired by zswap
+		 *
+		 * stream is returned with ->mutex locked which prevents
+		 * cpu_dead() from releasing this stream under us, however
+		 * there is still a race window between raw_cpu_ptr() and
+		 * mutex_lock(), during which we could have been migrated
+		 * to a CPU that has already destroyed its stream.  If so
+		 * then unlock and re-try on the current CPU.
+		 */
+		mutex_lock(&zstrm->lock);
+		if (likely(zstrm->buffer))
+			return zstrm;
+		mutex_unlock(&zstrm->lock);
+	}
 }
 
-void zcomp_stream_put(struct zcomp *comp)
+void zcomp_stream_put(struct zcomp_strm *zstrm)
 {
-	local_unlock(&comp->stream->lock);
+	mutex_unlock(&zstrm->lock);
 }
 
 int zcomp_compress(struct zcomp *comp, struct zcomp_strm *zstrm,
@@ -151,12 +169,9 @@ int zcomp_decompress(struct zcomp *comp, struct zcomp_strm *zstrm,
 int zcomp_cpu_up_prepare(unsigned int cpu, struct hlist_node *node)
 {
 	struct zcomp *comp = hlist_entry(node, struct zcomp, node);
-	struct zcomp_strm *zstrm;
+	struct zcomp_strm *zstrm = per_cpu_ptr(comp->stream, cpu);
 	int ret;
 
-	zstrm = per_cpu_ptr(comp->stream, cpu);
-	local_lock_init(&zstrm->lock);
-
 	ret = zcomp_strm_init(comp, zstrm);
 	if (ret)
 		pr_err("Can't allocate a compression stream\n");
@@ -166,10 +181,11 @@ int zcomp_cpu_up_prepare(unsigned int cpu, struct hlist_node *node)
 int zcomp_cpu_dead(unsigned int cpu, struct hlist_node *node)
 {
 	struct zcomp *comp = hlist_entry(node, struct zcomp, node);
-	struct zcomp_strm *zstrm;
+	struct zcomp_strm *zstrm = per_cpu_ptr(comp->stream, cpu);
 
-	zstrm = per_cpu_ptr(comp->stream, cpu);
+	mutex_lock(&zstrm->lock);
 	zcomp_strm_free(comp, zstrm);
+	mutex_unlock(&zstrm->lock);
 	return 0;
 }
 
diff --git a/drivers/block/zram/zcomp.h b/drivers/block/zram/zcomp.h
index ad5762813842..23b8236b9090 100644
--- a/drivers/block/zram/zcomp.h
+++ b/drivers/block/zram/zcomp.h
@@ -3,7 +3,7 @@
 #ifndef _ZCOMP_H_
 #define _ZCOMP_H_
 
-#include <linux/local_lock.h>
+#include <linux/mutex.h>
 
 #define ZCOMP_PARAM_NO_LEVEL	INT_MIN
 
@@ -31,7 +31,7 @@ struct zcomp_ctx {
 };
 
 struct zcomp_strm {
-	local_lock_t lock;
+	struct mutex lock;
 	/* compression buffer */
 	void *buffer;
 	struct zcomp_ctx ctx;
@@ -77,7 +77,7 @@ struct zcomp *zcomp_create(const char *alg, struct zcomp_params *params);
 void zcomp_destroy(struct zcomp *comp);
 
 struct zcomp_strm *zcomp_stream_get(struct zcomp *comp);
-void zcomp_stream_put(struct zcomp *comp);
+void zcomp_stream_put(struct zcomp_strm *zstrm);
 
 int zcomp_compress(struct zcomp *comp, struct zcomp_strm *zstrm,
 		   const void *src, unsigned int *dst_len);
diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index 3708436f1d1f..43f460a45e3e 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -1608,7 +1608,7 @@ static int read_compressed_page(struct zram *zram, struct page *page, u32 index)
 	ret = zcomp_decompress(zram->comps[prio], zstrm, src, size, dst);
 	kunmap_local(dst);
 	zs_unmap_object(zram->mem_pool, handle);
-	zcomp_stream_put(zram->comps[prio]);
+	zcomp_stream_put(zstrm);
 
 	return ret;
 }
@@ -1769,14 +1769,14 @@ static int zram_write_page(struct zram *zram, struct page *page, u32 index)
 	kunmap_local(mem);
 
 	if (unlikely(ret)) {
-		zcomp_stream_put(zram->comps[ZRAM_PRIMARY_COMP]);
+		zcomp_stream_put(zstrm);
 		pr_err("Compression failed! err=%d\n", ret);
 		zs_free(zram->mem_pool, handle);
 		return ret;
 	}
 
 	if (comp_len >= huge_class_size) {
-		zcomp_stream_put(zram->comps[ZRAM_PRIMARY_COMP]);
+		zcomp_stream_put(zstrm);
 		return write_incompressible_page(zram, page, index);
 	}
 
@@ -1800,7 +1800,7 @@ static int zram_write_page(struct zram *zram, struct page *page, u32 index)
 				   __GFP_HIGHMEM |
 				   __GFP_MOVABLE);
 	if (IS_ERR_VALUE(handle)) {
-		zcomp_stream_put(zram->comps[ZRAM_PRIMARY_COMP]);
+		zcomp_stream_put(zstrm);
 		atomic64_inc(&zram->stats.writestall);
 		handle = zs_malloc(zram->mem_pool, comp_len,
 				   GFP_NOIO | __GFP_HIGHMEM |
@@ -1812,7 +1812,7 @@ static int zram_write_page(struct zram *zram, struct page *page, u32 index)
 	}
 
 	if (!zram_can_store_page(zram)) {
-		zcomp_stream_put(zram->comps[ZRAM_PRIMARY_COMP]);
+		zcomp_stream_put(zstrm);
 		zs_free(zram->mem_pool, handle);
 		return -ENOMEM;
 	}
@@ -1820,7 +1820,7 @@ static int zram_write_page(struct zram *zram, struct page *page, u32 index)
 	dst = zs_map_object(zram->mem_pool, handle, ZS_MM_WO);
 
 	memcpy(dst, zstrm->buffer, comp_len);
-	zcomp_stream_put(zram->comps[ZRAM_PRIMARY_COMP]);
+	zcomp_stream_put(zstrm);
 	zs_unmap_object(zram->mem_pool, handle);
 
 	zram_slot_lock(zram, index);
@@ -1979,7 +1979,7 @@ static int recompress_slot(struct zram *zram, u32 index, struct page *page,
 		kunmap_local(src);
 
 		if (ret) {
-			zcomp_stream_put(zram->comps[prio]);
+			zcomp_stream_put(zstrm);
 			return ret;
 		}
 
@@ -1989,7 +1989,7 @@ static int recompress_slot(struct zram *zram, u32 index, struct page *page,
 		/* Continue until we make progress */
 		if (class_index_new >= class_index_old ||
 		    (threshold && comp_len_new >= threshold)) {
-			zcomp_stream_put(zram->comps[prio]);
+			zcomp_stream_put(zstrm);
 			continue;
 		}
 
@@ -2047,13 +2047,13 @@ static int recompress_slot(struct zram *zram, u32 index, struct page *page,
 			       __GFP_HIGHMEM |
 			       __GFP_MOVABLE);
 	if (IS_ERR_VALUE(handle_new)) {
-		zcomp_stream_put(zram->comps[prio]);
+		zcomp_stream_put(zstrm);
 		return PTR_ERR((void *)handle_new);
 	}
 
 	dst = zs_map_object(zram->mem_pool, handle_new, ZS_MM_WO);
 	memcpy(dst, zstrm->buffer, comp_len_new);
-	zcomp_stream_put(zram->comps[prio]);
+	zcomp_stream_put(zstrm);
 
 	zs_unmap_object(zram->mem_pool, handle_new);
 
-- 
2.48.1.502.g6dc24dfdaf-goog



^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v5 03/18] zram: remove crypto include
  2025-02-12  6:26 [PATCH v5 00/18] zsmalloc/zram: there be preemption Sergey Senozhatsky
  2025-02-12  6:26 ` [PATCH v5 01/18] zram: sleepable entry locking Sergey Senozhatsky
  2025-02-12  6:27 ` [PATCH v5 02/18] zram: permit preemption with active compression stream Sergey Senozhatsky
@ 2025-02-12  6:27 ` Sergey Senozhatsky
  2025-02-12 16:13   ` Yosry Ahmed
  2025-02-12  6:27 ` [PATCH v5 04/18] zram: remove max_comp_streams device attr Sergey Senozhatsky
                   ` (15 subsequent siblings)
  18 siblings, 1 reply; 40+ messages in thread
From: Sergey Senozhatsky @ 2025-02-12  6:27 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Yosry Ahmed, Kairui Song, Minchan Kim, linux-mm, linux-kernel,
	Sergey Senozhatsky

Remove a leftover crypto header include.

Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
---
 drivers/block/zram/zcomp.c    | 1 -
 drivers/block/zram/zram_drv.c | 4 +++-
 drivers/block/zram/zram_drv.h | 1 -
 3 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/block/zram/zcomp.c b/drivers/block/zram/zcomp.c
index e83dd9a80a81..c393243eeb5c 100644
--- a/drivers/block/zram/zcomp.c
+++ b/drivers/block/zram/zcomp.c
@@ -8,7 +8,6 @@
 #include <linux/sched.h>
 #include <linux/cpu.h>
 #include <linux/cpuhotplug.h>
-#include <linux/crypto.h>
 #include <linux/vmalloc.h>
 
 #include "zcomp.h"
diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index 43f460a45e3e..12fb260e3355 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -44,6 +44,8 @@ static DEFINE_MUTEX(zram_index_mutex);
 static int zram_major;
 static const char *default_compressor = CONFIG_ZRAM_DEF_COMP;
 
+#define ZRAM_MAX_ALGO_NAME_SZ	64
+
 /* Module params (documentation at end) */
 static unsigned int num_devices = 1;
 /*
@@ -1149,7 +1151,7 @@ static int __comp_algorithm_store(struct zram *zram, u32 prio, const char *buf)
 	size_t sz;
 
 	sz = strlen(buf);
-	if (sz >= CRYPTO_MAX_ALG_NAME)
+	if (sz >= ZRAM_MAX_ALGO_NAME_SZ)
 		return -E2BIG;
 
 	compressor = kstrdup(buf, GFP_KERNEL);
diff --git a/drivers/block/zram/zram_drv.h b/drivers/block/zram/zram_drv.h
index 63b933059cb6..97c98fa07954 100644
--- a/drivers/block/zram/zram_drv.h
+++ b/drivers/block/zram/zram_drv.h
@@ -17,7 +17,6 @@
 
 #include <linux/rwsem.h>
 #include <linux/zsmalloc.h>
-#include <linux/crypto.h>
 
 #include "zcomp.h"
 
-- 
2.48.1.502.g6dc24dfdaf-goog



^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v5 04/18] zram: remove max_comp_streams device attr
  2025-02-12  6:26 [PATCH v5 00/18] zsmalloc/zram: there be preemption Sergey Senozhatsky
                   ` (2 preceding siblings ...)
  2025-02-12  6:27 ` [PATCH v5 03/18] zram: remove crypto include Sergey Senozhatsky
@ 2025-02-12  6:27 ` Sergey Senozhatsky
  2025-02-12  6:27 ` [PATCH v5 05/18] zram: remove two-staged handle allocation Sergey Senozhatsky
                   ` (14 subsequent siblings)
  18 siblings, 0 replies; 40+ messages in thread
From: Sergey Senozhatsky @ 2025-02-12  6:27 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Yosry Ahmed, Kairui Song, Minchan Kim, linux-mm, linux-kernel,
	Sergey Senozhatsky

max_comp_streams device attribute has been defunct since
May 2016 when zram switched to per-CPU compression streams,
remove it.

Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
---
 Documentation/ABI/testing/sysfs-block-zram  |  8 -----
 Documentation/admin-guide/blockdev/zram.rst | 36 ++++++---------------
 drivers/block/zram/zram_drv.c               | 23 -------------
 3 files changed, 10 insertions(+), 57 deletions(-)

diff --git a/Documentation/ABI/testing/sysfs-block-zram b/Documentation/ABI/testing/sysfs-block-zram
index 1ef69e0271f9..36c57de0a10a 100644
--- a/Documentation/ABI/testing/sysfs-block-zram
+++ b/Documentation/ABI/testing/sysfs-block-zram
@@ -22,14 +22,6 @@ Description:
 		device. The reset operation frees all the memory associated
 		with this device.
 
-What:		/sys/block/zram<id>/max_comp_streams
-Date:		February 2014
-Contact:	Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
-Description:
-		The max_comp_streams file is read-write and specifies the
-		number of backend's zcomp_strm compression streams (number of
-		concurrent compress operations).
-
 What:		/sys/block/zram<id>/comp_algorithm
 Date:		February 2014
 Contact:	Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
diff --git a/Documentation/admin-guide/blockdev/zram.rst b/Documentation/admin-guide/blockdev/zram.rst
index 1576fb93f06c..9bdb30901a93 100644
--- a/Documentation/admin-guide/blockdev/zram.rst
+++ b/Documentation/admin-guide/blockdev/zram.rst
@@ -54,7 +54,7 @@ The list of possible return codes:
 If you use 'echo', the returned value is set by the 'echo' utility,
 and, in general case, something like::
 
-	echo 3 > /sys/block/zram0/max_comp_streams
+	echo foo > /sys/block/zram0/comp_algorithm
 	if [ $? -ne 0 ]; then
 		handle_error
 	fi
@@ -73,21 +73,7 @@ This creates 4 devices: /dev/zram{0,1,2,3}
 num_devices parameter is optional and tells zram how many devices should be
 pre-created. Default: 1.
 
-2) Set max number of compression streams
-========================================
-
-Regardless of the value passed to this attribute, ZRAM will always
-allocate multiple compression streams - one per online CPU - thus
-allowing several concurrent compression operations. The number of
-allocated compression streams goes down when some of the CPUs
-become offline. There is no single-compression-stream mode anymore,
-unless you are running a UP system or have only 1 CPU online.
-
-To find out how many streams are currently available::
-
-	cat /sys/block/zram0/max_comp_streams
-
-3) Select compression algorithm
+2) Select compression algorithm
 ===============================
 
 Using comp_algorithm device attribute one can see available and
@@ -107,7 +93,7 @@ Examples::
 For the time being, the `comp_algorithm` content shows only compression
 algorithms that are supported by zram.
 
-4) Set compression algorithm parameters: Optional
+3) Set compression algorithm parameters: Optional
 =================================================
 
 Compression algorithms may support specific parameters which can be
@@ -138,7 +124,7 @@ better the compression ratio, it even can take negatives values for some
 algorithms), for other algorithms `level` is acceleration level (the higher
 the value the lower the compression ratio).
 
-5) Set Disksize
+4) Set Disksize
 ===============
 
 Set disk size by writing the value to sysfs node 'disksize'.
@@ -158,7 +144,7 @@ There is little point creating a zram of greater than twice the size of memory
 since we expect a 2:1 compression ratio. Note that zram uses about 0.1% of the
 size of the disk when not in use so a huge zram is wasteful.
 
-6) Set memory limit: Optional
+5) Set memory limit: Optional
 =============================
 
 Set memory limit by writing the value to sysfs node 'mem_limit'.
@@ -177,7 +163,7 @@ Examples::
 	# To disable memory limit
 	echo 0 > /sys/block/zram0/mem_limit
 
-7) Activate
+6) Activate
 ===========
 
 ::
@@ -188,7 +174,7 @@ Examples::
 	mkfs.ext4 /dev/zram1
 	mount /dev/zram1 /tmp
 
-8) Add/remove zram devices
+7) Add/remove zram devices
 ==========================
 
 zram provides a control interface, which enables dynamic (on-demand) device
@@ -208,7 +194,7 @@ execute::
 
 	echo X > /sys/class/zram-control/hot_remove
 
-9) Stats
+8) Stats
 ========
 
 Per-device statistics are exported as various nodes under /sys/block/zram<id>/
@@ -228,8 +214,6 @@ mem_limit         	WO	specifies the maximum amount of memory ZRAM can
 writeback_limit   	WO	specifies the maximum amount of write IO zram
 				can write out to backing device as 4KB unit
 writeback_limit_enable  RW	show and set writeback_limit feature
-max_comp_streams  	RW	the number of possible concurrent compress
-				operations
 comp_algorithm    	RW	show and change the compression algorithm
 algorithm_params	WO	setup compression algorithm parameters
 compact           	WO	trigger memory compaction
@@ -310,7 +294,7 @@ a single line of text and contains the following stats separated by whitespace:
 		Unit: 4K bytes
  ============== =============================================================
 
-10) Deactivate
+9) Deactivate
 ==============
 
 ::
@@ -318,7 +302,7 @@ a single line of text and contains the following stats separated by whitespace:
 	swapoff /dev/zram0
 	umount /dev/zram1
 
-11) Reset
+10) Reset
 =========
 
 	Write any positive value to 'reset' sysfs node::
diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index 12fb260e3355..e0e64b2610d6 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -1104,27 +1104,6 @@ static void zram_debugfs_register(struct zram *zram) {};
 static void zram_debugfs_unregister(struct zram *zram) {};
 #endif
 
-/*
- * We switched to per-cpu streams and this attr is not needed anymore.
- * However, we will keep it around for some time, because:
- * a) we may revert per-cpu streams in the future
- * b) it's visible to user space and we need to follow our 2 years
- *    retirement rule; but we already have a number of 'soon to be
- *    altered' attrs, so max_comp_streams need to wait for the next
- *    layoff cycle.
- */
-static ssize_t max_comp_streams_show(struct device *dev,
-		struct device_attribute *attr, char *buf)
-{
-	return scnprintf(buf, PAGE_SIZE, "%d\n", num_online_cpus());
-}
-
-static ssize_t max_comp_streams_store(struct device *dev,
-		struct device_attribute *attr, const char *buf, size_t len)
-{
-	return len;
-}
-
 static void comp_algorithm_set(struct zram *zram, u32 prio, const char *alg)
 {
 	/* Do not free statically defined compression algorithms */
@@ -2541,7 +2520,6 @@ static DEVICE_ATTR_WO(reset);
 static DEVICE_ATTR_WO(mem_limit);
 static DEVICE_ATTR_WO(mem_used_max);
 static DEVICE_ATTR_WO(idle);
-static DEVICE_ATTR_RW(max_comp_streams);
 static DEVICE_ATTR_RW(comp_algorithm);
 #ifdef CONFIG_ZRAM_WRITEBACK
 static DEVICE_ATTR_RW(backing_dev);
@@ -2563,7 +2541,6 @@ static struct attribute *zram_disk_attrs[] = {
 	&dev_attr_mem_limit.attr,
 	&dev_attr_mem_used_max.attr,
 	&dev_attr_idle.attr,
-	&dev_attr_max_comp_streams.attr,
 	&dev_attr_comp_algorithm.attr,
 #ifdef CONFIG_ZRAM_WRITEBACK
 	&dev_attr_backing_dev.attr,
-- 
2.48.1.502.g6dc24dfdaf-goog



^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v5 05/18] zram: remove two-staged handle allocation
  2025-02-12  6:26 [PATCH v5 00/18] zsmalloc/zram: there be preemption Sergey Senozhatsky
                   ` (3 preceding siblings ...)
  2025-02-12  6:27 ` [PATCH v5 04/18] zram: remove max_comp_streams device attr Sergey Senozhatsky
@ 2025-02-12  6:27 ` Sergey Senozhatsky
  2025-02-12  6:27 ` [PATCH v5 06/18] zram: remove writestall zram_stats member Sergey Senozhatsky
                   ` (13 subsequent siblings)
  18 siblings, 0 replies; 40+ messages in thread
From: Sergey Senozhatsky @ 2025-02-12  6:27 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Yosry Ahmed, Kairui Song, Minchan Kim, linux-mm, linux-kernel,
	Sergey Senozhatsky

Previously zram write() was atomic which required us to pass
__GFP_KSWAPD_RECLAIM to zsmalloc handle allocation on a fast
path and attempt a slow path allocation (with recompression)
when the fast path failed.

Since it's not atomic anymore we can permit direct reclaim
during allocation, and remove fast allocation path and, also,
drop the recompression path (which should reduce CPU/battery
usage).

Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
---
 drivers/block/zram/zram_drv.c | 38 ++++++-----------------------------
 1 file changed, 6 insertions(+), 32 deletions(-)

diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index e0e64b2610d6..6384c61c03bf 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -1724,11 +1724,11 @@ static int write_incompressible_page(struct zram *zram, struct page *page,
 static int zram_write_page(struct zram *zram, struct page *page, u32 index)
 {
 	int ret = 0;
-	unsigned long handle = -ENOMEM;
-	unsigned int comp_len = 0;
+	unsigned long handle;
+	unsigned int comp_len;
 	void *dst, *mem;
 	struct zcomp_strm *zstrm;
-	unsigned long element = 0;
+	unsigned long element;
 	bool same_filled;
 
 	/* First, free memory allocated to this slot (if any) */
@@ -1742,7 +1742,6 @@ static int zram_write_page(struct zram *zram, struct page *page, u32 index)
 	if (same_filled)
 		return write_same_filled_page(zram, element, index);
 
-compress_again:
 	zstrm = zcomp_stream_get(zram->comps[ZRAM_PRIMARY_COMP]);
 	mem = kmap_local_page(page);
 	ret = zcomp_compress(zram->comps[ZRAM_PRIMARY_COMP], zstrm,
@@ -1752,7 +1751,6 @@ static int zram_write_page(struct zram *zram, struct page *page, u32 index)
 	if (unlikely(ret)) {
 		zcomp_stream_put(zstrm);
 		pr_err("Compression failed! err=%d\n", ret);
-		zs_free(zram->mem_pool, handle);
 		return ret;
 	}
 
@@ -1761,35 +1759,11 @@ static int zram_write_page(struct zram *zram, struct page *page, u32 index)
 		return write_incompressible_page(zram, page, index);
 	}
 
-	/*
-	 * handle allocation has 2 paths:
-	 * a) fast path is executed with preemption disabled (for
-	 *  per-cpu streams) and has __GFP_DIRECT_RECLAIM bit clear,
-	 *  since we can't sleep;
-	 * b) slow path enables preemption and attempts to allocate
-	 *  the page with __GFP_DIRECT_RECLAIM bit set. we have to
-	 *  put per-cpu compression stream and, thus, to re-do
-	 *  the compression once handle is allocated.
-	 *
-	 * if we have a 'non-null' handle here then we are coming
-	 * from the slow path and handle has already been allocated.
-	 */
-	if (IS_ERR_VALUE(handle))
-		handle = zs_malloc(zram->mem_pool, comp_len,
-				   __GFP_KSWAPD_RECLAIM |
-				   __GFP_NOWARN |
-				   __GFP_HIGHMEM |
-				   __GFP_MOVABLE);
+	handle = zs_malloc(zram->mem_pool, comp_len,
+			   GFP_NOIO | __GFP_HIGHMEM | __GFP_MOVABLE);
 	if (IS_ERR_VALUE(handle)) {
 		zcomp_stream_put(zstrm);
-		atomic64_inc(&zram->stats.writestall);
-		handle = zs_malloc(zram->mem_pool, comp_len,
-				   GFP_NOIO | __GFP_HIGHMEM |
-				   __GFP_MOVABLE);
-		if (IS_ERR_VALUE(handle))
-			return PTR_ERR((void *)handle);
-
-		goto compress_again;
+		return PTR_ERR((void *)handle);
 	}
 
 	if (!zram_can_store_page(zram)) {
-- 
2.48.1.502.g6dc24dfdaf-goog



^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v5 06/18] zram: remove writestall zram_stats member
  2025-02-12  6:26 [PATCH v5 00/18] zsmalloc/zram: there be preemption Sergey Senozhatsky
                   ` (4 preceding siblings ...)
  2025-02-12  6:27 ` [PATCH v5 05/18] zram: remove two-staged handle allocation Sergey Senozhatsky
@ 2025-02-12  6:27 ` Sergey Senozhatsky
  2025-02-12  6:27 ` [PATCH v5 07/18] zram: limit max recompress prio to num_active_comps Sergey Senozhatsky
                   ` (12 subsequent siblings)
  18 siblings, 0 replies; 40+ messages in thread
From: Sergey Senozhatsky @ 2025-02-12  6:27 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Yosry Ahmed, Kairui Song, Minchan Kim, linux-mm, linux-kernel,
	Sergey Senozhatsky

There is no zsmalloc handle allocation slow path now and
writestall is not possible any longer.  Remove it from
zram_stats.

Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
---
 drivers/block/zram/zram_drv.c | 3 +--
 drivers/block/zram/zram_drv.h | 1 -
 2 files changed, 1 insertion(+), 3 deletions(-)

diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index 6384c61c03bf..7e2694079760 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -1438,9 +1438,8 @@ static ssize_t debug_stat_show(struct device *dev,
 
 	down_read(&zram->init_lock);
 	ret = scnprintf(buf, PAGE_SIZE,
-			"version: %d\n%8llu %8llu\n",
+			"version: %d\n0 %8llu\n",
 			version,
-			(u64)atomic64_read(&zram->stats.writestall),
 			(u64)atomic64_read(&zram->stats.miss_free));
 	up_read(&zram->init_lock);
 
diff --git a/drivers/block/zram/zram_drv.h b/drivers/block/zram/zram_drv.h
index 97c98fa07954..b9528a62521e 100644
--- a/drivers/block/zram/zram_drv.h
+++ b/drivers/block/zram/zram_drv.h
@@ -84,7 +84,6 @@ struct zram_stats {
 	atomic64_t huge_pages_since;	/* no. of huge pages since zram set up */
 	atomic64_t pages_stored;	/* no. of pages currently stored */
 	atomic_long_t max_used_pages;	/* no. of maximum pages stored */
-	atomic64_t writestall;		/* no. of write slow paths */
 	atomic64_t miss_free;		/* no. of missed free */
 #ifdef	CONFIG_ZRAM_WRITEBACK
 	atomic64_t bd_count;		/* no. of pages in backing device */
-- 
2.48.1.502.g6dc24dfdaf-goog



^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v5 07/18] zram: limit max recompress prio to num_active_comps
  2025-02-12  6:26 [PATCH v5 00/18] zsmalloc/zram: there be preemption Sergey Senozhatsky
                   ` (5 preceding siblings ...)
  2025-02-12  6:27 ` [PATCH v5 06/18] zram: remove writestall zram_stats member Sergey Senozhatsky
@ 2025-02-12  6:27 ` Sergey Senozhatsky
  2025-02-12  6:27 ` [PATCH v5 08/18] zram: filter out recomp targets based on priority Sergey Senozhatsky
                   ` (11 subsequent siblings)
  18 siblings, 0 replies; 40+ messages in thread
From: Sergey Senozhatsky @ 2025-02-12  6:27 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Yosry Ahmed, Kairui Song, Minchan Kim, linux-mm, linux-kernel,
	Sergey Senozhatsky

Use the actual number of algorithms zram was configure with
instead of theoretical limit of ZRAM_MAX_COMPS.

Also make sure that min prio is not above max prio.

Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
---
 drivers/block/zram/zram_drv.c | 15 ++++++++++++---
 1 file changed, 12 insertions(+), 3 deletions(-)

diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index 7e2694079760..27148f3e5ae9 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -2026,16 +2026,19 @@ static ssize_t recompress_store(struct device *dev,
 				struct device_attribute *attr,
 				const char *buf, size_t len)
 {
-	u32 prio = ZRAM_SECONDARY_COMP, prio_max = ZRAM_MAX_COMPS;
 	struct zram *zram = dev_to_zram(dev);
 	char *args, *param, *val, *algo = NULL;
 	u64 num_recomp_pages = ULLONG_MAX;
 	struct zram_pp_ctl *ctl = NULL;
 	struct zram_pp_slot *pps;
 	u32 mode = 0, threshold = 0;
+	u32 prio, prio_max;
 	struct page *page;
 	ssize_t ret;
 
+	prio = ZRAM_SECONDARY_COMP;
+	prio_max = zram->num_active_comps;
+
 	args = skip_spaces(buf);
 	while (*args) {
 		args = next_arg(args, &param, &val);
@@ -2088,7 +2091,7 @@ static ssize_t recompress_store(struct device *dev,
 			if (prio == ZRAM_PRIMARY_COMP)
 				prio = ZRAM_SECONDARY_COMP;
 
-			prio_max = min(prio + 1, ZRAM_MAX_COMPS);
+			prio_max = prio + 1;
 			continue;
 		}
 	}
@@ -2116,7 +2119,7 @@ static ssize_t recompress_store(struct device *dev,
 				continue;
 
 			if (!strcmp(zram->comp_algs[prio], algo)) {
-				prio_max = min(prio + 1, ZRAM_MAX_COMPS);
+				prio_max = prio + 1;
 				found = true;
 				break;
 			}
@@ -2128,6 +2131,12 @@ static ssize_t recompress_store(struct device *dev,
 		}
 	}
 
+	prio_max = min(prio_max, (u32)zram->num_active_comps);
+	if (prio >= prio_max) {
+		ret = -EINVAL;
+		goto release_init_lock;
+	}
+
 	page = alloc_page(GFP_KERNEL);
 	if (!page) {
 		ret = -ENOMEM;
-- 
2.48.1.502.g6dc24dfdaf-goog



^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v5 08/18] zram: filter out recomp targets based on priority
  2025-02-12  6:26 [PATCH v5 00/18] zsmalloc/zram: there be preemption Sergey Senozhatsky
                   ` (6 preceding siblings ...)
  2025-02-12  6:27 ` [PATCH v5 07/18] zram: limit max recompress prio to num_active_comps Sergey Senozhatsky
@ 2025-02-12  6:27 ` Sergey Senozhatsky
  2025-02-12  6:27 ` [PATCH v5 09/18] zram: rework recompression loop Sergey Senozhatsky
                   ` (10 subsequent siblings)
  18 siblings, 0 replies; 40+ messages in thread
From: Sergey Senozhatsky @ 2025-02-12  6:27 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Yosry Ahmed, Kairui Song, Minchan Kim, linux-mm, linux-kernel,
	Sergey Senozhatsky

Do no select for post processing slots that are already
compressed with same or higher priority compression
algorithm.

This should save some memory, as previously we would still
put those entries into corresponding post-processing buckets
and filter them out later in recompress_slot().

Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
---
 drivers/block/zram/zram_drv.c | 25 ++++++++++++++++---------
 1 file changed, 16 insertions(+), 9 deletions(-)

diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index 27148f3e5ae9..31bdf5e0ff74 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -1822,7 +1822,7 @@ static int zram_bvec_write(struct zram *zram, struct bio_vec *bvec,
 #define RECOMPRESS_IDLE		(1 << 0)
 #define RECOMPRESS_HUGE		(1 << 1)
 
-static int scan_slots_for_recompress(struct zram *zram, u32 mode,
+static int scan_slots_for_recompress(struct zram *zram, u32 mode, u32 prio_max,
 				     struct zram_pp_ctl *ctl)
 {
 	unsigned long nr_pages = zram->disksize >> PAGE_SHIFT;
@@ -1854,6 +1854,10 @@ static int scan_slots_for_recompress(struct zram *zram, u32 mode,
 		    zram_test_flag(zram, index, ZRAM_INCOMPRESSIBLE))
 			goto next;
 
+		/* Already compressed with same of higher priority */
+		if (zram_get_priority(zram, index) + 1 >= prio_max)
+			goto next;
+
 		pps->index = index;
 		place_pp_slot(zram, ctl, pps);
 		pps = NULL;
@@ -1910,6 +1914,16 @@ static int recompress_slot(struct zram *zram, u32 index, struct page *page,
 	zram_clear_flag(zram, index, ZRAM_IDLE);
 
 	class_index_old = zs_lookup_class_index(zram->mem_pool, comp_len_old);
+
+	prio = max(prio, zram_get_priority(zram, index) + 1);
+	/*
+	 * Recompression slots scan should not select slots that are
+	 * already compressed with a higher priority algorithm, but
+	 * just in case
+	 */
+	if (prio >= prio_max)
+		return 0;
+
 	/*
 	 * Iterate the secondary comp algorithms list (in order of priority)
 	 * and try to recompress the page.
@@ -1918,13 +1932,6 @@ static int recompress_slot(struct zram *zram, u32 index, struct page *page,
 		if (!zram->comps[prio])
 			continue;
 
-		/*
-		 * Skip if the object is already re-compressed with a higher
-		 * priority algorithm (or same algorithm).
-		 */
-		if (prio <= zram_get_priority(zram, index))
-			continue;
-
 		num_recomps++;
 		zstrm = zcomp_stream_get(zram->comps[prio]);
 		src = kmap_local_page(page);
@@ -2149,7 +2156,7 @@ static ssize_t recompress_store(struct device *dev,
 		goto release_init_lock;
 	}
 
-	scan_slots_for_recompress(zram, mode, ctl);
+	scan_slots_for_recompress(zram, mode, prio_max, ctl);
 
 	ret = len;
 	while ((pps = select_pp_slot(ctl))) {
-- 
2.48.1.502.g6dc24dfdaf-goog



^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v5 09/18] zram: rework recompression loop
  2025-02-12  6:26 [PATCH v5 00/18] zsmalloc/zram: there be preemption Sergey Senozhatsky
                   ` (7 preceding siblings ...)
  2025-02-12  6:27 ` [PATCH v5 08/18] zram: filter out recomp targets based on priority Sergey Senozhatsky
@ 2025-02-12  6:27 ` Sergey Senozhatsky
  2025-02-12  6:27 ` [PATCH v5 10/18] zsmalloc: factor out pool locking helpers Sergey Senozhatsky
                   ` (9 subsequent siblings)
  18 siblings, 0 replies; 40+ messages in thread
From: Sergey Senozhatsky @ 2025-02-12  6:27 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Yosry Ahmed, Kairui Song, Minchan Kim, linux-mm, linux-kernel,
	Sergey Senozhatsky

This reworks recompression loop handling:

- set a rule that stream-put NULLs the stream pointer
  If the loop returns with a non-NULL stream then it's a
  successfull recompression, otherwise the stream should
  always be NULL.

- do not count the number of recompressions
  Mark object as incompressible as soon as the algorithm
  with the highest priority failed to compress that object.

- count compression errors as resource usage
  Even if compression has failed, we still need to bump
  num_recomp_pages counter.

Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
---
 drivers/block/zram/zram_drv.c | 53 +++++++++++++----------------------
 1 file changed, 19 insertions(+), 34 deletions(-)

diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index 31bdf5e0ff74..7c4c296181a8 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -1887,9 +1887,8 @@ static int recompress_slot(struct zram *zram, u32 index, struct page *page,
 	unsigned int comp_len_new;
 	unsigned int class_index_old;
 	unsigned int class_index_new;
-	u32 num_recomps = 0;
 	void *src, *dst;
-	int ret;
+	int ret = 0;
 
 	handle_old = zram_get_handle(zram, index);
 	if (!handle_old)
@@ -1932,7 +1931,6 @@ static int recompress_slot(struct zram *zram, u32 index, struct page *page,
 		if (!zram->comps[prio])
 			continue;
 
-		num_recomps++;
 		zstrm = zcomp_stream_get(zram->comps[prio]);
 		src = kmap_local_page(page);
 		ret = zcomp_compress(zram->comps[prio], zstrm,
@@ -1941,7 +1939,8 @@ static int recompress_slot(struct zram *zram, u32 index, struct page *page,
 
 		if (ret) {
 			zcomp_stream_put(zstrm);
-			return ret;
+			zstrm = NULL;
+			break;
 		}
 
 		class_index_new = zs_lookup_class_index(zram->mem_pool,
@@ -1951,6 +1950,7 @@ static int recompress_slot(struct zram *zram, u32 index, struct page *page,
 		if (class_index_new >= class_index_old ||
 		    (threshold && comp_len_new >= threshold)) {
 			zcomp_stream_put(zstrm);
+			zstrm = NULL;
 			continue;
 		}
 
@@ -1958,14 +1958,6 @@ static int recompress_slot(struct zram *zram, u32 index, struct page *page,
 		break;
 	}
 
-	/*
-	 * We did not try to recompress, e.g. when we have only one
-	 * secondary algorithm and the page is already recompressed
-	 * using that algorithm
-	 */
-	if (!zstrm)
-		return 0;
-
 	/*
 	 * Decrement the limit (if set) on pages we can recompress, even
 	 * when current recompression was unsuccessful or did not compress
@@ -1975,38 +1967,31 @@ static int recompress_slot(struct zram *zram, u32 index, struct page *page,
 	if (*num_recomp_pages)
 		*num_recomp_pages -= 1;
 
-	if (class_index_new >= class_index_old) {
+	/* Compression error */
+	if (ret)
+		return ret;
+
+	if (!zstrm) {
 		/*
 		 * Secondary algorithms failed to re-compress the page
-		 * in a way that would save memory, mark the object as
-		 * incompressible so that we will not try to compress
-		 * it again.
+		 * in a way that would save memory.
 		 *
-		 * We need to make sure that all secondary algorithms have
-		 * failed, so we test if the number of recompressions matches
-		 * the number of active secondary algorithms.
+		 * Mark the object incompressible if the max-priority
+		 * algorithm couldn't re-compress it.
 		 */
-		if (num_recomps == zram->num_active_comps - 1)
-			zram_set_flag(zram, index, ZRAM_INCOMPRESSIBLE);
+		if (prio < zram->num_active_comps)
+			return 0;
+		zram_set_flag(zram, index, ZRAM_INCOMPRESSIBLE);
 		return 0;
 	}
 
-	/* Successful recompression but above threshold */
-	if (threshold && comp_len_new >= threshold)
-		return 0;
-
 	/*
-	 * No direct reclaim (slow path) for handle allocation and no
-	 * re-compression attempt (unlike in zram_write_bvec()) since
-	 * we already have stored that object in zsmalloc. If we cannot
-	 * alloc memory for recompressed object then we bail out and
-	 * simply keep the old (existing) object in zsmalloc.
+	 * We are holding per-CPU stream mutex and entry lock so better
+	 * avoid direct reclaim.  Allocation error is not fatal since
+	 * we still have the old object in the mem_pool.
 	 */
 	handle_new = zs_malloc(zram->mem_pool, comp_len_new,
-			       __GFP_KSWAPD_RECLAIM |
-			       __GFP_NOWARN |
-			       __GFP_HIGHMEM |
-			       __GFP_MOVABLE);
+			       GFP_NOWAIT | __GFP_HIGHMEM | __GFP_MOVABLE);
 	if (IS_ERR_VALUE(handle_new)) {
 		zcomp_stream_put(zstrm);
 		return PTR_ERR((void *)handle_new);
-- 
2.48.1.502.g6dc24dfdaf-goog



^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v5 10/18] zsmalloc: factor out pool locking helpers
  2025-02-12  6:26 [PATCH v5 00/18] zsmalloc/zram: there be preemption Sergey Senozhatsky
                   ` (8 preceding siblings ...)
  2025-02-12  6:27 ` [PATCH v5 09/18] zram: rework recompression loop Sergey Senozhatsky
@ 2025-02-12  6:27 ` Sergey Senozhatsky
  2025-02-12 16:18   ` Yosry Ahmed
  2025-02-12  6:27 ` [PATCH v5 11/18] zsmalloc: factor out size-class " Sergey Senozhatsky
                   ` (8 subsequent siblings)
  18 siblings, 1 reply; 40+ messages in thread
From: Sergey Senozhatsky @ 2025-02-12  6:27 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Yosry Ahmed, Kairui Song, Minchan Kim, linux-mm, linux-kernel,
	Sergey Senozhatsky

We currently have a mix of migrate_{read,write}_lock() helpers
that lock zspages, but it's zs_pool that actually has a ->migrate_lock
access to which is opene-coded.  Factor out pool migrate locking
into helpers, zspage migration locking API will be renamed to
reduce confusion.

It's worth mentioning that zsmalloc locks sync not only migration,
but also compaction.

Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
---
 mm/zsmalloc.c | 63 +++++++++++++++++++++++++++++++++++----------------
 1 file changed, 44 insertions(+), 19 deletions(-)

diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
index 6d0e47f7ae33..47c638df47c5 100644
--- a/mm/zsmalloc.c
+++ b/mm/zsmalloc.c
@@ -18,7 +18,7 @@
 /*
  * lock ordering:
  *	page_lock
- *	pool->migrate_lock
+ *	pool->lock
  *	class->lock
  *	zspage->lock
  */
@@ -224,10 +224,35 @@ struct zs_pool {
 	struct work_struct free_work;
 #endif
 	/* protect page/zspage migration */
-	rwlock_t migrate_lock;
+	rwlock_t lock;
 	atomic_t compaction_in_progress;
 };
 
+static void pool_write_unlock(struct zs_pool *pool)
+{
+	write_unlock(&pool->lock);
+}
+
+static void pool_write_lock(struct zs_pool *pool)
+{
+	write_lock(&pool->lock);
+}
+
+static void pool_read_unlock(struct zs_pool *pool)
+{
+	read_unlock(&pool->lock);
+}
+
+static void pool_read_lock(struct zs_pool *pool)
+{
+	read_lock(&pool->lock);
+}
+
+static bool pool_lock_is_contended(struct zs_pool *pool)
+{
+	return rwlock_is_contended(&pool->lock);
+}
+
 static inline void zpdesc_set_first(struct zpdesc *zpdesc)
 {
 	SetPagePrivate(zpdesc_page(zpdesc));
@@ -1206,7 +1231,7 @@ void *zs_map_object(struct zs_pool *pool, unsigned long handle,
 	BUG_ON(in_interrupt());
 
 	/* It guarantees it can get zspage from handle safely */
-	read_lock(&pool->migrate_lock);
+	pool_read_lock(pool);
 	obj = handle_to_obj(handle);
 	obj_to_location(obj, &zpdesc, &obj_idx);
 	zspage = get_zspage(zpdesc);
@@ -1218,7 +1243,7 @@ void *zs_map_object(struct zs_pool *pool, unsigned long handle,
 	 * which is smaller granularity.
 	 */
 	migrate_read_lock(zspage);
-	read_unlock(&pool->migrate_lock);
+	pool_read_unlock(pool);
 
 	class = zspage_class(pool, zspage);
 	off = offset_in_page(class->size * obj_idx);
@@ -1450,16 +1475,16 @@ void zs_free(struct zs_pool *pool, unsigned long handle)
 		return;
 
 	/*
-	 * The pool->migrate_lock protects the race with zpage's migration
+	 * The pool->lock protects the race with zpage's migration
 	 * so it's safe to get the page from handle.
 	 */
-	read_lock(&pool->migrate_lock);
+	pool_read_lock(pool);
 	obj = handle_to_obj(handle);
 	obj_to_zpdesc(obj, &f_zpdesc);
 	zspage = get_zspage(f_zpdesc);
 	class = zspage_class(pool, zspage);
 	spin_lock(&class->lock);
-	read_unlock(&pool->migrate_lock);
+	pool_read_unlock(pool);
 
 	class_stat_sub(class, ZS_OBJS_INUSE, 1);
 	obj_free(class->size, obj);
@@ -1793,10 +1818,10 @@ static int zs_page_migrate(struct page *newpage, struct page *page,
 	pool = zspage->pool;
 
 	/*
-	 * The pool migrate_lock protects the race between zpage migration
+	 * The pool lock protects the race between zpage migration
 	 * and zs_free.
 	 */
-	write_lock(&pool->migrate_lock);
+	pool_write_lock(pool);
 	class = zspage_class(pool, zspage);
 
 	/*
@@ -1833,7 +1858,7 @@ static int zs_page_migrate(struct page *newpage, struct page *page,
 	 * Since we complete the data copy and set up new zspage structure,
 	 * it's okay to release migration_lock.
 	 */
-	write_unlock(&pool->migrate_lock);
+	pool_write_unlock(pool);
 	spin_unlock(&class->lock);
 	migrate_write_unlock(zspage);
 
@@ -1956,7 +1981,7 @@ static unsigned long __zs_compact(struct zs_pool *pool,
 	 * protect the race between zpage migration and zs_free
 	 * as well as zpage allocation/free
 	 */
-	write_lock(&pool->migrate_lock);
+	pool_write_lock(pool);
 	spin_lock(&class->lock);
 	while (zs_can_compact(class)) {
 		int fg;
@@ -1983,14 +2008,14 @@ static unsigned long __zs_compact(struct zs_pool *pool,
 		src_zspage = NULL;
 
 		if (get_fullness_group(class, dst_zspage) == ZS_INUSE_RATIO_100
-		    || rwlock_is_contended(&pool->migrate_lock)) {
+		    || pool_lock_is_contended(pool)) {
 			putback_zspage(class, dst_zspage);
 			dst_zspage = NULL;
 
 			spin_unlock(&class->lock);
-			write_unlock(&pool->migrate_lock);
+			pool_write_unlock(pool);
 			cond_resched();
-			write_lock(&pool->migrate_lock);
+			pool_write_lock(pool);
 			spin_lock(&class->lock);
 		}
 	}
@@ -2002,7 +2027,7 @@ static unsigned long __zs_compact(struct zs_pool *pool,
 		putback_zspage(class, dst_zspage);
 
 	spin_unlock(&class->lock);
-	write_unlock(&pool->migrate_lock);
+	pool_write_unlock(pool);
 
 	return pages_freed;
 }
@@ -2014,10 +2039,10 @@ unsigned long zs_compact(struct zs_pool *pool)
 	unsigned long pages_freed = 0;
 
 	/*
-	 * Pool compaction is performed under pool->migrate_lock so it is basically
+	 * Pool compaction is performed under pool->lock so it is basically
 	 * single-threaded. Having more than one thread in __zs_compact()
-	 * will increase pool->migrate_lock contention, which will impact other
-	 * zsmalloc operations that need pool->migrate_lock.
+	 * will increase pool->lock contention, which will impact other
+	 * zsmalloc operations that need pool->lock.
 	 */
 	if (atomic_xchg(&pool->compaction_in_progress, 1))
 		return 0;
@@ -2139,7 +2164,7 @@ struct zs_pool *zs_create_pool(const char *name)
 		return NULL;
 
 	init_deferred_free(pool);
-	rwlock_init(&pool->migrate_lock);
+	rwlock_init(&pool->lock);
 	atomic_set(&pool->compaction_in_progress, 0);
 
 	pool->name = kstrdup(name, GFP_KERNEL);
-- 
2.48.1.502.g6dc24dfdaf-goog



^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v5 11/18] zsmalloc: factor out size-class locking helpers
  2025-02-12  6:26 [PATCH v5 00/18] zsmalloc/zram: there be preemption Sergey Senozhatsky
                   ` (9 preceding siblings ...)
  2025-02-12  6:27 ` [PATCH v5 10/18] zsmalloc: factor out pool locking helpers Sergey Senozhatsky
@ 2025-02-12  6:27 ` Sergey Senozhatsky
  2025-02-12  6:27 ` [PATCH v5 12/18] zsmalloc: make zspage lock preemptible Sergey Senozhatsky
                   ` (7 subsequent siblings)
  18 siblings, 0 replies; 40+ messages in thread
From: Sergey Senozhatsky @ 2025-02-12  6:27 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Yosry Ahmed, Kairui Song, Minchan Kim, linux-mm, linux-kernel,
	Sergey Senozhatsky

Move open-coded size-class locking to dedicated helpers.

Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Reviewed-by: Yosry Ahmed <yosry.ahmed@linux.dev>
---
 mm/zsmalloc.c | 47 ++++++++++++++++++++++++++++-------------------
 1 file changed, 28 insertions(+), 19 deletions(-)

diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
index 47c638df47c5..c82c24b8e6a4 100644
--- a/mm/zsmalloc.c
+++ b/mm/zsmalloc.c
@@ -253,6 +253,16 @@ static bool pool_lock_is_contended(struct zs_pool *pool)
 	return rwlock_is_contended(&pool->lock);
 }
 
+static void size_class_lock(struct size_class *class)
+{
+	spin_lock(&class->lock);
+}
+
+static void size_class_unlock(struct size_class *class)
+{
+	spin_unlock(&class->lock);
+}
+
 static inline void zpdesc_set_first(struct zpdesc *zpdesc)
 {
 	SetPagePrivate(zpdesc_page(zpdesc));
@@ -613,8 +623,7 @@ static int zs_stats_size_show(struct seq_file *s, void *v)
 		if (class->index != i)
 			continue;
 
-		spin_lock(&class->lock);
-
+		size_class_lock(class);
 		seq_printf(s, " %5u %5u ", i, class->size);
 		for (fg = ZS_INUSE_RATIO_10; fg < NR_FULLNESS_GROUPS; fg++) {
 			inuse_totals[fg] += class_stat_read(class, fg);
@@ -624,7 +633,7 @@ static int zs_stats_size_show(struct seq_file *s, void *v)
 		obj_allocated = class_stat_read(class, ZS_OBJS_ALLOCATED);
 		obj_used = class_stat_read(class, ZS_OBJS_INUSE);
 		freeable = zs_can_compact(class);
-		spin_unlock(&class->lock);
+		size_class_unlock(class);
 
 		objs_per_zspage = class->objs_per_zspage;
 		pages_used = obj_allocated / objs_per_zspage *
@@ -1399,7 +1408,7 @@ unsigned long zs_malloc(struct zs_pool *pool, size_t size, gfp_t gfp)
 	class = pool->size_class[get_size_class_index(size)];
 
 	/* class->lock effectively protects the zpage migration */
-	spin_lock(&class->lock);
+	size_class_lock(class);
 	zspage = find_get_zspage(class);
 	if (likely(zspage)) {
 		obj_malloc(pool, zspage, handle);
@@ -1410,7 +1419,7 @@ unsigned long zs_malloc(struct zs_pool *pool, size_t size, gfp_t gfp)
 		goto out;
 	}
 
-	spin_unlock(&class->lock);
+	size_class_unlock(class);
 
 	zspage = alloc_zspage(pool, class, gfp);
 	if (!zspage) {
@@ -1418,7 +1427,7 @@ unsigned long zs_malloc(struct zs_pool *pool, size_t size, gfp_t gfp)
 		return (unsigned long)ERR_PTR(-ENOMEM);
 	}
 
-	spin_lock(&class->lock);
+	size_class_lock(class);
 	obj_malloc(pool, zspage, handle);
 	newfg = get_fullness_group(class, zspage);
 	insert_zspage(class, zspage, newfg);
@@ -1429,7 +1438,7 @@ unsigned long zs_malloc(struct zs_pool *pool, size_t size, gfp_t gfp)
 	/* We completely set up zspage so mark them as movable */
 	SetZsPageMovable(pool, zspage);
 out:
-	spin_unlock(&class->lock);
+	size_class_unlock(class);
 
 	return handle;
 }
@@ -1483,7 +1492,7 @@ void zs_free(struct zs_pool *pool, unsigned long handle)
 	obj_to_zpdesc(obj, &f_zpdesc);
 	zspage = get_zspage(f_zpdesc);
 	class = zspage_class(pool, zspage);
-	spin_lock(&class->lock);
+	size_class_lock(class);
 	pool_read_unlock(pool);
 
 	class_stat_sub(class, ZS_OBJS_INUSE, 1);
@@ -1493,7 +1502,7 @@ void zs_free(struct zs_pool *pool, unsigned long handle)
 	if (fullness == ZS_INUSE_RATIO_0)
 		free_zspage(pool, class, zspage);
 
-	spin_unlock(&class->lock);
+	size_class_unlock(class);
 	cache_free_handle(pool, handle);
 }
 EXPORT_SYMBOL_GPL(zs_free);
@@ -1827,7 +1836,7 @@ static int zs_page_migrate(struct page *newpage, struct page *page,
 	/*
 	 * the class lock protects zpage alloc/free in the zspage.
 	 */
-	spin_lock(&class->lock);
+	size_class_lock(class);
 	/* the migrate_write_lock protects zpage access via zs_map_object */
 	migrate_write_lock(zspage);
 
@@ -1859,7 +1868,7 @@ static int zs_page_migrate(struct page *newpage, struct page *page,
 	 * it's okay to release migration_lock.
 	 */
 	pool_write_unlock(pool);
-	spin_unlock(&class->lock);
+	size_class_unlock(class);
 	migrate_write_unlock(zspage);
 
 	zpdesc_get(newzpdesc);
@@ -1903,10 +1912,10 @@ static void async_free_zspage(struct work_struct *work)
 		if (class->index != i)
 			continue;
 
-		spin_lock(&class->lock);
+		size_class_lock(class);
 		list_splice_init(&class->fullness_list[ZS_INUSE_RATIO_0],
 				 &free_pages);
-		spin_unlock(&class->lock);
+		size_class_unlock(class);
 	}
 
 	list_for_each_entry_safe(zspage, tmp, &free_pages, list) {
@@ -1914,10 +1923,10 @@ static void async_free_zspage(struct work_struct *work)
 		lock_zspage(zspage);
 
 		class = zspage_class(pool, zspage);
-		spin_lock(&class->lock);
+		size_class_lock(class);
 		class_stat_sub(class, ZS_INUSE_RATIO_0, 1);
 		__free_zspage(pool, class, zspage);
-		spin_unlock(&class->lock);
+		size_class_unlock(class);
 	}
 };
 
@@ -1982,7 +1991,7 @@ static unsigned long __zs_compact(struct zs_pool *pool,
 	 * as well as zpage allocation/free
 	 */
 	pool_write_lock(pool);
-	spin_lock(&class->lock);
+	size_class_lock(class);
 	while (zs_can_compact(class)) {
 		int fg;
 
@@ -2012,11 +2021,11 @@ static unsigned long __zs_compact(struct zs_pool *pool,
 			putback_zspage(class, dst_zspage);
 			dst_zspage = NULL;
 
-			spin_unlock(&class->lock);
+			size_class_unlock(class);
 			pool_write_unlock(pool);
 			cond_resched();
 			pool_write_lock(pool);
-			spin_lock(&class->lock);
+			size_class_lock(class);
 		}
 	}
 
@@ -2026,7 +2035,7 @@ static unsigned long __zs_compact(struct zs_pool *pool,
 	if (dst_zspage)
 		putback_zspage(class, dst_zspage);
 
-	spin_unlock(&class->lock);
+	size_class_unlock(class);
 	pool_write_unlock(pool);
 
 	return pages_freed;
-- 
2.48.1.502.g6dc24dfdaf-goog



^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v5 12/18] zsmalloc: make zspage lock preemptible
  2025-02-12  6:26 [PATCH v5 00/18] zsmalloc/zram: there be preemption Sergey Senozhatsky
                   ` (10 preceding siblings ...)
  2025-02-12  6:27 ` [PATCH v5 11/18] zsmalloc: factor out size-class " Sergey Senozhatsky
@ 2025-02-12  6:27 ` Sergey Senozhatsky
  2025-02-12 17:14   ` Yosry Ahmed
  2025-02-13 11:32   ` Hillf Danton
  2025-02-12  6:27 ` [PATCH v5 13/18] zsmalloc: introduce new object mapping API Sergey Senozhatsky
                   ` (6 subsequent siblings)
  18 siblings, 2 replies; 40+ messages in thread
From: Sergey Senozhatsky @ 2025-02-12  6:27 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Yosry Ahmed, Kairui Song, Minchan Kim, linux-mm, linux-kernel,
	Sergey Senozhatsky

Switch over from rwlock_t to a atomic_t variable that takes negative
value when the page is under migration, or positive values when the
page is used by zsmalloc users (object map, etc.)   Using a rwsem
per-zspage is a little too memory heavy, a simple atomic_t should
suffice.

zspage lock is a leaf lock for zs_map_object(), where it's read-acquired.
Since this lock now permits preemption extra care needs to be taken when
it is write-acquired - all writers grab it in atomic context, so they
cannot spin and wait for (potentially preempted) reader to unlock zspage.
There are only two writers at this moment - migration and compaction.  In
both cases we use write-try-lock and bail out if zspage is read locked.
Writers, on the other hand, never get preempted, so readers can spin
waiting for the writer to unlock zspage.

With this we can implement a preemptible object mapping.

Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Yosry Ahmed <yosry.ahmed@linux.dev>
---
 mm/zsmalloc.c | 183 +++++++++++++++++++++++++++++++++++---------------
 1 file changed, 128 insertions(+), 55 deletions(-)

diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
index c82c24b8e6a4..80261bb78cf8 100644
--- a/mm/zsmalloc.c
+++ b/mm/zsmalloc.c
@@ -226,6 +226,9 @@ struct zs_pool {
 	/* protect page/zspage migration */
 	rwlock_t lock;
 	atomic_t compaction_in_progress;
+#ifdef CONFIG_DEBUG_LOCK_ALLOC
+	struct lock_class_key lockdep_key;
+#endif
 };
 
 static void pool_write_unlock(struct zs_pool *pool)
@@ -292,6 +295,9 @@ static inline void free_zpdesc(struct zpdesc *zpdesc)
 	__free_page(page);
 }
 
+#define ZS_PAGE_UNLOCKED	0
+#define ZS_PAGE_WRLOCKED	-1
+
 struct zspage {
 	struct {
 		unsigned int huge:HUGE_BITS;
@@ -304,7 +310,11 @@ struct zspage {
 	struct zpdesc *first_zpdesc;
 	struct list_head list; /* fullness list */
 	struct zs_pool *pool;
-	rwlock_t lock;
+	atomic_t lock;
+
+#ifdef CONFIG_DEBUG_LOCK_ALLOC
+	struct lockdep_map lockdep_map;
+#endif
 };
 
 struct mapping_area {
@@ -314,6 +324,88 @@ struct mapping_area {
 	enum zs_mapmode vm_mm; /* mapping mode */
 };
 
+static void zspage_lock_init(struct zspage *zspage)
+{
+#ifdef CONFIG_DEBUG_LOCK_ALLOC
+	lockdep_init_map(&zspage->lockdep_map, "zsmalloc-page",
+			 &zspage->pool->lockdep_key, 0);
+#endif
+
+	atomic_set(&zspage->lock, ZS_PAGE_UNLOCKED);
+}
+
+/*
+ * zspage locking rules:
+ *
+ * 1) writer-lock is exclusive
+ *
+ * 2) writer-lock owner cannot sleep
+ *
+ * 3) writer-lock owner cannot spin waiting for the lock
+ *   - caller (e.g. compaction and migration) must check return value and
+ *     handle locking failures
+ *   - there is only TRY variant of writer-lock function
+ *
+ * 4) reader-lock owners (multiple) can sleep
+ *
+ * 5) reader-lock owners can spin waiting for the lock, in any context
+ *   - existing readers (even preempted ones) don't block new readers
+ *   - writer-lock owners never sleep, always unlock at some point
+ */
+static void zspage_read_lock(struct zspage *zspage)
+{
+	atomic_t *lock = &zspage->lock;
+	int old = atomic_read_acquire(lock);
+
+#ifdef CONFIG_DEBUG_LOCK_ALLOC
+	rwsem_acquire_read(&zspage->lockdep_map, 0, 0, _RET_IP_);
+#endif
+
+	do {
+		if (old == ZS_PAGE_WRLOCKED) {
+			cpu_relax();
+			old = atomic_read_acquire(lock);
+			continue;
+		}
+	} while (!atomic_try_cmpxchg_acquire(lock, &old, old + 1));
+}
+
+static void zspage_read_unlock(struct zspage *zspage)
+{
+#ifdef CONFIG_DEBUG_LOCK_ALLOC
+	rwsem_release(&zspage->lockdep_map, _RET_IP_);
+#endif
+	atomic_dec_return_release(&zspage->lock);
+}
+
+static __must_check bool zspage_try_write_lock(struct zspage *zspage)
+{
+	atomic_t *lock = &zspage->lock;
+	int old = ZS_PAGE_UNLOCKED;
+
+	WARN_ON_ONCE(preemptible());
+
+	preempt_disable();
+	if (atomic_try_cmpxchg_acquire(lock, &old, ZS_PAGE_WRLOCKED)) {
+#ifdef CONFIG_DEBUG_LOCK_ALLOC
+		rwsem_acquire(&zspage->lockdep_map, 0, 1, _RET_IP_);
+#endif
+		return true;
+	}
+
+	preempt_enable();
+	return false;
+}
+
+static void zspage_write_unlock(struct zspage *zspage)
+{
+#ifdef CONFIG_DEBUG_LOCK_ALLOC
+	rwsem_release(&zspage->lockdep_map, _RET_IP_);
+#endif
+	atomic_set_release(&zspage->lock, ZS_PAGE_UNLOCKED);
+	preempt_enable();
+}
+
 /* huge object: pages_per_zspage == 1 && maxobj_per_zspage == 1 */
 static void SetZsHugePage(struct zspage *zspage)
 {
@@ -325,12 +417,6 @@ static bool ZsHugePage(struct zspage *zspage)
 	return zspage->huge;
 }
 
-static void migrate_lock_init(struct zspage *zspage);
-static void migrate_read_lock(struct zspage *zspage);
-static void migrate_read_unlock(struct zspage *zspage);
-static void migrate_write_lock(struct zspage *zspage);
-static void migrate_write_unlock(struct zspage *zspage);
-
 #ifdef CONFIG_COMPACTION
 static void kick_deferred_free(struct zs_pool *pool);
 static void init_deferred_free(struct zs_pool *pool);
@@ -1026,7 +1112,9 @@ static struct zspage *alloc_zspage(struct zs_pool *pool,
 		return NULL;
 
 	zspage->magic = ZSPAGE_MAGIC;
-	migrate_lock_init(zspage);
+	zspage->pool = pool;
+	zspage->class = class->index;
+	zspage_lock_init(zspage);
 
 	for (i = 0; i < class->pages_per_zspage; i++) {
 		struct zpdesc *zpdesc;
@@ -1049,8 +1137,6 @@ static struct zspage *alloc_zspage(struct zs_pool *pool,
 
 	create_page_chain(class, zspage, zpdescs);
 	init_zspage(class, zspage);
-	zspage->pool = pool;
-	zspage->class = class->index;
 
 	return zspage;
 }
@@ -1251,7 +1337,7 @@ void *zs_map_object(struct zs_pool *pool, unsigned long handle,
 	 * zs_unmap_object API so delegate the locking from class to zspage
 	 * which is smaller granularity.
 	 */
-	migrate_read_lock(zspage);
+	zspage_read_lock(zspage);
 	pool_read_unlock(pool);
 
 	class = zspage_class(pool, zspage);
@@ -1311,7 +1397,7 @@ void zs_unmap_object(struct zs_pool *pool, unsigned long handle)
 	}
 	local_unlock(&zs_map_area.lock);
 
-	migrate_read_unlock(zspage);
+	zspage_read_unlock(zspage);
 }
 EXPORT_SYMBOL_GPL(zs_unmap_object);
 
@@ -1705,18 +1791,18 @@ static void lock_zspage(struct zspage *zspage)
 	/*
 	 * Pages we haven't locked yet can be migrated off the list while we're
 	 * trying to lock them, so we need to be careful and only attempt to
-	 * lock each page under migrate_read_lock(). Otherwise, the page we lock
+	 * lock each page under zspage_read_lock(). Otherwise, the page we lock
 	 * may no longer belong to the zspage. This means that we may wait for
 	 * the wrong page to unlock, so we must take a reference to the page
-	 * prior to waiting for it to unlock outside migrate_read_lock().
+	 * prior to waiting for it to unlock outside zspage_read_lock().
 	 */
 	while (1) {
-		migrate_read_lock(zspage);
+		zspage_read_lock(zspage);
 		zpdesc = get_first_zpdesc(zspage);
 		if (zpdesc_trylock(zpdesc))
 			break;
 		zpdesc_get(zpdesc);
-		migrate_read_unlock(zspage);
+		zspage_read_unlock(zspage);
 		zpdesc_wait_locked(zpdesc);
 		zpdesc_put(zpdesc);
 	}
@@ -1727,41 +1813,16 @@ static void lock_zspage(struct zspage *zspage)
 			curr_zpdesc = zpdesc;
 		} else {
 			zpdesc_get(zpdesc);
-			migrate_read_unlock(zspage);
+			zspage_read_unlock(zspage);
 			zpdesc_wait_locked(zpdesc);
 			zpdesc_put(zpdesc);
-			migrate_read_lock(zspage);
+			zspage_read_lock(zspage);
 		}
 	}
-	migrate_read_unlock(zspage);
+	zspage_read_unlock(zspage);
 }
 #endif /* CONFIG_COMPACTION */
 
-static void migrate_lock_init(struct zspage *zspage)
-{
-	rwlock_init(&zspage->lock);
-}
-
-static void migrate_read_lock(struct zspage *zspage) __acquires(&zspage->lock)
-{
-	read_lock(&zspage->lock);
-}
-
-static void migrate_read_unlock(struct zspage *zspage) __releases(&zspage->lock)
-{
-	read_unlock(&zspage->lock);
-}
-
-static void migrate_write_lock(struct zspage *zspage)
-{
-	write_lock(&zspage->lock);
-}
-
-static void migrate_write_unlock(struct zspage *zspage)
-{
-	write_unlock(&zspage->lock);
-}
-
 #ifdef CONFIG_COMPACTION
 
 static const struct movable_operations zsmalloc_mops;
@@ -1803,7 +1864,7 @@ static bool zs_page_isolate(struct page *page, isolate_mode_t mode)
 }
 
 static int zs_page_migrate(struct page *newpage, struct page *page,
-		enum migrate_mode mode)
+			   enum migrate_mode mode)
 {
 	struct zs_pool *pool;
 	struct size_class *class;
@@ -1819,15 +1880,12 @@ static int zs_page_migrate(struct page *newpage, struct page *page,
 
 	VM_BUG_ON_PAGE(!zpdesc_is_isolated(zpdesc), zpdesc_page(zpdesc));
 
-	/* We're committed, tell the world that this is a Zsmalloc page. */
-	__zpdesc_set_zsmalloc(newzpdesc);
-
 	/* The page is locked, so this pointer must remain valid */
 	zspage = get_zspage(zpdesc);
 	pool = zspage->pool;
 
 	/*
-	 * The pool lock protects the race between zpage migration
+	 * The pool->lock protects the race between zpage migration
 	 * and zs_free.
 	 */
 	pool_write_lock(pool);
@@ -1837,8 +1895,15 @@ static int zs_page_migrate(struct page *newpage, struct page *page,
 	 * the class lock protects zpage alloc/free in the zspage.
 	 */
 	size_class_lock(class);
-	/* the migrate_write_lock protects zpage access via zs_map_object */
-	migrate_write_lock(zspage);
+	/* the zspage write_lock protects zpage access via zs_map_object */
+	if (!zspage_try_write_lock(zspage)) {
+		size_class_unlock(class);
+		pool_write_unlock(pool);
+		return -EINVAL;
+	}
+
+	/* We're committed, tell the world that this is a Zsmalloc page. */
+	__zpdesc_set_zsmalloc(newzpdesc);
 
 	offset = get_first_obj_offset(zpdesc);
 	s_addr = kmap_local_zpdesc(zpdesc);
@@ -1869,7 +1934,7 @@ static int zs_page_migrate(struct page *newpage, struct page *page,
 	 */
 	pool_write_unlock(pool);
 	size_class_unlock(class);
-	migrate_write_unlock(zspage);
+	zspage_write_unlock(zspage);
 
 	zpdesc_get(newzpdesc);
 	if (zpdesc_zone(newzpdesc) != zpdesc_zone(zpdesc)) {
@@ -2005,9 +2070,11 @@ static unsigned long __zs_compact(struct zs_pool *pool,
 		if (!src_zspage)
 			break;
 
-		migrate_write_lock(src_zspage);
+		if (!zspage_try_write_lock(src_zspage))
+			break;
+
 		migrate_zspage(pool, src_zspage, dst_zspage);
-		migrate_write_unlock(src_zspage);
+		zspage_write_unlock(src_zspage);
 
 		fg = putback_zspage(class, src_zspage);
 		if (fg == ZS_INUSE_RATIO_0) {
@@ -2267,7 +2334,9 @@ struct zs_pool *zs_create_pool(const char *name)
 	 * trigger compaction manually. Thus, ignore return code.
 	 */
 	zs_register_shrinker(pool);
-
+#ifdef CONFIG_DEBUG_LOCK_ALLOC
+	lockdep_register_key(&pool->lockdep_key);
+#endif
 	return pool;
 
 err:
@@ -2304,6 +2373,10 @@ void zs_destroy_pool(struct zs_pool *pool)
 		kfree(class);
 	}
 
+#ifdef CONFIG_DEBUG_LOCK_ALLOC
+	lockdep_unregister_key(&pool->lockdep_key);
+#endif
+
 	destroy_cache(pool);
 	kfree(pool->name);
 	kfree(pool);
-- 
2.48.1.502.g6dc24dfdaf-goog



^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v5 13/18] zsmalloc: introduce new object mapping API
  2025-02-12  6:26 [PATCH v5 00/18] zsmalloc/zram: there be preemption Sergey Senozhatsky
                   ` (11 preceding siblings ...)
  2025-02-12  6:27 ` [PATCH v5 12/18] zsmalloc: make zspage lock preemptible Sergey Senozhatsky
@ 2025-02-12  6:27 ` Sergey Senozhatsky
  2025-02-12  6:27 ` [PATCH v5 14/18] zram: switch to new zsmalloc " Sergey Senozhatsky
                   ` (5 subsequent siblings)
  18 siblings, 0 replies; 40+ messages in thread
From: Sergey Senozhatsky @ 2025-02-12  6:27 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Yosry Ahmed, Kairui Song, Minchan Kim, linux-mm, linux-kernel,
	Sergey Senozhatsky

Current object mapping API is a little cumbersome.  First, it's
inconsistent, sometimes it returns with page-faults disabled and
sometimes with page-faults enabled.  Second, and most importantly,
it enforces atomicity restrictions on its users.  zs_map_object()
has to return a liner object address which is not always possible
because some objects span multiple physical (non-contiguous)
pages.  For such objects zsmalloc uses a per-CPU buffer to which
object's data is copied before a pointer to that per-CPU buffer
is returned back to the caller.  This leads to another, final,
issue - extra memcpy().  Since the caller gets a pointer to
per-CPU buffer it can memcpy() data only to that buffer, and
during zs_unmap_object() zsmalloc will memcpy() from that per-CPU
buffer to physical pages that object in question spans across.

New API splits functions by access mode:
- zs_obj_read_begin(handle, local_copy)
  Returns a pointer to handle memory.  For objects that span two
  physical pages a local_copy buffer is used to store object's
  data before the address is returned to the caller.  Otherwise
  the object's page is kmap_local mapped directly.

- zs_obj_read_end(handle, buf)
  Unmaps the page if it was kmap_local mapped by zs_obj_read_begin().

- zs_obj_write(handle, buf, len)
  Copies len-bytes from compression buffer to handle memory
  (takes care of objects that span two pages).  This does not
  need any additional (e.g. per-CPU) buffers and writes the data
  directly to zsmalloc pool pages.

In terms of performance, on a synthetic and completely reproducible
test that allocates fixed number of objects of fixed sizes and
iterates over those objects, first mapping in RO then in RW mode:

OLD API
=======

10 runs

  369,205,778      instructions        #    0.80  insn per cycle
   40,467,926      branches            #  113.732 M/sec

  369,002,122      instructions        #    0.62  insn per cycle
   40,426,145      branches            #  189.361 M/sec

  369,036,706      instructions        #    0.63  insn per cycle
   40,430,860      branches            #  204.105 M/sec

[..]

NEW API
=======

10 runs

  265,799,293      instructions        #    0.51  insn per cycle
   29,834,567      branches            #  170.281 M/sec

  265,765,970      instructions        #    0.55  insn per cycle
   29,829,019      branches            #  161.602 M/sec

  265,764,702      instructions        #    0.51  insn per cycle
   29,828,015      branches            #  189.677 M/sec

[..]

Difference at 95.0% confidence
   -1.03219e+08 +/- 55308.7
   -27.9705% +/- 0.0149878%
   (Student's t, pooled s = 58864.4)

The old API will stay around until the remaining users switch
to the new one.  After that we'll also remove zsmalloc per-CPU
buffer and CPU hotplug handling.

Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Reviewed-by: Yosry Ahmed <yosry.ahmed@linux.dev>
---
 include/linux/zsmalloc.h |   8 +++
 mm/zsmalloc.c            | 129 +++++++++++++++++++++++++++++++++++++++
 2 files changed, 137 insertions(+)

diff --git a/include/linux/zsmalloc.h b/include/linux/zsmalloc.h
index a48cd0ffe57d..7d70983cf398 100644
--- a/include/linux/zsmalloc.h
+++ b/include/linux/zsmalloc.h
@@ -58,4 +58,12 @@ unsigned long zs_compact(struct zs_pool *pool);
 unsigned int zs_lookup_class_index(struct zs_pool *pool, unsigned int size);
 
 void zs_pool_stats(struct zs_pool *pool, struct zs_pool_stats *stats);
+
+void *zs_obj_read_begin(struct zs_pool *pool, unsigned long handle,
+			void *local_copy);
+void zs_obj_read_end(struct zs_pool *pool, unsigned long handle,
+		     void *handle_mem);
+void zs_obj_write(struct zs_pool *pool, unsigned long handle,
+		  void *handle_mem, size_t mem_len);
+
 #endif
diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
index 80261bb78cf8..e40268f3b655 100644
--- a/mm/zsmalloc.c
+++ b/mm/zsmalloc.c
@@ -1401,6 +1401,135 @@ void zs_unmap_object(struct zs_pool *pool, unsigned long handle)
 }
 EXPORT_SYMBOL_GPL(zs_unmap_object);
 
+void *zs_obj_read_begin(struct zs_pool *pool, unsigned long handle,
+			void *local_copy)
+{
+	struct zspage *zspage;
+	struct zpdesc *zpdesc;
+	unsigned long obj, off;
+	unsigned int obj_idx;
+	struct size_class *class;
+	void *addr;
+
+	WARN_ON(in_interrupt());
+
+	/* Guarantee we can get zspage from handle safely */
+	pool_read_lock(pool);
+	obj = handle_to_obj(handle);
+	obj_to_location(obj, &zpdesc, &obj_idx);
+	zspage = get_zspage(zpdesc);
+
+	/* Make sure migration doesn't move any pages in this zspage */
+	zspage_read_lock(zspage);
+	pool_read_unlock(pool);
+
+	class = zspage_class(pool, zspage);
+	off = offset_in_page(class->size * obj_idx);
+
+	if (off + class->size <= PAGE_SIZE) {
+		/* this object is contained entirely within a page */
+		addr = kmap_local_zpdesc(zpdesc);
+		addr += off;
+	} else {
+		size_t sizes[2];
+
+		/* this object spans two pages */
+		sizes[0] = PAGE_SIZE - off;
+		sizes[1] = class->size - sizes[0];
+		addr = local_copy;
+
+		memcpy_from_page(addr, zpdesc_page(zpdesc),
+				 off, sizes[0]);
+		zpdesc = get_next_zpdesc(zpdesc);
+		memcpy_from_page(addr + sizes[0],
+				 zpdesc_page(zpdesc),
+				 0, sizes[1]);
+	}
+
+	if (!ZsHugePage(zspage))
+		addr += ZS_HANDLE_SIZE;
+
+	return addr;
+}
+EXPORT_SYMBOL_GPL(zs_obj_read_begin);
+
+void zs_obj_read_end(struct zs_pool *pool, unsigned long handle,
+		     void *handle_mem)
+{
+	struct zspage *zspage;
+	struct zpdesc *zpdesc;
+	unsigned long obj, off;
+	unsigned int obj_idx;
+	struct size_class *class;
+
+	obj = handle_to_obj(handle);
+	obj_to_location(obj, &zpdesc, &obj_idx);
+	zspage = get_zspage(zpdesc);
+	class = zspage_class(pool, zspage);
+	off = offset_in_page(class->size * obj_idx);
+
+	if (off + class->size <= PAGE_SIZE) {
+		if (!ZsHugePage(zspage))
+			off += ZS_HANDLE_SIZE;
+		handle_mem -= off;
+		kunmap_local(handle_mem);
+	}
+
+	zspage_read_unlock(zspage);
+}
+EXPORT_SYMBOL_GPL(zs_obj_read_end);
+
+void zs_obj_write(struct zs_pool *pool, unsigned long handle,
+		  void *handle_mem, size_t mem_len)
+{
+	struct zspage *zspage;
+	struct zpdesc *zpdesc;
+	unsigned long obj, off;
+	unsigned int obj_idx;
+	struct size_class *class;
+
+	WARN_ON(in_interrupt());
+
+	/* Guarantee we can get zspage from handle safely */
+	pool_read_lock(pool);
+	obj = handle_to_obj(handle);
+	obj_to_location(obj, &zpdesc, &obj_idx);
+	zspage = get_zspage(zpdesc);
+
+	/* Make sure migration doesn't move any pages in this zspage */
+	zspage_read_lock(zspage);
+	pool_read_unlock(pool);
+
+	class = zspage_class(pool, zspage);
+	off = offset_in_page(class->size * obj_idx);
+
+	if (off + class->size <= PAGE_SIZE) {
+		/* this object is contained entirely within a page */
+		void *dst = kmap_local_zpdesc(zpdesc);
+
+		if (!ZsHugePage(zspage))
+			off += ZS_HANDLE_SIZE;
+		memcpy(dst + off, handle_mem, mem_len);
+		kunmap_local(dst);
+	} else {
+		/* this object spans two pages */
+		size_t sizes[2];
+
+		off += ZS_HANDLE_SIZE;
+		sizes[0] = PAGE_SIZE - off;
+		sizes[1] = mem_len - sizes[0];
+
+		memcpy_to_page(zpdesc_page(zpdesc), off,
+			       handle_mem, sizes[0]);
+		zpdesc = get_next_zpdesc(zpdesc);
+		memcpy_to_page(zpdesc_page(zpdesc), 0,
+			       handle_mem + sizes[0], sizes[1]);
+	}
+
+	zspage_read_unlock(zspage);
+}
+EXPORT_SYMBOL_GPL(zs_obj_write);
+
 /**
  * zs_huge_class_size() - Returns the size (in bytes) of the first huge
  *                        zsmalloc &size_class.
-- 
2.48.1.502.g6dc24dfdaf-goog



^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v5 14/18] zram: switch to new zsmalloc object mapping API
  2025-02-12  6:26 [PATCH v5 00/18] zsmalloc/zram: there be preemption Sergey Senozhatsky
                   ` (12 preceding siblings ...)
  2025-02-12  6:27 ` [PATCH v5 13/18] zsmalloc: introduce new object mapping API Sergey Senozhatsky
@ 2025-02-12  6:27 ` Sergey Senozhatsky
  2025-02-12  6:27 ` [PATCH v5 15/18] zram: permit reclaim in zstd custom allocator Sergey Senozhatsky
                   ` (4 subsequent siblings)
  18 siblings, 0 replies; 40+ messages in thread
From: Sergey Senozhatsky @ 2025-02-12  6:27 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Yosry Ahmed, Kairui Song, Minchan Kim, linux-mm, linux-kernel,
	Sergey Senozhatsky

Use new read/write zsmalloc object API.  For cases when RO mapped
object spans two physical pages (requires temp buffer) compression
streams now carry around one extra physical page.

Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
---
 drivers/block/zram/zcomp.c    |  4 +++-
 drivers/block/zram/zcomp.h    |  2 ++
 drivers/block/zram/zram_drv.c | 28 ++++++++++------------------
 3 files changed, 15 insertions(+), 19 deletions(-)

diff --git a/drivers/block/zram/zcomp.c b/drivers/block/zram/zcomp.c
index c393243eeb5c..61a9c3ed6f7a 100644
--- a/drivers/block/zram/zcomp.c
+++ b/drivers/block/zram/zcomp.c
@@ -46,6 +46,7 @@ static const struct zcomp_ops *backends[] = {
 static void zcomp_strm_free(struct zcomp *comp, struct zcomp_strm *zstrm)
 {
 	comp->ops->destroy_ctx(&zstrm->ctx);
+	vfree(zstrm->local_copy);
 	vfree(zstrm->buffer);
 	zstrm->buffer = NULL;
 }
@@ -59,12 +60,13 @@ static int zcomp_strm_init(struct zcomp *comp, struct zcomp_strm *zstrm)
 	if (ret)
 		return ret;
 
+	zstrm->local_copy = vzalloc(PAGE_SIZE);
 	/*
 	 * allocate 2 pages. 1 for compressed data, plus 1 extra for the
 	 * case when compressed size is larger than the original one
 	 */
 	zstrm->buffer = vzalloc(2 * PAGE_SIZE);
-	if (!zstrm->buffer) {
+	if (!zstrm->buffer || !zstrm->local_copy) {
 		zcomp_strm_free(comp, zstrm);
 		return -ENOMEM;
 	}
diff --git a/drivers/block/zram/zcomp.h b/drivers/block/zram/zcomp.h
index 23b8236b9090..25339ed1e07e 100644
--- a/drivers/block/zram/zcomp.h
+++ b/drivers/block/zram/zcomp.h
@@ -34,6 +34,8 @@ struct zcomp_strm {
 	struct mutex lock;
 	/* compression buffer */
 	void *buffer;
+	/* local copy of handle memory */
+	void *local_copy;
 	struct zcomp_ctx ctx;
 };
 
diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index 7c4c296181a8..c6310077c221 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -1561,11 +1561,11 @@ static int read_incompressible_page(struct zram *zram, struct page *page,
 	void *src, *dst;
 
 	handle = zram_get_handle(zram, index);
-	src = zs_map_object(zram->mem_pool, handle, ZS_MM_RO);
+	src = zs_obj_read_begin(zram->mem_pool, handle, NULL);
 	dst = kmap_local_page(page);
 	copy_page(dst, src);
 	kunmap_local(dst);
-	zs_unmap_object(zram->mem_pool, handle);
+	zs_obj_read_end(zram->mem_pool, handle, src);
 
 	return 0;
 }
@@ -1583,11 +1583,11 @@ static int read_compressed_page(struct zram *zram, struct page *page, u32 index)
 	prio = zram_get_priority(zram, index);
 
 	zstrm = zcomp_stream_get(zram->comps[prio]);
-	src = zs_map_object(zram->mem_pool, handle, ZS_MM_RO);
+	src = zs_obj_read_begin(zram->mem_pool, handle, zstrm->local_copy);
 	dst = kmap_local_page(page);
 	ret = zcomp_decompress(zram->comps[prio], zstrm, src, size, dst);
 	kunmap_local(dst);
-	zs_unmap_object(zram->mem_pool, handle);
+	zs_obj_read_end(zram->mem_pool, handle, src);
 	zcomp_stream_put(zstrm);
 
 	return ret;
@@ -1683,7 +1683,7 @@ static int write_incompressible_page(struct zram *zram, struct page *page,
 				     u32 index)
 {
 	unsigned long handle;
-	void *src, *dst;
+	void *src;
 
 	/*
 	 * This function is called from preemptible context so we don't need
@@ -1700,11 +1700,9 @@ static int write_incompressible_page(struct zram *zram, struct page *page,
 		return -ENOMEM;
 	}
 
-	dst = zs_map_object(zram->mem_pool, handle, ZS_MM_WO);
 	src = kmap_local_page(page);
-	memcpy(dst, src, PAGE_SIZE);
+	zs_obj_write(zram->mem_pool, handle, src, PAGE_SIZE);
 	kunmap_local(src);
-	zs_unmap_object(zram->mem_pool, handle);
 
 	zram_slot_lock(zram, index);
 	zram_set_flag(zram, index, ZRAM_HUGE);
@@ -1725,7 +1723,7 @@ static int zram_write_page(struct zram *zram, struct page *page, u32 index)
 	int ret = 0;
 	unsigned long handle;
 	unsigned int comp_len;
-	void *dst, *mem;
+	void *mem;
 	struct zcomp_strm *zstrm;
 	unsigned long element;
 	bool same_filled;
@@ -1771,11 +1769,8 @@ static int zram_write_page(struct zram *zram, struct page *page, u32 index)
 		return -ENOMEM;
 	}
 
-	dst = zs_map_object(zram->mem_pool, handle, ZS_MM_WO);
-
-	memcpy(dst, zstrm->buffer, comp_len);
+	zs_obj_write(zram->mem_pool, handle, zstrm->buffer, comp_len);
 	zcomp_stream_put(zstrm);
-	zs_unmap_object(zram->mem_pool, handle);
 
 	zram_slot_lock(zram, index);
 	zram_set_handle(zram, index, handle);
@@ -1887,7 +1882,7 @@ static int recompress_slot(struct zram *zram, u32 index, struct page *page,
 	unsigned int comp_len_new;
 	unsigned int class_index_old;
 	unsigned int class_index_new;
-	void *src, *dst;
+	void *src;
 	int ret = 0;
 
 	handle_old = zram_get_handle(zram, index);
@@ -1997,12 +1992,9 @@ static int recompress_slot(struct zram *zram, u32 index, struct page *page,
 		return PTR_ERR((void *)handle_new);
 	}
 
-	dst = zs_map_object(zram->mem_pool, handle_new, ZS_MM_WO);
-	memcpy(dst, zstrm->buffer, comp_len_new);
+	zs_obj_write(zram->mem_pool, handle_new, zstrm->buffer, comp_len_new);
 	zcomp_stream_put(zstrm);
 
-	zs_unmap_object(zram->mem_pool, handle_new);
-
 	zram_free_page(zram, index);
 	zram_set_handle(zram, index, handle_new);
 	zram_set_obj_size(zram, index, comp_len_new);
-- 
2.48.1.502.g6dc24dfdaf-goog



^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v5 15/18] zram: permit reclaim in zstd custom allocator
  2025-02-12  6:26 [PATCH v5 00/18] zsmalloc/zram: there be preemption Sergey Senozhatsky
                   ` (13 preceding siblings ...)
  2025-02-12  6:27 ` [PATCH v5 14/18] zram: switch to new zsmalloc " Sergey Senozhatsky
@ 2025-02-12  6:27 ` Sergey Senozhatsky
  2025-02-12  6:27 ` [PATCH v5 16/18] zram: do not leak page on recompress_store error path Sergey Senozhatsky
                   ` (3 subsequent siblings)
  18 siblings, 0 replies; 40+ messages in thread
From: Sergey Senozhatsky @ 2025-02-12  6:27 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Yosry Ahmed, Kairui Song, Minchan Kim, linux-mm, linux-kernel,
	Sergey Senozhatsky

When configured with pre-trained compression/decompression
dictionary support, zstd requires custom memory allocator,
which it calls internally from compression()/decompression()
routines.  That means allocation from atomic context (either
under entry spin-lock, or per-CPU local-lock or both).  Now,
with non-atomic zram read()/write(), those limitations are
relaxed and we can allow direct and indirect reclaim.

Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
---
 drivers/block/zram/backend_zstd.c | 11 +++--------
 1 file changed, 3 insertions(+), 8 deletions(-)

diff --git a/drivers/block/zram/backend_zstd.c b/drivers/block/zram/backend_zstd.c
index 1184c0036f44..53431251ea62 100644
--- a/drivers/block/zram/backend_zstd.c
+++ b/drivers/block/zram/backend_zstd.c
@@ -24,19 +24,14 @@ struct zstd_params {
 /*
  * For C/D dictionaries we need to provide zstd with zstd_custom_mem,
  * which zstd uses internally to allocate/free memory when needed.
- *
- * This means that allocator.customAlloc() can be called from zcomp_compress()
- * under local-lock (per-CPU compression stream), in which case we must use
- * GFP_ATOMIC.
- *
- * Another complication here is that we can be configured as a swap device.
  */
 static void *zstd_custom_alloc(void *opaque, size_t size)
 {
-	if (!preemptible())
+	/* Technically this should not happen */
+	if (WARN_ON_ONCE(!preemptible()))
 		return kvzalloc(size, GFP_ATOMIC);
 
-	return kvzalloc(size, __GFP_KSWAPD_RECLAIM | __GFP_NOWARN);
+	return kvzalloc(size, GFP_NOIO | __GFP_NOWARN);
 }
 
 static void zstd_custom_free(void *opaque, void *address)
-- 
2.48.1.502.g6dc24dfdaf-goog



^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v5 16/18] zram: do not leak page on recompress_store error path
  2025-02-12  6:26 [PATCH v5 00/18] zsmalloc/zram: there be preemption Sergey Senozhatsky
                   ` (14 preceding siblings ...)
  2025-02-12  6:27 ` [PATCH v5 15/18] zram: permit reclaim in zstd custom allocator Sergey Senozhatsky
@ 2025-02-12  6:27 ` Sergey Senozhatsky
  2025-02-12  6:27 ` [PATCH v5 17/18] zram: do not leak page on writeback_store " Sergey Senozhatsky
                   ` (2 subsequent siblings)
  18 siblings, 0 replies; 40+ messages in thread
From: Sergey Senozhatsky @ 2025-02-12  6:27 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Yosry Ahmed, Kairui Song, Minchan Kim, linux-mm, linux-kernel,
	Sergey Senozhatsky

Ensure the page used for local object data is freed
on error out path.

Fixes: 3f909a60cec1 ("zram: rework recompress target selection strategy")
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
---
 drivers/block/zram/zram_drv.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index c6310077c221..f4644c29f74e 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -2017,7 +2017,7 @@ static ssize_t recompress_store(struct device *dev,
 	struct zram_pp_slot *pps;
 	u32 mode = 0, threshold = 0;
 	u32 prio, prio_max;
-	struct page *page;
+	struct page *page = NULL;
 	ssize_t ret;
 
 	prio = ZRAM_SECONDARY_COMP;
@@ -2161,9 +2161,9 @@ static ssize_t recompress_store(struct device *dev,
 		cond_resched();
 	}
 
-	__free_page(page);
-
 release_init_lock:
+	if (page)
+		__free_page(page);
 	release_pp_ctl(zram, ctl);
 	atomic_set(&zram->pp_in_progress, 0);
 	up_read(&zram->init_lock);
-- 
2.48.1.502.g6dc24dfdaf-goog



^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v5 17/18] zram: do not leak page on writeback_store error path
  2025-02-12  6:26 [PATCH v5 00/18] zsmalloc/zram: there be preemption Sergey Senozhatsky
                   ` (15 preceding siblings ...)
  2025-02-12  6:27 ` [PATCH v5 16/18] zram: do not leak page on recompress_store error path Sergey Senozhatsky
@ 2025-02-12  6:27 ` Sergey Senozhatsky
  2025-02-12  6:27 ` [PATCH v5 18/18] zram: add might_sleep to zcomp API Sergey Senozhatsky
  2025-02-13  0:09 ` [PATCH v5 00/18] zsmalloc/zram: there be preemption Andrew Morton
  18 siblings, 0 replies; 40+ messages in thread
From: Sergey Senozhatsky @ 2025-02-12  6:27 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Yosry Ahmed, Kairui Song, Minchan Kim, linux-mm, linux-kernel,
	Sergey Senozhatsky

Ensure the page used for local object data is freed
on error out path.

Fixes: 330edc2bc059 (zram: rework writeback target selection strategy)
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
---
 drivers/block/zram/zram_drv.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index f4644c29f74e..10239aea5ce0 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -787,7 +787,7 @@ static ssize_t writeback_store(struct device *dev,
 	unsigned long index = 0;
 	struct bio bio;
 	struct bio_vec bio_vec;
-	struct page *page;
+	struct page *page = NULL;
 	ssize_t ret = len;
 	int mode, err;
 	unsigned long blk_idx = 0;
@@ -929,8 +929,10 @@ static ssize_t writeback_store(struct device *dev,
 
 	if (blk_idx)
 		free_block_bdev(zram, blk_idx);
-	__free_page(page);
+
 release_init_lock:
+	if (page)
+		__free_page(page);
 	release_pp_ctl(zram, ctl);
 	atomic_set(&zram->pp_in_progress, 0);
 	up_read(&zram->init_lock);
-- 
2.48.1.502.g6dc24dfdaf-goog



^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v5 18/18] zram: add might_sleep to zcomp API
  2025-02-12  6:26 [PATCH v5 00/18] zsmalloc/zram: there be preemption Sergey Senozhatsky
                   ` (16 preceding siblings ...)
  2025-02-12  6:27 ` [PATCH v5 17/18] zram: do not leak page on writeback_store " Sergey Senozhatsky
@ 2025-02-12  6:27 ` Sergey Senozhatsky
  2025-02-13  0:09 ` [PATCH v5 00/18] zsmalloc/zram: there be preemption Andrew Morton
  18 siblings, 0 replies; 40+ messages in thread
From: Sergey Senozhatsky @ 2025-02-12  6:27 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Yosry Ahmed, Kairui Song, Minchan Kim, linux-mm, linux-kernel,
	Sergey Senozhatsky

Explicitly state that zcomp compress/decompress must be
called from non-atomic context.

Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
---
 drivers/block/zram/zcomp.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/block/zram/zcomp.c b/drivers/block/zram/zcomp.c
index 61a9c3ed6f7a..217a77e09dc7 100644
--- a/drivers/block/zram/zcomp.c
+++ b/drivers/block/zram/zcomp.c
@@ -148,6 +148,7 @@ int zcomp_compress(struct zcomp *comp, struct zcomp_strm *zstrm,
 	};
 	int ret;
 
+	might_sleep();
 	ret = comp->ops->compress(comp->params, &zstrm->ctx, &req);
 	if (!ret)
 		*dst_len = req.dst_len;
@@ -164,6 +165,7 @@ int zcomp_decompress(struct zcomp *comp, struct zcomp_strm *zstrm,
 		.dst_len = PAGE_SIZE,
 	};
 
+	might_sleep();
 	return comp->ops->decompress(comp->params, &zstrm->ctx, &req);
 }
 
-- 
2.48.1.502.g6dc24dfdaf-goog



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v5 02/18] zram: permit preemption with active compression stream
  2025-02-12  6:27 ` [PATCH v5 02/18] zram: permit preemption with active compression stream Sergey Senozhatsky
@ 2025-02-12 16:01   ` Yosry Ahmed
  2025-02-13  1:04     ` Sergey Senozhatsky
  0 siblings, 1 reply; 40+ messages in thread
From: Yosry Ahmed @ 2025-02-12 16:01 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Andrew Morton, Kairui Song, Minchan Kim, linux-mm, linux-kernel

On Wed, Feb 12, 2025 at 03:27:00PM +0900, Sergey Senozhatsky wrote:
> Currently, per-CPU stream access is done from a non-preemptible
> (atomic) section, which imposes the same atomicity requirements on
> compression backends as entry spin-lock, and makes it impossible
> to use algorithms that can schedule/wait/sleep during compression
> and decompression.
> 
> Switch to preemptible per-CPU model, similar to the one used
> in zswap.  Instead of a per-CPU local lock, each stream carries
> a mutex which is locked throughout entire time zram uses it
> for compression or decompression, so that cpu-dead event waits
> for zram to stop using a particular per-CPU stream and release
> it.
> 
> Suggested-by: Yosry Ahmed <yosry.ahmed@linux.dev>
> Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
> ---
>  drivers/block/zram/zcomp.c    | 36 +++++++++++++++++++++++++----------
>  drivers/block/zram/zcomp.h    |  6 +++---
>  drivers/block/zram/zram_drv.c | 20 +++++++++----------
>  3 files changed, 39 insertions(+), 23 deletions(-)
> 
> diff --git a/drivers/block/zram/zcomp.c b/drivers/block/zram/zcomp.c
> index bb514403e305..e83dd9a80a81 100644
> --- a/drivers/block/zram/zcomp.c
> +++ b/drivers/block/zram/zcomp.c
> @@ -7,6 +7,7 @@
>  #include <linux/wait.h>
>  #include <linux/sched.h>
>  #include <linux/cpu.h>
> +#include <linux/cpuhotplug.h>

What code changes prompt this?

>  #include <linux/crypto.h>
>  #include <linux/vmalloc.h>
>  
> @@ -54,6 +55,7 @@ static int zcomp_strm_init(struct zcomp *comp, struct zcomp_strm *zstrm)
>  {
>  	int ret;
>  
> +	mutex_init(&zstrm->lock);

I don't think we can initialize the mutex in the hotplug callback. I
think the following scenario is possible:

CPU #1					CPU #2
zcomp_stream_get()
  zstrm = raw_cpu_ptr()
/* task migrated to CPU 2 */

CPU goes offline
  zcomp_cpu_dead()
    mutex_lock()
    ..
    mutex_unlock()
					/* migrated task continues */
					zcomp_stream_get()
					  mutex_lock()
CPU goes online
  mutex_init()
  					  mutex_unlock() /* problem */
					
In this case we'll end up initializing the mutex on CPU #1 while CPU #2
has it locked. When we unlocked it on CPU #2 we will corrupt it AFAICT.

This is why I moved the mutex initialization out of the hotplug callback
in zswap. I suspect to do something similar for zram we'd need to do it
in zcomp_init()?

>  	ret = comp->ops->create_ctx(comp->params, &zstrm->ctx);
>  	if (ret)
>  		return ret;
> @@ -109,13 +111,29 @@ ssize_t zcomp_available_show(const char *comp, char *buf)
>  
>  struct zcomp_strm *zcomp_stream_get(struct zcomp *comp)
>  {
> -	local_lock(&comp->stream->lock);
> -	return this_cpu_ptr(comp->stream);
> +	for (;;) {
> +		struct zcomp_strm *zstrm = raw_cpu_ptr(comp->stream);
> +
> +		/*
> +		 * Inspired by zswap
> +		 *
> +		 * stream is returned with ->mutex locked which prevents
> +		 * cpu_dead() from releasing this stream under us, however
> +		 * there is still a race window between raw_cpu_ptr() and
> +		 * mutex_lock(), during which we could have been migrated
> +		 * to a CPU that has already destroyed its stream.  If so

"we could have been migrated from** a CPU that has already destroyed its
stream"? Right?

> +		 * then unlock and re-try on the current CPU.
> +		 */
> +		mutex_lock(&zstrm->lock);
> +		if (likely(zstrm->buffer))
> +			return zstrm;
> +		mutex_unlock(&zstrm->lock);
> +	}
>  }
>  
> -void zcomp_stream_put(struct zcomp *comp)
> +void zcomp_stream_put(struct zcomp_strm *zstrm)
>  {
> -	local_unlock(&comp->stream->lock);
> +	mutex_unlock(&zstrm->lock);
>  }
>  
>  int zcomp_compress(struct zcomp *comp, struct zcomp_strm *zstrm,
> @@ -151,12 +169,9 @@ int zcomp_decompress(struct zcomp *comp, struct zcomp_strm *zstrm,
>  int zcomp_cpu_up_prepare(unsigned int cpu, struct hlist_node *node)
>  {
>  	struct zcomp *comp = hlist_entry(node, struct zcomp, node);
> -	struct zcomp_strm *zstrm;
> +	struct zcomp_strm *zstrm = per_cpu_ptr(comp->stream, cpu);
>  	int ret;
>  
> -	zstrm = per_cpu_ptr(comp->stream, cpu);
> -	local_lock_init(&zstrm->lock);
> -
>  	ret = zcomp_strm_init(comp, zstrm);
>  	if (ret)
>  		pr_err("Can't allocate a compression stream\n");
> @@ -166,10 +181,11 @@ int zcomp_cpu_up_prepare(unsigned int cpu, struct hlist_node *node)
>  int zcomp_cpu_dead(unsigned int cpu, struct hlist_node *node)
>  {
>  	struct zcomp *comp = hlist_entry(node, struct zcomp, node);
> -	struct zcomp_strm *zstrm;
> +	struct zcomp_strm *zstrm = per_cpu_ptr(comp->stream, cpu);
>  
> -	zstrm = per_cpu_ptr(comp->stream, cpu);
> +	mutex_lock(&zstrm->lock);
>  	zcomp_strm_free(comp, zstrm);
> +	mutex_unlock(&zstrm->lock);
>  	return 0;
>  }
>  
> diff --git a/drivers/block/zram/zcomp.h b/drivers/block/zram/zcomp.h
> index ad5762813842..23b8236b9090 100644
> --- a/drivers/block/zram/zcomp.h
> +++ b/drivers/block/zram/zcomp.h
> @@ -3,7 +3,7 @@
>  #ifndef _ZCOMP_H_
>  #define _ZCOMP_H_
>  
> -#include <linux/local_lock.h>
> +#include <linux/mutex.h>
>  
>  #define ZCOMP_PARAM_NO_LEVEL	INT_MIN
>  
> @@ -31,7 +31,7 @@ struct zcomp_ctx {
>  };
>  
>  struct zcomp_strm {
> -	local_lock_t lock;
> +	struct mutex lock;
>  	/* compression buffer */
>  	void *buffer;
>  	struct zcomp_ctx ctx;
> @@ -77,7 +77,7 @@ struct zcomp *zcomp_create(const char *alg, struct zcomp_params *params);
>  void zcomp_destroy(struct zcomp *comp);
>  
>  struct zcomp_strm *zcomp_stream_get(struct zcomp *comp);
> -void zcomp_stream_put(struct zcomp *comp);
> +void zcomp_stream_put(struct zcomp_strm *zstrm);
>  
>  int zcomp_compress(struct zcomp *comp, struct zcomp_strm *zstrm,
>  		   const void *src, unsigned int *dst_len);
> diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
> index 3708436f1d1f..43f460a45e3e 100644
> --- a/drivers/block/zram/zram_drv.c
> +++ b/drivers/block/zram/zram_drv.c
> @@ -1608,7 +1608,7 @@ static int read_compressed_page(struct zram *zram, struct page *page, u32 index)
>  	ret = zcomp_decompress(zram->comps[prio], zstrm, src, size, dst);
>  	kunmap_local(dst);
>  	zs_unmap_object(zram->mem_pool, handle);
> -	zcomp_stream_put(zram->comps[prio]);
> +	zcomp_stream_put(zstrm);
>  
>  	return ret;
>  }
> @@ -1769,14 +1769,14 @@ static int zram_write_page(struct zram *zram, struct page *page, u32 index)
>  	kunmap_local(mem);
>  
>  	if (unlikely(ret)) {
> -		zcomp_stream_put(zram->comps[ZRAM_PRIMARY_COMP]);
> +		zcomp_stream_put(zstrm);
>  		pr_err("Compression failed! err=%d\n", ret);
>  		zs_free(zram->mem_pool, handle);
>  		return ret;
>  	}
>  
>  	if (comp_len >= huge_class_size) {
> -		zcomp_stream_put(zram->comps[ZRAM_PRIMARY_COMP]);
> +		zcomp_stream_put(zstrm);
>  		return write_incompressible_page(zram, page, index);
>  	}
>  
> @@ -1800,7 +1800,7 @@ static int zram_write_page(struct zram *zram, struct page *page, u32 index)
>  				   __GFP_HIGHMEM |
>  				   __GFP_MOVABLE);
>  	if (IS_ERR_VALUE(handle)) {
> -		zcomp_stream_put(zram->comps[ZRAM_PRIMARY_COMP]);
> +		zcomp_stream_put(zstrm);
>  		atomic64_inc(&zram->stats.writestall);
>  		handle = zs_malloc(zram->mem_pool, comp_len,
>  				   GFP_NOIO | __GFP_HIGHMEM |
> @@ -1812,7 +1812,7 @@ static int zram_write_page(struct zram *zram, struct page *page, u32 index)
>  	}
>  
>  	if (!zram_can_store_page(zram)) {
> -		zcomp_stream_put(zram->comps[ZRAM_PRIMARY_COMP]);
> +		zcomp_stream_put(zstrm);
>  		zs_free(zram->mem_pool, handle);
>  		return -ENOMEM;
>  	}
> @@ -1820,7 +1820,7 @@ static int zram_write_page(struct zram *zram, struct page *page, u32 index)
>  	dst = zs_map_object(zram->mem_pool, handle, ZS_MM_WO);
>  
>  	memcpy(dst, zstrm->buffer, comp_len);
> -	zcomp_stream_put(zram->comps[ZRAM_PRIMARY_COMP]);
> +	zcomp_stream_put(zstrm);
>  	zs_unmap_object(zram->mem_pool, handle);
>  
>  	zram_slot_lock(zram, index);
> @@ -1979,7 +1979,7 @@ static int recompress_slot(struct zram *zram, u32 index, struct page *page,
>  		kunmap_local(src);
>  
>  		if (ret) {
> -			zcomp_stream_put(zram->comps[prio]);
> +			zcomp_stream_put(zstrm);
>  			return ret;
>  		}
>  
> @@ -1989,7 +1989,7 @@ static int recompress_slot(struct zram *zram, u32 index, struct page *page,
>  		/* Continue until we make progress */
>  		if (class_index_new >= class_index_old ||
>  		    (threshold && comp_len_new >= threshold)) {
> -			zcomp_stream_put(zram->comps[prio]);
> +			zcomp_stream_put(zstrm);
>  			continue;
>  		}
>  
> @@ -2047,13 +2047,13 @@ static int recompress_slot(struct zram *zram, u32 index, struct page *page,
>  			       __GFP_HIGHMEM |
>  			       __GFP_MOVABLE);
>  	if (IS_ERR_VALUE(handle_new)) {
> -		zcomp_stream_put(zram->comps[prio]);
> +		zcomp_stream_put(zstrm);
>  		return PTR_ERR((void *)handle_new);
>  	}
>  
>  	dst = zs_map_object(zram->mem_pool, handle_new, ZS_MM_WO);
>  	memcpy(dst, zstrm->buffer, comp_len_new);
> -	zcomp_stream_put(zram->comps[prio]);
> +	zcomp_stream_put(zstrm);
>  
>  	zs_unmap_object(zram->mem_pool, handle_new);
>  
> -- 
> 2.48.1.502.g6dc24dfdaf-goog
> 


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v5 03/18] zram: remove crypto include
  2025-02-12  6:27 ` [PATCH v5 03/18] zram: remove crypto include Sergey Senozhatsky
@ 2025-02-12 16:13   ` Yosry Ahmed
  2025-02-13  0:53     ` Sergey Senozhatsky
  0 siblings, 1 reply; 40+ messages in thread
From: Yosry Ahmed @ 2025-02-12 16:13 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Andrew Morton, Kairui Song, Minchan Kim, linux-mm, linux-kernel

On Wed, Feb 12, 2025 at 03:27:01PM +0900, Sergey Senozhatsky wrote:
> Remove a leftover crypto header include.

The subject and log is not very descriptive imo. We stop using
CRYPTO_MAX_ALG_NAME and define our own limit in zram, and removing the
include is just an artifact of that.

> 
> Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
> ---
>  drivers/block/zram/zcomp.c    | 1 -
>  drivers/block/zram/zram_drv.c | 4 +++-
>  drivers/block/zram/zram_drv.h | 1 -
>  3 files changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/block/zram/zcomp.c b/drivers/block/zram/zcomp.c
> index e83dd9a80a81..c393243eeb5c 100644
> --- a/drivers/block/zram/zcomp.c
> +++ b/drivers/block/zram/zcomp.c
> @@ -8,7 +8,6 @@
>  #include <linux/sched.h>
>  #include <linux/cpu.h>
>  #include <linux/cpuhotplug.h>
> -#include <linux/crypto.h>
>  #include <linux/vmalloc.h>
>  
>  #include "zcomp.h"
> diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
> index 43f460a45e3e..12fb260e3355 100644
> --- a/drivers/block/zram/zram_drv.c
> +++ b/drivers/block/zram/zram_drv.c
> @@ -44,6 +44,8 @@ static DEFINE_MUTEX(zram_index_mutex);
>  static int zram_major;
>  static const char *default_compressor = CONFIG_ZRAM_DEF_COMP;
>  
> +#define ZRAM_MAX_ALGO_NAME_SZ	64
> +
>  /* Module params (documentation at end) */
>  static unsigned int num_devices = 1;
>  /*
> @@ -1149,7 +1151,7 @@ static int __comp_algorithm_store(struct zram *zram, u32 prio, const char *buf)
>  	size_t sz;
>  
>  	sz = strlen(buf);
> -	if (sz >= CRYPTO_MAX_ALG_NAME)
> +	if (sz >= ZRAM_MAX_ALGO_NAME_SZ)
>  		return -E2BIG;
>  
>  	compressor = kstrdup(buf, GFP_KERNEL);
> diff --git a/drivers/block/zram/zram_drv.h b/drivers/block/zram/zram_drv.h
> index 63b933059cb6..97c98fa07954 100644
> --- a/drivers/block/zram/zram_drv.h
> +++ b/drivers/block/zram/zram_drv.h
> @@ -17,7 +17,6 @@
>  
>  #include <linux/rwsem.h>
>  #include <linux/zsmalloc.h>
> -#include <linux/crypto.h>
>  
>  #include "zcomp.h"
>  
> -- 
> 2.48.1.502.g6dc24dfdaf-goog
> 


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v5 10/18] zsmalloc: factor out pool locking helpers
  2025-02-12  6:27 ` [PATCH v5 10/18] zsmalloc: factor out pool locking helpers Sergey Senozhatsky
@ 2025-02-12 16:18   ` Yosry Ahmed
  2025-02-12 16:19     ` Yosry Ahmed
  2025-02-13  0:57     ` Sergey Senozhatsky
  0 siblings, 2 replies; 40+ messages in thread
From: Yosry Ahmed @ 2025-02-12 16:18 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Andrew Morton, Kairui Song, Minchan Kim, linux-mm, linux-kernel

On Wed, Feb 12, 2025 at 03:27:08PM +0900, Sergey Senozhatsky wrote:
> We currently have a mix of migrate_{read,write}_lock() helpers
> that lock zspages, but it's zs_pool that actually has a ->migrate_lock
> access to which is opene-coded.  Factor out pool migrate locking
> into helpers, zspage migration locking API will be renamed to
> reduce confusion.
> 
> It's worth mentioning that zsmalloc locks sync not only migration,
> but also compaction.
> 
> Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>

FWIW I don't see a lot of value in the helpers (renaming the lock is
useful tho). We open-code other locks like the class lock anyway, and
the helpers obscure the underlying lock type without adding much value
in terms of readability/conciseness.

> ---
>  mm/zsmalloc.c | 63 +++++++++++++++++++++++++++++++++++----------------
>  1 file changed, 44 insertions(+), 19 deletions(-)
> 
> diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
> index 6d0e47f7ae33..47c638df47c5 100644
> --- a/mm/zsmalloc.c
> +++ b/mm/zsmalloc.c
> @@ -18,7 +18,7 @@
>  /*
>   * lock ordering:
>   *	page_lock
> - *	pool->migrate_lock
> + *	pool->lock
>   *	class->lock
>   *	zspage->lock
>   */
> @@ -224,10 +224,35 @@ struct zs_pool {
>  	struct work_struct free_work;
>  #endif
>  	/* protect page/zspage migration */
> -	rwlock_t migrate_lock;
> +	rwlock_t lock;
>  	atomic_t compaction_in_progress;
>  };
>  
> +static void pool_write_unlock(struct zs_pool *pool)
> +{
> +	write_unlock(&pool->lock);
> +}
> +
> +static void pool_write_lock(struct zs_pool *pool)
> +{
> +	write_lock(&pool->lock);
> +}
> +
> +static void pool_read_unlock(struct zs_pool *pool)
> +{
> +	read_unlock(&pool->lock);
> +}
> +
> +static void pool_read_lock(struct zs_pool *pool)
> +{
> +	read_lock(&pool->lock);
> +}
> +
> +static bool pool_lock_is_contended(struct zs_pool *pool)
> +{
> +	return rwlock_is_contended(&pool->lock);
> +}
> +
>  static inline void zpdesc_set_first(struct zpdesc *zpdesc)
>  {
>  	SetPagePrivate(zpdesc_page(zpdesc));
> @@ -1206,7 +1231,7 @@ void *zs_map_object(struct zs_pool *pool, unsigned long handle,
>  	BUG_ON(in_interrupt());
>  
>  	/* It guarantees it can get zspage from handle safely */
> -	read_lock(&pool->migrate_lock);
> +	pool_read_lock(pool);
>  	obj = handle_to_obj(handle);
>  	obj_to_location(obj, &zpdesc, &obj_idx);
>  	zspage = get_zspage(zpdesc);
> @@ -1218,7 +1243,7 @@ void *zs_map_object(struct zs_pool *pool, unsigned long handle,
>  	 * which is smaller granularity.
>  	 */
>  	migrate_read_lock(zspage);
> -	read_unlock(&pool->migrate_lock);
> +	pool_read_unlock(pool);
>  
>  	class = zspage_class(pool, zspage);
>  	off = offset_in_page(class->size * obj_idx);
> @@ -1450,16 +1475,16 @@ void zs_free(struct zs_pool *pool, unsigned long handle)
>  		return;
>  
>  	/*
> -	 * The pool->migrate_lock protects the race with zpage's migration
> +	 * The pool->lock protects the race with zpage's migration
>  	 * so it's safe to get the page from handle.
>  	 */
> -	read_lock(&pool->migrate_lock);
> +	pool_read_lock(pool);
>  	obj = handle_to_obj(handle);
>  	obj_to_zpdesc(obj, &f_zpdesc);
>  	zspage = get_zspage(f_zpdesc);
>  	class = zspage_class(pool, zspage);
>  	spin_lock(&class->lock);
> -	read_unlock(&pool->migrate_lock);
> +	pool_read_unlock(pool);
>  
>  	class_stat_sub(class, ZS_OBJS_INUSE, 1);
>  	obj_free(class->size, obj);
> @@ -1793,10 +1818,10 @@ static int zs_page_migrate(struct page *newpage, struct page *page,
>  	pool = zspage->pool;
>  
>  	/*
> -	 * The pool migrate_lock protects the race between zpage migration
> +	 * The pool lock protects the race between zpage migration
>  	 * and zs_free.
>  	 */
> -	write_lock(&pool->migrate_lock);
> +	pool_write_lock(pool);
>  	class = zspage_class(pool, zspage);
>  
>  	/*
> @@ -1833,7 +1858,7 @@ static int zs_page_migrate(struct page *newpage, struct page *page,
>  	 * Since we complete the data copy and set up new zspage structure,
>  	 * it's okay to release migration_lock.
>  	 */
> -	write_unlock(&pool->migrate_lock);
> +	pool_write_unlock(pool);
>  	spin_unlock(&class->lock);
>  	migrate_write_unlock(zspage);
>  
> @@ -1956,7 +1981,7 @@ static unsigned long __zs_compact(struct zs_pool *pool,
>  	 * protect the race between zpage migration and zs_free
>  	 * as well as zpage allocation/free
>  	 */
> -	write_lock(&pool->migrate_lock);
> +	pool_write_lock(pool);
>  	spin_lock(&class->lock);
>  	while (zs_can_compact(class)) {
>  		int fg;
> @@ -1983,14 +2008,14 @@ static unsigned long __zs_compact(struct zs_pool *pool,
>  		src_zspage = NULL;
>  
>  		if (get_fullness_group(class, dst_zspage) == ZS_INUSE_RATIO_100
> -		    || rwlock_is_contended(&pool->migrate_lock)) {
> +		    || pool_lock_is_contended(pool)) {
>  			putback_zspage(class, dst_zspage);
>  			dst_zspage = NULL;
>  
>  			spin_unlock(&class->lock);
> -			write_unlock(&pool->migrate_lock);
> +			pool_write_unlock(pool);
>  			cond_resched();
> -			write_lock(&pool->migrate_lock);
> +			pool_write_lock(pool);
>  			spin_lock(&class->lock);
>  		}
>  	}
> @@ -2002,7 +2027,7 @@ static unsigned long __zs_compact(struct zs_pool *pool,
>  		putback_zspage(class, dst_zspage);
>  
>  	spin_unlock(&class->lock);
> -	write_unlock(&pool->migrate_lock);
> +	pool_write_unlock(pool);
>  
>  	return pages_freed;
>  }
> @@ -2014,10 +2039,10 @@ unsigned long zs_compact(struct zs_pool *pool)
>  	unsigned long pages_freed = 0;
>  
>  	/*
> -	 * Pool compaction is performed under pool->migrate_lock so it is basically
> +	 * Pool compaction is performed under pool->lock so it is basically
>  	 * single-threaded. Having more than one thread in __zs_compact()
> -	 * will increase pool->migrate_lock contention, which will impact other
> -	 * zsmalloc operations that need pool->migrate_lock.
> +	 * will increase pool->lock contention, which will impact other
> +	 * zsmalloc operations that need pool->lock.
>  	 */
>  	if (atomic_xchg(&pool->compaction_in_progress, 1))
>  		return 0;
> @@ -2139,7 +2164,7 @@ struct zs_pool *zs_create_pool(const char *name)
>  		return NULL;
>  
>  	init_deferred_free(pool);
> -	rwlock_init(&pool->migrate_lock);
> +	rwlock_init(&pool->lock);
>  	atomic_set(&pool->compaction_in_progress, 0);
>  
>  	pool->name = kstrdup(name, GFP_KERNEL);
> -- 
> 2.48.1.502.g6dc24dfdaf-goog
> 


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v5 10/18] zsmalloc: factor out pool locking helpers
  2025-02-12 16:18   ` Yosry Ahmed
@ 2025-02-12 16:19     ` Yosry Ahmed
  2025-02-13  0:57     ` Sergey Senozhatsky
  1 sibling, 0 replies; 40+ messages in thread
From: Yosry Ahmed @ 2025-02-12 16:19 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Andrew Morton, Kairui Song, Minchan Kim, linux-mm, linux-kernel

On Wed, Feb 12, 2025 at 04:18:14PM +0000, Yosry Ahmed wrote:
> On Wed, Feb 12, 2025 at 03:27:08PM +0900, Sergey Senozhatsky wrote:
> > We currently have a mix of migrate_{read,write}_lock() helpers
> > that lock zspages, but it's zs_pool that actually has a ->migrate_lock
> > access to which is opene-coded.  Factor out pool migrate locking
> > into helpers, zspage migration locking API will be renamed to
> > reduce confusion.
> > 
> > It's worth mentioning that zsmalloc locks sync not only migration,
> > but also compaction.
> > 
> > Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
> 
> FWIW I don't see a lot of value in the helpers (renaming the lock is
> useful tho). We open-code other locks like the class lock anyway, and
> the helpers obscure the underlying lock type without adding much value
> in terms of readability/conciseness.

We use helpers for the class lock in the following change, but my point
stands for that too.

> 
> > ---
> >  mm/zsmalloc.c | 63 +++++++++++++++++++++++++++++++++++----------------
> >  1 file changed, 44 insertions(+), 19 deletions(-)
> > 
> > diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
> > index 6d0e47f7ae33..47c638df47c5 100644
> > --- a/mm/zsmalloc.c
> > +++ b/mm/zsmalloc.c
> > @@ -18,7 +18,7 @@
> >  /*
> >   * lock ordering:
> >   *	page_lock
> > - *	pool->migrate_lock
> > + *	pool->lock
> >   *	class->lock
> >   *	zspage->lock
> >   */
> > @@ -224,10 +224,35 @@ struct zs_pool {
> >  	struct work_struct free_work;
> >  #endif
> >  	/* protect page/zspage migration */
> > -	rwlock_t migrate_lock;
> > +	rwlock_t lock;
> >  	atomic_t compaction_in_progress;
> >  };
> >  
> > +static void pool_write_unlock(struct zs_pool *pool)
> > +{
> > +	write_unlock(&pool->lock);
> > +}
> > +
> > +static void pool_write_lock(struct zs_pool *pool)
> > +{
> > +	write_lock(&pool->lock);
> > +}
> > +
> > +static void pool_read_unlock(struct zs_pool *pool)
> > +{
> > +	read_unlock(&pool->lock);
> > +}
> > +
> > +static void pool_read_lock(struct zs_pool *pool)
> > +{
> > +	read_lock(&pool->lock);
> > +}
> > +
> > +static bool pool_lock_is_contended(struct zs_pool *pool)
> > +{
> > +	return rwlock_is_contended(&pool->lock);
> > +}
> > +
> >  static inline void zpdesc_set_first(struct zpdesc *zpdesc)
> >  {
> >  	SetPagePrivate(zpdesc_page(zpdesc));
> > @@ -1206,7 +1231,7 @@ void *zs_map_object(struct zs_pool *pool, unsigned long handle,
> >  	BUG_ON(in_interrupt());
> >  
> >  	/* It guarantees it can get zspage from handle safely */
> > -	read_lock(&pool->migrate_lock);
> > +	pool_read_lock(pool);
> >  	obj = handle_to_obj(handle);
> >  	obj_to_location(obj, &zpdesc, &obj_idx);
> >  	zspage = get_zspage(zpdesc);
> > @@ -1218,7 +1243,7 @@ void *zs_map_object(struct zs_pool *pool, unsigned long handle,
> >  	 * which is smaller granularity.
> >  	 */
> >  	migrate_read_lock(zspage);
> > -	read_unlock(&pool->migrate_lock);
> > +	pool_read_unlock(pool);
> >  
> >  	class = zspage_class(pool, zspage);
> >  	off = offset_in_page(class->size * obj_idx);
> > @@ -1450,16 +1475,16 @@ void zs_free(struct zs_pool *pool, unsigned long handle)
> >  		return;
> >  
> >  	/*
> > -	 * The pool->migrate_lock protects the race with zpage's migration
> > +	 * The pool->lock protects the race with zpage's migration
> >  	 * so it's safe to get the page from handle.
> >  	 */
> > -	read_lock(&pool->migrate_lock);
> > +	pool_read_lock(pool);
> >  	obj = handle_to_obj(handle);
> >  	obj_to_zpdesc(obj, &f_zpdesc);
> >  	zspage = get_zspage(f_zpdesc);
> >  	class = zspage_class(pool, zspage);
> >  	spin_lock(&class->lock);
> > -	read_unlock(&pool->migrate_lock);
> > +	pool_read_unlock(pool);
> >  
> >  	class_stat_sub(class, ZS_OBJS_INUSE, 1);
> >  	obj_free(class->size, obj);
> > @@ -1793,10 +1818,10 @@ static int zs_page_migrate(struct page *newpage, struct page *page,
> >  	pool = zspage->pool;
> >  
> >  	/*
> > -	 * The pool migrate_lock protects the race between zpage migration
> > +	 * The pool lock protects the race between zpage migration
> >  	 * and zs_free.
> >  	 */
> > -	write_lock(&pool->migrate_lock);
> > +	pool_write_lock(pool);
> >  	class = zspage_class(pool, zspage);
> >  
> >  	/*
> > @@ -1833,7 +1858,7 @@ static int zs_page_migrate(struct page *newpage, struct page *page,
> >  	 * Since we complete the data copy and set up new zspage structure,
> >  	 * it's okay to release migration_lock.
> >  	 */
> > -	write_unlock(&pool->migrate_lock);
> > +	pool_write_unlock(pool);
> >  	spin_unlock(&class->lock);
> >  	migrate_write_unlock(zspage);
> >  
> > @@ -1956,7 +1981,7 @@ static unsigned long __zs_compact(struct zs_pool *pool,
> >  	 * protect the race between zpage migration and zs_free
> >  	 * as well as zpage allocation/free
> >  	 */
> > -	write_lock(&pool->migrate_lock);
> > +	pool_write_lock(pool);
> >  	spin_lock(&class->lock);
> >  	while (zs_can_compact(class)) {
> >  		int fg;
> > @@ -1983,14 +2008,14 @@ static unsigned long __zs_compact(struct zs_pool *pool,
> >  		src_zspage = NULL;
> >  
> >  		if (get_fullness_group(class, dst_zspage) == ZS_INUSE_RATIO_100
> > -		    || rwlock_is_contended(&pool->migrate_lock)) {
> > +		    || pool_lock_is_contended(pool)) {
> >  			putback_zspage(class, dst_zspage);
> >  			dst_zspage = NULL;
> >  
> >  			spin_unlock(&class->lock);
> > -			write_unlock(&pool->migrate_lock);
> > +			pool_write_unlock(pool);
> >  			cond_resched();
> > -			write_lock(&pool->migrate_lock);
> > +			pool_write_lock(pool);
> >  			spin_lock(&class->lock);
> >  		}
> >  	}
> > @@ -2002,7 +2027,7 @@ static unsigned long __zs_compact(struct zs_pool *pool,
> >  		putback_zspage(class, dst_zspage);
> >  
> >  	spin_unlock(&class->lock);
> > -	write_unlock(&pool->migrate_lock);
> > +	pool_write_unlock(pool);
> >  
> >  	return pages_freed;
> >  }
> > @@ -2014,10 +2039,10 @@ unsigned long zs_compact(struct zs_pool *pool)
> >  	unsigned long pages_freed = 0;
> >  
> >  	/*
> > -	 * Pool compaction is performed under pool->migrate_lock so it is basically
> > +	 * Pool compaction is performed under pool->lock so it is basically
> >  	 * single-threaded. Having more than one thread in __zs_compact()
> > -	 * will increase pool->migrate_lock contention, which will impact other
> > -	 * zsmalloc operations that need pool->migrate_lock.
> > +	 * will increase pool->lock contention, which will impact other
> > +	 * zsmalloc operations that need pool->lock.
> >  	 */
> >  	if (atomic_xchg(&pool->compaction_in_progress, 1))
> >  		return 0;
> > @@ -2139,7 +2164,7 @@ struct zs_pool *zs_create_pool(const char *name)
> >  		return NULL;
> >  
> >  	init_deferred_free(pool);
> > -	rwlock_init(&pool->migrate_lock);
> > +	rwlock_init(&pool->lock);
> >  	atomic_set(&pool->compaction_in_progress, 0);
> >  
> >  	pool->name = kstrdup(name, GFP_KERNEL);
> > -- 
> > 2.48.1.502.g6dc24dfdaf-goog
> > 
> 


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v5 12/18] zsmalloc: make zspage lock preemptible
  2025-02-12  6:27 ` [PATCH v5 12/18] zsmalloc: make zspage lock preemptible Sergey Senozhatsky
@ 2025-02-12 17:14   ` Yosry Ahmed
  2025-02-13  1:20     ` Sergey Senozhatsky
  2025-02-13 11:32   ` Hillf Danton
  1 sibling, 1 reply; 40+ messages in thread
From: Yosry Ahmed @ 2025-02-12 17:14 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Andrew Morton, Kairui Song, Minchan Kim, linux-mm, linux-kernel

On Wed, Feb 12, 2025 at 03:27:10PM +0900, Sergey Senozhatsky wrote:
> Switch over from rwlock_t to a atomic_t variable that takes negative
> value when the page is under migration, or positive values when the
> page is used by zsmalloc users (object map, etc.)   Using a rwsem
> per-zspage is a little too memory heavy, a simple atomic_t should
> suffice.

We should also explain that rwsem cannot be used due to the locking
context (we need to hold it in an atomic context). Basically what you
explained to me before :)

> 
> zspage lock is a leaf lock for zs_map_object(), where it's read-acquired.
> Since this lock now permits preemption extra care needs to be taken when
> it is write-acquired - all writers grab it in atomic context, so they
> cannot spin and wait for (potentially preempted) reader to unlock zspage.
> There are only two writers at this moment - migration and compaction.  In
> both cases we use write-try-lock and bail out if zspage is read locked.
> Writers, on the other hand, never get preempted, so readers can spin
> waiting for the writer to unlock zspage.

The details are important, but I think we want to concisely state the
problem statement either before or after. Basically we want a lock that
we *never* sleep while acquiring but *can* sleep while holding in read
mode. This allows holding the lock from any context, but also being
preemptible if the context allows it.

> 
> With this we can implement a preemptible object mapping.
> 
> Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
> Cc: Yosry Ahmed <yosry.ahmed@linux.dev>
> ---
>  mm/zsmalloc.c | 183 +++++++++++++++++++++++++++++++++++---------------
>  1 file changed, 128 insertions(+), 55 deletions(-)
> 
> diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
> index c82c24b8e6a4..80261bb78cf8 100644
> --- a/mm/zsmalloc.c
> +++ b/mm/zsmalloc.c
> @@ -226,6 +226,9 @@ struct zs_pool {
>  	/* protect page/zspage migration */
>  	rwlock_t lock;
>  	atomic_t compaction_in_progress;
> +#ifdef CONFIG_DEBUG_LOCK_ALLOC
> +	struct lock_class_key lockdep_key;
> +#endif
>  };
>  
>  static void pool_write_unlock(struct zs_pool *pool)
> @@ -292,6 +295,9 @@ static inline void free_zpdesc(struct zpdesc *zpdesc)
>  	__free_page(page);
>  }
>  
> +#define ZS_PAGE_UNLOCKED	0
> +#define ZS_PAGE_WRLOCKED	-1
> +
>  struct zspage {
>  	struct {
>  		unsigned int huge:HUGE_BITS;
> @@ -304,7 +310,11 @@ struct zspage {
>  	struct zpdesc *first_zpdesc;
>  	struct list_head list; /* fullness list */
>  	struct zs_pool *pool;
> -	rwlock_t lock;
> +	atomic_t lock;
> +
> +#ifdef CONFIG_DEBUG_LOCK_ALLOC
> +	struct lockdep_map lockdep_map;
> +#endif
>  };
>  
>  struct mapping_area {
> @@ -314,6 +324,88 @@ struct mapping_area {
>  	enum zs_mapmode vm_mm; /* mapping mode */
>  };
>  
> +static void zspage_lock_init(struct zspage *zspage)
> +{
> +#ifdef CONFIG_DEBUG_LOCK_ALLOC
> +	lockdep_init_map(&zspage->lockdep_map, "zsmalloc-page",
> +			 &zspage->pool->lockdep_key, 0);
> +#endif
> +
> +	atomic_set(&zspage->lock, ZS_PAGE_UNLOCKED);
> +}
> +
> +/*
> + * zspage locking rules:

Also here we need to state our key rule:
Never sleep while acquiring, preemtible while holding (if possible). The
following rules are basically how we make sure we keep this true.

> + *
> + * 1) writer-lock is exclusive
> + *
> + * 2) writer-lock owner cannot sleep
> + *
> + * 3) writer-lock owner cannot spin waiting for the lock
> + *   - caller (e.g. compaction and migration) must check return value and
> + *     handle locking failures
> + *   - there is only TRY variant of writer-lock function
> + *
> + * 4) reader-lock owners (multiple) can sleep
> + *
> + * 5) reader-lock owners can spin waiting for the lock, in any context
> + *   - existing readers (even preempted ones) don't block new readers
> + *   - writer-lock owners never sleep, always unlock at some point


May I suggest something more concise and to the point?

/*
 * The zspage lock can be held from atomic contexts, but it needs to remain
 * preemptible when held for reading because it remains held outside of those
 * atomic contexts, otherwise we unnecessarily lose preemptibility.
 *
 * To achieve this, the following rules are enforced on readers and writers:
 *
 * - Writers are blocked by both writers and readers, while readers are only
 *   blocked by writers (i.e. normal rwlock semantics).
 *
 * - Writers are always atomic (to allow readers to spin waiting for them).
 *
 * - Writers always use trylock (as the lock may be held be sleeping readers).
 *
 * - Readers may spin on the lock (as they can only wait for atomic writers).
 *
 * - Readers may sleep while holding the lock (as writes only use trylock).
 */

> + */
> +static void zspage_read_lock(struct zspage *zspage)
> +{
> +	atomic_t *lock = &zspage->lock;
> +	int old = atomic_read_acquire(lock);
> +
> +#ifdef CONFIG_DEBUG_LOCK_ALLOC
> +	rwsem_acquire_read(&zspage->lockdep_map, 0, 0, _RET_IP_);
> +#endif
> +
> +	do {
> +		if (old == ZS_PAGE_WRLOCKED) {
> +			cpu_relax();
> +			old = atomic_read_acquire(lock);
> +			continue;
> +		}
> +	} while (!atomic_try_cmpxchg_acquire(lock, &old, old + 1));
> +}
> +
> +static void zspage_read_unlock(struct zspage *zspage)
> +{
> +#ifdef CONFIG_DEBUG_LOCK_ALLOC
> +	rwsem_release(&zspage->lockdep_map, _RET_IP_);
> +#endif
> +	atomic_dec_return_release(&zspage->lock);
> +}
> +
> +static __must_check bool zspage_try_write_lock(struct zspage *zspage)

I believe zspage_write_trylock() would be closer to the normal rwlock
naming.

> +{
> +	atomic_t *lock = &zspage->lock;
> +	int old = ZS_PAGE_UNLOCKED;
> +
> +	WARN_ON_ONCE(preemptible());

Hmm I know I may have been the one suggesting this, but do we actually
need it? We disable preemption explicitly anyway before holding the
lock.

> +
> +	preempt_disable();
> +	if (atomic_try_cmpxchg_acquire(lock, &old, ZS_PAGE_WRLOCKED)) {
> +#ifdef CONFIG_DEBUG_LOCK_ALLOC
> +		rwsem_acquire(&zspage->lockdep_map, 0, 1, _RET_IP_);
> +#endif
> +		return true;
> +	}
> +
> +	preempt_enable();
> +	return false;
> +}
> +
> +static void zspage_write_unlock(struct zspage *zspage)
> +{
> +#ifdef CONFIG_DEBUG_LOCK_ALLOC
> +	rwsem_release(&zspage->lockdep_map, _RET_IP_);
> +#endif
> +	atomic_set_release(&zspage->lock, ZS_PAGE_UNLOCKED);
> +	preempt_enable();
> +}
> +
>  /* huge object: pages_per_zspage == 1 && maxobj_per_zspage == 1 */
>  static void SetZsHugePage(struct zspage *zspage)
>  {
> @@ -325,12 +417,6 @@ static bool ZsHugePage(struct zspage *zspage)
>  	return zspage->huge;
>  }
>  
> -static void migrate_lock_init(struct zspage *zspage);
> -static void migrate_read_lock(struct zspage *zspage);
> -static void migrate_read_unlock(struct zspage *zspage);
> -static void migrate_write_lock(struct zspage *zspage);
> -static void migrate_write_unlock(struct zspage *zspage);
> -
>  #ifdef CONFIG_COMPACTION
>  static void kick_deferred_free(struct zs_pool *pool);
>  static void init_deferred_free(struct zs_pool *pool);
> @@ -1026,7 +1112,9 @@ static struct zspage *alloc_zspage(struct zs_pool *pool,
>  		return NULL;
>  
>  	zspage->magic = ZSPAGE_MAGIC;
> -	migrate_lock_init(zspage);
> +	zspage->pool = pool;
> +	zspage->class = class->index;
> +	zspage_lock_init(zspage);
>  
>  	for (i = 0; i < class->pages_per_zspage; i++) {
>  		struct zpdesc *zpdesc;
> @@ -1049,8 +1137,6 @@ static struct zspage *alloc_zspage(struct zs_pool *pool,
>  
>  	create_page_chain(class, zspage, zpdescs);
>  	init_zspage(class, zspage);
> -	zspage->pool = pool;
> -	zspage->class = class->index;
>  
>  	return zspage;
>  }
> @@ -1251,7 +1337,7 @@ void *zs_map_object(struct zs_pool *pool, unsigned long handle,
>  	 * zs_unmap_object API so delegate the locking from class to zspage
>  	 * which is smaller granularity.
>  	 */
> -	migrate_read_lock(zspage);
> +	zspage_read_lock(zspage);
>  	pool_read_unlock(pool);
>  
>  	class = zspage_class(pool, zspage);
> @@ -1311,7 +1397,7 @@ void zs_unmap_object(struct zs_pool *pool, unsigned long handle)
>  	}
>  	local_unlock(&zs_map_area.lock);
>  
> -	migrate_read_unlock(zspage);
> +	zspage_read_unlock(zspage);
>  }
>  EXPORT_SYMBOL_GPL(zs_unmap_object);
>  
> @@ -1705,18 +1791,18 @@ static void lock_zspage(struct zspage *zspage)
>  	/*
>  	 * Pages we haven't locked yet can be migrated off the list while we're
>  	 * trying to lock them, so we need to be careful and only attempt to
> -	 * lock each page under migrate_read_lock(). Otherwise, the page we lock
> +	 * lock each page under zspage_read_lock(). Otherwise, the page we lock
>  	 * may no longer belong to the zspage. This means that we may wait for
>  	 * the wrong page to unlock, so we must take a reference to the page
> -	 * prior to waiting for it to unlock outside migrate_read_lock().
> +	 * prior to waiting for it to unlock outside zspage_read_lock().
>  	 */
>  	while (1) {
> -		migrate_read_lock(zspage);
> +		zspage_read_lock(zspage);
>  		zpdesc = get_first_zpdesc(zspage);
>  		if (zpdesc_trylock(zpdesc))
>  			break;
>  		zpdesc_get(zpdesc);
> -		migrate_read_unlock(zspage);
> +		zspage_read_unlock(zspage);
>  		zpdesc_wait_locked(zpdesc);
>  		zpdesc_put(zpdesc);
>  	}
> @@ -1727,41 +1813,16 @@ static void lock_zspage(struct zspage *zspage)
>  			curr_zpdesc = zpdesc;
>  		} else {
>  			zpdesc_get(zpdesc);
> -			migrate_read_unlock(zspage);
> +			zspage_read_unlock(zspage);
>  			zpdesc_wait_locked(zpdesc);
>  			zpdesc_put(zpdesc);
> -			migrate_read_lock(zspage);
> +			zspage_read_lock(zspage);
>  		}
>  	}
> -	migrate_read_unlock(zspage);
> +	zspage_read_unlock(zspage);
>  }
>  #endif /* CONFIG_COMPACTION */
>  
> -static void migrate_lock_init(struct zspage *zspage)
> -{
> -	rwlock_init(&zspage->lock);
> -}
> -
> -static void migrate_read_lock(struct zspage *zspage) __acquires(&zspage->lock)
> -{
> -	read_lock(&zspage->lock);
> -}
> -
> -static void migrate_read_unlock(struct zspage *zspage) __releases(&zspage->lock)
> -{
> -	read_unlock(&zspage->lock);
> -}
> -
> -static void migrate_write_lock(struct zspage *zspage)
> -{
> -	write_lock(&zspage->lock);
> -}
> -
> -static void migrate_write_unlock(struct zspage *zspage)
> -{
> -	write_unlock(&zspage->lock);
> -}
> -
>  #ifdef CONFIG_COMPACTION
>  
>  static const struct movable_operations zsmalloc_mops;
> @@ -1803,7 +1864,7 @@ static bool zs_page_isolate(struct page *page, isolate_mode_t mode)
>  }
>  
>  static int zs_page_migrate(struct page *newpage, struct page *page,
> -		enum migrate_mode mode)
> +			   enum migrate_mode mode)
>  {
>  	struct zs_pool *pool;
>  	struct size_class *class;
> @@ -1819,15 +1880,12 @@ static int zs_page_migrate(struct page *newpage, struct page *page,
>  
>  	VM_BUG_ON_PAGE(!zpdesc_is_isolated(zpdesc), zpdesc_page(zpdesc));
>  
> -	/* We're committed, tell the world that this is a Zsmalloc page. */
> -	__zpdesc_set_zsmalloc(newzpdesc);
> -
>  	/* The page is locked, so this pointer must remain valid */
>  	zspage = get_zspage(zpdesc);
>  	pool = zspage->pool;
>  
>  	/*
> -	 * The pool lock protects the race between zpage migration
> +	 * The pool->lock protects the race between zpage migration
>  	 * and zs_free.
>  	 */
>  	pool_write_lock(pool);
> @@ -1837,8 +1895,15 @@ static int zs_page_migrate(struct page *newpage, struct page *page,
>  	 * the class lock protects zpage alloc/free in the zspage.
>  	 */
>  	size_class_lock(class);
> -	/* the migrate_write_lock protects zpage access via zs_map_object */
> -	migrate_write_lock(zspage);
> +	/* the zspage write_lock protects zpage access via zs_map_object */
> +	if (!zspage_try_write_lock(zspage)) {
> +		size_class_unlock(class);
> +		pool_write_unlock(pool);
> +		return -EINVAL;
> +	}
> +
> +	/* We're committed, tell the world that this is a Zsmalloc page. */
> +	__zpdesc_set_zsmalloc(newzpdesc);

We used to do this earlier on, before any locks are held. Why is it
moved here?

>  
>  	offset = get_first_obj_offset(zpdesc);
>  	s_addr = kmap_local_zpdesc(zpdesc);
> @@ -1869,7 +1934,7 @@ static int zs_page_migrate(struct page *newpage, struct page *page,
>  	 */
>  	pool_write_unlock(pool);
>  	size_class_unlock(class);
> -	migrate_write_unlock(zspage);
> +	zspage_write_unlock(zspage);
>  
>  	zpdesc_get(newzpdesc);
>  	if (zpdesc_zone(newzpdesc) != zpdesc_zone(zpdesc)) {
> @@ -2005,9 +2070,11 @@ static unsigned long __zs_compact(struct zs_pool *pool,
>  		if (!src_zspage)
>  			break;
>  
> -		migrate_write_lock(src_zspage);
> +		if (!zspage_try_write_lock(src_zspage))
> +			break;
> +
>  		migrate_zspage(pool, src_zspage, dst_zspage);
> -		migrate_write_unlock(src_zspage);
> +		zspage_write_unlock(src_zspage);
>  
>  		fg = putback_zspage(class, src_zspage);
>  		if (fg == ZS_INUSE_RATIO_0) {
> @@ -2267,7 +2334,9 @@ struct zs_pool *zs_create_pool(const char *name)
>  	 * trigger compaction manually. Thus, ignore return code.
>  	 */
>  	zs_register_shrinker(pool);
> -
> +#ifdef CONFIG_DEBUG_LOCK_ALLOC
> +	lockdep_register_key(&pool->lockdep_key);
> +#endif
>  	return pool;
>  
>  err:
> @@ -2304,6 +2373,10 @@ void zs_destroy_pool(struct zs_pool *pool)
>  		kfree(class);
>  	}
>  
> +#ifdef CONFIG_DEBUG_LOCK_ALLOC
> +	lockdep_unregister_key(&pool->lockdep_key);
> +#endif
> +
>  	destroy_cache(pool);
>  	kfree(pool->name);
>  	kfree(pool);
> -- 
> 2.48.1.502.g6dc24dfdaf-goog
> 


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v5 01/18] zram: sleepable entry locking
  2025-02-12  6:26 ` [PATCH v5 01/18] zram: sleepable entry locking Sergey Senozhatsky
@ 2025-02-13  0:08   ` Andrew Morton
  2025-02-13  0:52     ` Sergey Senozhatsky
  0 siblings, 1 reply; 40+ messages in thread
From: Andrew Morton @ 2025-02-13  0:08 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Yosry Ahmed, Kairui Song, Minchan Kim, linux-mm, linux-kernel

On Wed, 12 Feb 2025 15:26:59 +0900 Sergey Senozhatsky <senozhatsky@chromium.org> wrote:

> Concurrent modifications of meta table entries is now handled
> by per-entry spin-lock.  This has a number of shortcomings.
> 
> First, this imposes atomic requirements on compression backends.
> zram can call both zcomp_compress() and zcomp_decompress() under
> entry spin-lock, which implies that we can use only compression
> algorithms that don't schedule/sleep/wait during compression and
> decompression.  This, for instance, makes it impossible to use
> some of the ASYNC compression algorithms (H/W compression, etc.)
> implementations.
> 
> Second, this can potentially trigger watchdogs.  For example,
> entry re-compression with secondary algorithms is performed
> under entry spin-lock.  Given that we chain secondary
> compression algorithms and that some of them can be configured
> for best compression ratio (and worst compression speed) zram
> can stay under spin-lock for quite some time.
> 
> Having a per-entry mutex (or, for instance, a rw-semaphore)
> significantly increases sizeof() of each entry and hence the
> meta table.  Therefore entry locking returns back to bit
> locking, as before, however, this time also preempt-rt friendly,
> because if waits-on-bit instead of spinning-on-bit.  Lock owners
> are also now permitted to schedule, which is a first step on the
> path of making zram non-atomic.
> 
> ...
>
> -static int zram_slot_trylock(struct zram *zram, u32 index)
> +static void zram_slot_lock_init(struct zram *zram, u32 index)
>  {
> -	return spin_trylock(&zram->table[index].lock);
> +#ifdef CONFIG_DEBUG_LOCK_ALLOC
> +	lockdep_init_map(&zram->table[index].lockdep_map, "zram-entry->lock",
> +			 &zram->table_lockdep_key, 0);
> +#endif
> +}
> +
>  
> ...
>
> +#ifdef CONFIG_DEBUG_LOCK_ALLOC
> +	lockdep_register_key(&zram->table_lockdep_key);
> +#endif
> +

Please check whether all the ifdefs are needed - some of these things
have CONFIG_LOCKDEP=n stubs.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v5 00/18] zsmalloc/zram: there be preemption
  2025-02-12  6:26 [PATCH v5 00/18] zsmalloc/zram: there be preemption Sergey Senozhatsky
                   ` (17 preceding siblings ...)
  2025-02-12  6:27 ` [PATCH v5 18/18] zram: add might_sleep to zcomp API Sergey Senozhatsky
@ 2025-02-13  0:09 ` Andrew Morton
  2025-02-13  0:51   ` Sergey Senozhatsky
  18 siblings, 1 reply; 40+ messages in thread
From: Andrew Morton @ 2025-02-13  0:09 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Yosry Ahmed, Kairui Song, Minchan Kim, linux-mm, linux-kernel

On Wed, 12 Feb 2025 15:26:58 +0900 Sergey Senozhatsky <senozhatsky@chromium.org> wrote:

> First, zsmalloc is converted to use sleepable RW-"lock" (it's atomic_t
> in fact) for zspage migration protection.

This sentence is stale?




^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v5 00/18] zsmalloc/zram: there be preemption
  2025-02-13  0:09 ` [PATCH v5 00/18] zsmalloc/zram: there be preemption Andrew Morton
@ 2025-02-13  0:51   ` Sergey Senozhatsky
  0 siblings, 0 replies; 40+ messages in thread
From: Sergey Senozhatsky @ 2025-02-13  0:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Sergey Senozhatsky, Yosry Ahmed, Kairui Song, Minchan Kim,
	linux-mm, linux-kernel

On (25/02/12 16:09), Andrew Morton wrote:
> On Wed, 12 Feb 2025 15:26:58 +0900 Sergey Senozhatsky <senozhatsky@chromium.org> wrote:
> 
> > First, zsmalloc is converted to use sleepable RW-"lock" (it's atomic_t
> > in fact) for zspage migration protection.
> 
> This sentence is stale?

I'd say it is accurate, zspage has 'atomic_t lock' which is reader-writer
type of lock (permitting scheduling for readers and forbidding scheduling
for writers).


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v5 01/18] zram: sleepable entry locking
  2025-02-13  0:08   ` Andrew Morton
@ 2025-02-13  0:52     ` Sergey Senozhatsky
  2025-02-13  1:42       ` Sergey Senozhatsky
  0 siblings, 1 reply; 40+ messages in thread
From: Sergey Senozhatsky @ 2025-02-13  0:52 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Sergey Senozhatsky, Yosry Ahmed, Kairui Song, Minchan Kim,
	linux-mm, linux-kernel

On (25/02/12 16:08), Andrew Morton wrote:
> > Concurrent modifications of meta table entries is now handled
> > by per-entry spin-lock.  This has a number of shortcomings.
> > 
> > First, this imposes atomic requirements on compression backends.
> > zram can call both zcomp_compress() and zcomp_decompress() under
> > entry spin-lock, which implies that we can use only compression
> > algorithms that don't schedule/sleep/wait during compression and
> > decompression.  This, for instance, makes it impossible to use
> > some of the ASYNC compression algorithms (H/W compression, etc.)
> > implementations.
> > 
> > Second, this can potentially trigger watchdogs.  For example,
> > entry re-compression with secondary algorithms is performed
> > under entry spin-lock.  Given that we chain secondary
> > compression algorithms and that some of them can be configured
> > for best compression ratio (and worst compression speed) zram
> > can stay under spin-lock for quite some time.
> > 
> > Having a per-entry mutex (or, for instance, a rw-semaphore)
> > significantly increases sizeof() of each entry and hence the
> > meta table.  Therefore entry locking returns back to bit
> > locking, as before, however, this time also preempt-rt friendly,
> > because if waits-on-bit instead of spinning-on-bit.  Lock owners
> > are also now permitted to schedule, which is a first step on the
> > path of making zram non-atomic.
> > 
> > ...
> >
> > -static int zram_slot_trylock(struct zram *zram, u32 index)
> > +static void zram_slot_lock_init(struct zram *zram, u32 index)
> >  {
> > -	return spin_trylock(&zram->table[index].lock);
> > +#ifdef CONFIG_DEBUG_LOCK_ALLOC
> > +	lockdep_init_map(&zram->table[index].lockdep_map, "zram-entry->lock",
> > +			 &zram->table_lockdep_key, 0);
> > +#endif
> > +}
> > +
> >  
> > ...
> >
> > +#ifdef CONFIG_DEBUG_LOCK_ALLOC
> > +	lockdep_register_key(&zram->table_lockdep_key);
> > +#endif
> > +
> 
> Please check whether all the ifdefs are needed - some of these things
> have CONFIG_LOCKDEP=n stubs.

Will do.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v5 03/18] zram: remove crypto include
  2025-02-12 16:13   ` Yosry Ahmed
@ 2025-02-13  0:53     ` Sergey Senozhatsky
  0 siblings, 0 replies; 40+ messages in thread
From: Sergey Senozhatsky @ 2025-02-13  0:53 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Sergey Senozhatsky, Andrew Morton, Kairui Song, Minchan Kim,
	linux-mm, linux-kernel

On (25/02/12 16:13), Yosry Ahmed wrote:
> On Wed, Feb 12, 2025 at 03:27:01PM +0900, Sergey Senozhatsky wrote:
> > Remove a leftover crypto header include.
> 
> The subject and log is not very descriptive imo. We stop using
> CRYPTO_MAX_ALG_NAME and define our own limit in zram, and removing the
> include is just an artifact of that.

Ack.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v5 10/18] zsmalloc: factor out pool locking helpers
  2025-02-12 16:18   ` Yosry Ahmed
  2025-02-12 16:19     ` Yosry Ahmed
@ 2025-02-13  0:57     ` Sergey Senozhatsky
  2025-02-13  1:12       ` Yosry Ahmed
  1 sibling, 1 reply; 40+ messages in thread
From: Sergey Senozhatsky @ 2025-02-13  0:57 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Sergey Senozhatsky, Andrew Morton, Kairui Song, Minchan Kim,
	linux-mm, linux-kernel

On (25/02/12 16:18), Yosry Ahmed wrote:
> On Wed, Feb 12, 2025 at 03:27:08PM +0900, Sergey Senozhatsky wrote:
> > We currently have a mix of migrate_{read,write}_lock() helpers
> > that lock zspages, but it's zs_pool that actually has a ->migrate_lock
> > access to which is opene-coded.  Factor out pool migrate locking
> > into helpers, zspage migration locking API will be renamed to
> > reduce confusion.
> > 
> > It's worth mentioning that zsmalloc locks sync not only migration,
> > but also compaction.
> > 
> > Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
> 
> FWIW I don't see a lot of value in the helpers (renaming the lock is
> useful tho).

I want to hide the details, keep them in one place and at some
point *in the future* have the same "locking rules" as for zspage
lock.  Also *possibly* throwing a couple of lockdep assertions.
So I'd prefer to abstract all of these.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v5 02/18] zram: permit preemption with active compression stream
  2025-02-12 16:01   ` Yosry Ahmed
@ 2025-02-13  1:04     ` Sergey Senozhatsky
  0 siblings, 0 replies; 40+ messages in thread
From: Sergey Senozhatsky @ 2025-02-13  1:04 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Sergey Senozhatsky, Andrew Morton, Kairui Song, Minchan Kim,
	linux-mm, linux-kernel

On (25/02/12 16:01), Yosry Ahmed wrote:
> On Wed, Feb 12, 2025 at 03:27:00PM +0900, Sergey Senozhatsky wrote:
> > Currently, per-CPU stream access is done from a non-preemptible
> > (atomic) section, which imposes the same atomicity requirements on
> > compression backends as entry spin-lock, and makes it impossible
> > to use algorithms that can schedule/wait/sleep during compression
> > and decompression.
> > 
> > Switch to preemptible per-CPU model, similar to the one used
> > in zswap.  Instead of a per-CPU local lock, each stream carries
> > a mutex which is locked throughout entire time zram uses it
> > for compression or decompression, so that cpu-dead event waits
> > for zram to stop using a particular per-CPU stream and release
> > it.
> > 
> > Suggested-by: Yosry Ahmed <yosry.ahmed@linux.dev>
> > Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
> > ---
> >  drivers/block/zram/zcomp.c    | 36 +++++++++++++++++++++++++----------
> >  drivers/block/zram/zcomp.h    |  6 +++---
> >  drivers/block/zram/zram_drv.c | 20 +++++++++----------
> >  3 files changed, 39 insertions(+), 23 deletions(-)
> > 
> > diff --git a/drivers/block/zram/zcomp.c b/drivers/block/zram/zcomp.c
> > index bb514403e305..e83dd9a80a81 100644
> > --- a/drivers/block/zram/zcomp.c
> > +++ b/drivers/block/zram/zcomp.c
> > @@ -7,6 +7,7 @@
> >  #include <linux/wait.h>
> >  #include <linux/sched.h>
> >  #include <linux/cpu.h>
> > +#include <linux/cpuhotplug.h>
> 
> What code changes prompt this?

Just a missing header include.  We use cpuhotplug.
I actually think I wanted to replace cpu.h with it.

> >  #include <linux/crypto.h>
> >  #include <linux/vmalloc.h>
> >  
> > @@ -54,6 +55,7 @@ static int zcomp_strm_init(struct zcomp *comp, struct zcomp_strm *zstrm)
> >  {
> >  	int ret;
> >  
> > +	mutex_init(&zstrm->lock);
> 
> I don't think we can initialize the mutex in the hotplug callback. I
> think the following scenario is possible:
> 
> CPU #1					CPU #2
> zcomp_stream_get()
>   zstrm = raw_cpu_ptr()
> /* task migrated to CPU 2 */
> 
> CPU goes offline
>   zcomp_cpu_dead()
>     mutex_lock()
>     ..
>     mutex_unlock()
> 					/* migrated task continues */
> 					zcomp_stream_get()
> 					  mutex_lock()
> CPU goes online
>   mutex_init()
>   					  mutex_unlock() /* problem */
> 					
> In this case we'll end up initializing the mutex on CPU #1 while CPU #2
> has it locked. When we unlocked it on CPU #2 we will corrupt it AFAICT.
> 
> This is why I moved the mutex initialization out of the hotplug callback
> in zswap. I suspect to do something similar for zram we'd need to do it
> in zcomp_init()?

Yeah, I think you are right.  Let me take a look.

> >  	ret = comp->ops->create_ctx(comp->params, &zstrm->ctx);
> >  	if (ret)
> >  		return ret;
> > @@ -109,13 +111,29 @@ ssize_t zcomp_available_show(const char *comp, char *buf)
> >  
> >  struct zcomp_strm *zcomp_stream_get(struct zcomp *comp)
> >  {
> > -	local_lock(&comp->stream->lock);
> > -	return this_cpu_ptr(comp->stream);
> > +	for (;;) {
> > +		struct zcomp_strm *zstrm = raw_cpu_ptr(comp->stream);
> > +
> > +		/*
> > +		 * Inspired by zswap
> > +		 *
> > +		 * stream is returned with ->mutex locked which prevents
> > +		 * cpu_dead() from releasing this stream under us, however
> > +		 * there is still a race window between raw_cpu_ptr() and
> > +		 * mutex_lock(), during which we could have been migrated
> > +		 * to a CPU that has already destroyed its stream.  If so
> 
> "we could have been migrated from** a CPU that has already destroyed its
> stream"? Right?

"from", "to"... what's the difference :)


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v5 10/18] zsmalloc: factor out pool locking helpers
  2025-02-13  0:57     ` Sergey Senozhatsky
@ 2025-02-13  1:12       ` Yosry Ahmed
  2025-02-13  2:54         ` Sergey Senozhatsky
  0 siblings, 1 reply; 40+ messages in thread
From: Yosry Ahmed @ 2025-02-13  1:12 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Sergey Senozhatsky, Andrew Morton, Kairui Song, Minchan Kim,
	linux-mm, linux-kernel

February 12, 2025 at 4:57 PM, "Sergey Senozhatsky" <senozhatsky@chromium.org> wrote:



> 
> On (25/02/12 16:18), Yosry Ahmed wrote:
> 
> > 
> > On Wed, Feb 12, 2025 at 03:27:08PM +0900, Sergey Senozhatsky wrote:
> > 
> >  We currently have a mix of migrate_{read,write}_lock() helpers
> > 
> >  that lock zspages, but it's zs_pool that actually has a ->migrate_lock
> > 
> >  access to which is opene-coded. Factor out pool migrate locking
> > 
> >  into helpers, zspage migration locking API will be renamed to
> > 
> >  reduce confusion.
> > 
> >  
> > 
> >  It's worth mentioning that zsmalloc locks sync not only migration,
> > 
> >  but also compaction.
> > 
> >  
> > 
> >  Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
> > 
> >  
> > 
> >  FWIW I don't see a lot of value in the helpers (renaming the lock is
> > 
> >  useful tho).
> > 
> 
> I want to hide the details, keep them in one place and at some
> 
> point *in the future* have the same "locking rules" as for zspage
> 
> lock. Also *possibly* throwing a couple of lockdep assertions.
> 
> So I'd prefer to abstract all of these.


I'd prefer to introduce the abstractions when they are needed tbh. Right now they just make the code less readable.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v5 12/18] zsmalloc: make zspage lock preemptible
  2025-02-12 17:14   ` Yosry Ahmed
@ 2025-02-13  1:20     ` Sergey Senozhatsky
  2025-02-13  1:31       ` Yosry Ahmed
  0 siblings, 1 reply; 40+ messages in thread
From: Sergey Senozhatsky @ 2025-02-13  1:20 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Sergey Senozhatsky, Andrew Morton, Kairui Song, Minchan Kim,
	linux-mm, linux-kernel

On (25/02/12 17:14), Yosry Ahmed wrote:
> On Wed, Feb 12, 2025 at 03:27:10PM +0900, Sergey Senozhatsky wrote:
> > Switch over from rwlock_t to a atomic_t variable that takes negative
> > value when the page is under migration, or positive values when the
> > page is used by zsmalloc users (object map, etc.)   Using a rwsem
> > per-zspage is a little too memory heavy, a simple atomic_t should
> > suffice.
> 
> We should also explain that rwsem cannot be used due to the locking
> context (we need to hold it in an atomic context). Basically what you
> explained to me before :)
>
> > zspage lock is a leaf lock for zs_map_object(), where it's read-acquired.
> > Since this lock now permits preemption extra care needs to be taken when
> > it is write-acquired - all writers grab it in atomic context, so they
> > cannot spin and wait for (potentially preempted) reader to unlock zspage.
> > There are only two writers at this moment - migration and compaction.  In
> > both cases we use write-try-lock and bail out if zspage is read locked.
> > Writers, on the other hand, never get preempted, so readers can spin
> > waiting for the writer to unlock zspage.
> 
> The details are important, but I think we want to concisely state the
> problem statement either before or after. Basically we want a lock that
> we *never* sleep while acquiring but *can* sleep while holding in read
> mode. This allows holding the lock from any context, but also being
> preemptible if the context allows it.

Ack.

[..]
> > +/*
> > + * zspage locking rules:
> 
> Also here we need to state our key rule:
> Never sleep while acquiring, preemtible while holding (if possible). The
> following rules are basically how we make sure we keep this true.
> 
> > + *
> > + * 1) writer-lock is exclusive
> > + *
> > + * 2) writer-lock owner cannot sleep
> > + *
> > + * 3) writer-lock owner cannot spin waiting for the lock
> > + *   - caller (e.g. compaction and migration) must check return value and
> > + *     handle locking failures
> > + *   - there is only TRY variant of writer-lock function
> > + *
> > + * 4) reader-lock owners (multiple) can sleep
> > + *
> > + * 5) reader-lock owners can spin waiting for the lock, in any context
> > + *   - existing readers (even preempted ones) don't block new readers
> > + *   - writer-lock owners never sleep, always unlock at some point
> 
> 
> May I suggest something more concise and to the point?
> 
> /*
>  * The zspage lock can be held from atomic contexts, but it needs to remain
>  * preemptible when held for reading because it remains held outside of those
>  * atomic contexts, otherwise we unnecessarily lose preemptibility.
>  *
>  * To achieve this, the following rules are enforced on readers and writers:
>  *
>  * - Writers are blocked by both writers and readers, while readers are only
>  *   blocked by writers (i.e. normal rwlock semantics).
>  *
>  * - Writers are always atomic (to allow readers to spin waiting for them).
>  *
>  * - Writers always use trylock (as the lock may be held be sleeping readers).
>  *
>  * - Readers may spin on the lock (as they can only wait for atomic writers).
>  *
>  * - Readers may sleep while holding the lock (as writes only use trylock).
>  */

Looks good, thanks.

> > + */
> > +static void zspage_read_lock(struct zspage *zspage)
> > +{
> > +	atomic_t *lock = &zspage->lock;
> > +	int old = atomic_read_acquire(lock);
> > +
> > +#ifdef CONFIG_DEBUG_LOCK_ALLOC
> > +	rwsem_acquire_read(&zspage->lockdep_map, 0, 0, _RET_IP_);
> > +#endif
> > +
> > +	do {
> > +		if (old == ZS_PAGE_WRLOCKED) {
> > +			cpu_relax();
> > +			old = atomic_read_acquire(lock);
> > +			continue;
> > +		}
> > +	} while (!atomic_try_cmpxchg_acquire(lock, &old, old + 1));
> > +}
> > +
> > +static void zspage_read_unlock(struct zspage *zspage)
> > +{
> > +#ifdef CONFIG_DEBUG_LOCK_ALLOC
> > +	rwsem_release(&zspage->lockdep_map, _RET_IP_);
> > +#endif
> > +	atomic_dec_return_release(&zspage->lock);
> > +}
> > +
> > +static __must_check bool zspage_try_write_lock(struct zspage *zspage)
> 
> I believe zspage_write_trylock() would be closer to the normal rwlock
> naming.

It derived its name from rwsem "age".  Can rename.

> > +{
> > +	atomic_t *lock = &zspage->lock;
> > +	int old = ZS_PAGE_UNLOCKED;
> > +
> > +	WARN_ON_ONCE(preemptible());
> 
> Hmm I know I may have been the one suggesting this, but do we actually
> need it? We disable preemption explicitly anyway before holding the
> lock.

This is just to make sure that the precondition for
"writer is always atomic" is satisfied.  But I can drop it.

> >  	size_class_lock(class);
> > -	/* the migrate_write_lock protects zpage access via zs_map_object */
> > -	migrate_write_lock(zspage);
> > +	/* the zspage write_lock protects zpage access via zs_map_object */
> > +	if (!zspage_try_write_lock(zspage)) {
> > +		size_class_unlock(class);
> > +		pool_write_unlock(pool);
> > +		return -EINVAL;
> > +	}
> > +
> > +	/* We're committed, tell the world that this is a Zsmalloc page. */
> > +	__zpdesc_set_zsmalloc(newzpdesc);
> 
> We used to do this earlier on, before any locks are held. Why is it
> moved here?

I want to do that only if zspaage write-trylock has succeeded (we didn't
have any error out paths before).


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v5 12/18] zsmalloc: make zspage lock preemptible
  2025-02-13  1:20     ` Sergey Senozhatsky
@ 2025-02-13  1:31       ` Yosry Ahmed
  2025-02-13  1:53         ` Sergey Senozhatsky
  0 siblings, 1 reply; 40+ messages in thread
From: Yosry Ahmed @ 2025-02-13  1:31 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Sergey Senozhatsky, Andrew Morton, Kairui Song, Minchan Kim,
	linux-mm, linux-kernel

> 
> > 
> > +{
> > 
> >  + atomic_t *lock = &zspage->lock;
> > 
> >  + int old = ZS_PAGE_UNLOCKED;
> > 
> >  +
> > 
> >  + WARN_ON_ONCE(preemptible());
> > 
> >  
> > 
> >  Hmm I know I may have been the one suggesting this, but do we actually
> > 
> >  need it? We disable preemption explicitly anyway before holding the
> > 
> >  lock.
> > 
> 
> This is just to make sure that the precondition for
> 
> "writer is always atomic" is satisfied. But I can drop it.

Right, but why do we care? Even if the context is not atomic, we disable preemtion and make sure the context stays atomic throughout the lock critical section.

> 
> > 
> > size_class_lock(class);
> > 
> >  - /* the migrate_write_lock protects zpage access via zs_map_object */
> > 
> >  - migrate_write_lock(zspage);
> > 
> >  + /* the zspage write_lock protects zpage access via zs_map_object */
> > 
> >  + if (!zspage_try_write_lock(zspage)) {
> > 
> >  + size_class_unlock(class);
> > 
> >  + pool_write_unlock(pool);
> > 
> >  + return -EINVAL;
> > 
> >  + }
> > 
> >  +
> > 
> >  + /* We're committed, tell the world that this is a Zsmalloc page. */
> > 
> >  + __zpdesc_set_zsmalloc(newzpdesc);
> > 
> >  
> > 
> >  We used to do this earlier on, before any locks are held. Why is it
> > 
> >  moved here?
> > 
> 
> I want to do that only if zspaage write-trylock has succeeded (we didn't
> 
> have any error out paths before).

Ack.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v5 01/18] zram: sleepable entry locking
  2025-02-13  0:52     ` Sergey Senozhatsky
@ 2025-02-13  1:42       ` Sergey Senozhatsky
  2025-02-13  8:49         ` Sergey Senozhatsky
  0 siblings, 1 reply; 40+ messages in thread
From: Sergey Senozhatsky @ 2025-02-13  1:42 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Yosry Ahmed, Kairui Song, Minchan Kim, linux-mm, linux-kernel,
	Sergey Senozhatsky

On (25/02/13 09:52), Sergey Senozhatsky wrote:
> > > -static int zram_slot_trylock(struct zram *zram, u32 index)
> > > +static void zram_slot_lock_init(struct zram *zram, u32 index)
> > >  {
> > > -	return spin_trylock(&zram->table[index].lock);
> > > +#ifdef CONFIG_DEBUG_LOCK_ALLOC
> > > +	lockdep_init_map(&zram->table[index].lockdep_map, "zram-entry->lock",
> > > +			 &zram->table_lockdep_key, 0);
> > > +#endif
> > > +}
> > > +
> > >  
> > > ...
> > >
> > > +#ifdef CONFIG_DEBUG_LOCK_ALLOC
> > > +	lockdep_register_key(&zram->table_lockdep_key);
> > > +#endif
> > > +
> > 
> > Please check whether all the ifdefs are needed - some of these things
> > have CONFIG_LOCKDEP=n stubs.

The problem is that while functions have LOCKDEP=n stubs, struct members
don't - we still declare table_lockdep_key and lockdep_map only when
DEBUG_LOCK_ALLOC is enabled.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v5 12/18] zsmalloc: make zspage lock preemptible
  2025-02-13  1:31       ` Yosry Ahmed
@ 2025-02-13  1:53         ` Sergey Senozhatsky
  0 siblings, 0 replies; 40+ messages in thread
From: Sergey Senozhatsky @ 2025-02-13  1:53 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Sergey Senozhatsky, Andrew Morton, Kairui Song, Minchan Kim,
	linux-mm, linux-kernel

On (25/02/13 01:31), Yosry Ahmed wrote:
> > >  Hmm I know I may have been the one suggesting this, but do we actually
> > > 
> > >  need it? We disable preemption explicitly anyway before holding the
> > > 
> > >  lock.
> > > 
> > 
> > This is just to make sure that the precondition for
> > 
> > "writer is always atomic" is satisfied. But I can drop it.
> 
> Right, but why do we care?

Oh, not that we care, just wanted extra smoke-detectors.  It's gone now.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v5 10/18] zsmalloc: factor out pool locking helpers
  2025-02-13  1:12       ` Yosry Ahmed
@ 2025-02-13  2:54         ` Sergey Senozhatsky
  0 siblings, 0 replies; 40+ messages in thread
From: Sergey Senozhatsky @ 2025-02-13  2:54 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Sergey Senozhatsky, Andrew Morton, Kairui Song, Minchan Kim,
	linux-mm, linux-kernel

On (25/02/13 01:12), Yosry Ahmed wrote:
> > I want to hide the details, keep them in one place and at some
> > 
> > point *in the future* have the same "locking rules" as for zspage
> > 
> > lock. Also *possibly* throwing a couple of lockdep assertions.
> > 
> > So I'd prefer to abstract all of these.
> 
> 
> I'd prefer to introduce the abstractions when they are needed tbh. Right now they just make the code less readable.

OK, gone now.  I think I didn't screw things up resolving the conflicts.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v5 01/18] zram: sleepable entry locking
  2025-02-13  1:42       ` Sergey Senozhatsky
@ 2025-02-13  8:49         ` Sergey Senozhatsky
  0 siblings, 0 replies; 40+ messages in thread
From: Sergey Senozhatsky @ 2025-02-13  8:49 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Yosry Ahmed, Kairui Song, Minchan Kim, linux-mm, linux-kernel,
	Sergey Senozhatsky

On (25/02/13 10:42), Sergey Senozhatsky wrote:
> On (25/02/13 09:52), Sergey Senozhatsky wrote:
> > > > -static int zram_slot_trylock(struct zram *zram, u32 index)
> > > > +static void zram_slot_lock_init(struct zram *zram, u32 index)
> > > >  {
> > > > -	return spin_trylock(&zram->table[index].lock);
> > > > +#ifdef CONFIG_DEBUG_LOCK_ALLOC
> > > > +	lockdep_init_map(&zram->table[index].lockdep_map, "zram-entry->lock",
> > > > +			 &zram->table_lockdep_key, 0);
> > > > +#endif
> > > > +}
> > > > +
> > > >  
> > > > ...
> > > >
> > > > +#ifdef CONFIG_DEBUG_LOCK_ALLOC
> > > > +	lockdep_register_key(&zram->table_lockdep_key);
> > > > +#endif
> > > > +
> > > 
> > > Please check whether all the ifdefs are needed - some of these things
> > > have CONFIG_LOCKDEP=n stubs.
> 
> The problem is that while functions have LOCKDEP=n stubs, struct members
> don't - we still declare table_lockdep_key and lockdep_map only when
> DEBUG_LOCK_ALLOC is enabled.

I rewrote those bits (in zram and in zsmalloc), given that we also
need lock-contended/lock-acquired in various branches, which require
even more ifdef-s.  So I factored out debug-enabled variants.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v5 12/18] zsmalloc: make zspage lock preemptible
  2025-02-12  6:27 ` [PATCH v5 12/18] zsmalloc: make zspage lock preemptible Sergey Senozhatsky
  2025-02-12 17:14   ` Yosry Ahmed
@ 2025-02-13 11:32   ` Hillf Danton
  2025-02-13 12:29     ` Sergey Senozhatsky
  1 sibling, 1 reply; 40+ messages in thread
From: Hillf Danton @ 2025-02-13 11:32 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Yosry Ahmed, Kairui Song, Minchan Kim, Andrew Morton, linux-mm,
	linux-kernel

On Wed, 12 Feb 2025 15:27:10 +0900 Sergey Senozhatsky
> Switch over from rwlock_t to a atomic_t variable that takes negative
> value when the page is under migration, or positive values when the
> page is used by zsmalloc users (object map, etc.)   Using a rwsem
> per-zspage is a little too memory heavy, a simple atomic_t should
> suffice.
> 
> zspage lock is a leaf lock for zs_map_object(), where it's read-acquired.
> Since this lock now permits preemption extra care needs to be taken when
> it is write-acquired - all writers grab it in atomic context, so they
> cannot spin and wait for (potentially preempted) reader to unlock zspage.
> There are only two writers at this moment - migration and compaction.  In
> both cases we use write-try-lock and bail out if zspage is read locked.
> Writers, on the other hand, never get preempted, so readers can spin
> waiting for the writer to unlock zspage.
> 
> With this we can implement a preemptible object mapping.
> 
> Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
> Cc: Yosry Ahmed <yosry.ahmed@linux.dev>
> ---
>  mm/zsmalloc.c | 183 +++++++++++++++++++++++++++++++++++---------------
>  1 file changed, 128 insertions(+), 55 deletions(-)
> 
> diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
> index c82c24b8e6a4..80261bb78cf8 100644
> --- a/mm/zsmalloc.c
> +++ b/mm/zsmalloc.c
> @@ -226,6 +226,9 @@ struct zs_pool {
>  	/* protect page/zspage migration */
>  	rwlock_t lock;
>  	atomic_t compaction_in_progress;
> +#ifdef CONFIG_DEBUG_LOCK_ALLOC
> +	struct lock_class_key lockdep_key;
> +#endif
>  };
>  
>  static void pool_write_unlock(struct zs_pool *pool)
> @@ -292,6 +295,9 @@ static inline void free_zpdesc(struct zpdesc *zpdesc)
>  	__free_page(page);
>  }
>  
> +#define ZS_PAGE_UNLOCKED	0
> +#define ZS_PAGE_WRLOCKED	-1
> +
>  struct zspage {
>  	struct {
>  		unsigned int huge:HUGE_BITS;
> @@ -304,7 +310,11 @@ struct zspage {
>  	struct zpdesc *first_zpdesc;
>  	struct list_head list; /* fullness list */
>  	struct zs_pool *pool;
> -	rwlock_t lock;
> +	atomic_t lock;
> +
> +#ifdef CONFIG_DEBUG_LOCK_ALLOC
> +	struct lockdep_map lockdep_map;
> +#endif
>  };
>  
>  struct mapping_area {
> @@ -314,6 +324,88 @@ struct mapping_area {
>  	enum zs_mapmode vm_mm; /* mapping mode */
>  };
>  
> +static void zspage_lock_init(struct zspage *zspage)
> +{
> +#ifdef CONFIG_DEBUG_LOCK_ALLOC
> +	lockdep_init_map(&zspage->lockdep_map, "zsmalloc-page",
> +			 &zspage->pool->lockdep_key, 0);
> +#endif
> +
> +	atomic_set(&zspage->lock, ZS_PAGE_UNLOCKED);
> +}
> +
> +/*
> + * zspage locking rules:
> + *
> + * 1) writer-lock is exclusive
> + *
> + * 2) writer-lock owner cannot sleep
> + *
> + * 3) writer-lock owner cannot spin waiting for the lock
> + *   - caller (e.g. compaction and migration) must check return value and
> + *     handle locking failures
> + *   - there is only TRY variant of writer-lock function
> + *
> + * 4) reader-lock owners (multiple) can sleep
> + *
> + * 5) reader-lock owners can spin waiting for the lock, in any context
> + *   - existing readers (even preempted ones) don't block new readers
> + *   - writer-lock owners never sleep, always unlock at some point
> + */
> +static void zspage_read_lock(struct zspage *zspage)
> +{
> +	atomic_t *lock = &zspage->lock;
> +	int old = atomic_read_acquire(lock);
> +
> +#ifdef CONFIG_DEBUG_LOCK_ALLOC
> +	rwsem_acquire_read(&zspage->lockdep_map, 0, 0, _RET_IP_);
> +#endif
> +
> +	do {
> +		if (old == ZS_PAGE_WRLOCKED) {
> +			cpu_relax();
> +			old = atomic_read_acquire(lock);
> +			continue;
> +		}
> +	} while (!atomic_try_cmpxchg_acquire(lock, &old, old + 1));

Given mcs_spinlock, inventing spinlock in 2025 sounds no good.
See below for the spinlock version.
> +}
> +
> +static void zspage_read_unlock(struct zspage *zspage)
> +{
> +#ifdef CONFIG_DEBUG_LOCK_ALLOC
> +	rwsem_release(&zspage->lockdep_map, _RET_IP_);
> +#endif
> +	atomic_dec_return_release(&zspage->lock);
> +}
> +
> +static __must_check bool zspage_try_write_lock(struct zspage *zspage)
> +{
> +	atomic_t *lock = &zspage->lock;
> +	int old = ZS_PAGE_UNLOCKED;
> +
> +	WARN_ON_ONCE(preemptible());
> +
> +	preempt_disable();
> +	if (atomic_try_cmpxchg_acquire(lock, &old, ZS_PAGE_WRLOCKED)) {
> +#ifdef CONFIG_DEBUG_LOCK_ALLOC
> +		rwsem_acquire(&zspage->lockdep_map, 0, 1, _RET_IP_);
> +#endif
> +		return true;
> +	}
> +
> +	preempt_enable();
> +	return false;
> +}
> +
> +static void zspage_write_unlock(struct zspage *zspage)
> +{
> +#ifdef CONFIG_DEBUG_LOCK_ALLOC
> +	rwsem_release(&zspage->lockdep_map, _RET_IP_);
> +#endif
> +	atomic_set_release(&zspage->lock, ZS_PAGE_UNLOCKED);
> +	preempt_enable();
> +}

struct zspage_lock {
	spinlock_t	lock;
	int		cnt;
	struct lockdep_map lockdep_map;
};

static __must_check bool zspage_write_trylock(struct zspage_lock *zl)
{
	spin_lock(&zl->lock);
	if (zl->cnt == ZS_PAGE_UNLOCKED) {
		// zl->cnt = ZS_PAGE_WRLOCKED;
		rwsem_acquire(&zl->lockdep_map, 0, 1, _RET_IP_);
		return true;
	}
	spin_unlock(&zl->lock);
	return false;
}

static void zspage_write_unlock(struct zspage_lock *zl)
{
	rwsem_release(&zl->lockdep_map, _RET_IP_);
	spin_unlock(&zl->lock);
}

static void zspage_read_lock(struct zspage_lock *zl)
{
	rwsem_acquire_read(&zl->lockdep_map, 0, 0, _RET_IP_);

	spin_lock(&zl->lock);
	zl->cnt++;
	spin_unlock(&zl->lock);
}

static void zspage_read_unlock(struct zspage_lock *zl)
{
	rwsem_release(&zl->lockdep_map, _RET_IP_);

	spin_lock(&zl->lock);
	zl->cnt--;
	spin_unlock(&zl->lock);
}


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v5 12/18] zsmalloc: make zspage lock preemptible
  2025-02-13 11:32   ` Hillf Danton
@ 2025-02-13 12:29     ` Sergey Senozhatsky
  0 siblings, 0 replies; 40+ messages in thread
From: Sergey Senozhatsky @ 2025-02-13 12:29 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Sergey Senozhatsky, Yosry Ahmed, Kairui Song, Minchan Kim,
	Andrew Morton, linux-mm, linux-kernel

On (25/02/13 19:32), Hillf Danton wrote:
[..]
> > +static void zspage_read_lock(struct zspage *zspage)
> > +{
> > +	atomic_t *lock = &zspage->lock;
> > +	int old = atomic_read_acquire(lock);
> > +
> > +#ifdef CONFIG_DEBUG_LOCK_ALLOC
> > +	rwsem_acquire_read(&zspage->lockdep_map, 0, 0, _RET_IP_);
> > +#endif
> > +
> > +	do {
> > +		if (old == ZS_PAGE_WRLOCKED) {
> > +			cpu_relax();
> > +			old = atomic_read_acquire(lock);
> > +			continue;
> > +		}
> > +	} while (!atomic_try_cmpxchg_acquire(lock, &old, old + 1));
> 
> Given mcs_spinlock, inventing spinlock in 2025 sounds no good.
> See below for the spinlock version.

I should have sent this series in 2024, when inventing a spinlock
sounded good :)

> struct zspage_lock {
> 	spinlock_t	lock;
> 	int		cnt;
> 	struct lockdep_map lockdep_map;
> };
> 
> static __must_check bool zspage_write_trylock(struct zspage_lock *zl)
> {
> 	spin_lock(&zl->lock);
> 	if (zl->cnt == ZS_PAGE_UNLOCKED) {
> 		// zl->cnt = ZS_PAGE_WRLOCKED;
> 		rwsem_acquire(&zl->lockdep_map, 0, 1, _RET_IP_);
> 		return true;
> 	}
> 	spin_unlock(&zl->lock);
> 	return false;
> }
> 
> static void zspage_write_unlock(struct zspage_lock *zl)
> {
> 	rwsem_release(&zl->lockdep_map, _RET_IP_);
> 	spin_unlock(&zl->lock);
> }
> 
> static void zspage_read_lock(struct zspage_lock *zl)
> {
> 	rwsem_acquire_read(&zl->lockdep_map, 0, 0, _RET_IP_);
> 
> 	spin_lock(&zl->lock);
> 	zl->cnt++;
> 	spin_unlock(&zl->lock);
> }
> 
> static void zspage_read_unlock(struct zspage_lock *zl)
> {
> 	rwsem_release(&zl->lockdep_map, _RET_IP_);
> 
> 	spin_lock(&zl->lock);
> 	zl->cnt--;
> 	spin_unlock(&zl->lock);
> }

I see, yeah I can pick it up, thanks.  A couple of *minor* things I can
think of.  First. in the current implementation I also track LOCK_STAT
(lock-contended/lock-acquired), something like

static inline void __read_lock(struct zspage *zspage)
{
        atomic_t *lock = &zspage->lock;
        int old = atomic_read_acquire(lock);

        rwsem_acquire_read(&zspage->dep_map, 0, 0, _RET_IP_);

        do {
                if (old == ZS_PAGE_WRLOCKED) {
                        lock_contended(&zspage->dep_map, _RET_IP_);
                        cpu_relax();
                        old = atomic_read_acquire(lock);
                        continue;
                }
        } while (!atomic_try_cmpxchg_acquire(lock, &old, old + 1));

        lock_acquired(&zspage->dep_map, _RET_IP_);
}


I'll add lock-stat to zsl, but it's worth mentioning that zsl "splits"
the stats into zsl spin-lock's dep_map and zsl's own dep_map:

                              class name    con-bounces    contentions   waittime-min   waittime-max waittime-total   waittime-avg    acq-bounces   acquisitions   holdtime-min   holdtime-max holdtime-total   holdtime-avg
                          zspage->lock-R:             0              0           0.00           0.00           0.00           0.00              1              2           6.19          11.61          17.80           8.90
                       &zspage->zsl.lock:             0              0           0.00           0.00           0.00           0.00           5457        1330106           0.10         118.53      174917.46           0.13

That is, quite likely, fine.  One can just add the numbers, I assume.
Second, we'll be carrying around two dep_map-s per-zsl in lockdep builds
now, but, again, that is, likely, not a problem as sizeof(lockdep_map)
isn't too huge (around 48 bytes).


^ permalink raw reply	[flat|nested] 40+ messages in thread

end of thread, other threads:[~2025-02-13 12:29 UTC | newest]

Thread overview: 40+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-02-12  6:26 [PATCH v5 00/18] zsmalloc/zram: there be preemption Sergey Senozhatsky
2025-02-12  6:26 ` [PATCH v5 01/18] zram: sleepable entry locking Sergey Senozhatsky
2025-02-13  0:08   ` Andrew Morton
2025-02-13  0:52     ` Sergey Senozhatsky
2025-02-13  1:42       ` Sergey Senozhatsky
2025-02-13  8:49         ` Sergey Senozhatsky
2025-02-12  6:27 ` [PATCH v5 02/18] zram: permit preemption with active compression stream Sergey Senozhatsky
2025-02-12 16:01   ` Yosry Ahmed
2025-02-13  1:04     ` Sergey Senozhatsky
2025-02-12  6:27 ` [PATCH v5 03/18] zram: remove crypto include Sergey Senozhatsky
2025-02-12 16:13   ` Yosry Ahmed
2025-02-13  0:53     ` Sergey Senozhatsky
2025-02-12  6:27 ` [PATCH v5 04/18] zram: remove max_comp_streams device attr Sergey Senozhatsky
2025-02-12  6:27 ` [PATCH v5 05/18] zram: remove two-staged handle allocation Sergey Senozhatsky
2025-02-12  6:27 ` [PATCH v5 06/18] zram: remove writestall zram_stats member Sergey Senozhatsky
2025-02-12  6:27 ` [PATCH v5 07/18] zram: limit max recompress prio to num_active_comps Sergey Senozhatsky
2025-02-12  6:27 ` [PATCH v5 08/18] zram: filter out recomp targets based on priority Sergey Senozhatsky
2025-02-12  6:27 ` [PATCH v5 09/18] zram: rework recompression loop Sergey Senozhatsky
2025-02-12  6:27 ` [PATCH v5 10/18] zsmalloc: factor out pool locking helpers Sergey Senozhatsky
2025-02-12 16:18   ` Yosry Ahmed
2025-02-12 16:19     ` Yosry Ahmed
2025-02-13  0:57     ` Sergey Senozhatsky
2025-02-13  1:12       ` Yosry Ahmed
2025-02-13  2:54         ` Sergey Senozhatsky
2025-02-12  6:27 ` [PATCH v5 11/18] zsmalloc: factor out size-class " Sergey Senozhatsky
2025-02-12  6:27 ` [PATCH v5 12/18] zsmalloc: make zspage lock preemptible Sergey Senozhatsky
2025-02-12 17:14   ` Yosry Ahmed
2025-02-13  1:20     ` Sergey Senozhatsky
2025-02-13  1:31       ` Yosry Ahmed
2025-02-13  1:53         ` Sergey Senozhatsky
2025-02-13 11:32   ` Hillf Danton
2025-02-13 12:29     ` Sergey Senozhatsky
2025-02-12  6:27 ` [PATCH v5 13/18] zsmalloc: introduce new object mapping API Sergey Senozhatsky
2025-02-12  6:27 ` [PATCH v5 14/18] zram: switch to new zsmalloc " Sergey Senozhatsky
2025-02-12  6:27 ` [PATCH v5 15/18] zram: permit reclaim in zstd custom allocator Sergey Senozhatsky
2025-02-12  6:27 ` [PATCH v5 16/18] zram: do not leak page on recompress_store error path Sergey Senozhatsky
2025-02-12  6:27 ` [PATCH v5 17/18] zram: do not leak page on writeback_store " Sergey Senozhatsky
2025-02-12  6:27 ` [PATCH v5 18/18] zram: add might_sleep to zcomp API Sergey Senozhatsky
2025-02-13  0:09 ` [PATCH v5 00/18] zsmalloc/zram: there be preemption Andrew Morton
2025-02-13  0:51   ` Sergey Senozhatsky

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox