[PATCHv4 00/17] zsmalloc/zram: there be preemption

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCHv4 00/17] zsmalloc/zram: there be preemption
@ 2025-01-31  9:05 Sergey Senozhatsky
  2025-01-31  9:06 ` [PATCHv4 01/17] zram: switch to non-atomic entry locking Sergey Senozhatsky
                   ` (16 more replies)
  0 siblings, 17 replies; 73+ messages in thread
From: Sergey Senozhatsky @ 2025-01-31  9:05 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Minchan Kim, linux-mm, linux-kernel, Sergey Senozhatsky

Posting [1] and [2] combined into one series for the first time,
so don't get confused by v4.

Currently zram runs compression and decompression in non-preemptible
sections, e.g.

    zcomp_stream_get()     // grabs CPU local lock
    zcomp_compress()

or

    zram_slot_lock()       // grabs entry spin-lock
    zcomp_stream_get()     // grabs CPU local lock
    zs_map_object()        // grabs rwlock and CPU local lock
    zcomp_decompress()

Potentially a little troublesome for a number of reasons.

For instance, this makes it impossible to use async compression
algorithms or/and H/W compression algorithms, which can wait for OP
completion or resource availability.  This also restricts what
compression algorithms can do internally, for example, zstd can
allocate internal state memory for C/D dictionaries:

do_fsync()
 do_writepages()
  zram_bio_write()
   zram_write_page()                          // become non-preemptible
    zcomp_compress()
     zstd_compress()
      ZSTD_compress_usingCDict()
       ZSTD_compressBegin_usingCDict_internal()
        ZSTD_resetCCtx_usingCDict()
         ZSTD_resetCCtx_internal()
          zstd_custom_alloc()                 // memory allocation

Not to mention that the system can be configured to maximize
compression ratio at a cost of CPU/HW time (e.g. lz4hc or deflate
with very high compression level) so zram can stay in non-preemptible
section (even under spin-lock or/and rwlock) for an extended period
of time.  Aside from compression algorithms, this also restricts what
zram can do.  One particular example is zram_write_page() zsmalloc
handle allocation, which has an optimistic allocation (disallowing
direct reclaim) and a pessimistic fallback path, which then forces
zram to compress the page one more time.

This series changes zram to not directly impose atomicity restrictions
on compression algorithms (and on itself), which makes zram write()
fully preemptible; zram read(), sadly, is not always preemptible yet.
There are still indirect atomicity restrictions imposed by zsmalloc().
One notable example is object mapping API, which returns with:
a) local CPU lock held
b) zspage rwlock held

First, zsmalloc is converted to use sleepable RW-"lock" (it's atomic_t
in fact) for zspage migration protection.  Second, a new handle mapping
is introduced which doesn't use per-CPU buffers (and hence no local CPU
lock), does fewer memcpy() calls, but requires users to provide a
pointer to temp buffer for object copy-in (when needed).  Third, zram is
converted to the new zsmalloc mapping API and thus zram read() becomes
preemptible.

[1] https://lore.kernel.org/linux-mm/20250130111105.2861324-1-senozhatsky@chromium.org
[2] https://lore.kernel.org/linux-mm/20250130044455.2642465-1-senozhatsky@chromium.org

v4:
-- merged the series
-- renamed zs_pool migrate_lock (Yosry)
-- dropped zs_pool members re-shuffle patch (Yosry hated it with passion)
-- some minor cleanups

Sergey Senozhatsky (17):
  zram: switch to non-atomic entry locking
  zram: do not use per-CPU compression streams
  zram: remove crypto include
  zram: remove max_comp_streams device attr
  zram: remove two-staged handle allocation
  zram: permit reclaim in zstd custom allocator
  zram: permit reclaim in recompression handle allocation
  zram: remove writestall zram_stats member
  zram: limit max recompress prio to num_active_comps
  zram: filter out recomp targets based on priority
  zram: unlock slot during recompression
  zsmalloc: factor out pool locking helpers
  zsmalloc: factor out size-class locking helpers
  zsmalloc: make zspage lock preemptible
  zsmalloc: introduce new object mapping API
  zram: switch to new zsmalloc object mapping API
  zram: add might_sleep to zcomp API

 Documentation/ABI/testing/sysfs-block-zram  |   8 -
 Documentation/admin-guide/blockdev/zram.rst |  36 +-
 drivers/block/zram/backend_zstd.c           |  11 +-
 drivers/block/zram/zcomp.c                  | 169 +++++----
 drivers/block/zram/zcomp.h                  |  19 +-
 drivers/block/zram/zram_drv.c               | 357 +++++++++----------
 drivers/block/zram/zram_drv.h               |   8 +-
 include/linux/cpuhotplug.h                  |   1 -
 include/linux/zsmalloc.h                    |   8 +
 mm/zsmalloc.c                               | 372 +++++++++++++++-----
 10 files changed, 577 insertions(+), 412 deletions(-)

-- 
2.48.1.362.g079036d154-goog

^ permalink raw reply	[flat|nested] 73+ messages in thread

* [PATCHv4 01/17] zram: switch to non-atomic entry locking
  2025-01-31  9:05 [PATCHv4 00/17] zsmalloc/zram: there be preemption Sergey Senozhatsky
@ 2025-01-31  9:06 ` Sergey Senozhatsky
  2025-01-31 11:41   ` Hillf Danton
  2025-01-31 22:55   ` Andrew Morton
  2025-01-31  9:06 ` [PATCHv4 02/17] zram: do not use per-CPU compression streams Sergey Senozhatsky
                   ` (15 subsequent siblings)
  16 siblings, 2 replies; 73+ messages in thread
From: Sergey Senozhatsky @ 2025-01-31  9:06 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Minchan Kim, linux-mm, linux-kernel, Sergey Senozhatsky

Concurrent modifications of meta table entries is now handled
by per-entry spin-lock.  This has a number of shortcomings.

First, this imposes atomic requirements on compression backends.
zram can call both zcomp_compress() and zcomp_decompress() under
entry spin-lock, which implies that we can use only compression
algorithms that don't schedule/sleep/wait during compression and
decompression.  This, for instance, makes it impossible to use
some of the ASYNC compression algorithms (H/W compression, etc.)
implementations.

Second, this can potentially trigger watchdogs.  For example,
entry re-compression with secondary algorithms is performed
under entry spin-lock.  Given that we chain secondary
compression algorithms and that some of them can be configured
for best compression ratio (and worst compression speed) zram
can stay under spin-lock for quite some time.

Do not use per-entry spin-locks and instead convert it to an
atomic_t variable which open codes reader-writer type of lock.
This permits preemption from slot_lock section, also reduces
the sizeof() zram entry when lockdep is enabled.

Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
---
 drivers/block/zram/zram_drv.c | 126 ++++++++++++++++++++--------------
 drivers/block/zram/zram_drv.h |   6 +-
 2 files changed, 79 insertions(+), 53 deletions(-)

diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index 9f5020b077c5..1c2df2341704 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -58,19 +58,50 @@ static void zram_free_page(struct zram *zram, size_t index);
 static int zram_read_from_zspool(struct zram *zram, struct page *page,
 				 u32 index);
 
-static int zram_slot_trylock(struct zram *zram, u32 index)
+static bool zram_slot_try_write_lock(struct zram *zram, u32 index)
 {
-	return spin_trylock(&zram->table[index].lock);
+	atomic_t *lock = &zram->table[index].lock;
+	int old = ZRAM_ENTRY_UNLOCKED;
+
+	return atomic_try_cmpxchg(lock, &old, ZRAM_ENTRY_WRLOCKED);
+}
+
+static void zram_slot_write_lock(struct zram *zram, u32 index)
+{
+	atomic_t *lock = &zram->table[index].lock;
+	int old = atomic_read(lock);
+
+	do {
+		if (old != ZRAM_ENTRY_UNLOCKED) {
+			cond_resched();
+			old = atomic_read(lock);
+			continue;
+		}
+	} while (!atomic_try_cmpxchg(lock, &old, ZRAM_ENTRY_WRLOCKED));
+}
+
+static void zram_slot_write_unlock(struct zram *zram, u32 index)
+{
+	atomic_set(&zram->table[index].lock, ZRAM_ENTRY_UNLOCKED);
 }
 
-static void zram_slot_lock(struct zram *zram, u32 index)
+static void zram_slot_read_lock(struct zram *zram, u32 index)
 {
-	spin_lock(&zram->table[index].lock);
+	atomic_t *lock = &zram->table[index].lock;
+	int old = atomic_read(lock);
+
+	do {
+		if (old == ZRAM_ENTRY_WRLOCKED) {
+			cond_resched();
+			old = atomic_read(lock);
+			continue;
+		}
+	} while (!atomic_try_cmpxchg(lock, &old, old + 1));
 }
 
-static void zram_slot_unlock(struct zram *zram, u32 index)
+static void zram_slot_read_unlock(struct zram *zram, u32 index)
 {
-	spin_unlock(&zram->table[index].lock);
+	atomic_dec(&zram->table[index].lock);
 }
 
 static inline bool init_done(struct zram *zram)
@@ -93,7 +124,6 @@ static void zram_set_handle(struct zram *zram, u32 index, unsigned long handle)
 	zram->table[index].handle = handle;
 }
 
-/* flag operations require table entry bit_spin_lock() being held */
 static bool zram_test_flag(struct zram *zram, u32 index,
 			enum zram_pageflags flag)
 {
@@ -229,9 +259,9 @@ static void release_pp_slot(struct zram *zram, struct zram_pp_slot *pps)
 {
 	list_del_init(&pps->entry);
 
-	zram_slot_lock(zram, pps->index);
+	zram_slot_write_lock(zram, pps->index);
 	zram_clear_flag(zram, pps->index, ZRAM_PP_SLOT);
-	zram_slot_unlock(zram, pps->index);
+	zram_slot_write_unlock(zram, pps->index);
 
 	kfree(pps);
 }
@@ -394,11 +424,11 @@ static void mark_idle(struct zram *zram, ktime_t cutoff)
 		 *
 		 * And ZRAM_WB slots simply cannot be ZRAM_IDLE.
 		 */
-		zram_slot_lock(zram, index);
+		zram_slot_write_lock(zram, index);
 		if (!zram_allocated(zram, index) ||
 		    zram_test_flag(zram, index, ZRAM_WB) ||
 		    zram_test_flag(zram, index, ZRAM_SAME)) {
-			zram_slot_unlock(zram, index);
+			zram_slot_write_unlock(zram, index);
 			continue;
 		}
 
@@ -410,7 +440,7 @@ static void mark_idle(struct zram *zram, ktime_t cutoff)
 			zram_set_flag(zram, index, ZRAM_IDLE);
 		else
 			zram_clear_flag(zram, index, ZRAM_IDLE);
-		zram_slot_unlock(zram, index);
+		zram_slot_write_unlock(zram, index);
 	}
 }
 
@@ -709,7 +739,7 @@ static int scan_slots_for_writeback(struct zram *zram, u32 mode,
 
 		INIT_LIST_HEAD(&pps->entry);
 
-		zram_slot_lock(zram, index);
+		zram_slot_write_lock(zram, index);
 		if (!zram_allocated(zram, index))
 			goto next;
 
@@ -731,7 +761,7 @@ static int scan_slots_for_writeback(struct zram *zram, u32 mode,
 		place_pp_slot(zram, ctl, pps);
 		pps = NULL;
 next:
-		zram_slot_unlock(zram, index);
+		zram_slot_write_unlock(zram, index);
 	}
 
 	kfree(pps);
@@ -822,7 +852,7 @@ static ssize_t writeback_store(struct device *dev,
 		}
 
 		index = pps->index;
-		zram_slot_lock(zram, index);
+		zram_slot_read_lock(zram, index);
 		/*
 		 * scan_slots() sets ZRAM_PP_SLOT and relases slot lock, so
 		 * slots can change in the meantime. If slots are accessed or
@@ -833,7 +863,7 @@ static ssize_t writeback_store(struct device *dev,
 			goto next;
 		if (zram_read_from_zspool(zram, page, index))
 			goto next;
-		zram_slot_unlock(zram, index);
+		zram_slot_read_unlock(zram, index);
 
 		bio_init(&bio, zram->bdev, &bio_vec, 1,
 			 REQ_OP_WRITE | REQ_SYNC);
@@ -860,7 +890,7 @@ static ssize_t writeback_store(struct device *dev,
 		}
 
 		atomic64_inc(&zram->stats.bd_writes);
-		zram_slot_lock(zram, index);
+		zram_slot_write_lock(zram, index);
 		/*
 		 * Same as above, we release slot lock during writeback so
 		 * slot can change under us: slot_free() or slot_free() and
@@ -882,7 +912,7 @@ static ssize_t writeback_store(struct device *dev,
 			zram->bd_wb_limit -=  1UL << (PAGE_SHIFT - 12);
 		spin_unlock(&zram->wb_limit_lock);
 next:
-		zram_slot_unlock(zram, index);
+		zram_slot_write_unlock(zram, index);
 		release_pp_slot(zram, pps);
 
 		cond_resched();
@@ -1001,7 +1031,7 @@ static ssize_t read_block_state(struct file *file, char __user *buf,
 	for (index = *ppos; index < nr_pages; index++) {
 		int copied;
 
-		zram_slot_lock(zram, index);
+		zram_slot_read_lock(zram, index);
 		if (!zram_allocated(zram, index))
 			goto next;
 
@@ -1019,13 +1049,13 @@ static ssize_t read_block_state(struct file *file, char __user *buf,
 				       ZRAM_INCOMPRESSIBLE) ? 'n' : '.');
 
 		if (count <= copied) {
-			zram_slot_unlock(zram, index);
+			zram_slot_read_unlock(zram, index);
 			break;
 		}
 		written += copied;
 		count -= copied;
 next:
-		zram_slot_unlock(zram, index);
+		zram_slot_read_unlock(zram, index);
 		*ppos += 1;
 	}
 
@@ -1473,15 +1503,11 @@ static bool zram_meta_alloc(struct zram *zram, u64 disksize)
 		huge_class_size = zs_huge_class_size(zram->mem_pool);
 
 	for (index = 0; index < num_pages; index++)
-		spin_lock_init(&zram->table[index].lock);
+		atomic_set(&zram->table[index].lock, ZRAM_ENTRY_UNLOCKED);
+
 	return true;
 }
 
-/*
- * To protect concurrent access to the same index entry,
- * caller should hold this table index entry's bit_spinlock to
- * indicate this index entry is accessing.
- */
 static void zram_free_page(struct zram *zram, size_t index)
 {
 	unsigned long handle;
@@ -1602,17 +1628,17 @@ static int zram_read_page(struct zram *zram, struct page *page, u32 index,
 {
 	int ret;
 
-	zram_slot_lock(zram, index);
+	zram_slot_read_lock(zram, index);
 	if (!zram_test_flag(zram, index, ZRAM_WB)) {
 		/* Slot should be locked through out the function call */
 		ret = zram_read_from_zspool(zram, page, index);
-		zram_slot_unlock(zram, index);
+		zram_slot_read_unlock(zram, index);
 	} else {
 		/*
 		 * The slot should be unlocked before reading from the backing
 		 * device.
 		 */
-		zram_slot_unlock(zram, index);
+		zram_slot_read_unlock(zram, index);
 
 		ret = read_from_bdev(zram, page, zram_get_handle(zram, index),
 				     parent);
@@ -1655,10 +1681,10 @@ static int zram_bvec_read(struct zram *zram, struct bio_vec *bvec,
 static int write_same_filled_page(struct zram *zram, unsigned long fill,
 				  u32 index)
 {
-	zram_slot_lock(zram, index);
+	zram_slot_write_lock(zram, index);
 	zram_set_flag(zram, index, ZRAM_SAME);
 	zram_set_handle(zram, index, fill);
-	zram_slot_unlock(zram, index);
+	zram_slot_write_unlock(zram, index);
 
 	atomic64_inc(&zram->stats.same_pages);
 	atomic64_inc(&zram->stats.pages_stored);
@@ -1693,11 +1719,11 @@ static int write_incompressible_page(struct zram *zram, struct page *page,
 	kunmap_local(src);
 	zs_unmap_object(zram->mem_pool, handle);
 
-	zram_slot_lock(zram, index);
+	zram_slot_write_lock(zram, index);
 	zram_set_flag(zram, index, ZRAM_HUGE);
 	zram_set_handle(zram, index, handle);
 	zram_set_obj_size(zram, index, PAGE_SIZE);
-	zram_slot_unlock(zram, index);
+	zram_slot_write_unlock(zram, index);
 
 	atomic64_add(PAGE_SIZE, &zram->stats.compr_data_size);
 	atomic64_inc(&zram->stats.huge_pages);
@@ -1718,9 +1744,9 @@ static int zram_write_page(struct zram *zram, struct page *page, u32 index)
 	bool same_filled;
 
 	/* First, free memory allocated to this slot (if any) */
-	zram_slot_lock(zram, index);
+	zram_slot_write_lock(zram, index);
 	zram_free_page(zram, index);
-	zram_slot_unlock(zram, index);
+	zram_slot_write_unlock(zram, index);
 
 	mem = kmap_local_page(page);
 	same_filled = page_same_filled(mem, &element);
@@ -1790,10 +1816,10 @@ static int zram_write_page(struct zram *zram, struct page *page, u32 index)
 	zcomp_stream_put(zram->comps[ZRAM_PRIMARY_COMP]);
 	zs_unmap_object(zram->mem_pool, handle);
 
-	zram_slot_lock(zram, index);
+	zram_slot_write_lock(zram, index);
 	zram_set_handle(zram, index, handle);
 	zram_set_obj_size(zram, index, comp_len);
-	zram_slot_unlock(zram, index);
+	zram_slot_write_unlock(zram, index);
 
 	/* Update stats */
 	atomic64_inc(&zram->stats.pages_stored);
@@ -1850,7 +1876,7 @@ static int scan_slots_for_recompress(struct zram *zram, u32 mode,
 
 		INIT_LIST_HEAD(&pps->entry);
 
-		zram_slot_lock(zram, index);
+		zram_slot_write_lock(zram, index);
 		if (!zram_allocated(zram, index))
 			goto next;
 
@@ -1871,7 +1897,7 @@ static int scan_slots_for_recompress(struct zram *zram, u32 mode,
 		place_pp_slot(zram, ctl, pps);
 		pps = NULL;
 next:
-		zram_slot_unlock(zram, index);
+		zram_slot_write_unlock(zram, index);
 	}
 
 	kfree(pps);
@@ -2162,7 +2188,7 @@ static ssize_t recompress_store(struct device *dev,
 		if (!num_recomp_pages)
 			break;
 
-		zram_slot_lock(zram, pps->index);
+		zram_slot_write_lock(zram, pps->index);
 		if (!zram_test_flag(zram, pps->index, ZRAM_PP_SLOT))
 			goto next;
 
@@ -2170,7 +2196,7 @@ static ssize_t recompress_store(struct device *dev,
 				      &num_recomp_pages, threshold,
 				      prio, prio_max);
 next:
-		zram_slot_unlock(zram, pps->index);
+		zram_slot_write_unlock(zram, pps->index);
 		release_pp_slot(zram, pps);
 
 		if (err) {
@@ -2217,9 +2243,9 @@ static void zram_bio_discard(struct zram *zram, struct bio *bio)
 	}
 
 	while (n >= PAGE_SIZE) {
-		zram_slot_lock(zram, index);
+		zram_slot_write_lock(zram, index);
 		zram_free_page(zram, index);
-		zram_slot_unlock(zram, index);
+		zram_slot_write_unlock(zram, index);
 		atomic64_inc(&zram->stats.notify_free);
 		index++;
 		n -= PAGE_SIZE;
@@ -2248,9 +2274,9 @@ static void zram_bio_read(struct zram *zram, struct bio *bio)
 		}
 		flush_dcache_page(bv.bv_page);
 
-		zram_slot_lock(zram, index);
+		zram_slot_write_lock(zram, index);
 		zram_accessed(zram, index);
-		zram_slot_unlock(zram, index);
+		zram_slot_write_unlock(zram, index);
 
 		bio_advance_iter_single(bio, &iter, bv.bv_len);
 	} while (iter.bi_size);
@@ -2278,9 +2304,9 @@ static void zram_bio_write(struct zram *zram, struct bio *bio)
 			break;
 		}
 
-		zram_slot_lock(zram, index);
+		zram_slot_write_lock(zram, index);
 		zram_accessed(zram, index);
-		zram_slot_unlock(zram, index);
+		zram_slot_write_unlock(zram, index);
 
 		bio_advance_iter_single(bio, &iter, bv.bv_len);
 	} while (iter.bi_size);
@@ -2321,13 +2347,13 @@ static void zram_slot_free_notify(struct block_device *bdev,
 	zram = bdev->bd_disk->private_data;
 
 	atomic64_inc(&zram->stats.notify_free);
-	if (!zram_slot_trylock(zram, index)) {
+	if (!zram_slot_try_write_lock(zram, index)) {
 		atomic64_inc(&zram->stats.miss_free);
 		return;
 	}
 
 	zram_free_page(zram, index);
-	zram_slot_unlock(zram, index);
+	zram_slot_write_unlock(zram, index);
 }
 
 static void zram_comp_params_reset(struct zram *zram)
diff --git a/drivers/block/zram/zram_drv.h b/drivers/block/zram/zram_drv.h
index db78d7c01b9a..e20538cdf565 100644
--- a/drivers/block/zram/zram_drv.h
+++ b/drivers/block/zram/zram_drv.h
@@ -28,7 +28,6 @@
 #define ZRAM_SECTOR_PER_LOGICAL_BLOCK	\
 	(1 << (ZRAM_LOGICAL_BLOCK_SHIFT - SECTOR_SHIFT))
 
-
 /*
  * ZRAM is mainly used for memory efficiency so we want to keep memory
  * footprint small and thus squeeze size and zram pageflags into a flags
@@ -58,13 +57,14 @@ enum zram_pageflags {
 	__NR_ZRAM_PAGEFLAGS,
 };
 
-/*-- Data structures */
+#define ZRAM_ENTRY_UNLOCKED	0
+#define ZRAM_ENTRY_WRLOCKED	(-1)
 
 /* Allocated for each disk page */
 struct zram_table_entry {
 	unsigned long handle;
 	unsigned int flags;
-	spinlock_t lock;
+	atomic_t lock;
 #ifdef CONFIG_ZRAM_TRACK_ENTRY_ACTIME
 	ktime_t ac_time;
 #endif
-- 
2.48.1.362.g079036d154-goog



^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCHv4 01/17] zram: switch to non-atomic entry locking
  2025-01-31  9:06 ` [PATCHv4 01/17] zram: switch to non-atomic entry locking Sergey Senozhatsky
@ 2025-01-31 11:41   ` Hillf Danton
  2025-02-03  3:21     ` Sergey Senozhatsky
  2025-01-31 22:55   ` Andrew Morton
  1 sibling, 1 reply; 73+ messages in thread
From: Hillf Danton @ 2025-01-31 11:41 UTC (permalink / raw)
  To: Sergey Senozhatsky; +Cc: Andrew Morton, Minchan Kim, linux-mm, linux-kernel

On Fri, 31 Jan 2025 18:06:00 +0900 Sergey Senozhatsky
> Concurrent modifications of meta table entries is now handled
> by per-entry spin-lock.  This has a number of shortcomings.
> 
> First, this imposes atomic requirements on compression backends.
> zram can call both zcomp_compress() and zcomp_decompress() under
> entry spin-lock, which implies that we can use only compression
> algorithms that don't schedule/sleep/wait during compression and
> decompression.  This, for instance, makes it impossible to use
> some of the ASYNC compression algorithms (H/W compression, etc.)
> implementations.
> 
> Second, this can potentially trigger watchdogs.  For example,
> entry re-compression with secondary algorithms is performed
> under entry spin-lock.  Given that we chain secondary
> compression algorithms and that some of them can be configured
> for best compression ratio (and worst compression speed) zram
> can stay under spin-lock for quite some time.
> 
> Do not use per-entry spin-locks and instead convert it to an
> atomic_t variable which open codes reader-writer type of lock.
> This permits preemption from slot_lock section, also reduces
> the sizeof() zram entry when lockdep is enabled.
> 
Nope, the price of cut in size will be paid by extra hours in debugging,
given nothing is free.

> Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
> ---
>  drivers/block/zram/zram_drv.c | 126 ++++++++++++++++++++--------------
>  drivers/block/zram/zram_drv.h |   6 +-
>  2 files changed, 79 insertions(+), 53 deletions(-)
> 
> diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
> index 9f5020b077c5..1c2df2341704 100644
> --- a/drivers/block/zram/zram_drv.c
> +++ b/drivers/block/zram/zram_drv.c
> @@ -58,19 +58,50 @@ static void zram_free_page(struct zram *zram, size_t index);
>  static int zram_read_from_zspool(struct zram *zram, struct page *page,
>  				 u32 index);
>  
> -static int zram_slot_trylock(struct zram *zram, u32 index)
> +static bool zram_slot_try_write_lock(struct zram *zram, u32 index)
>  {
> -	return spin_trylock(&zram->table[index].lock);
> +	atomic_t *lock = &zram->table[index].lock;
> +	int old = ZRAM_ENTRY_UNLOCKED;
> +
> +	return atomic_try_cmpxchg(lock, &old, ZRAM_ENTRY_WRLOCKED);
> +}
> +
> +static void zram_slot_write_lock(struct zram *zram, u32 index)
> +{
> +	atomic_t *lock = &zram->table[index].lock;
> +	int old = atomic_read(lock);
> +
> +	do {
> +		if (old != ZRAM_ENTRY_UNLOCKED) {
> +			cond_resched();
> +			old = atomic_read(lock);
> +			continue;
> +		}
> +	} while (!atomic_try_cmpxchg(lock, &old, ZRAM_ENTRY_WRLOCKED));
> +}
> +
> +static void zram_slot_write_unlock(struct zram *zram, u32 index)
> +{
> +	atomic_set(&zram->table[index].lock, ZRAM_ENTRY_UNLOCKED);
>  }
>  
> -static void zram_slot_lock(struct zram *zram, u32 index)
> +static void zram_slot_read_lock(struct zram *zram, u32 index)
>  {
> -	spin_lock(&zram->table[index].lock);
> +	atomic_t *lock = &zram->table[index].lock;
> +	int old = atomic_read(lock);
> +
> +	do {
> +		if (old == ZRAM_ENTRY_WRLOCKED) {
> +			cond_resched();
> +			old = atomic_read(lock);
> +			continue;
> +		}
> +	} while (!atomic_try_cmpxchg(lock, &old, old + 1));
>  }
>  
> -static void zram_slot_unlock(struct zram *zram, u32 index)
> +static void zram_slot_read_unlock(struct zram *zram, u32 index)
>  {
> -	spin_unlock(&zram->table[index].lock);
> +	atomic_dec(&zram->table[index].lock);
>  }
>  
Given no boundaries of locking section marked in addition to lockdep, 
this is another usual case of inventing lock in 2025.

What sense could be made by exercising molar tooth extraction in the
kitchen because of pain afetr downing a pint of vodka instead of 
directly driving to see your dentist?


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCHv4 01/17] zram: switch to non-atomic entry locking
  2025-01-31 11:41   ` Hillf Danton
@ 2025-02-03  3:21     ` Sergey Senozhatsky
  2025-02-03  3:52       ` Sergey Senozhatsky
  2025-02-03 12:39       ` Sergey Senozhatsky
  0 siblings, 2 replies; 73+ messages in thread
From: Sergey Senozhatsky @ 2025-02-03  3:21 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Sergey Senozhatsky, Andrew Morton, Minchan Kim, linux-mm, linux-kernel

On (25/01/31 19:41), Hillf Danton wrote:
> On Fri, 31 Jan 2025 18:06:00 +0900 Sergey Senozhatsky
> > Concurrent modifications of meta table entries is now handled
> > by per-entry spin-lock.  This has a number of shortcomings.
> > 
> > First, this imposes atomic requirements on compression backends.
> > zram can call both zcomp_compress() and zcomp_decompress() under
> > entry spin-lock, which implies that we can use only compression
> > algorithms that don't schedule/sleep/wait during compression and
> > decompression.  This, for instance, makes it impossible to use
> > some of the ASYNC compression algorithms (H/W compression, etc.)
> > implementations.
> > 
> > Second, this can potentially trigger watchdogs.  For example,
> > entry re-compression with secondary algorithms is performed
> > under entry spin-lock.  Given that we chain secondary
> > compression algorithms and that some of them can be configured
> > for best compression ratio (and worst compression speed) zram
> > can stay under spin-lock for quite some time.
> > 
> > Do not use per-entry spin-locks and instead convert it to an
> > atomic_t variable which open codes reader-writer type of lock.
> > This permits preemption from slot_lock section, also reduces
> > the sizeof() zram entry when lockdep is enabled.
> > 
> Nope, the price of cut in size will be paid by extra hours in debugging,
> given nothing is free.

This has been a bit-spin-lock basically forever, until late last
year when it was switched to a spinlock, for reasons unrelated
to debugging (as far as I understand it).  See 9518e5bfaae19 (zram:
Replace bit spinlocks with a spinlock_t).

> > -static void zram_slot_unlock(struct zram *zram, u32 index)
> > +static void zram_slot_read_unlock(struct zram *zram, u32 index)
> >  {
> > -	spin_unlock(&zram->table[index].lock);
> > +	atomic_dec(&zram->table[index].lock);
> >  }
> >  
> Given no boundaries of locking section marked in addition to lockdep, 
> this is another usual case of inventing lock in 2025.

So zram entry has been memory-saving driver, pretty much always, and not
debug-ability driver, I'm afraid.

Would lockdep additional keep the dentist away? (so to speak)

---
 drivers/block/zram/zram_drv.c         |  38 ++++++++++++++++++++++++--
 drivers/block/zram/zram_drv.h         |   3 ++
 scripts/selinux/genheaders/genheaders | Bin 90112 -> 0 bytes
 3 files changed, 39 insertions(+), 2 deletions(-)
 delete mode 100755 scripts/selinux/genheaders/genheaders

diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index f85502ae7dce..165b50927d13 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -59,12 +59,30 @@ static void zram_free_page(struct zram *zram, size_t index);
 static int zram_read_from_zspool(struct zram *zram, struct page *page,
 				 u32 index);

+static void zram_slot_lock_init(struct zram *zram, u32 index)
+{
+#ifdef CONFIG_DEBUG_LOCK_ALLOC
+	static struct lock_class_key key;
+
+	lockdep_init_map(&zram->table[index].lockdep_map, "zram-entry", &key,
+			 0);
+#endif
+
+	atomic_set(&zram->table[index].lock, ZRAM_ENTRY_UNLOCKED);
+}
+
 static bool zram_slot_try_write_lock(struct zram *zram, u32 index)
 {
 	atomic_t *lock = &zram->table[index].lock;
 	int old = ZRAM_ENTRY_UNLOCKED;

-	return atomic_try_cmpxchg(lock, &old, ZRAM_ENTRY_WRLOCKED);
+	if (atomic_try_cmpxchg(lock, &old, ZRAM_ENTRY_WRLOCKED)) {
+#ifdef CONFIG_DEBUG_LOCK_ALLOC
+		rwsem_acquire(&zram->table[index].lockdep_map, 0, 0, _RET_IP_);
+#endif
+		return true;
+	}
+	return false;
 }

 static void zram_slot_write_lock(struct zram *zram, u32 index)
@@ -79,11 +97,19 @@ static void zram_slot_write_lock(struct zram *zram, u32 index)
 			continue;
 		}
 	} while (!atomic_try_cmpxchg(lock, &old, ZRAM_ENTRY_WRLOCKED));
+
+#ifdef CONFIG_DEBUG_LOCK_ALLOC
+	rwsem_acquire(&zram->table[index].lockdep_map, 0, 0, _RET_IP_);
+#endif
 }

 static void zram_slot_write_unlock(struct zram *zram, u32 index)
 {
 	atomic_set(&zram->table[index].lock, ZRAM_ENTRY_UNLOCKED);
+
+#ifdef CONFIG_DEBUG_LOCK_ALLOC
+	rwsem_release(&zram->table[index].lockdep_map, _RET_IP_);
+#endif
 }

 static void zram_slot_read_lock(struct zram *zram, u32 index)
@@ -98,11 +124,19 @@ static void zram_slot_read_lock(struct zram *zram, u32 index)
 			continue;
 		}
 	} while (!atomic_try_cmpxchg(lock, &old, old + 1));
+
+#ifdef CONFIG_DEBUG_LOCK_ALLOC
+	rwsem_acquire_read(&zram->table[index].lockdep_map, 0, 0, _RET_IP_);
+#endif
 }

 static void zram_slot_read_unlock(struct zram *zram, u32 index)
 {
 	atomic_dec(&zram->table[index].lock);
+
+#ifdef CONFIG_DEBUG_LOCK_ALLOC
+	rwsem_release(&zram->table[index].lockdep_map, _RET_IP_);
+#endif
 }

 static inline bool init_done(struct zram *zram)
@@ -1482,7 +1516,7 @@ static bool zram_meta_alloc(struct zram *zram, u64 disksize)
 		huge_class_size = zs_huge_class_size(zram->mem_pool);

 	for (index = 0; index < num_pages; index++)
-		atomic_set(&zram->table[index].lock, ZRAM_ENTRY_UNLOCKED);
+		zram_slot_lock_init(zram, index);

 	return true;
 }
diff --git a/drivers/block/zram/zram_drv.h b/drivers/block/zram/zram_drv.h
index 219d405fc26e..86d1e412f9c7 100644
--- a/drivers/block/zram/zram_drv.h
+++ b/drivers/block/zram/zram_drv.h
@@ -64,6 +64,9 @@ struct zram_table_entry {
 	unsigned long handle;
 	unsigned int flags;
 	atomic_t lock;
+#ifdef CONFIG_DEBUG_LOCK_ALLOC
+	struct lockdep_map lockdep_map;
+#endif
 #ifdef CONFIG_ZRAM_TRACK_ENTRY_ACTIME
 	ktime_t ac_time;
 #endif
diff --git a/scripts/selinux/genheaders/genheaders b/scripts/selinux/genheaders/genheaders
deleted file mode 100755
index 3fc32a664a7930b12a38d02449aec78d49690dfe..0000000000000000000000000000000000000000
GIT binary patch
literal 0
HcmV?d00001

literal 90112
zcmeI54VYWidFQXqM}!HsAV3BA;1X~VGz`MnRD8IRv5n=#7zLYDVcSUZjK)$tBMC`k
z%cLnHP^yXRL`~8@lP2BHlcn8hyWI-frUOqCluw3sHxX$*)FdTJmIS#ZsnShSku1!<
z=ic`(>;7Xu*(8vziNu57dw%zvbI*P6@7#MeGt#|t>y8^&u2^B=b&~ZfmMPUg;gX21
zh{TW9iCAIl3@c(?V7<aRRo0&@{}1Y+zQ*=ScLC9-{3MB{UBE0HBfiYV79zH8qG@-$
zSK|S94Wi|D%ck8aX0d7hkyq3CcMw<Tzz~PqIooS#eTua;E=L@0XL8ee!d=htFZa28
zNh8{sbeVRYo7_GAp{R-IXhc5E7|s7-%_ql*tTV5O^RH!byNb5sXls2$Cl|tYeXTt4
zlWZ@h?c$HS9dlf-+e5_mUMKBLUjAQSdf2U7bbFkCltY?-L`|!8#Z-3B)$6aAnz(!_
zo13XzUI|})`PJ*kO8K#M&JfJFLh`?HYTM3Rt(8@)X_%%_=FkeKjQ&<?S*kcMQ`d~q
ztT*k=%e2+$;>5G0Y}Pybz2YuevQ`@Q68ZnJ^e3mU`L{!u9%h~AW!jm#{CcH;WAUBG
z;qN*Q|DteP{^!@70*=-HHOJxi9*6(Ca9jT8*Es@?)&Et;;g=kT|ITrE<T(6PNvJLB
z`1L9Q#~RQ6<M6%0%`xCtQoxzk`BpV18<H31r<F*Imk$=wiE@6XP)HXOQ~7aWg<>{W
zo=l8q_F0p~blOT+vSq7OE>6r8tjT;Koil|g35mqyP~AjAv>s2C%Bf;GF`deqY7^;V
z(QKuXN=#;Rsj2K8Y13F}zGzL3PvuK#YqBs?9=9eB6tm^D)RlF)yctQVoXzJf)2Epk
zS6kb5Y`<Z1V%^wPW9yGDA6;BGw&5tY65Ds*l#s2b_hw7wbaD4ho2T-*^zPK2DYK2e
z)A^hZG~re~W<@h&PLea<NitJTa{gN(bJATpWx&kZldbzCuQ)H1&&!@>X77Jj_Se7r
z`!dZ=wjR>*%=u>hN!G`;{E}?;4707rq-<U|f8aP39&vg52{(d=ly`}rrycc(2U>3X
zK5<*?`@ZJw|Mbtf<pA-^w7yOJ3gtoK-!KzeULoSwYW*<ri1G;Wy<c+sStPzq>o<vG
zf0u~gto2*OZ&ThTo>kr<UQym9ey{Q#@pmfk6Mw&Q>+A0R;rRZgav$-Bl>3Q)M0tSt
zCzad8A6FhE{x8Z~#Gh8)Ccfg+?zlU|ga7FAF7Z>fevkNS<$dDkDz~2W=I0g4eZ+rJ
zxu5v9&$#Ua#4p$SHu0;K2Z_I0d5HLHl!uAmqC7(UHsw*`DdjQZ8Rc=}Ips;>`;}*i
z|EBT+@wX|j5dS^pRpRecUL*b|%Im}%%IAoGNclYRN0l!Se@uCU_!pHg5`RK@llZrl
zFA-l>-XVS(d?tRD@*eRE!Mk2>|3N+M6TeF9TSvV0uUGCP9#!rqzDs$4_#Wjp@!OSK
z|Ki>5Ta^2V*OdE-zg>BN_<NPx#Q$7*koW`2L&VS2;}RzRpw^ENZz_)xe?)nV_@|V|
ziGNvnl6X&fhWN9}3&eeT+$+TW%B#fBQ(hx}k@7n6G39f_!^-E0Z&toQe5djT@woCu
z;(L@giSJXsM7*rLMf@)1ZQ^fJ-XZ=T<z3=`q`XJ`kCpd{->=+y%Ip8b%KgM2Q63<E
z@Uw1voA{@-evtSVl!u7pd>$tLRjnT({w?KE;(g^Y;wyEWapI>dPZB>H`}GZPesF&n
zB!0fu4-wz2^CC=q&*$BKBh2-FBT5|ki4nK4U&PPWJS2%DpBduFXMuQ7`>zncLV1<=
zRmy9`Hz=<YzgGDi@lDF-iEmfFK>T&e8^qtBe3AI1@+R?p%9n`4hZgaY)^8KPQ+bDY
zO?j92obn#=`;_;I|Bdp{H@$v#l!uAK&j@k&86^%sW6aghIC1!yBo03_#NlUwIQ*;-
zho4pA@Uuo7e%6V@&pG1obDlW-Tp$iV8^qz~B60ZHBo059h{MkoaroIL4nI4@;b)gP
z{Ol2jpMB!+)B2Xzx2F2$BmSszKk?5h4-mgk&yzOs$F+Wh_!G*b#Gh0i`&aKcpHdzt
zj(L$Jj(L$Gj(JgFuJfWo9P^?|9P^?^9P^@19P?t1IOfGXam<Sa;+Pi=;+PkU#4#_L
z#4#_Hh+|%~h+|%~iDO=Lh+|%KiDO>$h+|&#iDO<^J+BX#7e3;c7k=WH7Xjj!7fIsZ
zQQtDe;ah<?e5(+LZ&l{%Ta7q;s}qNBbHw4>JaPE8Kpeg`h{Lx<;_$6W9KJ0Phi@(7
z@U2Z8zIBMhw=Qw`)*}wz`o!Uz^=+>|@XbdYzWIs6w*YbYW)p{RLE`YOK>Qi?twJ2W
zRf)s58gckmXa1PG|8vCQ+dOgjwm=-dHHgEvMdI+SNgTc{5r=Or;_$6a9KLmk!?!MR
z_|_v1-}=PioAqyAf8d*sIDGRHhi?Jm@XaO;--5*9TZlM(3loQLRpO`W&&g}V;ai<J
ze48T<-{zS==H_dGIDBgmhi{9-;aih9d|M(8-&(}sTbnq1>kx-;UE=VqM;yNOiNiPR
ze|!CbZ$9Gi%}*S@1&G5pn>c(65{GXg;_xj@9KJ<}!?!4L_*N(GSKsD{!?$_j@NI!O
zd}}aQ-xi6(w<dA;wnQAhwTQ#FHgWjYAr9ZV#Nk_yIDG39hi}$*y#By9A948RCl22N
z#NnGw9KHpK!?zG|_!cG(-y+1}Ta-9_ixG!!apLf8o_I-rf44v!zBP!$w?*Rct;t+{
zTOtnMTEyX7n>c*y5QlGF;_$6U9KQ95!#C@@UVpMWAAQ72%KgO8R~{gKp>mseP<fE}
z70N@zuT~x={%YkB;!)*M;x{XA5PyU6MdI+GNgO^b5r+>g=ITS6IDF_3hYwxi@S#T>
zKJ<yh2kU#@;|m{r#NmUVID7~YhYvP!_z)xxA40_8Lzp;xh!BSlQR46+MjSrGiNl8^
z@rlp5`N|N_DQ^;gv+^b4@TWx_{<Mk1pAK{Nr%N3E^oYZsK5_VCnLikk7k)1fe|*H@
zkDoaF2@r=rHu3D|-0cO4!=Dgw_!B06q4pCY4u7J=;ZKY>{D~8XKS|>7Cqo?m6o|v0
z7V&#^{B7d!q(dB@bcw^09&`1iPaK|DefK!vxWf}4ad_e<4o?Ea;fYNgo&<@*lMr!u
z5+)8$BE;cIlsG(z5r-#n;_xI%9G+x|!;=DWcv2w_PpZV>NryQ8E~rZ!+wBp@c6(2I
zk1zfnsZab}>Y??0Z~gaz|G>-NuiQr*-uj8d+W>KRYcp4GgT&!&h&a3r6Nk4E;_x;~
z9Nxx=!`nD<c$*{+Z!^T<ZGkwvtq_N|RpRirMjYPOiNo7D;_zplIQ&^44sZSc$LquW
z>O+7ye6WebhahqI5Mr)Ago(q42yyrjB@Q2A#Nk7nIDAMFhYuOz@S#8)K2(UqhbnRS
zP$Lc>>crv09C7$CPaHlh5Qh&9;_zXSIDD|5@%r$v`Vb@zA40_8Lzp;xh%i?lqQv1t
zj5vIV6Ne8;;_x9u96l6?!-ooS_)sMdA8N$mL!CH$m?I7!=840H1>*3bK^#6T5{C~>
z;_zXKID7~_>-FL9)rT-~_z)ouAELzJLyWol5GM{FlEmRdhB$mE5Qh&H;_#tL96r>D
z!-qO?_%KHtKFkw`4-3TMLxVVcSR@V~n#AG55^?y@A`Tzg#Nk7S_&=);k!7zZPb!ZR
zhbJ-O@FY$go+O#8CmG`Kq(B^=REWcqDsgyHBMwjM#No*tad<LM9G)x?hbIl<@MMuV
zJZTb#CriZPNsBl<X%mMh9pdn$OB|lW<dX*Gh4&**s}FJFd-OOai9e(DGsFkSm-yiL
z5{Fk+;_#|Q9Dj#aCw|K3UBBmuuTh@V&kkaLH^Wcj=WG1}@e9CLdh>RP@(OX}tx6nu
zs}V=u>co+^IpWCMJaOc0fjIKkAdb8(5=Y*e#9yWRTRq8}hxN*9#F2+OapYl+IPx%0
z9C=tEjyyDoBM*zjk%uO6<Y9?8^3WoV^HTj}Zyv&$hdJWN!#r{1VSzaE&>)UHED}c^
zn#7TZCF00Ki#YPoCXPJJpW@x$>vew@h~qlaAdYb^62~~3%yphF5l4Pn#F3viaU8D>
zaa>2b#Bm+z5yy3;PaM|~%X|QD=)8*Sh>tj~BYxtzjs%F~I${&YbtFg}*O3r$Tt~vh
zaUF>e$8{u1d~h8h{yOy`PW;!DH;8AHFA|48P2%upi8%aeF;{=u#NkheIQ;1nhd({y
z@TX54{#Y;a`T&1?#Nm&hIQ$6^hd(xP_!A@!e?r9JPnbCTi4ccBQR46?MjZaciNl{H
zarl!V4u6`&?^J)5h{K;2aro0F4u3k#)t@eL_|qc}fBMAXkLC0F0DpYM;g6p<{0R_;
zKQ?jr6C@6QLd4-um^l225Qjfe;_xR%9R9?K!=EH^_>&<He_F)ftv<Ag!-o!W_|PQ|
zA9~Eyhdy!mV4dbYzVN|E96tDo!-oKI_+S%<4?*JaAw(QLgo(q42yyrjB@Q2A#Nk7n
zIDAMFhYuOz@S#8)K2(UqhbnQLPin+*KIsr&P=C6_;ZKh^{OJ>iKh`R*54aEX5r;p1
z;_xRx9RAqE;ZKk_{0R|<KVjnVCqf+lM2W+n7;*R$Ck}s-#NkhdIQ%IPhd&kK@TW=~
z{?v%WpC0jt)rUTD_+XvxJ-+b4M;t!*nd|u^KpZ~U#Nk7bID7~ZhYw-m@F7ARK17Mb
zhZu4A5GM{FlEmRdhB$mE5Qh&H;_#tL96r>D!-qO?_%KHtK3FgI9^cQX4?g1X!A~4M
z1c<{2o4NWBBn}@!#Nk7jIDCi@hYwNW@F7MVKE#Q`ha_?MkRc8q3dG?<g*beu5{C~p
z;_#tP96rnuhY$0_;ll!P_|PDZ>&qf>Twj{RkElOO#J{V&Mf~Kp`*SV-8D0<1RvsV@
z4{hS`Fi0F8hM22|VdC&GLL44OiNnJfad;Rf4iA&W;bDe2JS-50hZW-Ruu2>r)`-Ky
zI&pY7M;so`6NiTj#NlCsI6Pb=4iD|sULVHPhahqI5F!pA!o=Z2gt__<B@Q2A#Nk7n
zIDAMFhYuOz@S#8)K2(UqhbnRSP$Lc>>crv09C7$CPaHlh5Qh&9;_zXSIDBXlhYw4{
z;X`PR*N1KDLzp;xh!BSlQR46+#$0`f6Ne8;;_x9u96l6?!-ooS_)sMdA8N$mL!CH$
zm?I7!=840H1>*3bK^#6T5{C~>;_zXKIDBXkhYxMy@FC*&`jAu~qQv1tj5vIV6Ne8;
z=ITR+ID9A&hYuCv@S#c^KGcZAhdOciFh?9d%oB$X3&i0=gE)LxBn}^%#Nopdarn?8
z4j<aY;X{WweCQH~53w`7K9tpmIC1ooB#!N7h*!0r0`Wu2lV^Fi`)=hK;_$FQ93EDP
z!^0|b^{_@99@dG&!#U#caGp3kTp$h)8^qz^B5`=wBn}Ukh{MAcad_A!4i7uT;bE6J
zJnRvNhkfGk!Fq}J_`)9_ad=oD{($;WAr2p^#Nk7YIDDuxS0Cnx!-sj|@L_>Cd}t7d
z4~xX%Lz6gsSRxJ|TEyW)n>c*v5Qh(4;_#tI96t1k!w2hZ@9~8XKH~7fPaHl3h{J~}
z@kiB%8gckgCk`Lxh{K0@=IX-&arn?64j&eY!-pnu_^?DAKD3C#hc<Ee&>;>Vy2Rl_
zk2rkj6NeAhIo{(7AAH2&gP%Bj2oQ%4HgWh6Bn}_y#J{dS%n^qV^Tgr90&)1zV6Hwa
z5{C~>;_zXKIDBXkhYxMy@S#H-K6Hu0haPeG&?gQbtaH7`7e4ri!v{Zc_z)ltA8g|A
zAxIoPgowk3Fmd=WPyGAp!vb;m&>#*U7Ky`$CUf;+i8y>{5r+?L;_#tE96of3!-pPm
z_|PW~AFN;S9$)z2BMu+@#Nk7LIDD{)!-pVo_z)rvAHu}pLxebdh!Ten4dVWPa(_Oy
zNE|*iiNl8_;_#uxTzzO0hYua%@S#f_KJ<vghdy!mU<JI#7e4ri!v{Zc_z)ltA8g|A
zAxIoPgowk3Fmd=0Ar2p+#Nk7XIDCi`hYwBS7pf0S#PN51E#g6~-zJXycZgrE^}EEw
z%3J4okJlHz==#|vzD4VIh{Mk=aroIIj_-Hs6CZp&{e16u;HQr`elOV}z8(8Z9P_J3
z9OLg3$M~(6ddGw9`iNtG`H5rv0pgfnHgU|aAaNYu5OK_}FmcSU2yx7>C~?fM7;((6
zIC0FcByr5I3~|h_0&&c*3USP@DsjxO8gb08I&pk%ZjLxUpW7pTtNPO?o>XqV-0Q=e
zl>3Ndp88+mt)JKW0pf2~ZWFI54-$W;@(}U&C=V0APkDs+1InYsKd3xL{Nu{w#2-~|
zU*O%}FDVZae@c0X_z%EsZ$GDh!96d9iDO<zh+|$yiDO>JnCo>jP8{<(NgVS!LmczE
zKpgYBLLBqDN*wdLMjZ3HP8{=kjyUG^JaNqH1>%_34dR&Bi^MUno5V4%mxyCtw}@k2
zw~1q3cZi>>K143`dUBESDDlgb$B198JWl*N<w@eNML)0fZg)5OA&&2hP7=rWMQ4cP
z`=SfP@qN)1;`qMkD)BQv?ar4P@dy9W<#potzUVpP_`c|Q;`qMk1>*R==mv3oU-Tkz
zd|z~vIKD4>i8#J5x<&lF&**&%aeQBNhxnM*?-CCy?-9rMMfZvC)cV#%ULWGheZ==D
z_Y=qWMF)tNwZ2XKF6BYu_`c{6@%L!`FmZfebc8s*FFH#6e(fhl9N!n6B%aWDks*%j
zU4b~RcNOBe-c^a?dRHTk>s_5Vu6J|9alM--j_ch5aa`{T7kiKUKHc96aa^aW#BrUf
z5yy3^&RnlkbHs6-nkSCy)B<r_ry9g@omwQ0>r|6Cu2W0Iah+-r$91Yr9M`E1aa^am
z#BrVK5yy3^PaM}NYpwTq;X36bj_Z`4IIdFx;<!%P#Bu)_B#!&f5OMsTt4jQUdQ~G1
zuj<6%)f{nnHP2kVS|ARu8pPq%B5`=tBo42Zh{LNEad_1x4zD`I;Z>J7yy_8$SAF8}
z%KAmGC-BNg9A5c}!>a&scx4lZS3%<NDnuM!)rsG&p3D)4C-cPN$pUeB(qOKhEE0z&
zP2%umi8wrI5r-#j;_#$H9G-NE!;>Cyc+w{hPpnJ4#~q&dh{F><ad;9S4o__2@FYka
zo`i_QlQ403GEe*+>cawY_|PB@9~OzjhbD9NVTm|=Xc31GZQ}5uLmWPIiNl8;arn?D
z4j-(b_xQpGA948LCk`J1#NmTY96khz!-o)Y_z)%zA0ouzLzFmtXb}Gc^<j}Xd}tDf
z4@<=1LyNik&?XKaI>g~ampFXr5r+?b;_$(`)O&p4gO50T@Dqm*0pjq%CJrBh#Nk7T
zID7~bhYu0r@F7YZKE#N_hd6Qg&?LT~J}eQ34=v*Gp-mh<beO9TUE=VeM;t!%iNgo$
zGVk$)4?g1X!A~4M1c<{2n>c(35{C~V;_x9%96m&d!-ptw_z)uwAL7K}Ly|ar$PkAQ
zE#mj64{hS`p+g)#bcw@<9&`1fPaHm2zvMl>@WDqMKKO~lhX8T-U=xQALE`WsL>xYZ
ziNl8oarh7=4j*E~;X|A_d`J?94;kX{p+FoyREWce4)G7D4_)H$p+_7(^ohd<>s8+4
zJNSJHarodT4j%%<;e$;aJ_L!whY)f25GD>EBE;cClsJ5d5r+?P;_x9!96n@-!-oQK
z_)sAZAF9ORLyb6m=n?;j`p_p1AFRv0#}_{Mh{Fdzb3H!<h{Fe)ID7~ahYum*@F7eb
zK17JahbVFQ5F-vB;>6)Yk~n<G5Qh&1;_#tD96nTu!-pDi_)sSfALfX|2W!lGd|T>+
zk2rkr6Ne80;_$&{u08~b!-o)Y_z)%zA0ouzLzFmth!KYmapLeHNgO_8h{J~harjUn
z4j-z-;X{o$e5ezL4|Bxf!#r{Lus|F>_^<H#@VNRAAPygF;_x9z96p4Ys}Et~@F7AR
zK17MbhZu4A5GM{FlEmRdhB$mE5Qh&H;_#tL96r>D!-qO?_%KHtKFkw`4-3TMLxVVc
zSR@V~?2y-oBkDttID7~ZhYw-m@FBuneTWi=4>98KAx<1VB#Fa^3~~5SAPyfY#Nk7g
zIDDuPhYxk)@L`TPe3&N=9~OwihX!%@ut*#}G>OB9CF1ZQbfwpa@2U@B;_x9t96m&e
z!-p7i^&w6iJ|v05hYWG}P#_K;D#YPKl{kE;5r+?T;_zXPIDD8V4j&eX!-ocO_^?PE
zJ~WBLhb7|hp+y`%w28xq$U3hNEA{WqqQv1tj5vIV6Ne8;=ITR+ID9A&hYuCv@S#c^
zKGcZAhdOciFh?9d%oB$X3&i0=gE)LxBn}^%#Nopdarn?84j<aY;X{WweCQH~53#Gf
zKAfRG#EHX)BysqVAr2o3%+-eqarjUr4j*d7;X|D`e3&B+ALfa}hXvyBp+OuzEE0zg
zP2%uji8y>{5r+?L;_#tE96of3!-pPm_|PW~ACl|6J_OW<3~~5SAPyfY#Nk7gx%yBe
z4j<~o;lmtp_%Kf#J}eN24-Mk*VUaj|XcC7HOT^(ri#U8}6Ne8S;_#tM96t1j!-qa`
z_+V}D9$)z2BMu)5#4lDKD#YPKl{kE;5r+?T=IX;7ariJ#96l@%hYt<n@L`cSd}tDf
z4@<=1LyI_kXcLDI9pdnzOB_D*h{K0Iarj_e?LEHm!ABfE_=&@Z0CD(GCB8v@s1b(`
zb>i@0jyQanXRbah5Qh&9;_zXSIDBXlhYw4{;X{i!d}tGg4;|w0p-UV-^oYZUK5_V9
z{j&G?!UrF5_~0iF9|FYTgH0Sh1c}3kI`K{F!yIw=Fi#vlED(ne4d&{@B60Z8Bn}^z
zh{J~#arn?C4j($i;X{`=eCQE}4}IeB!3ukiFMRM3hYx<@@F74PKG?+JLy$Op2oZ-5
zVdC&%p7>7nVSzY&Xb^`Fi^SnWlezk^L>xY}h{K0Aarn?74j;P2;X{u&eCQL057sr_
z;|m{r#NmUVID7~YhYvP!_z)xxA40_8Lzp;xh!BSlQR48SLHsuLVUaj|XcC7HOT^(r
zi@EyHCJrAu#Nk7iIDF_4hYx+?@WHy)dwk)8k2rkr6Ne80;_$&H4j+QV;X{Zxd<YYV
z4-w+<Axa!R#E8R(IC1#UB%V<pmWacL7IFB{CJrAu%+-f3arn?94j=l&;e&OZ_xQpG
zA948LCk`J1#NmTY96khz!-o)Y_z)%zA0ouzLzFmth!KYmapLeHNgO_8h{J~#@jKLq
zHgWjSAr2q9#Nk7ax%$v24j-)Ry~h_m_=v*?KXLdFAPygF;_x9z96p4I!-p_&_z)ou
zAELzJLyS0lh!ckoN#gJ!LmWO7h{J~parn?7ey{q_B@Q2Y#Nk7qIDD`+dXMkm?;nW6
z2S0K65Fid8Y~t`CNE|+dh{K04arh7+4j-b#;X{l#e25c=4@u(iAwwKK6o|uz3UT;Q
zB@Q2I#Nk7a`0uI@ed6%J`W5f-g%3XB@WIbq&kq6O@WCbyAA-c;Lx?zh2or}75#sP6
zN*q4Kh{K0CarlrV4j(ea;X{Eqe5eqI4^`sup++1&)QQ7~IpXladbRiXzE6Gd5r+?c
z;_x9r96s30)rTN)_z)rvAHu}pLxebdh!TenG2-wcP8>cYiNl8sarjUm4j(GS;X{=;
ze5etJ4|U@3VU9R_m?sV&7Kp<Kf5hv<pQ;Z5;_$&H4j+QV;X{bI`Vb}#A0ouzLzFmt
zh!KYmapLeHNgO_8h{J~harjUn4j-z-;X{o$e5ezL4|Bxf!#r{Lus|F>G>F58MdI+m
z-sJV+uhoYjarh7-4j;nA;X{PE`Vb`!A7aGeL!3B#ND_w+8RGDvKpZ|)h{J~}arjUp
z4j<~o;lmtp_%Kf#J}eN24-Mk*VUaj|XcC7HOT^(r=mxJ3A5|a1#Nk7PIDCi_hYvC4
z>O-73d`J?94;kX{p+FoyREWceDslKwBMu+x#NopnariJ#96l@%hYt<n@L`cSd}tDf
z4@<=1LyI_kXcLDIk<DHoKB+!Li6=hmex5zH#asVztsf`;ZRN?W-un3ba)$V_)-MoW
z^(A+^72+>ZUM2o=<u&4$C@<XT9Zv}T6Nmp*;_$yl9RAmttN(Mv;r~2w_`g6L{x^uj
z|3%{PzeybaFA<0TE#mOMO&tDrh{OLbaroaO4*&ba;lH)bd%WPkk2w7I6Nmo+;_$yp
z{5tqU96r>E!-qNI@L`_0`mjJ8J~W8KhehJ>p-CJ*ED?tfE#mN@O&mUSh{K02arn?9
z4j=l&;e!?R9$)z2BMu+@#Nk7LIDD{)!-pVo_)sUlO?{XnKB?!!dE&Qd{RQIZ>2Yrm
zf4$aUBo42d#0Oq&_vS5y?GlH#4dUp3kvRHqGS~4h5r?-e;_$Xj9Nu<_!`m)#c-tcm
zZ~MgIt@Rpjey-Bv<s**odGiy$R_h0dZ&Gd(->y7J{B_Dh#NVJiOdMWCi0{+-QQ~JS
zj}b2^j}yOBd6IZdd4~9$@&@rKcuRand6W1(%9n^ALjBizkJmd;pE!O#S%`V-qujjN
z<)_MR!%6O~gY)mw5I<#zw}$v>L%co2*9`H_5I<{(S9Lt5@0Sd5OZkh0xg&E|`iA(*
zq5A$Ie)13x3~{q>c-ceTYy+?05XUo{&AQMKUxmsN!$bV^As!jxFCOC2A%4aXj}7tF
zLp(mj{X;xC#Lpb!qs37MMj04oV3dJT21XeeWnh$nQ3ggC7-e9TfuAe`(YwFliyk`V
z<L|Jn=x;U4E0<f*yFcdp=%BOZ4Sz4Ime>9d`FG6)5m_?jj9GuAvn;Q*A2DU~7WPQX
zDL-V&=B?|Irc-{vl+9c2BMqngXQpi4!XBA-%I`O2^Y-{i-6{WpDVw*lN2*Twou+Kw
z!X7C&<+qu#dFy&4>6G7U%H}Qlk(g7y!<5Zi*dr0ATr_3#*7ZopDZj~-&0E$Zwp0GX
z+bwIdE@kJn=~p+se$%a+cHg>7G82p*IyX97zStj~J!f2+-Mzf>&X+%I_BDEV!=9H~
z)|JifYo!whbyKnwJ+wt?AKvg+@-u+wp^C42y@9hImE5km=Wk`Nqq84==vwLEL#CH2
zzq#Y^hBrw+I}Weiyw<Wd&3<#o?7#0gyz#f?Epqol)AaDhx5@hG>_0uU0j;_|S+Qew
z;$oky-7PiEaq0e*q;S*SA78O!_WRM<FUw}4vmf6&+jY0o+<w=`{IbRFKR;`R^RL~l
z61N}vyNi3K{9DU2%^inKf3uH2Y%;g$gJyKyq*;0R@&{jO2DovJY{}W`nhUmA)@u&^
z#y7>7<#Moljci#qDI1e*Z=d~`Y+JY5%}EbCW}obS?ityf3~u|~&%A8rV-Lz6xe1VF
zS2m-E&v{IKRN#!Dd%fAF?Cq}%svO?fJYNR$5?x`;a=MqD|Ee_WeqZ`FEzTKQ@rP!`
zSImmT8~>{;oAm*#?=HD3&XW~SU9{$cs;rP2|B4)4Y6g2WH|3q(Uv&=LBa+eX^|JEr
z{wcrtsp#CHbAI`-G(5cJl<!~TpZU<5zuoXlQl7k~aPrLk(b*%iTU(t?%Pu+t?k1(}
zQ@?M<vIb+3@vS$N4qv`eLU+uJOx7EnWkhbX^AwsRgFQE6?^?&I{nY<)yD&2hwZA92
zcR#Hw<S-6aeATRYP*)6Rwry6-n-zyQeoB_jpv;_eNA%lHi!XOCkSHgO>$|7Q!l7{Y
zj|@MwG59-!Eu6FEJj>cX+jQr}p{P%0%p0WE!)AEhE1df7jJZp4A~VL{T_te)?8luM
z^I`GOGh<|WbWf8CJ7$kOv<>5tF`gouktBZ0mR^tiK-M36i*?g)Ew8!fi&AOF>~i<r
zvUky0IfNhW9+dUbLvQgJdw*lo-(BL=_>GGLo1C57bm_;on7Os}$|r7`ee$U@)sO9W
zKO*)n+vPWQw@X)BW}h}^&K+XStkmzx!M<P4F-{Nv*tu!;E0VC-7C9LNWQQ|mKr#zj
z-(OyqdGm}vew&$d|IN(PyFTvB%fFFTN4Dt+B(Qz9FC88fW6eB1wDI$A)2^knmr7?3
zo6}|Y+tNmMJtW&T={S0Ly0St>)~!l(4$od$ki%m+JHBxiy~#HI_!m5f$1fXrNGd;U
z&cEHdEM2+Wot4nN*zMwj5@bL(4t4R?b3I*{BYZ$=J#2L9-elT#pLRBXmuY+WoLjv0
z{YPFZyFIz)f_tRCIVEnNee8|V*{??LexeuKz4glGm5<6!rlsxjIcsJu>pPQUsB6x9
zvhG2l(Zi=aV8S*1@;P!j`&LbsW!Fzz?&$IRDCYU$jk8Yu4R1B!(r2S`@IL&lS4ThG
zKPkH66Vb<>EuSqNd{8^^Eq7#STPzFhs~cY@?W~zAZjIi(@jpu~IYeJ8zgTkeO5xqV
zkX`L&WUQY!<x(lH_;s{HJo09<w}V}%JJ%Us_f~0V27G;V&4&VVp1J3X<x9HtzDc&6
zmAgk996I|w(Yu=~YO!Tg^PVrxe9K&~rOJQ$^^eLR{=$s=&>0^v;qW;Z`OGECowNT-
zI(yh$hPw~SlI2|YrRC8py{sC&Mj04oV3dJT21XeeW#H!`11qeR*4omwww<3T+xba*
zIz62)9$aOW(&ID5Z26#d(L{POn@iifwr<|MW7Do(iM5wsbxp}$yKZ9D@Vf0gx9{Gy
zearCLRYOg&dgrY-*=wccX?ybdDZ7*(-)C>MCsR|U^mVI*mJVdg;~6`+FI~(f%Hvb1
zQt49r&9Y)VRT>^b=~`i@Y4dV%Mp{dGPcfa^XO?8hshO#=sq_}pZ82S*DdwEb$?CVP
zvQ}M`&P`+|kM6<RQbLA)S;-Dn)`r5BTUS|T2<a(nCO4JZlb%XXSlL{*Y^8Ds6K1pN
zvXx7hvy<lE>C#?nBE3I1Gc{$6=cfxZ<#Zyof3P?%qbjEdMM?0UbkQ2mq{sIq#`C#y
zx>B~L@~MeLAwQKJKRBpZOgmd1lxIrPtduUNbCdbvc-kzQB$_A(tB@+Drp>a`aWQ>+
zp_ngQvgtv;`%_cd2^mSETukLkmaHF7mr9m&w$CbM_l{?#CQRIymEDk7D(4F(I=f|J
zYRcLxo09&e9u^FdUb6X|Sy|Ydb=Emi%Ht{NLPI81Oq)8Xa@pODN_yO^oGfXAa`}|m
zkr}Bbo6F`hW^HjQJ1uE<gJBZ~bI$m)WwTAw`Ao5x&XujS^e4G>VyT=O-{&l4((VxU
zr4Q<6q=`9zT9N_jCNv-GtX!_(rpVnyE?=6-P7IT5VLXwQgJ6{omZtK16DG-6p3V;z
z<=B?;Q)x@k%v4%x-9D2pl~FWD+>-pH_GG8bG#k(459F+g)OaGlKV2+lC(=&AIe?{f
zsyLpJBXuAr3QWo#xM|jG%kWGh3MSiA+1yMek)58NDW~?x#zoJ>o@{O+A@c0ciZu=>
z%Eo0zISwaM6Vut8!;7f{mf5)}$&a((ZcuhJ;qHXl%Y@X*I{Tg&&lK|_nSlj4D8@;%
zm>M5<q1&@zBC3mWW(-a*=~8~i(O6cOv(qjumk*jbF`3<KP4CM|E~nBm@vYR%M7Eqb
zAm<}2n~76&<)1YBJ(-<KOSfeapeSds84+S8=SH1Jlf}GQ=gg^c-jXv*emrX?=It~2
za>^VxXF-Ip4y0rjS=sz}*_mCAv8h5qP8rs;Y)eiF+1x(qE0>aC8}DSsyRAh@XQL9`
zF-tDYB+E-<*HGDx)4gQW4ANaL?a>7}{bdq`ba7hC=J1!Wi9{~1b60#X9xUXu(m`rs
zBH@g~?DqaNmgRI{CPi_2A}hvBSaR~PrgHldCQss8(OuXxrNqn#7fog-^X}4AR?;c^
zA$w3Lo0*%JAf7oPGgF>0!WPrx`<={|Bs-E;GZaTSt2|y9oGJ4Y3Z2z*>~d*QWM*Pe
z!#Fi4WOI9qCTVFmh_NQ$=uGsIvoA_Audz6j%T^MlvZQPpNVg>8a#l|4m2)K4l|+>Z
znUKZ&L3gb)jh%X~Xl9Sh{-v|~WfGV6-fpFHGKFWv)XcP*9%jxsBP+_aGmU<8Q&QJG
zC%B7lZ>TG2naHO04r*%`xk+;-IeKMr`mIbB2g61CQd(BY6sM}$(s(J$&{YsMbzD=a
zgVxb87K+*ZS~PRdX))NZ$?P6E7v%TN3=R!Op^42Gd6QB>6f38HXvIXXI5=pSdpLvZ
zMNqDP)6UGpR`%uhyh%h!%%t~=%)F_+X*nBa#~GVMI9G^5{^%CbV{u|S#SO+~DwXq~
zGS8>vA|(?;^XB^Ih;FWx2U5ieIr|S5<<MLEv*`oY#Q0I+<vL_8zee(DIb$U5NEh=~
zA#E=EgERg*IhPEb>`h^Cz7!Ya1h7wTBc@VvF4$vE$cfTS$vFkNN7b=a&J>f=a<17^
zm~?L_oO6Q6Cl_zIjJmflgHykp0}`py(d&zQ33MiVNmf|pnH+XnPI;;EOv2UOTt8&!
z*=gBaDW7w0W>V%RL8hTO56cZqVtQ}Ua<V?@?qtH;t;k)5?C&}(m&VHjk+S2{gXL6Z
z-3DwpTNqYbPT5$IDo+>o4-R%QFZ-egXd*o>hZiNOR~o7}QNoIBabj@j3xz^{u$`%?
zLAT?nK?cg^La1BVGc_Zpy}X>76=Y||_hY$ODITqvElmu1C}i?O<I1Mf>2PS>l^fQh
z;dm-LiRIMP(aAqK*zSI_IaJv<KFG!GGPYuQFvUvKgBEgIBDbk~QgXpB$f;58SqEh~
zo0%CQ*5y)rC3!P@WOU|wme5-vv*=i57TlATNW53h0CK0|EEQ*RR^{kPdwS@28WR)F
zG_}SHGe!luYPc7uTs|igMlJzzMNP<iiriicw>W0ga(ye^E>mCbou{4aRMEMHm1N$f
z<sH~eNk)6Ey|!de%DaJmV;MWO-?>6&ox9G`nDeVX%i5funVPV3d3h5fv&5;NFWP1|
zWP|edW^Kv*`oyxXxXj+2$x3f>A+#l?X3BXv>tx5p=!3R%P7%Q;#%y~_-svtaw;p5m
zWmmY(t&6ggG8-oC#ICJ7w(q<(p14uo6umYPO$^@c{O}snbd}tb%2w>Gxp_$2?%N-I
zkCn*6(uDJtCpdHE)%NgPAnD-f{*`8&dx}YUyK&GSpOTp%J2%+F9~$4L*C%3IZ@Fpv
zu3g)2-nr|T1O4GDtE}Dmye+C6w9Va!=GrdjC2wWf2U$qvZ1?8I-Y@4l$#tr12iKNf
zC4;$iY?Za*r6*mtUiRDkref(U%ga@X3twGcUXZvb@f#Au-R0$&#Nd(T<)%dQI@=A_
ziaTzxR#f~eUi#t}`RXfP<Tw0f@_*;a<>gmu9TSdS-?A>2b*6qqw%w6vt@huzI`En`
zryuZDtyiCa-DOu@>~v$=%a8Z1zSNVSmJeP*S^72kKPh88VCbgR{(Dz$S{=Ciq%EuM
zm627e16x-6H?8(<I^DWnR@dbJk4YZ~%nskM+W(G~H>?i4{iGXK+lNoSVRi6#PT90N
zbnmH~R)_C?(U#T8lkAgMXWqEFaNFw4rqxMlF?x+MFv`Fv1EUQ5lo=>I;&u?yi0<$?
zyFkk&wtj96>rek2mfU!muD?R#H_RUr$m?1yM>OvJlFNVA|8CRmV}EYe@@*Qk8Y>#_
z)%Z@0@7MU38Xwa55sjbJ__)S@(fG68&mVVtPwO~W=yT<R`W*RFwY*y6xf)-g@fS62
z`;5EA%e8#9##d{6jmBFv-lj37F{2Tmf84L--_-avjlZYyeH#BnV?*PIG(M{FF^yl;
z_=Lu9Yh2dI`8rMa|16CcYUFx9FXeyG<9n6n;d+fxjk`4N(RjPY7ixaqqWf3V_;!u&
z)%fQcAJBNFdiJ1}n;IX{_$iHF*4WedtVW;a->>mJjTdPg(-_vcS>sNPagBR4?$cP-
zc$dbvX?%~yKhpTe8t>Qmutpr;N3?uUKi~RkEq_5H<~2T-jL#wWb^S_hce+M=4j28O
zXc5;%d|n)%569=h@wr>%1)uN6=eapg_<S}#kB!e?<MY<|d^J8#jn7Zx^V0ZyG(HcF
z&p#v2_<S=y&y3G6<MYb*>2diRjX&=3=xDnaN<NypKaXnstj7CvzCNzy<K^iIUH{~d
zsCT^fFEsv?=HZ3n$4`5FzB9^)Cxy?9^5MzHsrtTaG+*4b_(%Eh_)*gL-OraF_}(IX
zPZ7SC2;W15?;XPT4B>l)@I6BK-XMHWkZ17EL*Wh2vmeiOJ|=X$IgQVCO5S{e=Ht1}
z1Nysnln<oBiTL>(&sR^SpQ8ePe)%r-<h{?+ygX4-`u-DazC7P~y?>MsPl_KN<-?PY
zzdyn9@y|M5&rUptg(pw2d_0G{e$1|()_maSg3oA~e?G`RN5s!hPx-vt?HY~8+{g(8
zH|y`C&e#39;03Z@Kj|U3MEC!>^5az}cs|!Z$N9j|-G9>K^B>(V{2b@wKYD+D+U;Mj
z`8ZL2{_#2;@2_e6Y4@MoXGZz(q~y*MEFYf1KM#d>pJ4fTzURxr36_uNK?zTn50CQU
z$;W3-uzc`*IilnLuEvwwp56Pg3ukNl6XkmQWAEUP*<fsx4^PInjq>5iM{<-8Pd>_;
z5By#bzjvrAKcvyq=mlJO_s9>=(S2Z)4^KWG9p%H5kFSsN;mODMNBQvN!~aiiaddy`
z=HWu*L+}5CTE^dHU#{h_Mo%(dfQ2vU-%D-L{pH`Y;O~0z_tyA(ZTvj`g_e@-nuil*
zzC7Q5FLkTtBl&YOAAXe2Z_+&Ayz!&zztGp`H6L%*Skw4UjqlNTpT-9?eo!Nx<NBzU
zPxO3!N%!L^jX!vS?AK3vTu%Rj+ww#?pPsAZ{Ym%tN42|1`@c-%)f%tU_*#uWs-G9i
z`rW#}qt6q%M>L`N7@beueC*SFoG9lHz8}NS_nxcydf)`j$8)v++|S)7SU!0C&vp47
zqkMQ$`Uj(Yc=E9@%7-T(_l)x4$;Ssq`S9f9BcptH^3fXQ!;_E4NBQvN<H#r<o_u_F
zln+lnR*pWO?a9X(qkMSs5g6selaGr>`S9dp!zdq~d~6!!!;_DlqkMSsaoZ>#o_u6R
z`S9f9j!`~5`M7tK4^KXRca#rLKHfLVhbJF@I?9J9AAdc{hbJE&9p%H5k56ho5~J@w
zb@KAK=HuHMJ<0kxEa30qmNg%%^mzd<(fD$Wm;47kN8o3DNJ6@wC(7S{Jm2}bZj=vC
zinnP#@crJmXc^D%c)gbS@7g_`ynqU+=h=^+_4lv;w1;L&^D(3G9*u`I{<Qs{Xzkyj
z`~5=w{KkoPJpO+h?+P8zeHvek0XE%*Gql{-=Ovw|<<=AK!b`P`XXBl(<+iSW8Lj{B
zv&(sd=1y`~n8)Fn*Ss#A>K3i0uGhyZTDA`KbX_0Nc#cK{ZDw_Eb{nkJ6=7|Eww7O{
z@fS2!?spf!!@7X|iu{?o{xq$R?f+oe-5<eD)n~Qb)g4*ZGV<Kha#Pnot>v?@VeS8C
z9XdB*gx)P*-E_-0YJ9!MNsaq8-mCHVHU5dl4`}>^#?NW|y2kHmwDeVox#QS%^XAuX
z-ED2#y!l!?xJ@4KE6<y}@~W|QV^<~$>q6_+%cD|5cHzqPmnyXZ(a~|D>HTEu+snLq
zwSbk@YODH)yXJsaFvod?b&l0hpF7IO9Dbfv(eVx+zGL;j&Z&Q{#d*HY>F-P{{<wCm
zVW(3+Fz^^_CuHF)6o%qkkHg<}9Nst%|DVEb`JZ3cNaAekJS(I7i}goj;Y{m%tF9dL
z<==$+rT(2%|7qdJ8vl!>|Fc%Ev2dP2|7RJF!cbg$9RAC~&31X5TZG#~?XkE^_?auu
zv)~<g(s1d?YPsZOU6B{=m;Ifi{n{w}Eb*uQ3AZAiL0WeDImc>#&E<B+jdy83RqE%k
z_T&GsThDo3s4TwI;Vv8g_s@^RKX@Ge$>Z>^2se4)@qF_*^<N|>R(p6Lx_qJVV;!&c
z$Kl(Bzw#LUOqlvB<!s;du6&bl;~T~a-{b-GxlB4G527xOja%}3>zRT)nYJQNIF={;
zrlu05>_oym#k=f0V7N3EvJzX~uyfN*+c#T@#I~KcCbmWsH{P=8rmczSmRqdEwjDR$
zuxUr)<{NL^wRLx5_of?mY)weT@$$h!TKaK1mnR>ON4D+Qe#7R(y0NRQ$znP^D6SjZ
zAl0X`d*so}<&->Vd^(lQ4c1<@URI`y#a!O>A<zBIA6@6P+`jv!gzQRsZ?+`QO5c6c
z<|%nr^zPK2skHN?<y>kyZ6(qZsdCDibRO|L=-O#+9_lPPC{Crl{Z1BU%14Lnj?r}D
zj#0+6SDqHG12(%mQOYMW@~~*xY-0P(vSK2eGf(TDkd5ai<N@E~&hv<6=L%ALPe~qA
zY<5<j8ZA#iPbB2w@wxJ3Vmz~NFubeA*2|;O<q6CKVn&g!WXm2f={#TBgAxfzLt^XB
zEeSKS$>a2$Fqu1M-Xt9pGX+nVg5)6NiykNkxssBHp660i**oZf8kr=oa%8-1tg+I;
z>2hk1#B$M%87xT((#3)`CXb#@kL}IPjLD<2#mw?SMVoHeet9{y*K*cnQl*SFHgPZ~
z{kgGRbl2<`u|*S4AtCFE>8X@ypi70RvNh%mb4(V;_U5H%9>8vmIZlif^NuuQ=}cm>
zC>A9$6VeGvZqr1nSWL<AObaZVR%Tn$fiteubaq_Ek}sRh2r(v(i2&Bvcz$|XF4x{!
z`;(W<{iU3O<l5}M0(ygoi2F-j?bOAxi5JQL=Gu(*VQr5Xk#)nb!TCht)e?u>SF}B%
zJ=|E9ah(9(A`9P>dpvA^PTL~}wIc3=dHd!*z})kqJ?@JUJ0>~uGPjdv5$$n*`g&nz
zlW33oY(!hzOBLruMeB`1%{?sI<GvM<$B%l5`=q_OCq^0f#fUL?vp+O`#nv6NU~-D~
zxGzV<eL3e3`+twNzd<|1eLf=I7nt^@4d?Ie!c0!|cHJ6WkD3DIh4&9`>4%U1tgIMr
zZ#UhVh#j{$_*d>YoEPG|r8L~$`mk2ei18{Nj34|DsXgv15T6|EnA-!ki~jx#wa5J#
zBHj<7KKuWt+8)Ob_fv@R)oAHPj2qOvEi?O%_P9?*T#t$xdHWAj`?}s|Blf91Z~swQ
zZ-%JT*24Q<M6QoAqWQJYnxXc1UyO))QpI_pjQACyX8*DM!TXRW-1=5lqXyesw7w-}
zbBG7+^?e@RhjdKm@`47~E<Bi%b;H|l>icTMpws&3KWLBm@6;ae%Mq(bn>tHqhiEoF
zynWo)Bl7$kw9|f1mBxl)|8d`l_a!(_U>G(QG@h>84-rTEtB9Min%8iIyY3qI-$lLe
v@9Dr^F4)W^j347N3}bg&y8kZhcDL{r>Kxle-7}BVe(kTiHP5C7RJ8stqpf*V

-- 
2.48.1.362.g079036d154-goog

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCHv4 01/17] zram: switch to non-atomic entry locking
  2025-02-03  3:21     ` Sergey Senozhatsky
@ 2025-02-03  3:52       ` Sergey Senozhatsky
  2025-02-03 12:39       ` Sergey Senozhatsky
  1 sibling, 0 replies; 73+ messages in thread
From: Sergey Senozhatsky @ 2025-02-03  3:52 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Andrew Morton, Minchan Kim, linux-mm, linux-kernel, Sergey Senozhatsky

On (25/02/03 12:21), Sergey Senozhatsky wrote:
> So zram entry has been memory-saving driver, pretty much always, and not
> debug-ability driver, I'm afraid.

s/driver/driven/

> 
> Would lockdep additional keep the dentist away? (so to speak)

                addition

[..]
> diff --git a/scripts/selinux/genheaders/genheaders b/scripts/selinux/genheaders/genheaders
> deleted file mode 100755
> index 3fc32a664a7930b12a38d02449aec78d49690dfe..0000000000000000000000000000000000000000
> GIT binary patch
> literal 0
> HcmV?d00001

Sorry, was not meant to be there, no idea what that is but I get
that file after every `make` now.


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCHv4 01/17] zram: switch to non-atomic entry locking
  2025-02-03  3:21     ` Sergey Senozhatsky
  2025-02-03  3:52       ` Sergey Senozhatsky
@ 2025-02-03 12:39       ` Sergey Senozhatsky
  1 sibling, 0 replies; 73+ messages in thread
From: Sergey Senozhatsky @ 2025-02-03 12:39 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Andrew Morton, Minchan Kim, linux-mm, linux-kernel, Sergey Senozhatsky

On (25/02/03 12:21), Sergey Senozhatsky wrote:
> On (25/01/31 19:41), Hillf Danton wrote:
> > On Fri, 31 Jan 2025 18:06:00 +0900 Sergey Senozhatsky
> > > Concurrent modifications of meta table entries is now handled
> > > by per-entry spin-lock.  This has a number of shortcomings.
> > > 
> > > First, this imposes atomic requirements on compression backends.
> > > zram can call both zcomp_compress() and zcomp_decompress() under
> > > entry spin-lock, which implies that we can use only compression
> > > algorithms that don't schedule/sleep/wait during compression and
> > > decompression.  This, for instance, makes it impossible to use
> > > some of the ASYNC compression algorithms (H/W compression, etc.)
> > > implementations.
> > > 
> > > Second, this can potentially trigger watchdogs.  For example,
> > > entry re-compression with secondary algorithms is performed
> > > under entry spin-lock.  Given that we chain secondary
> > > compression algorithms and that some of them can be configured
> > > for best compression ratio (and worst compression speed) zram
> > > can stay under spin-lock for quite some time.
> > > 
> > > Do not use per-entry spin-locks and instead convert it to an
> > > atomic_t variable which open codes reader-writer type of lock.
> > > This permits preemption from slot_lock section, also reduces
> > > the sizeof() zram entry when lockdep is enabled.
> > > 
> > Nope, the price of cut in size will be paid by extra hours in debugging,
> > given nothing is free.
> 
> This has been a bit-spin-lock basically forever, until late last
> year when it was switched to a spinlock, for reasons unrelated
> to debugging (as far as I understand it).  See 9518e5bfaae19 (zram:
> Replace bit spinlocks with a spinlock_t).

Just want to clarify a little:

That "also reduces sizeof()" thing was added last minute (I think before
sending v4 out) and it was not an intention of this patch.  I just recalled
that sizeof() zram entry under lockdep was brought by linux-rt folks when
they discussed the patch that converter zram entry bit-spinlock into a
spinlock, and then I just put that line.

> > > -static void zram_slot_unlock(struct zram *zram, u32 index)
> > > +static void zram_slot_read_unlock(struct zram *zram, u32 index)
> > >  {
> > > -	spin_unlock(&zram->table[index].lock);
> > > +	atomic_dec(&zram->table[index].lock);
> > >  }
> > >  
> > Given no boundaries of locking section marked in addition to lockdep, 
> > this is another usual case of inventing lock in 2025.
> 
> So zram entry has been memory-saving driver, pretty much always, and not
> debug-ability driver, I'm afraid.

Before zram per-entry bit-spinlock there was a zram table rwlock, that
protected all zram meta table entries.  And before that there was a
per-zram device rwsem that synchronized all operation and protected
the entire zram meta table, so zram was fully preemptible back then.
Kind of interesting, that's what I want it to be now again.


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCHv4 01/17] zram: switch to non-atomic entry locking
  2025-01-31  9:06 ` [PATCHv4 01/17] zram: switch to non-atomic entry locking Sergey Senozhatsky
  2025-01-31 11:41   ` Hillf Danton
@ 2025-01-31 22:55   ` Andrew Morton
  2025-02-03  3:26     ` Sergey Senozhatsky
  2025-02-06  7:01     ` Sergey Senozhatsky
  1 sibling, 2 replies; 73+ messages in thread
From: Andrew Morton @ 2025-01-31 22:55 UTC (permalink / raw)
  To: Sergey Senozhatsky; +Cc: Minchan Kim, linux-mm, linux-kernel

On Fri, 31 Jan 2025 18:06:00 +0900 Sergey Senozhatsky <senozhatsky@chromium.org> wrote:

> +static void zram_slot_write_lock(struct zram *zram, u32 index)
> +{
> +	atomic_t *lock = &zram->table[index].lock;
> +	int old = atomic_read(lock);
> +
> +	do {
> +		if (old != ZRAM_ENTRY_UNLOCKED) {
> +			cond_resched();
> +			old = atomic_read(lock);
> +			continue;
> +		}
> +	} while (!atomic_try_cmpxchg(lock, &old, ZRAM_ENTRY_WRLOCKED));
> +}

I expect that if the calling userspace process has realtime policy (eg
SCHED_FIFO) then the cond_resched() won't schedule SCHED_NORMAL tasks
and this becomes a busy loop.  And if the machine is single-CPU, the
loop is infinite.

I do agree that for inventing new locking schemes, the bar is set
really high.



^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCHv4 01/17] zram: switch to non-atomic entry locking
  2025-01-31 22:55   ` Andrew Morton
@ 2025-02-03  3:26     ` Sergey Senozhatsky
  2025-02-03  7:11       ` Sergey Senozhatsky
  2025-02-04  0:19       ` Andrew Morton
  2025-02-06  7:01     ` Sergey Senozhatsky
  1 sibling, 2 replies; 73+ messages in thread
From: Sergey Senozhatsky @ 2025-02-03  3:26 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Sergey Senozhatsky, Minchan Kim, linux-mm, linux-kernel

On (25/01/31 14:55), Andrew Morton wrote:
> > +static void zram_slot_write_lock(struct zram *zram, u32 index)
> > +{
> > +	atomic_t *lock = &zram->table[index].lock;
> > +	int old = atomic_read(lock);
> > +
> > +	do {
> > +		if (old != ZRAM_ENTRY_UNLOCKED) {
> > +			cond_resched();
> > +			old = atomic_read(lock);
> > +			continue;
> > +		}
> > +	} while (!atomic_try_cmpxchg(lock, &old, ZRAM_ENTRY_WRLOCKED));
> > +}
> 
> I expect that if the calling userspace process has realtime policy (eg
> SCHED_FIFO) then the cond_resched() won't schedule SCHED_NORMAL tasks
> and this becomes a busy loop.  And if the machine is single-CPU, the
> loop is infinite.

So for that scenario to happen zram needs to see two writes() to the same
index (page) simultaneously?  Or read() and write() on the same index (page)
concurrently?


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCHv4 01/17] zram: switch to non-atomic entry locking
  2025-02-03  3:26     ` Sergey Senozhatsky
@ 2025-02-03  7:11       ` Sergey Senozhatsky
  2025-02-03  7:33         ` Sergey Senozhatsky
  2025-02-04  0:19       ` Andrew Morton
  1 sibling, 1 reply; 73+ messages in thread
From: Sergey Senozhatsky @ 2025-02-03  7:11 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Sergey Senozhatsky, Minchan Kim, linux-mm, linux-kernel

On (25/02/03 12:26), Sergey Senozhatsky wrote:
> On (25/01/31 14:55), Andrew Morton wrote:
> > > +static void zram_slot_write_lock(struct zram *zram, u32 index)
> > > +{
> > > +	atomic_t *lock = &zram->table[index].lock;
> > > +	int old = atomic_read(lock);
> > > +
> > > +	do {
> > > +		if (old != ZRAM_ENTRY_UNLOCKED) {
> > > +			cond_resched();
> > > +			old = atomic_read(lock);
> > > +			continue;
> > > +		}
> > > +	} while (!atomic_try_cmpxchg(lock, &old, ZRAM_ENTRY_WRLOCKED));
> > > +}
> > 
> > I expect that if the calling userspace process has realtime policy (eg
> > SCHED_FIFO) then the cond_resched() won't schedule SCHED_NORMAL tasks
> > and this becomes a busy loop.  And if the machine is single-CPU, the
> > loop is infinite.
> 
> So for that scenario to happen zram needs to see two writes() to the same
> index (page) simultaneously?  Or read() and write() on the same index (page)
> concurrently?

Just to put more details:

1) zram always works with only one particular zram entry index, which
   is provided by an upper layer (e.g. bi_sector)

2) for read

 read()
  zram_read(page, index)
   rlock entry[index]
   decompress entry zshandle page
   runlock entry[index]

   for write

 write()
  zram_write(page, index)
   len = compress page obj
   handle = zsmalloc len
   wlock entry[index]
   entry.handle = handle
   entry.len = len
   wunlock entry[index]

3) at no point zram locks more than entry index
  a) there is no entry cross-locking (entries are not hierarchical)
  b) there is no entry lock nesting (including recursion)


I guess where we actually need zram entry lock is writeback and
recompression.  Writeback moves object from zsmalloc pool to actual
physical storage, freeing zsmalloc memory after that and setting
zram entry[index] handle to the backikng device's block idx, which
needs synchronization.  Recompression does a similar thing, it frees
the old zsmalloc handle and stores recompressed objects under new
zsmalloc handle, it thus updates zram entry[index] handle to point
to the new location, which needs to be synchronized.


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCHv4 01/17] zram: switch to non-atomic entry locking
  2025-02-03  7:11       ` Sergey Senozhatsky
@ 2025-02-03  7:33         ` Sergey Senozhatsky
  0 siblings, 0 replies; 73+ messages in thread
From: Sergey Senozhatsky @ 2025-02-03  7:33 UTC (permalink / raw)
  To: Sergey Senozhatsky; +Cc: Andrew Morton, Minchan Kim, linux-mm, linux-kernel

On (25/02/03 16:11), Sergey Senozhatsky wrote:
> I guess where we actually need zram entry lock is writeback and
> recompression.  Writeback moves object from zsmalloc pool to actual
> physical storage, freeing zsmalloc memory after that and setting
> zram entry[index] handle to the backikng device's block idx, which
> needs synchronization.  Recompression does a similar thing, it frees
> the old zsmalloc handle and stores recompressed objects under new
> zsmalloc handle, it thus updates zram entry[index] handle to point
> to the new location, which needs to be synchronized.

... Luckily there is a trivial solution

---

diff --git a/drivers/block/zram/Kconfig b/drivers/block/zram/Kconfig
index 402b7b175863..dd7c5ae91cc0 100644
--- a/drivers/block/zram/Kconfig
+++ b/drivers/block/zram/Kconfig
@@ -1,7 +1,7 @@
 # SPDX-License-Identifier: GPL-2.0
 config ZRAM
        tristate "Compressed RAM block device support"
-       depends on BLOCK && SYSFS && MMU
+       depends on BLOCK && SYSFS && MMU && !PREEMPT
        select ZSMALLOC
        help
          Creates virtual block devices called /dev/zramX (X = 0, 1, ...).


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCHv4 01/17] zram: switch to non-atomic entry locking
  2025-02-03  3:26     ` Sergey Senozhatsky
  2025-02-03  7:11       ` Sergey Senozhatsky
@ 2025-02-04  0:19       ` Andrew Morton
  2025-02-04  4:22         ` Sergey Senozhatsky
  1 sibling, 1 reply; 73+ messages in thread
From: Andrew Morton @ 2025-02-04  0:19 UTC (permalink / raw)
  To: Sergey Senozhatsky; +Cc: Minchan Kim, linux-mm, linux-kernel

On Mon, 3 Feb 2025 12:26:12 +0900 Sergey Senozhatsky <senozhatsky@chromium.org> wrote:

> On (25/01/31 14:55), Andrew Morton wrote:
> > > +static void zram_slot_write_lock(struct zram *zram, u32 index)
> > > +{
> > > +	atomic_t *lock = &zram->table[index].lock;
> > > +	int old = atomic_read(lock);
> > > +
> > > +	do {
> > > +		if (old != ZRAM_ENTRY_UNLOCKED) {
> > > +			cond_resched();
> > > +			old = atomic_read(lock);
> > > +			continue;
> > > +		}
> > > +	} while (!atomic_try_cmpxchg(lock, &old, ZRAM_ENTRY_WRLOCKED));
> > > +}
> > 
> > I expect that if the calling userspace process has realtime policy (eg
> > SCHED_FIFO) then the cond_resched() won't schedule SCHED_NORMAL tasks
> > and this becomes a busy loop.  And if the machine is single-CPU, the
> > loop is infinite.
> 
> So for that scenario to happen zram needs to see two writes() to the same
> index (page) simultaneously?  Or read() and write() on the same index (page)
> concurrently?

Well, my point is that in the contended case, this "lock" operation can
get stuck forever.  If there are no contended cases, we don't need a
lock!

And I don't see how disabling the feature if PREEMPT=y will avoid this
situation.  cond_resched() won't schedule away from a realtime task to
a non-realtime one - a policy which isn't related to preemption.



^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCHv4 01/17] zram: switch to non-atomic entry locking
  2025-02-04  0:19       ` Andrew Morton
@ 2025-02-04  4:22         ` Sergey Senozhatsky
  0 siblings, 0 replies; 73+ messages in thread
From: Sergey Senozhatsky @ 2025-02-04  4:22 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Sergey Senozhatsky, Minchan Kim, linux-mm, linux-kernel

On (25/02/03 16:19), Andrew Morton wrote:
> > On (25/01/31 14:55), Andrew Morton wrote:
> > > > +static void zram_slot_write_lock(struct zram *zram, u32 index)
> > > > +{
> > > > +	atomic_t *lock = &zram->table[index].lock;
> > > > +	int old = atomic_read(lock);
> > > > +
> > > > +	do {
> > > > +		if (old != ZRAM_ENTRY_UNLOCKED) {
> > > > +			cond_resched();
> > > > +			old = atomic_read(lock);
> > > > +			continue;
> > > > +		}
> > > > +	} while (!atomic_try_cmpxchg(lock, &old, ZRAM_ENTRY_WRLOCKED));
> > > > +}
> > > 
> > > I expect that if the calling userspace process has realtime policy (eg
> > > SCHED_FIFO) then the cond_resched() won't schedule SCHED_NORMAL tasks
> > > and this becomes a busy loop.  And if the machine is single-CPU, the
> > > loop is infinite.
> > 
> > So for that scenario to happen zram needs to see two writes() to the same
> > index (page) simultaneously?  Or read() and write() on the same index (page)
> > concurrently?
> 
> Well, my point is that in the contended case, this "lock" operation can
> get stuck forever.  If there are no contended cases, we don't need a
> lock!

Let me see if I can come up with something, I don't have an awfully
a lot of ideas right now.

> And I don't see how disabling the feature if PREEMPT=y will avoid this

Oh, that was a silly joke: the series that enables preemption in zram
and zsmalloc ends up disabling PREEMPT.


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCHv4 01/17] zram: switch to non-atomic entry locking
  2025-01-31 22:55   ` Andrew Morton
  2025-02-03  3:26     ` Sergey Senozhatsky
@ 2025-02-06  7:01     ` Sergey Senozhatsky
  2025-02-06  7:38       ` Sebastian Andrzej Siewior
  1 sibling, 1 reply; 73+ messages in thread
From: Sergey Senozhatsky @ 2025-02-06  7:01 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Sergey Senozhatsky, Minchan Kim, linux-mm, linux-kernel,
	Hillf Danton, Sebastian Andrzej Siewior, Mike Galbraith

Cc-ing Sebastian and Mike

On (25/01/31 14:55), Andrew Morton wrote:
> > +static void zram_slot_write_lock(struct zram *zram, u32 index)
> > +{
> > +	atomic_t *lock = &zram->table[index].lock;
> > +	int old = atomic_read(lock);
> > +
> > +	do {
> > +		if (old != ZRAM_ENTRY_UNLOCKED) {
> > +			cond_resched();
> > +			old = atomic_read(lock);
> > +			continue;
> > +		}
> > +	} while (!atomic_try_cmpxchg(lock, &old, ZRAM_ENTRY_WRLOCKED));
> > +}
> 
> I expect that if the calling userspace process has realtime policy (eg
> SCHED_FIFO) then the cond_resched() won't schedule SCHED_NORMAL tasks
> and this becomes a busy loop.  And if the machine is single-CPU, the
> loop is infinite.
> 
> I do agree that for inventing new locking schemes, the bar is set
> really high.

So I completely reworked this bit and we don't have that problem
anymore, nor the problem of "inventing locking schemes in 2025".

In short - we are returning back to bit-locks, what zram has been using
until commit 9518e5bfaae19 ("zram: Replace bit spinlocks with a spinlock_t),
not bit-spinlock these time around, that won't work with linux-rt, but
wait_on_bit and friends.  Entry lock is exclusive, just like before,
but lock owner can sleep now, any task wishing to lock that same entry
will wait to be woken up by the current lock owner once it unlocks the
entry.  For cases when lock is taken from atomic context (e.g. slot-free
notification from softirq) we continue using TRY lock, which has been
introduced by commit 3c9959e025472 ("zram: fix lockdep warning of free block
handling"), so there's nothing new here.  In addition I added some lockdep
annotations, just to be safe.

There shouldn't be too many tasks competing for the same entry.  I can
only think of cases when read/write (or slot-free notification if zram
is used as a swap device) vs. writeback or recompression (we cannot have
writeback and recompression simultaneously).

It currently looks like this:

---

struct zram_table_entry {
        unsigned long handle;
        unsigned long flags;
#ifdef CONFIG_DEBUG_LOCK_ALLOC
        struct lockdep_map lockdep_map;
#endif
#ifdef CONFIG_ZRAM_TRACK_ENTRY_ACTIME
        ktime_t ac_time;
#endif
};

/*
 * entry locking rules:
 *
 * 1) Lock is exclusive
 *
 * 2) lock() function can sleep waiting for the lock
 *
 * 3) Lock owner can sleep
 *
 * 4) Use TRY lock variant when in atomic context
 *    - must check return value and handle locking failers
 */
static __must_check bool zram_slot_try_lock(struct zram *zram, u32 index)
{
        unsigned long *lock = &zram->table[index].flags;

        if (!test_and_set_bit_lock(ZRAM_ENTRY_LOCK, lock)) {
#ifdef CONFIG_DEBUG_LOCK_ALLOC
                mutex_acquire(&zram->table[index].lockdep_map, 0, 0, _RET_IP_);
#endif
                return true;
        }
        return false;
}

static void zram_slot_lock(struct zram *zram, u32 index)
{
        unsigned long *lock = &zram->table[index].flags;

        WARN_ON_ONCE(!preemptible());

        wait_on_bit_lock(lock, ZRAM_ENTRY_LOCK, TASK_UNINTERRUPTIBLE);
#ifdef CONFIG_DEBUG_LOCK_ALLOC
        mutex_acquire(&zram->table[index].lockdep_map, 0, 0, _RET_IP_);
#endif
}

static void zram_slot_unlock(struct zram *zram, u32 index)
{
        unsigned long *lock = &zram->table[index].flags;

        clear_and_wake_up_bit(ZRAM_ENTRY_LOCK, lock);
#ifdef CONFIG_DEBUG_LOCK_ALLOC
        mutex_release(&zram->table[index].lockdep_map, _RET_IP_);
#endif
}


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCHv4 01/17] zram: switch to non-atomic entry locking
  2025-02-06  7:01     ` Sergey Senozhatsky
@ 2025-02-06  7:38       ` Sebastian Andrzej Siewior
  2025-02-06  7:47         ` Sergey Senozhatsky
  0 siblings, 1 reply; 73+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-02-06  7:38 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Andrew Morton, Minchan Kim, linux-mm, linux-kernel, Hillf Danton,
	Mike Galbraith

On 2025-02-06 16:01:12 [+0900], Sergey Senozhatsky wrote:
> So I completely reworked this bit and we don't have that problem
> anymore, nor the problem of "inventing locking schemes in 2025".
> 
> In short - we are returning back to bit-locks, what zram has been using
> until commit 9518e5bfaae19 ("zram: Replace bit spinlocks with a spinlock_t),
> not bit-spinlock these time around, that won't work with linux-rt, but
> wait_on_bit and friends.  Entry lock is exclusive, just like before,
> but lock owner can sleep now, any task wishing to lock that same entry
> will wait to be woken up by the current lock owner once it unlocks the
> entry.  For cases when lock is taken from atomic context (e.g. slot-free
> notification from softirq) we continue using TRY lock, which has been
> introduced by commit 3c9959e025472 ("zram: fix lockdep warning of free block
> handling"), so there's nothing new here.  In addition I added some lockdep
> annotations, just to be safe.
> 
> There shouldn't be too many tasks competing for the same entry.  I can
> only think of cases when read/write (or slot-free notification if zram
> is used as a swap device) vs. writeback or recompression (we cannot have
> writeback and recompression simultaneously).

So if I understand, you want to get back to bit spinlocks but sleeping
instead of polling. But why? Do you intend to have more locks per entry
so that you use the additional bits with the "lock"?

> It currently looks like this:
> 
…
> static __must_check bool zram_slot_try_lock(struct zram *zram, u32 index)
> {
>         unsigned long *lock = &zram->table[index].flags;
> 
>         if (!test_and_set_bit_lock(ZRAM_ENTRY_LOCK, lock)) {
> #ifdef CONFIG_DEBUG_LOCK_ALLOC
>                 mutex_acquire(&zram->table[index].lockdep_map, 0, 0, _RET_IP_);
> #endif
>                 return true;
>         }
>         return false;
> }

I hope the caller does not poll.

> static void zram_slot_lock(struct zram *zram, u32 index)
> {
>         unsigned long *lock = &zram->table[index].flags;
> 
>         WARN_ON_ONCE(!preemptible());

you want might_sleep() here instead. preemptible() works only on
preemptible kernels. And might_sleep() is already provided by
wait_on_bit_lock(). So this can go.

>         wait_on_bit_lock(lock, ZRAM_ENTRY_LOCK, TASK_UNINTERRUPTIBLE);
> #ifdef CONFIG_DEBUG_LOCK_ALLOC
>         mutex_acquire(&zram->table[index].lockdep_map, 0, 0, _RET_IP_);
> #endif

I would argue that you want this before the wait_on_bit_lock() simply
because you want to see a possible deadlock before it happens.

> }
> 
> static void zram_slot_unlock(struct zram *zram, u32 index)
> {
>         unsigned long *lock = &zram->table[index].flags;
> 
>         clear_and_wake_up_bit(ZRAM_ENTRY_LOCK, lock);
> #ifdef CONFIG_DEBUG_LOCK_ALLOC
>         mutex_release(&zram->table[index].lockdep_map, _RET_IP_);
> #endif
Also before. So it complains about release a not locked lock before it
happens.
> }

Sebastian


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCHv4 01/17] zram: switch to non-atomic entry locking
  2025-02-06  7:38       ` Sebastian Andrzej Siewior
@ 2025-02-06  7:47         ` Sergey Senozhatsky
  2025-02-06  8:13           ` Sebastian Andrzej Siewior
  0 siblings, 1 reply; 73+ messages in thread
From: Sergey Senozhatsky @ 2025-02-06  7:47 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Sergey Senozhatsky, Andrew Morton, Minchan Kim, linux-mm,
	linux-kernel, Hillf Danton, Mike Galbraith

On (25/02/06 08:38), Sebastian Andrzej Siewior wrote:
> On 2025-02-06 16:01:12 [+0900], Sergey Senozhatsky wrote:
> > So I completely reworked this bit and we don't have that problem
> > anymore, nor the problem of "inventing locking schemes in 2025".
> > 
> > In short - we are returning back to bit-locks, what zram has been using
> > until commit 9518e5bfaae19 ("zram: Replace bit spinlocks with a spinlock_t),
> > not bit-spinlock these time around, that won't work with linux-rt, but
> > wait_on_bit and friends.  Entry lock is exclusive, just like before,
> > but lock owner can sleep now, any task wishing to lock that same entry
> > will wait to be woken up by the current lock owner once it unlocks the
> > entry.  For cases when lock is taken from atomic context (e.g. slot-free
> > notification from softirq) we continue using TRY lock, which has been
> > introduced by commit 3c9959e025472 ("zram: fix lockdep warning of free block
> > handling"), so there's nothing new here.  In addition I added some lockdep
> > annotations, just to be safe.
> > 
> > There shouldn't be too many tasks competing for the same entry.  I can
> > only think of cases when read/write (or slot-free notification if zram
> > is used as a swap device) vs. writeback or recompression (we cannot have
> > writeback and recompression simultaneously).
> 
> So if I understand, you want to get back to bit spinlocks but sleeping
> instead of polling. But why? Do you intend to have more locks per entry
> so that you use the additional bits with the "lock"?

zram is atomic right now, e.g.

zram_read()
	lock entry by index   # disables preemption
	map zsmalloc entry    # possibly memcpy
	decompress
	unmap zsmalloc
	unlock entry          # enables preemption

That's a pretty long time to keep preemption disabled (e.g. using slow
algorithm like zstd or deflate configured with high compression levels).
Apart from that that, difficult to use async algorithms, which can
e.g. wait for a H/W to become available, or algorithms that might want
to allocate memory internally during compression/decompression, e.g.
zstd).

Entry lock is not the only lock in zram currently that makes it
atomic, just one of.

> > It currently looks like this:
> > 
> …
> > static __must_check bool zram_slot_try_lock(struct zram *zram, u32 index)
> > {
> >         unsigned long *lock = &zram->table[index].flags;
> > 
> >         if (!test_and_set_bit_lock(ZRAM_ENTRY_LOCK, lock)) {
> > #ifdef CONFIG_DEBUG_LOCK_ALLOC
> >                 mutex_acquire(&zram->table[index].lockdep_map, 0, 0, _RET_IP_);
> > #endif
> >                 return true;
> >         }
> >         return false;
> > }
> 
> I hope the caller does not poll.

Yeah, we don't that in the code.

> > static void zram_slot_lock(struct zram *zram, u32 index)
> > {
> >         unsigned long *lock = &zram->table[index].flags;
> > 
> >         WARN_ON_ONCE(!preemptible());
> 
> you want might_sleep() here instead. preemptible() works only on
> preemptible kernels. And might_sleep() is already provided by
> wait_on_bit_lock(). So this can go.

wait_on_bit_lock() has might_sleep().

> >         wait_on_bit_lock(lock, ZRAM_ENTRY_LOCK, TASK_UNINTERRUPTIBLE);
> > #ifdef CONFIG_DEBUG_LOCK_ALLOC
> >         mutex_acquire(&zram->table[index].lockdep_map, 0, 0, _RET_IP_);
> > #endif
> 
> I would argue that you want this before the wait_on_bit_lock() simply
> because you want to see a possible deadlock before it happens.

Ack.

> > static void zram_slot_unlock(struct zram *zram, u32 index)
> > {
> >         unsigned long *lock = &zram->table[index].flags;
> > 
> >         clear_and_wake_up_bit(ZRAM_ENTRY_LOCK, lock);
> > #ifdef CONFIG_DEBUG_LOCK_ALLOC
> >         mutex_release(&zram->table[index].lockdep_map, _RET_IP_);
> > #endif
> Also before. So it complains about release a not locked lock before it
> happens.

OK.


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCHv4 01/17] zram: switch to non-atomic entry locking
  2025-02-06  7:47         ` Sergey Senozhatsky
@ 2025-02-06  8:13           ` Sebastian Andrzej Siewior
  2025-02-06  8:17             ` Sergey Senozhatsky
  0 siblings, 1 reply; 73+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-02-06  8:13 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Andrew Morton, Minchan Kim, linux-mm, linux-kernel, Hillf Danton,
	Mike Galbraith

On 2025-02-06 16:47:02 [+0900], Sergey Senozhatsky wrote:
> zram is atomic right now, e.g.
> 
> zram_read()
> 	lock entry by index   # disables preemption
> 	map zsmalloc entry    # possibly memcpy
> 	decompress
> 	unmap zsmalloc
> 	unlock entry          # enables preemption
> 
> That's a pretty long time to keep preemption disabled (e.g. using slow
> algorithm like zstd or deflate configured with high compression levels).
> Apart from that that, difficult to use async algorithms, which can
> e.g. wait for a H/W to become available, or algorithms that might want
> to allocate memory internally during compression/decompression, e.g.
> zstd).
> 
> Entry lock is not the only lock in zram currently that makes it
> atomic, just one of.

Okay. So there are requirements for the sleeping lock. A mutex isn't
fitting the requirement because it is too large I guess.

> > > static void zram_slot_lock(struct zram *zram, u32 index)
> > > {
> > >         unsigned long *lock = &zram->table[index].flags;
> > > 
> > >         WARN_ON_ONCE(!preemptible());
> > 
> > you want might_sleep() here instead. preemptible() works only on
> > preemptible kernels. And might_sleep() is already provided by
> > wait_on_bit_lock(). So this can go.
> 
> wait_on_bit_lock() has might_sleep().

My point exactly. This makes the WARN_ON_ONCE() obsolete.

Sebastian


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCHv4 01/17] zram: switch to non-atomic entry locking
  2025-02-06  8:13           ` Sebastian Andrzej Siewior
@ 2025-02-06  8:17             ` Sergey Senozhatsky
  2025-02-06  8:26               ` Sebastian Andrzej Siewior
  0 siblings, 1 reply; 73+ messages in thread
From: Sergey Senozhatsky @ 2025-02-06  8:17 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Sergey Senozhatsky, Andrew Morton, Minchan Kim, linux-mm,
	linux-kernel, Hillf Danton, Mike Galbraith

On (25/02/06 09:13), Sebastian Andrzej Siewior wrote:
> On 2025-02-06 16:47:02 [+0900], Sergey Senozhatsky wrote:
> > zram is atomic right now, e.g.
> > 
> > zram_read()
> > 	lock entry by index   # disables preemption
> > 	map zsmalloc entry    # possibly memcpy
> > 	decompress
> > 	unmap zsmalloc
> > 	unlock entry          # enables preemption
> > 
> > That's a pretty long time to keep preemption disabled (e.g. using slow
> > algorithm like zstd or deflate configured with high compression levels).
> > Apart from that that, difficult to use async algorithms, which can
> > e.g. wait for a H/W to become available, or algorithms that might want
> > to allocate memory internally during compression/decompression, e.g.
> > zstd).
> > 
> > Entry lock is not the only lock in zram currently that makes it
> > atomic, just one of.
> 
> Okay. So there are requirements for the sleeping lock. A mutex isn't
> fitting the requirement because it is too large I guess.

Correct.

> > > > static void zram_slot_lock(struct zram *zram, u32 index)
> > > > {
> > > >         unsigned long *lock = &zram->table[index].flags;
> > > > 
> > > >         WARN_ON_ONCE(!preemptible());
> > > 
> > > you want might_sleep() here instead. preemptible() works only on
> > > preemptible kernels. And might_sleep() is already provided by
> > > wait_on_bit_lock(). So this can go.
> > 
> > wait_on_bit_lock() has might_sleep().
> 
> My point exactly. This makes the WARN_ON_ONCE() obsolete.

Right, might_sleep() can be disabled, as far as I understand,
via CONFIG_DEBUG_ATOMIC_SLEEP, unlike WARN_ON_ONCE().  But I
can drop it and then just rely on might_sleep(), should be
enough.


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCHv4 01/17] zram: switch to non-atomic entry locking
  2025-02-06  8:17             ` Sergey Senozhatsky
@ 2025-02-06  8:26               ` Sebastian Andrzej Siewior
  2025-02-06  8:29                 ` Sergey Senozhatsky
  0 siblings, 1 reply; 73+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-02-06  8:26 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Andrew Morton, Minchan Kim, linux-mm, linux-kernel, Hillf Danton,
	Mike Galbraith

On 2025-02-06 17:17:41 [+0900], Sergey Senozhatsky wrote:
> > Okay. So there are requirements for the sleeping lock. A mutex isn't
> > fitting the requirement because it is too large I guess.
> 
> Correct.

I would nice to state this why a generic locking implementation can not
be used. From what I have seen it should play along with RT nicely.

> > > > > static void zram_slot_lock(struct zram *zram, u32 index)
> > > > > {
> > > > >         unsigned long *lock = &zram->table[index].flags;
> > > > > 
> > > > >         WARN_ON_ONCE(!preemptible());
> > > > 
> > > > you want might_sleep() here instead. preemptible() works only on
> > > > preemptible kernels. And might_sleep() is already provided by
> > > > wait_on_bit_lock(). So this can go.
> > > 
> > > wait_on_bit_lock() has might_sleep().
> > 
> > My point exactly. This makes the WARN_ON_ONCE() obsolete.
> 
> Right, might_sleep() can be disabled, as far as I understand,
> via CONFIG_DEBUG_ATOMIC_SLEEP, unlike WARN_ON_ONCE().  But I
> can drop it and then just rely on might_sleep(), should be
> enough.

It should be enough. mutex_lock(), down() and so on relies solely on it.
As I said, preemptible() only works on preemptible kernels if it comes
to the preemption counter on and !preemptible kernels with enabled
debugging.

Sebastian


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCHv4 01/17] zram: switch to non-atomic entry locking
  2025-02-06  8:26               ` Sebastian Andrzej Siewior
@ 2025-02-06  8:29                 ` Sergey Senozhatsky
  0 siblings, 0 replies; 73+ messages in thread
From: Sergey Senozhatsky @ 2025-02-06  8:29 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Sergey Senozhatsky, Andrew Morton, Minchan Kim, linux-mm,
	linux-kernel, Hillf Danton, Mike Galbraith

On (25/02/06 09:26), Sebastian Andrzej Siewior wrote:
> On 2025-02-06 17:17:41 [+0900], Sergey Senozhatsky wrote:
> > > Okay. So there are requirements for the sleeping lock. A mutex isn't
> > > fitting the requirement because it is too large I guess.
> > 
> > Correct.
> 
> I would nice to state this why a generic locking implementation can not
> be used. From what I have seen it should play along with RT nicely.

Will do.

> > > > wait_on_bit_lock() has might_sleep().
> > > 
> > > My point exactly. This makes the WARN_ON_ONCE() obsolete.
> > 
> > Right, might_sleep() can be disabled, as far as I understand,
> > via CONFIG_DEBUG_ATOMIC_SLEEP, unlike WARN_ON_ONCE().  But I
> > can drop it and then just rely on might_sleep(), should be
> > enough.
> 
> It should be enough. mutex_lock(), down() and so on relies solely on it.
> As I said, preemptible() only works on preemptible kernels if it comes
> to the preemption counter on and !preemptible kernels with enabled
> debugging.

Ack.


^ permalink raw reply	[flat|nested] 73+ messages in thread

* [PATCHv4 02/17] zram: do not use per-CPU compression streams
  2025-01-31  9:05 [PATCHv4 00/17] zsmalloc/zram: there be preemption Sergey Senozhatsky
  2025-01-31  9:06 ` [PATCHv4 01/17] zram: switch to non-atomic entry locking Sergey Senozhatsky
@ 2025-01-31  9:06 ` Sergey Senozhatsky
  2025-02-01  9:21   ` Kairui Song
  2025-01-31  9:06 ` [PATCHv4 03/17] zram: remove crypto include Sergey Senozhatsky
                   ` (14 subsequent siblings)
  16 siblings, 1 reply; 73+ messages in thread
From: Sergey Senozhatsky @ 2025-01-31  9:06 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Minchan Kim, linux-mm, linux-kernel, Sergey Senozhatsky

Similarly to per-entry spin-lock per-CPU compression streams
also have a number of shortcoming.

First, per-CPU stream access has to be done from a non-preemptible
(atomic) section, which imposes the same atomicity requirements on
compression backends as entry spin-lock do and makes it impossible
to use algorithms that can schedule/wait/sleep during compression
and decompression.

Second, per-CPU streams noticeably increase memory usage (actually
more like wastage) of secondary compression streams.  The problem
is that secondary compression streams are allocated per-CPU, just
like the primary streams are.  Yet we never use more that one
secondary stream at a time, because recompression is a single
threaded action.  Which means that remaining num_online_cpu() - 1
streams are allocated for nothing, and this is per-priority list
(we can have several secondary compression algorithms).  Depending
on the algorithm this may lead to a significant memory wastage, in
addition each stream also carries a workmem buffer (2 physical
pages).

Instead of per-CPU streams, maintain a list of idle compression
streams and allocate new streams on-demand (something that we
used to do many years ago).  So that zram read() and write() become
non-atomic and ease requirements on the compression algorithm
implementation.  This also means that we now should have only one
secondary stream per-priority list.

Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
---
 drivers/block/zram/zcomp.c    | 164 +++++++++++++++++++---------------
 drivers/block/zram/zcomp.h    |  17 ++--
 drivers/block/zram/zram_drv.c |  29 +++---
 include/linux/cpuhotplug.h    |   1 -
 4 files changed, 109 insertions(+), 102 deletions(-)

diff --git a/drivers/block/zram/zcomp.c b/drivers/block/zram/zcomp.c
index bb514403e305..982c769d5831 100644
--- a/drivers/block/zram/zcomp.c
+++ b/drivers/block/zram/zcomp.c
@@ -6,7 +6,7 @@
 #include <linux/slab.h>
 #include <linux/wait.h>
 #include <linux/sched.h>
-#include <linux/cpu.h>
+#include <linux/cpumask.h>
 #include <linux/crypto.h>
 #include <linux/vmalloc.h>
 
@@ -43,31 +43,40 @@ static const struct zcomp_ops *backends[] = {
 	NULL
 };
 
-static void zcomp_strm_free(struct zcomp *comp, struct zcomp_strm *zstrm)
+static void zcomp_strm_free(struct zcomp *comp, struct zcomp_strm *strm)
 {
-	comp->ops->destroy_ctx(&zstrm->ctx);
-	vfree(zstrm->buffer);
-	zstrm->buffer = NULL;
+	comp->ops->destroy_ctx(&strm->ctx);
+	vfree(strm->buffer);
+	kfree(strm);
 }
 
-static int zcomp_strm_init(struct zcomp *comp, struct zcomp_strm *zstrm)
+static struct zcomp_strm *zcomp_strm_alloc(struct zcomp *comp)
 {
+	struct zcomp_strm *strm;
 	int ret;
 
-	ret = comp->ops->create_ctx(comp->params, &zstrm->ctx);
-	if (ret)
-		return ret;
+	strm = kzalloc(sizeof(*strm), GFP_KERNEL);
+	if (!strm)
+		return NULL;
+
+	INIT_LIST_HEAD(&strm->entry);
+
+	ret = comp->ops->create_ctx(comp->params, &strm->ctx);
+	if (ret) {
+		kfree(strm);
+		return NULL;
+	}
 
 	/*
-	 * allocate 2 pages. 1 for compressed data, plus 1 extra for the
-	 * case when compressed size is larger than the original one
+	 * allocate 2 pages. 1 for compressed data, plus 1 extra in case if
+	 * compressed data is larger than the original one.
 	 */
-	zstrm->buffer = vzalloc(2 * PAGE_SIZE);
-	if (!zstrm->buffer) {
-		zcomp_strm_free(comp, zstrm);
-		return -ENOMEM;
+	strm->buffer = vzalloc(2 * PAGE_SIZE);
+	if (!strm->buffer) {
+		zcomp_strm_free(comp, strm);
+		return NULL;
 	}
-	return 0;
+	return strm;
 }
 
 static const struct zcomp_ops *lookup_backend_ops(const char *comp)
@@ -109,13 +118,59 @@ ssize_t zcomp_available_show(const char *comp, char *buf)
 
 struct zcomp_strm *zcomp_stream_get(struct zcomp *comp)
 {
-	local_lock(&comp->stream->lock);
-	return this_cpu_ptr(comp->stream);
+	struct zcomp_strm *strm;
+
+	might_sleep();
+
+	while (1) {
+		spin_lock(&comp->strm_lock);
+		if (!list_empty(&comp->idle_strm)) {
+			strm = list_first_entry(&comp->idle_strm,
+						struct zcomp_strm,
+						entry);
+			list_del(&strm->entry);
+			spin_unlock(&comp->strm_lock);
+			return strm;
+		}
+
+		/* cannot allocate new stream, wait for an idle one */
+		if (comp->avail_strm >= num_online_cpus()) {
+			spin_unlock(&comp->strm_lock);
+			wait_event(comp->strm_wait,
+				   !list_empty(&comp->idle_strm));
+			continue;
+		}
+
+		/* allocate new stream */
+		comp->avail_strm++;
+		spin_unlock(&comp->strm_lock);
+
+		strm = zcomp_strm_alloc(comp);
+		if (strm)
+			break;
+
+		spin_lock(&comp->strm_lock);
+		comp->avail_strm--;
+		spin_unlock(&comp->strm_lock);
+		wait_event(comp->strm_wait, !list_empty(&comp->idle_strm));
+	}
+
+	return strm;
 }
 
-void zcomp_stream_put(struct zcomp *comp)
+void zcomp_stream_put(struct zcomp *comp, struct zcomp_strm *strm)
 {
-	local_unlock(&comp->stream->lock);
+	spin_lock(&comp->strm_lock);
+	if (comp->avail_strm <= num_online_cpus()) {
+		list_add(&strm->entry, &comp->idle_strm);
+		spin_unlock(&comp->strm_lock);
+		wake_up(&comp->strm_wait);
+		return;
+	}
+
+	comp->avail_strm--;
+	spin_unlock(&comp->strm_lock);
+	zcomp_strm_free(comp, strm);
 }
 
 int zcomp_compress(struct zcomp *comp, struct zcomp_strm *zstrm,
@@ -148,61 +203,19 @@ int zcomp_decompress(struct zcomp *comp, struct zcomp_strm *zstrm,
 	return comp->ops->decompress(comp->params, &zstrm->ctx, &req);
 }
 
-int zcomp_cpu_up_prepare(unsigned int cpu, struct hlist_node *node)
-{
-	struct zcomp *comp = hlist_entry(node, struct zcomp, node);
-	struct zcomp_strm *zstrm;
-	int ret;
-
-	zstrm = per_cpu_ptr(comp->stream, cpu);
-	local_lock_init(&zstrm->lock);
-
-	ret = zcomp_strm_init(comp, zstrm);
-	if (ret)
-		pr_err("Can't allocate a compression stream\n");
-	return ret;
-}
-
-int zcomp_cpu_dead(unsigned int cpu, struct hlist_node *node)
-{
-	struct zcomp *comp = hlist_entry(node, struct zcomp, node);
-	struct zcomp_strm *zstrm;
-
-	zstrm = per_cpu_ptr(comp->stream, cpu);
-	zcomp_strm_free(comp, zstrm);
-	return 0;
-}
-
-static int zcomp_init(struct zcomp *comp, struct zcomp_params *params)
-{
-	int ret;
-
-	comp->stream = alloc_percpu(struct zcomp_strm);
-	if (!comp->stream)
-		return -ENOMEM;
-
-	comp->params = params;
-	ret = comp->ops->setup_params(comp->params);
-	if (ret)
-		goto cleanup;
-
-	ret = cpuhp_state_add_instance(CPUHP_ZCOMP_PREPARE, &comp->node);
-	if (ret < 0)
-		goto cleanup;
-
-	return 0;
-
-cleanup:
-	comp->ops->release_params(comp->params);
-	free_percpu(comp->stream);
-	return ret;
-}
-
 void zcomp_destroy(struct zcomp *comp)
 {
-	cpuhp_state_remove_instance(CPUHP_ZCOMP_PREPARE, &comp->node);
+	struct zcomp_strm *strm;
+
+	while (!list_empty(&comp->idle_strm)) {
+		strm = list_first_entry(&comp->idle_strm,
+					struct zcomp_strm,
+					entry);
+		list_del(&strm->entry);
+		zcomp_strm_free(comp, strm);
+	}
+
 	comp->ops->release_params(comp->params);
-	free_percpu(comp->stream);
 	kfree(comp);
 }
 
@@ -229,7 +242,12 @@ struct zcomp *zcomp_create(const char *alg, struct zcomp_params *params)
 		return ERR_PTR(-EINVAL);
 	}
 
-	error = zcomp_init(comp, params);
+	INIT_LIST_HEAD(&comp->idle_strm);
+	init_waitqueue_head(&comp->strm_wait);
+	spin_lock_init(&comp->strm_lock);
+
+	comp->params = params;
+	error = comp->ops->setup_params(comp->params);
 	if (error) {
 		kfree(comp);
 		return ERR_PTR(error);
diff --git a/drivers/block/zram/zcomp.h b/drivers/block/zram/zcomp.h
index ad5762813842..62330829db3f 100644
--- a/drivers/block/zram/zcomp.h
+++ b/drivers/block/zram/zcomp.h
@@ -3,10 +3,10 @@
 #ifndef _ZCOMP_H_
 #define _ZCOMP_H_
 
-#include <linux/local_lock.h>
-
 #define ZCOMP_PARAM_NO_LEVEL	INT_MIN
 
+#include <linux/wait.h>
+
 /*
  * Immutable driver (backend) parameters. The driver may attach private
  * data to it (e.g. driver representation of the dictionary, etc.).
@@ -31,7 +31,7 @@ struct zcomp_ctx {
 };
 
 struct zcomp_strm {
-	local_lock_t lock;
+	struct list_head entry;
 	/* compression buffer */
 	void *buffer;
 	struct zcomp_ctx ctx;
@@ -60,16 +60,15 @@ struct zcomp_ops {
 	const char *name;
 };
 
-/* dynamic per-device compression frontend */
 struct zcomp {
-	struct zcomp_strm __percpu *stream;
+	struct list_head idle_strm;
+	spinlock_t strm_lock;
+	u32 avail_strm;
+	wait_queue_head_t strm_wait;
 	const struct zcomp_ops *ops;
 	struct zcomp_params *params;
-	struct hlist_node node;
 };
 
-int zcomp_cpu_up_prepare(unsigned int cpu, struct hlist_node *node);
-int zcomp_cpu_dead(unsigned int cpu, struct hlist_node *node);
 ssize_t zcomp_available_show(const char *comp, char *buf);
 bool zcomp_available_algorithm(const char *comp);
 
@@ -77,7 +76,7 @@ struct zcomp *zcomp_create(const char *alg, struct zcomp_params *params);
 void zcomp_destroy(struct zcomp *comp);
 
 struct zcomp_strm *zcomp_stream_get(struct zcomp *comp);
-void zcomp_stream_put(struct zcomp *comp);
+void zcomp_stream_put(struct zcomp *comp, struct zcomp_strm *strm);
 
 int zcomp_compress(struct zcomp *comp, struct zcomp_strm *zstrm,
 		   const void *src, unsigned int *dst_len);
diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index 1c2df2341704..8d5974ea8ff8 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -31,7 +31,6 @@
 #include <linux/idr.h>
 #include <linux/sysfs.h>
 #include <linux/debugfs.h>
-#include <linux/cpuhotplug.h>
 #include <linux/part_stat.h>
 #include <linux/kernel_read_file.h>
 
@@ -1601,7 +1600,7 @@ static int read_compressed_page(struct zram *zram, struct page *page, u32 index)
 	ret = zcomp_decompress(zram->comps[prio], zstrm, src, size, dst);
 	kunmap_local(dst);
 	zs_unmap_object(zram->mem_pool, handle);
-	zcomp_stream_put(zram->comps[prio]);
+	zcomp_stream_put(zram->comps[prio], zstrm);
 
 	return ret;
 }
@@ -1762,14 +1761,14 @@ static int zram_write_page(struct zram *zram, struct page *page, u32 index)
 	kunmap_local(mem);
 
 	if (unlikely(ret)) {
-		zcomp_stream_put(zram->comps[ZRAM_PRIMARY_COMP]);
+		zcomp_stream_put(zram->comps[ZRAM_PRIMARY_COMP], zstrm);
 		pr_err("Compression failed! err=%d\n", ret);
 		zs_free(zram->mem_pool, handle);
 		return ret;
 	}
 
 	if (comp_len >= huge_class_size) {
-		zcomp_stream_put(zram->comps[ZRAM_PRIMARY_COMP]);
+		zcomp_stream_put(zram->comps[ZRAM_PRIMARY_COMP], zstrm);
 		return write_incompressible_page(zram, page, index);
 	}
 
@@ -1793,7 +1792,7 @@ static int zram_write_page(struct zram *zram, struct page *page, u32 index)
 				   __GFP_HIGHMEM |
 				   __GFP_MOVABLE);
 	if (IS_ERR_VALUE(handle)) {
-		zcomp_stream_put(zram->comps[ZRAM_PRIMARY_COMP]);
+		zcomp_stream_put(zram->comps[ZRAM_PRIMARY_COMP], zstrm);
 		atomic64_inc(&zram->stats.writestall);
 		handle = zs_malloc(zram->mem_pool, comp_len,
 				   GFP_NOIO | __GFP_HIGHMEM |
@@ -1805,7 +1804,7 @@ static int zram_write_page(struct zram *zram, struct page *page, u32 index)
 	}
 
 	if (!zram_can_store_page(zram)) {
-		zcomp_stream_put(zram->comps[ZRAM_PRIMARY_COMP]);
+		zcomp_stream_put(zram->comps[ZRAM_PRIMARY_COMP], zstrm);
 		zs_free(zram->mem_pool, handle);
 		return -ENOMEM;
 	}
@@ -1813,7 +1812,7 @@ static int zram_write_page(struct zram *zram, struct page *page, u32 index)
 	dst = zs_map_object(zram->mem_pool, handle, ZS_MM_WO);
 
 	memcpy(dst, zstrm->buffer, comp_len);
-	zcomp_stream_put(zram->comps[ZRAM_PRIMARY_COMP]);
+	zcomp_stream_put(zram->comps[ZRAM_PRIMARY_COMP], zstrm);
 	zs_unmap_object(zram->mem_pool, handle);
 
 	zram_slot_write_lock(zram, index);
@@ -1972,7 +1971,7 @@ static int recompress_slot(struct zram *zram, u32 index, struct page *page,
 		kunmap_local(src);
 
 		if (ret) {
-			zcomp_stream_put(zram->comps[prio]);
+			zcomp_stream_put(zram->comps[prio], zstrm);
 			return ret;
 		}
 
@@ -1982,7 +1981,7 @@ static int recompress_slot(struct zram *zram, u32 index, struct page *page,
 		/* Continue until we make progress */
 		if (class_index_new >= class_index_old ||
 		    (threshold && comp_len_new >= threshold)) {
-			zcomp_stream_put(zram->comps[prio]);
+			zcomp_stream_put(zram->comps[prio], zstrm);
 			continue;
 		}
 
@@ -2040,13 +2039,13 @@ static int recompress_slot(struct zram *zram, u32 index, struct page *page,
 			       __GFP_HIGHMEM |
 			       __GFP_MOVABLE);
 	if (IS_ERR_VALUE(handle_new)) {
-		zcomp_stream_put(zram->comps[prio]);
+		zcomp_stream_put(zram->comps[prio], zstrm);
 		return PTR_ERR((void *)handle_new);
 	}
 
 	dst = zs_map_object(zram->mem_pool, handle_new, ZS_MM_WO);
 	memcpy(dst, zstrm->buffer, comp_len_new);
-	zcomp_stream_put(zram->comps[prio]);
+	zcomp_stream_put(zram->comps[prio], zstrm);
 
 	zs_unmap_object(zram->mem_pool, handle_new);
 
@@ -2794,7 +2793,6 @@ static void destroy_devices(void)
 	zram_debugfs_destroy();
 	idr_destroy(&zram_index_idr);
 	unregister_blkdev(zram_major, "zram");
-	cpuhp_remove_multi_state(CPUHP_ZCOMP_PREPARE);
 }
 
 static int __init zram_init(void)
@@ -2804,15 +2802,9 @@ static int __init zram_init(void)
 
 	BUILD_BUG_ON(__NR_ZRAM_PAGEFLAGS > sizeof(zram_te.flags) * 8);
 
-	ret = cpuhp_setup_state_multi(CPUHP_ZCOMP_PREPARE, "block/zram:prepare",
-				      zcomp_cpu_up_prepare, zcomp_cpu_dead);
-	if (ret < 0)
-		return ret;
-
 	ret = class_register(&zram_control_class);
 	if (ret) {
 		pr_err("Unable to register zram-control class\n");
-		cpuhp_remove_multi_state(CPUHP_ZCOMP_PREPARE);
 		return ret;
 	}
 
@@ -2821,7 +2813,6 @@ static int __init zram_init(void)
 	if (zram_major <= 0) {
 		pr_err("Unable to get major number\n");
 		class_unregister(&zram_control_class);
-		cpuhp_remove_multi_state(CPUHP_ZCOMP_PREPARE);
 		return -EBUSY;
 	}
 
diff --git a/include/linux/cpuhotplug.h b/include/linux/cpuhotplug.h
index 6cc5e484547c..092ace7db8ee 100644
--- a/include/linux/cpuhotplug.h
+++ b/include/linux/cpuhotplug.h
@@ -119,7 +119,6 @@ enum cpuhp_state {
 	CPUHP_MM_ZS_PREPARE,
 	CPUHP_MM_ZSWP_POOL_PREPARE,
 	CPUHP_KVM_PPC_BOOK3S_PREPARE,
-	CPUHP_ZCOMP_PREPARE,
 	CPUHP_TIMERS_PREPARE,
 	CPUHP_TMIGR_PREPARE,
 	CPUHP_MIPS_SOC_PREPARE,
-- 
2.48.1.362.g079036d154-goog



^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCHv4 02/17] zram: do not use per-CPU compression streams
  2025-01-31  9:06 ` [PATCHv4 02/17] zram: do not use per-CPU compression streams Sergey Senozhatsky
@ 2025-02-01  9:21   ` Kairui Song
  2025-02-03  3:49     ` Sergey Senozhatsky
  0 siblings, 1 reply; 73+ messages in thread
From: Kairui Song @ 2025-02-01  9:21 UTC (permalink / raw)
  To: Sergey Senozhatsky; +Cc: Andrew Morton, Minchan Kim, linux-mm, linux-kernel

Hi Sergey,

On Fri, Jan 31, 2025 at 5:07 PM Sergey Senozhatsky
<senozhatsky@chromium.org> wrote:
>
> Similarly to per-entry spin-lock per-CPU compression streams
> also have a number of shortcoming.
>
> First, per-CPU stream access has to be done from a non-preemptible
> (atomic) section, which imposes the same atomicity requirements on
> compression backends as entry spin-lock do and makes it impossible
> to use algorithms that can schedule/wait/sleep during compression
> and decompression.
>
> Second, per-CPU streams noticeably increase memory usage (actually
> more like wastage) of secondary compression streams.  The problem
> is that secondary compression streams are allocated per-CPU, just
> like the primary streams are.  Yet we never use more that one
> secondary stream at a time, because recompression is a single
> threaded action.  Which means that remaining num_online_cpu() - 1
> streams are allocated for nothing, and this is per-priority list
> (we can have several secondary compression algorithms).  Depending
> on the algorithm this may lead to a significant memory wastage, in
> addition each stream also carries a workmem buffer (2 physical
> pages).
>
> Instead of per-CPU streams, maintain a list of idle compression
> streams and allocate new streams on-demand (something that we
> used to do many years ago).  So that zram read() and write() become
> non-atomic and ease requirements on the compression algorithm
> implementation.  This also means that we now should have only one
> secondary stream per-priority list.
>
> Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
> ---
>  drivers/block/zram/zcomp.c    | 164 +++++++++++++++++++---------------
>  drivers/block/zram/zcomp.h    |  17 ++--
>  drivers/block/zram/zram_drv.c |  29 +++---
>  include/linux/cpuhotplug.h    |   1 -
>  4 files changed, 109 insertions(+), 102 deletions(-)
>
> diff --git a/drivers/block/zram/zcomp.c b/drivers/block/zram/zcomp.c
> index bb514403e305..982c769d5831 100644
> --- a/drivers/block/zram/zcomp.c
> +++ b/drivers/block/zram/zcomp.c
> @@ -6,7 +6,7 @@
>  #include <linux/slab.h>
>  #include <linux/wait.h>
>  #include <linux/sched.h>
> -#include <linux/cpu.h>
> +#include <linux/cpumask.h>
>  #include <linux/crypto.h>
>  #include <linux/vmalloc.h>
>
> @@ -43,31 +43,40 @@ static const struct zcomp_ops *backends[] = {
>         NULL
>  };
>
> -static void zcomp_strm_free(struct zcomp *comp, struct zcomp_strm *zstrm)
> +static void zcomp_strm_free(struct zcomp *comp, struct zcomp_strm *strm)
>  {
> -       comp->ops->destroy_ctx(&zstrm->ctx);
> -       vfree(zstrm->buffer);
> -       zstrm->buffer = NULL;
> +       comp->ops->destroy_ctx(&strm->ctx);
> +       vfree(strm->buffer);
> +       kfree(strm);
>  }
>
> -static int zcomp_strm_init(struct zcomp *comp, struct zcomp_strm *zstrm)
> +static struct zcomp_strm *zcomp_strm_alloc(struct zcomp *comp)
>  {
> +       struct zcomp_strm *strm;
>         int ret;
>
> -       ret = comp->ops->create_ctx(comp->params, &zstrm->ctx);
> -       if (ret)
> -               return ret;
> +       strm = kzalloc(sizeof(*strm), GFP_KERNEL);
> +       if (!strm)
> +               return NULL;
> +
> +       INIT_LIST_HEAD(&strm->entry);
> +
> +       ret = comp->ops->create_ctx(comp->params, &strm->ctx);
> +       if (ret) {
> +               kfree(strm);
> +               return NULL;
> +       }
>
>         /*
> -        * allocate 2 pages. 1 for compressed data, plus 1 extra for the
> -        * case when compressed size is larger than the original one
> +        * allocate 2 pages. 1 for compressed data, plus 1 extra in case if
> +        * compressed data is larger than the original one.
>          */
> -       zstrm->buffer = vzalloc(2 * PAGE_SIZE);
> -       if (!zstrm->buffer) {
> -               zcomp_strm_free(comp, zstrm);
> -               return -ENOMEM;
> +       strm->buffer = vzalloc(2 * PAGE_SIZE);
> +       if (!strm->buffer) {
> +               zcomp_strm_free(comp, strm);
> +               return NULL;
>         }
> -       return 0;
> +       return strm;
>  }
>
>  static const struct zcomp_ops *lookup_backend_ops(const char *comp)
> @@ -109,13 +118,59 @@ ssize_t zcomp_available_show(const char *comp, char *buf)
>
>  struct zcomp_strm *zcomp_stream_get(struct zcomp *comp)
>  {
> -       local_lock(&comp->stream->lock);
> -       return this_cpu_ptr(comp->stream);
> +       struct zcomp_strm *strm;
> +
> +       might_sleep();
> +
> +       while (1) {
> +               spin_lock(&comp->strm_lock);
> +               if (!list_empty(&comp->idle_strm)) {
> +                       strm = list_first_entry(&comp->idle_strm,
> +                                               struct zcomp_strm,
> +                                               entry);
> +                       list_del(&strm->entry);
> +                       spin_unlock(&comp->strm_lock);
> +                       return strm;
> +               }
> +
> +               /* cannot allocate new stream, wait for an idle one */
> +               if (comp->avail_strm >= num_online_cpus()) {
> +                       spin_unlock(&comp->strm_lock);
> +                       wait_event(comp->strm_wait,
> +                                  !list_empty(&comp->idle_strm));
> +                       continue;
> +               }
> +
> +               /* allocate new stream */
> +               comp->avail_strm++;
> +               spin_unlock(&comp->strm_lock);
> +
> +               strm = zcomp_strm_alloc(comp);
> +               if (strm)
> +                       break;
> +
> +               spin_lock(&comp->strm_lock);
> +               comp->avail_strm--;
> +               spin_unlock(&comp->strm_lock);
> +               wait_event(comp->strm_wait, !list_empty(&comp->idle_strm));
> +       }
> +
> +       return strm;
>  }
>
> -void zcomp_stream_put(struct zcomp *comp)
> +void zcomp_stream_put(struct zcomp *comp, struct zcomp_strm *strm)
>  {
> -       local_unlock(&comp->stream->lock);
> +       spin_lock(&comp->strm_lock);
> +       if (comp->avail_strm <= num_online_cpus()) {
> +               list_add(&strm->entry, &comp->idle_strm);
> +               spin_unlock(&comp->strm_lock);
> +               wake_up(&comp->strm_wait);
> +               return;
> +       }
> +
> +       comp->avail_strm--;
> +       spin_unlock(&comp->strm_lock);
> +       zcomp_strm_free(comp, strm);
>  }
>
>  int zcomp_compress(struct zcomp *comp, struct zcomp_strm *zstrm,
> @@ -148,61 +203,19 @@ int zcomp_decompress(struct zcomp *comp, struct zcomp_strm *zstrm,
>         return comp->ops->decompress(comp->params, &zstrm->ctx, &req);
>  }
>
> -int zcomp_cpu_up_prepare(unsigned int cpu, struct hlist_node *node)
> -{
> -       struct zcomp *comp = hlist_entry(node, struct zcomp, node);
> -       struct zcomp_strm *zstrm;
> -       int ret;
> -
> -       zstrm = per_cpu_ptr(comp->stream, cpu);
> -       local_lock_init(&zstrm->lock);
> -
> -       ret = zcomp_strm_init(comp, zstrm);
> -       if (ret)
> -               pr_err("Can't allocate a compression stream\n");
> -       return ret;
> -}
> -
> -int zcomp_cpu_dead(unsigned int cpu, struct hlist_node *node)
> -{
> -       struct zcomp *comp = hlist_entry(node, struct zcomp, node);
> -       struct zcomp_strm *zstrm;
> -
> -       zstrm = per_cpu_ptr(comp->stream, cpu);
> -       zcomp_strm_free(comp, zstrm);
> -       return 0;
> -}
> -
> -static int zcomp_init(struct zcomp *comp, struct zcomp_params *params)
> -{
> -       int ret;
> -
> -       comp->stream = alloc_percpu(struct zcomp_strm);
> -       if (!comp->stream)
> -               return -ENOMEM;
> -
> -       comp->params = params;
> -       ret = comp->ops->setup_params(comp->params);
> -       if (ret)
> -               goto cleanup;
> -
> -       ret = cpuhp_state_add_instance(CPUHP_ZCOMP_PREPARE, &comp->node);
> -       if (ret < 0)
> -               goto cleanup;
> -
> -       return 0;
> -
> -cleanup:
> -       comp->ops->release_params(comp->params);
> -       free_percpu(comp->stream);
> -       return ret;
> -}
> -
>  void zcomp_destroy(struct zcomp *comp)
>  {
> -       cpuhp_state_remove_instance(CPUHP_ZCOMP_PREPARE, &comp->node);
> +       struct zcomp_strm *strm;
> +
> +       while (!list_empty(&comp->idle_strm)) {
> +               strm = list_first_entry(&comp->idle_strm,
> +                                       struct zcomp_strm,
> +                                       entry);
> +               list_del(&strm->entry);
> +               zcomp_strm_free(comp, strm);
> +       }
> +
>         comp->ops->release_params(comp->params);
> -       free_percpu(comp->stream);
>         kfree(comp);
>  }
>
> @@ -229,7 +242,12 @@ struct zcomp *zcomp_create(const char *alg, struct zcomp_params *params)
>                 return ERR_PTR(-EINVAL);
>         }
>
> -       error = zcomp_init(comp, params);
> +       INIT_LIST_HEAD(&comp->idle_strm);
> +       init_waitqueue_head(&comp->strm_wait);
> +       spin_lock_init(&comp->strm_lock);
> +
> +       comp->params = params;
> +       error = comp->ops->setup_params(comp->params);
>         if (error) {
>                 kfree(comp);
>                 return ERR_PTR(error);
> diff --git a/drivers/block/zram/zcomp.h b/drivers/block/zram/zcomp.h
> index ad5762813842..62330829db3f 100644
> --- a/drivers/block/zram/zcomp.h
> +++ b/drivers/block/zram/zcomp.h
> @@ -3,10 +3,10 @@
>  #ifndef _ZCOMP_H_
>  #define _ZCOMP_H_
>
> -#include <linux/local_lock.h>
> -
>  #define ZCOMP_PARAM_NO_LEVEL   INT_MIN
>
> +#include <linux/wait.h>
> +
>  /*
>   * Immutable driver (backend) parameters. The driver may attach private
>   * data to it (e.g. driver representation of the dictionary, etc.).
> @@ -31,7 +31,7 @@ struct zcomp_ctx {
>  };
>
>  struct zcomp_strm {
> -       local_lock_t lock;
> +       struct list_head entry;
>         /* compression buffer */
>         void *buffer;
>         struct zcomp_ctx ctx;
> @@ -60,16 +60,15 @@ struct zcomp_ops {
>         const char *name;
>  };
>
> -/* dynamic per-device compression frontend */
>  struct zcomp {
> -       struct zcomp_strm __percpu *stream;
> +       struct list_head idle_strm;
> +       spinlock_t strm_lock;
> +       u32 avail_strm;
> +       wait_queue_head_t strm_wait;
>         const struct zcomp_ops *ops;
>         struct zcomp_params *params;
> -       struct hlist_node node;
>  };
>
> -int zcomp_cpu_up_prepare(unsigned int cpu, struct hlist_node *node);
> -int zcomp_cpu_dead(unsigned int cpu, struct hlist_node *node);
>  ssize_t zcomp_available_show(const char *comp, char *buf);
>  bool zcomp_available_algorithm(const char *comp);
>
> @@ -77,7 +76,7 @@ struct zcomp *zcomp_create(const char *alg, struct zcomp_params *params);
>  void zcomp_destroy(struct zcomp *comp);
>
>  struct zcomp_strm *zcomp_stream_get(struct zcomp *comp);
> -void zcomp_stream_put(struct zcomp *comp);
> +void zcomp_stream_put(struct zcomp *comp, struct zcomp_strm *strm);
>
>  int zcomp_compress(struct zcomp *comp, struct zcomp_strm *zstrm,
>                    const void *src, unsigned int *dst_len);
> diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
> index 1c2df2341704..8d5974ea8ff8 100644
> --- a/drivers/block/zram/zram_drv.c
> +++ b/drivers/block/zram/zram_drv.c
> @@ -31,7 +31,6 @@
>  #include <linux/idr.h>
>  #include <linux/sysfs.h>
>  #include <linux/debugfs.h>
> -#include <linux/cpuhotplug.h>
>  #include <linux/part_stat.h>
>  #include <linux/kernel_read_file.h>
>
> @@ -1601,7 +1600,7 @@ static int read_compressed_page(struct zram *zram, struct page *page, u32 index)
>         ret = zcomp_decompress(zram->comps[prio], zstrm, src, size, dst);
>         kunmap_local(dst);
>         zs_unmap_object(zram->mem_pool, handle);
> -       zcomp_stream_put(zram->comps[prio]);
> +       zcomp_stream_put(zram->comps[prio], zstrm);
>
>         return ret;
>  }
> @@ -1762,14 +1761,14 @@ static int zram_write_page(struct zram *zram, struct page *page, u32 index)
>         kunmap_local(mem);
>
>         if (unlikely(ret)) {
> -               zcomp_stream_put(zram->comps[ZRAM_PRIMARY_COMP]);
> +               zcomp_stream_put(zram->comps[ZRAM_PRIMARY_COMP], zstrm);
>                 pr_err("Compression failed! err=%d\n", ret);
>                 zs_free(zram->mem_pool, handle);
>                 return ret;
>         }
>
>         if (comp_len >= huge_class_size) {
> -               zcomp_stream_put(zram->comps[ZRAM_PRIMARY_COMP]);
> +               zcomp_stream_put(zram->comps[ZRAM_PRIMARY_COMP], zstrm);
>                 return write_incompressible_page(zram, page, index);
>         }
>
> @@ -1793,7 +1792,7 @@ static int zram_write_page(struct zram *zram, struct page *page, u32 index)
>                                    __GFP_HIGHMEM |
>                                    __GFP_MOVABLE);
>         if (IS_ERR_VALUE(handle)) {
> -               zcomp_stream_put(zram->comps[ZRAM_PRIMARY_COMP]);
> +               zcomp_stream_put(zram->comps[ZRAM_PRIMARY_COMP], zstrm);
>                 atomic64_inc(&zram->stats.writestall);
>                 handle = zs_malloc(zram->mem_pool, comp_len,
>                                    GFP_NOIO | __GFP_HIGHMEM |
> @@ -1805,7 +1804,7 @@ static int zram_write_page(struct zram *zram, struct page *page, u32 index)
>         }
>
>         if (!zram_can_store_page(zram)) {
> -               zcomp_stream_put(zram->comps[ZRAM_PRIMARY_COMP]);
> +               zcomp_stream_put(zram->comps[ZRAM_PRIMARY_COMP], zstrm);
>                 zs_free(zram->mem_pool, handle);
>                 return -ENOMEM;
>         }
> @@ -1813,7 +1812,7 @@ static int zram_write_page(struct zram *zram, struct page *page, u32 index)
>         dst = zs_map_object(zram->mem_pool, handle, ZS_MM_WO);
>
>         memcpy(dst, zstrm->buffer, comp_len);
> -       zcomp_stream_put(zram->comps[ZRAM_PRIMARY_COMP]);
> +       zcomp_stream_put(zram->comps[ZRAM_PRIMARY_COMP], zstrm);
>         zs_unmap_object(zram->mem_pool, handle);
>
>         zram_slot_write_lock(zram, index);
> @@ -1972,7 +1971,7 @@ static int recompress_slot(struct zram *zram, u32 index, struct page *page,
>                 kunmap_local(src);
>
>                 if (ret) {
> -                       zcomp_stream_put(zram->comps[prio]);
> +                       zcomp_stream_put(zram->comps[prio], zstrm);
>                         return ret;
>                 }
>
> @@ -1982,7 +1981,7 @@ static int recompress_slot(struct zram *zram, u32 index, struct page *page,
>                 /* Continue until we make progress */
>                 if (class_index_new >= class_index_old ||
>                     (threshold && comp_len_new >= threshold)) {
> -                       zcomp_stream_put(zram->comps[prio]);
> +                       zcomp_stream_put(zram->comps[prio], zstrm);
>                         continue;
>                 }
>
> @@ -2040,13 +2039,13 @@ static int recompress_slot(struct zram *zram, u32 index, struct page *page,
>                                __GFP_HIGHMEM |
>                                __GFP_MOVABLE);
>         if (IS_ERR_VALUE(handle_new)) {
> -               zcomp_stream_put(zram->comps[prio]);
> +               zcomp_stream_put(zram->comps[prio], zstrm);
>                 return PTR_ERR((void *)handle_new);
>         }
>
>         dst = zs_map_object(zram->mem_pool, handle_new, ZS_MM_WO);
>         memcpy(dst, zstrm->buffer, comp_len_new);
> -       zcomp_stream_put(zram->comps[prio]);
> +       zcomp_stream_put(zram->comps[prio], zstrm);
>
>         zs_unmap_object(zram->mem_pool, handle_new);
>
> @@ -2794,7 +2793,6 @@ static void destroy_devices(void)
>         zram_debugfs_destroy();
>         idr_destroy(&zram_index_idr);
>         unregister_blkdev(zram_major, "zram");
> -       cpuhp_remove_multi_state(CPUHP_ZCOMP_PREPARE);
>  }
>
>  static int __init zram_init(void)
> @@ -2804,15 +2802,9 @@ static int __init zram_init(void)
>
>         BUILD_BUG_ON(__NR_ZRAM_PAGEFLAGS > sizeof(zram_te.flags) * 8);
>
> -       ret = cpuhp_setup_state_multi(CPUHP_ZCOMP_PREPARE, "block/zram:prepare",
> -                                     zcomp_cpu_up_prepare, zcomp_cpu_dead);
> -       if (ret < 0)
> -               return ret;
> -
>         ret = class_register(&zram_control_class);
>         if (ret) {
>                 pr_err("Unable to register zram-control class\n");
> -               cpuhp_remove_multi_state(CPUHP_ZCOMP_PREPARE);
>                 return ret;
>         }
>
> @@ -2821,7 +2813,6 @@ static int __init zram_init(void)
>         if (zram_major <= 0) {
>                 pr_err("Unable to get major number\n");
>                 class_unregister(&zram_control_class);
> -               cpuhp_remove_multi_state(CPUHP_ZCOMP_PREPARE);
>                 return -EBUSY;
>         }
>
> diff --git a/include/linux/cpuhotplug.h b/include/linux/cpuhotplug.h
> index 6cc5e484547c..092ace7db8ee 100644
> --- a/include/linux/cpuhotplug.h
> +++ b/include/linux/cpuhotplug.h
> @@ -119,7 +119,6 @@ enum cpuhp_state {
>         CPUHP_MM_ZS_PREPARE,
>         CPUHP_MM_ZSWP_POOL_PREPARE,
>         CPUHP_KVM_PPC_BOOK3S_PREPARE,
> -       CPUHP_ZCOMP_PREPARE,
>         CPUHP_TIMERS_PREPARE,
>         CPUHP_TMIGR_PREPARE,
>         CPUHP_MIPS_SOC_PREPARE,
> --
> 2.48.1.362.g079036d154-goog

This seems will cause a huge regression of performance on multi core
systems, this is especially significant as the number of concurrent
tasks increases:

Test build linux kernel using ZRAM as SWAP (1G memcg):

Before:
+ /usr/bin/time make -s -j48
2495.77user 2604.77system 2:12.95elapsed 3836%CPU (0avgtext+0avgdata
863304maxresident)k

After:
+ /usr/bin/time make -s -j48
2403.60user 6676.09system 3:38.22elapsed 4160%CPU (0avgtext+0avgdata
863276maxresident)k

`perf lock contention -ab sleep 3` also indicates the big spin lock in
zcomp_stream_get/put is having significant contention:
contended   total wait     max wait     avg wait         type   caller

    793357     28.71 s       2.66 ms     36.19 us     spinlock
zcomp_stream_get+0x37
    793170     28.60 s       2.65 ms     36.06 us     spinlock
zcomp_stream_put+0x1f
    444007     15.26 s       2.58 ms     34.37 us     spinlock
zcomp_stream_put+0x1f
    443960     15.21 s       2.68 ms     34.25 us     spinlock
zcomp_stream_get+0x37
      5516    152.50 ms      3.30 ms     27.65 us     spinlock
evict_folios+0x7e
      4523    137.47 ms      3.66 ms     30.39 us     spinlock
folio_lruvec_lock_irqsave+0xc3
      4253    108.93 ms      2.92 ms     25.61 us     spinlock
folio_lruvec_lock_irqsave+0xc3
     49294     71.73 ms     15.87 us      1.46 us     spinlock
list_lru_del+0x7c
      2327     51.35 ms      3.48 ms     22.07 us     spinlock
evict_folios+0x5c0


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCHv4 02/17] zram: do not use per-CPU compression streams
  2025-02-01  9:21   ` Kairui Song
@ 2025-02-03  3:49     ` Sergey Senozhatsky
  2025-02-03 21:00       ` Yosry Ahmed
  2025-02-06  6:55       ` Kairui Song
  0 siblings, 2 replies; 73+ messages in thread
From: Sergey Senozhatsky @ 2025-02-03  3:49 UTC (permalink / raw)
  To: Kairui Song
  Cc: Sergey Senozhatsky, Andrew Morton, Minchan Kim, linux-mm, linux-kernel

On (25/02/01 17:21), Kairui Song wrote:
> This seems will cause a huge regression of performance on multi core
> systems, this is especially significant as the number of concurrent
> tasks increases:
> 
> Test build linux kernel using ZRAM as SWAP (1G memcg):
> 
> Before:
> + /usr/bin/time make -s -j48
> 2495.77user 2604.77system 2:12.95elapsed 3836%CPU (0avgtext+0avgdata
> 863304maxresident)k
> 
> After:
> + /usr/bin/time make -s -j48
> 2403.60user 6676.09system 3:38.22elapsed 4160%CPU (0avgtext+0avgdata
> 863276maxresident)k

How many CPUs do you have?  I assume, preemption gets into way which is
sort of expected, to be honest...  Using per-CPU compression streams
disables preemption and uses CPU exclusively at a price of other tasks
not being able to run.  I do tend to think that I made a mistake by
switching zram to per-CPU compression streams.

What preemption model do you use and to what extent do you overload
your system?

My tests don't show anything unusual (but I don't overload the system)

CONFIG_PREEMPT

before
1371.96user 156.21system 1:30.91elapsed 1680%CPU (0avgtext+0avgdata 825636maxresident)k
32688inputs+1768416outputs (259major+51539861minor)pagefaults 0swaps

after
1372.05user 155.79system 1:30.82elapsed 1682%CPU (0avgtext+0avgdata 825684maxresident)k
32680inputs+1768416outputs (273major+51541815minor)pagefaults 0swaps

(I use zram as a block device with ext4 on it.)

> `perf lock contention -ab sleep 3` also indicates the big spin lock in
> zcomp_stream_get/put is having significant contention:

Hmm it's just

	spin_lock()
	list first entry
	spin_unlock()

Shouldn't be "a big spin lock", that's very odd.  I'm not familiar with
perf lock contention, let me take a look.


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCHv4 02/17] zram: do not use per-CPU compression streams
  2025-02-03  3:49     ` Sergey Senozhatsky
@ 2025-02-03 21:00       ` Yosry Ahmed
  2025-02-06 12:26         ` Sergey Senozhatsky
  2025-02-06  6:55       ` Kairui Song
  1 sibling, 1 reply; 73+ messages in thread
From: Yosry Ahmed @ 2025-02-03 21:00 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Kairui Song, Andrew Morton, Minchan Kim, linux-mm, linux-kernel

On Mon, Feb 03, 2025 at 12:49:42PM +0900, Sergey Senozhatsky wrote:
> On (25/02/01 17:21), Kairui Song wrote:
> > This seems will cause a huge regression of performance on multi core
> > systems, this is especially significant as the number of concurrent
> > tasks increases:
> > 
> > Test build linux kernel using ZRAM as SWAP (1G memcg):
> > 
> > Before:
> > + /usr/bin/time make -s -j48
> > 2495.77user 2604.77system 2:12.95elapsed 3836%CPU (0avgtext+0avgdata
> > 863304maxresident)k
> > 
> > After:
> > + /usr/bin/time make -s -j48
> > 2403.60user 6676.09system 3:38.22elapsed 4160%CPU (0avgtext+0avgdata
> > 863276maxresident)k
> 
> How many CPUs do you have?  I assume, preemption gets into way which is
> sort of expected, to be honest...  Using per-CPU compression streams
> disables preemption and uses CPU exclusively at a price of other tasks
> not being able to run.  I do tend to think that I made a mistake by
> switching zram to per-CPU compression streams.

FWIW, I am not familiar at all with the zram code but zswap uses per-CPU
acomp contexts with a mutex instead of a spinlock. So the task uses the
context of the CPU that it started on, but it can be preempted or
migrated and end up running on a different CPU. This means that
contention is still possible, but probably much lower than having a
shared pool of contexts that all CPUs compete on.

Again, this could be irrelevant as I am not very familiar with the zram
code, just thought this may be useful.


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCHv4 02/17] zram: do not use per-CPU compression streams
  2025-02-03 21:00       ` Yosry Ahmed
@ 2025-02-06 12:26         ` Sergey Senozhatsky
  0 siblings, 0 replies; 73+ messages in thread
From: Sergey Senozhatsky @ 2025-02-06 12:26 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Sergey Senozhatsky, Kairui Song, Andrew Morton, Minchan Kim,
	linux-mm, linux-kernel

On (25/02/03 21:00), Yosry Ahmed wrote:
> On Mon, Feb 03, 2025 at 12:49:42PM +0900, Sergey Senozhatsky wrote:
> > On (25/02/01 17:21), Kairui Song wrote:
> FWIW, I am not familiar at all with the zram code but zswap uses per-CPU
> acomp contexts with a mutex instead of a spinlock. So the task uses the
> context of the CPU that it started on, but it can be preempted or
> migrated and end up running on a different CPU.

Thank you for the idea.  We couldn't do that before (in zram), in a
number of cases per-CPU stream was taken from atomic context (under
zram table entry spinlock/bit-spinlock), but it's possible now
because entry lock is preemptible.


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCHv4 02/17] zram: do not use per-CPU compression streams
  2025-02-03  3:49     ` Sergey Senozhatsky
  2025-02-03 21:00       ` Yosry Ahmed
@ 2025-02-06  6:55       ` Kairui Song
  2025-02-06  7:22         ` Sergey Senozhatsky
  1 sibling, 1 reply; 73+ messages in thread
From: Kairui Song @ 2025-02-06  6:55 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Andrew Morton, Minchan Kim, linux-mm, linux-kernel, Yosry Ahmed

On Mon, Feb 3, 2025 at 11:49 AM Sergey Senozhatsky
<senozhatsky@chromium.org> wrote:
>
> On (25/02/01 17:21), Kairui Song wrote:
> > This seems will cause a huge regression of performance on multi core
> > systems, this is especially significant as the number of concurrent
> > tasks increases:
> >
> > Test build linux kernel using ZRAM as SWAP (1G memcg):
> >
> > Before:
> > + /usr/bin/time make -s -j48
> > 2495.77user 2604.77system 2:12.95elapsed 3836%CPU (0avgtext+0avgdata
> > 863304maxresident)k
> >
> > After:
> > + /usr/bin/time make -s -j48
> > 2403.60user 6676.09system 3:38.22elapsed 4160%CPU (0avgtext+0avgdata
> > 863276maxresident)k
>
> How many CPUs do you have?  I assume, preemption gets into way which is
> sort of expected, to be honest...  Using per-CPU compression streams
> disables preemption and uses CPU exclusively at a price of other tasks
> not being able to run.  I do tend to think that I made a mistake by
> switching zram to per-CPU compression streams.
>
> What preemption model do you use and to what extent do you overload
> your system?
>
> My tests don't show anything unusual (but I don't overload the system)
>
> CONFIG_PREEMPT

I'm using CONFIG_PREEMPT_VOLUNTARY=y, and there are 96 logical CPUs
(48c96t), make -j48 shouldn't be considered overload I think. make
-j32 also showed an obvious slow down.

>
> before
> 1371.96user 156.21system 1:30.91elapsed 1680%CPU (0avgtext+0avgdata 825636maxresident)k
> 32688inputs+1768416outputs (259major+51539861minor)pagefaults 0swaps
>
> after
> 1372.05user 155.79system 1:30.82elapsed 1682%CPU (0avgtext+0avgdata 825684maxresident)k
> 32680inputs+1768416outputs (273major+51541815minor)pagefaults 0swaps
>
> (I use zram as a block device with ext4 on it.)

I'm testing with ZRAM as SWAP, and tmpfs as storage for the kernel
source code, with memory pressure inside a 2G or smaller mem cgroup
(depend on make -j48 or -j32).

>
> > `perf lock contention -ab sleep 3` also indicates the big spin lock in
> > zcomp_stream_get/put is having significant contention:
>
> Hmm it's just
>
>         spin_lock()
>         list first entry
>         spin_unlock()
>
> Shouldn't be "a big spin lock", that's very odd.  I'm not familiar with
> perf lock contention, let me take a look.

I can debug this a bit more to figure out why the contention is huge
later, but my first thought is that, as Yosry also mentioned in
another reply, making it preemptable doesn't necessarily mean the per
CPU stream has to be gone.


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCHv4 02/17] zram: do not use per-CPU compression streams
  2025-02-06  6:55       ` Kairui Song
@ 2025-02-06  7:22         ` Sergey Senozhatsky
  2025-02-06  8:22           ` Sergey Senozhatsky
  2025-02-06 16:16           ` Yosry Ahmed
  0 siblings, 2 replies; 73+ messages in thread
From: Sergey Senozhatsky @ 2025-02-06  7:22 UTC (permalink / raw)
  To: Kairui Song
  Cc: Sergey Senozhatsky, Andrew Morton, Minchan Kim, linux-mm,
	linux-kernel, Yosry Ahmed

On (25/02/06 14:55), Kairui Song wrote:
> > On (25/02/01 17:21), Kairui Song wrote:
> > > This seems will cause a huge regression of performance on multi core
> > > systems, this is especially significant as the number of concurrent
> > > tasks increases:
> > >
> > > Test build linux kernel using ZRAM as SWAP (1G memcg):
> > >
> > > Before:
> > > + /usr/bin/time make -s -j48
> > > 2495.77user 2604.77system 2:12.95elapsed 3836%CPU (0avgtext+0avgdata
> > > 863304maxresident)k
> > >
> > > After:
> > > + /usr/bin/time make -s -j48
> > > 2403.60user 6676.09system 3:38.22elapsed 4160%CPU (0avgtext+0avgdata
> > > 863276maxresident)k
> >
> > How many CPUs do you have?  I assume, preemption gets into way which is
> > sort of expected, to be honest...  Using per-CPU compression streams
> > disables preemption and uses CPU exclusively at a price of other tasks
> > not being able to run.  I do tend to think that I made a mistake by
> > switching zram to per-CPU compression streams.
> >
> > What preemption model do you use and to what extent do you overload
> > your system?
> >
> > My tests don't show anything unusual (but I don't overload the system)
> >
> > CONFIG_PREEMPT
> 
> I'm using CONFIG_PREEMPT_VOLUNTARY=y, and there are 96 logical CPUs
> (48c96t), make -j48 shouldn't be considered overload I think. make
> -j32 also showed an obvious slow down.

Hmm, there should be more than enough compression streams then, the
limit is num_online_cpus.  That's strange.  I wonder if that's zsmalloc
handle allocation ("remove two-staged handle allocation" in the series.)

[..]
> > Hmm it's just
> >
> >         spin_lock()
> >         list first entry
> >         spin_unlock()
> >
> > Shouldn't be "a big spin lock", that's very odd.  I'm not familiar with
> > perf lock contention, let me take a look.
> 
> I can debug this a bit more to figure out why the contention is huge
> later

That will be appreciated, thank you.

> but my first thought is that, as Yosry also mentioned in
> another reply, making it preemptable doesn't necessarily mean the per
> CPU stream has to be gone.

Was going to reply to Yosry's email today/tomorrow, didn't have time to
look into, but will reply here.

So for spin-lock contention - yes, but that lock really should not
be so visible.  Other than that we limit the number of compression
streams to the number of the CPUs and permit preemption, so it should
be the same as the "preemptible per-CPU" streams, roughly.  The
difference, perhaps, is that we don't pre-allocate streams, but
allocate only as needed.  This has two sides: one side is that later
allocations can fail, but the other side is that we don't allocate
streams that we don't use.  Especially secondary streams (priority 1
and 2, which are used for recompression).  I didn't know it was possible
to use per-CPU data and still have preemption enabled at the same time.
So I'm not opposed to the idea of still having per-CPU streams and do
what zswap folks did.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCHv4 02/17] zram: do not use per-CPU compression streams
  2025-02-06  7:22         ` Sergey Senozhatsky
@ 2025-02-06  8:22           ` Sergey Senozhatsky
  2025-02-06 16:16           ` Yosry Ahmed
  1 sibling, 0 replies; 73+ messages in thread
From: Sergey Senozhatsky @ 2025-02-06  8:22 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Kairui Song, Andrew Morton, Minchan Kim, linux-mm, linux-kernel,
	Yosry Ahmed

On (25/02/06 16:22), Sergey Senozhatsky wrote:
> I didn't know it was possible to use per-CPU data and still have
> preemption enabled at the same time.  So I'm not opposed to the
> idea of still having per-CPU streams and do what zswap folks did.

Maybe that's actually a preferable option.   Allocation of streams
on-demand has a problem that streams' constructors need to use proper
GFP flags (they still use GFP_KERNEL, wrongly), and so on.  Keeping
things the way they are (per-CPU) but adding a preemption is likely
a safer and nicer option.


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCHv4 02/17] zram: do not use per-CPU compression streams
  2025-02-06  7:22         ` Sergey Senozhatsky
  2025-02-06  8:22           ` Sergey Senozhatsky
@ 2025-02-06 16:16           ` Yosry Ahmed
  2025-02-07  2:56             ` Sergey Senozhatsky
  1 sibling, 1 reply; 73+ messages in thread
From: Yosry Ahmed @ 2025-02-06 16:16 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Kairui Song, Andrew Morton, Minchan Kim, linux-mm, linux-kernel

On Thu, Feb 06, 2025 at 04:22:27PM +0900, Sergey Senozhatsky wrote:
> On (25/02/06 14:55), Kairui Song wrote:
> > > On (25/02/01 17:21), Kairui Song wrote:
> > > > This seems will cause a huge regression of performance on multi core
> > > > systems, this is especially significant as the number of concurrent
> > > > tasks increases:
> > > >
> > > > Test build linux kernel using ZRAM as SWAP (1G memcg):
> > > >
> > > > Before:
> > > > + /usr/bin/time make -s -j48
> > > > 2495.77user 2604.77system 2:12.95elapsed 3836%CPU (0avgtext+0avgdata
> > > > 863304maxresident)k
> > > >
> > > > After:
> > > > + /usr/bin/time make -s -j48
> > > > 2403.60user 6676.09system 3:38.22elapsed 4160%CPU (0avgtext+0avgdata
> > > > 863276maxresident)k
> > >
> > > How many CPUs do you have?  I assume, preemption gets into way which is
> > > sort of expected, to be honest...  Using per-CPU compression streams
> > > disables preemption and uses CPU exclusively at a price of other tasks
> > > not being able to run.  I do tend to think that I made a mistake by
> > > switching zram to per-CPU compression streams.
> > >
> > > What preemption model do you use and to what extent do you overload
> > > your system?
> > >
> > > My tests don't show anything unusual (but I don't overload the system)
> > >
> > > CONFIG_PREEMPT
> > 
> > I'm using CONFIG_PREEMPT_VOLUNTARY=y, and there are 96 logical CPUs
> > (48c96t), make -j48 shouldn't be considered overload I think. make
> > -j32 also showed an obvious slow down.
> 
> Hmm, there should be more than enough compression streams then, the
> limit is num_online_cpus.  That's strange.  I wonder if that's zsmalloc
> handle allocation ("remove two-staged handle allocation" in the series.)
> 
> [..]
> > > Hmm it's just
> > >
> > >         spin_lock()
> > >         list first entry
> > >         spin_unlock()
> > >
> > > Shouldn't be "a big spin lock", that's very odd.  I'm not familiar with
> > > perf lock contention, let me take a look.
> > 
> > I can debug this a bit more to figure out why the contention is huge
> > later
> 
> That will be appreciated, thank you.
> 
> > but my first thought is that, as Yosry also mentioned in
> > another reply, making it preemptable doesn't necessarily mean the per
> > CPU stream has to be gone.
> 
> Was going to reply to Yosry's email today/tomorrow, didn't have time to
> look into, but will reply here.
> 
> 
> So for spin-lock contention - yes, but that lock really should not
> be so visible.  Other than that we limit the number of compression
> streams to the number of the CPUs and permit preemption, so it should
> be the same as the "preemptible per-CPU" streams, roughly.

I think one other problem is that with a pool of streams guarded by a
single lock all CPUs have to be serialized on that lock, even if there's
enough streams for all CPUs in theory.

> The difference, perhaps, is that we don't pre-allocate streams, but
> allocate only as needed.  This has two sides: one side is that later
> allocations can fail, but the other side is that we don't allocate
> streams that we don't use.  Especially secondary streams (priority 1
> and 2, which are used for recompression).  I didn't know it was possible
> to use per-CPU data and still have preemption enabled at the same time.
> So I'm not opposed to the idea of still having per-CPU streams and do
> what zswap folks did.

Note that it's not a free lunch. If preemption is allowed there is
nothing holding keeping the CPU that you're using its data, and it can
be offlined. I see that zcomp_cpu_dead() would free the compression
stream from under its user in this case.

We had a similar problem recently in zswap and it took me a couple of
iterations to properly fix it. In short, you need to synchronize the CPU
hotplug callbacks with the users of the compression stream to make sure
the stream is not freed under the user.


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCHv4 02/17] zram: do not use per-CPU compression streams
  2025-02-06 16:16           ` Yosry Ahmed
@ 2025-02-07  2:56             ` Sergey Senozhatsky
  2025-02-07  6:12               ` Sergey Senozhatsky
  0 siblings, 1 reply; 73+ messages in thread
From: Sergey Senozhatsky @ 2025-02-07  2:56 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Sergey Senozhatsky, Kairui Song, Andrew Morton, Minchan Kim,
	linux-mm, linux-kernel

On (25/02/06 16:16), Yosry Ahmed wrote:
> > So for spin-lock contention - yes, but that lock really should not
> > be so visible.  Other than that we limit the number of compression
> > streams to the number of the CPUs and permit preemption, so it should
> > be the same as the "preemptible per-CPU" streams, roughly.
> 
> I think one other problem is that with a pool of streams guarded by a
> single lock all CPUs have to be serialized on that lock, even if there's
> enough streams for all CPUs in theory.

Yes, at the same time it guards list-first-entry, which is not
exceptionally expensive.  Yet, somehow, it still showed up on
Kairui's radar.

I think there was also a problem with how on-demand streams were
constructed - GFP_KERNEL allocations from a reclaim path, which
is a tiny bit problematic and deadlock-ish.

> > The difference, perhaps, is that we don't pre-allocate streams, but
> > allocate only as needed.  This has two sides: one side is that later
> > allocations can fail, but the other side is that we don't allocate
> > streams that we don't use.  Especially secondary streams (priority 1
> > and 2, which are used for recompression).  I didn't know it was possible
> > to use per-CPU data and still have preemption enabled at the same time.
> > So I'm not opposed to the idea of still having per-CPU streams and do
> > what zswap folks did.
> 
> Note that it's not a free lunch. If preemption is allowed there is
> nothing holding keeping the CPU that you're using its data, and it can
> be offlined. I see that zcomp_cpu_dead() would free the compression
> stream from under its user in this case.

Yes, I took same approach as you did in zswap - we are holding the mutex
that cpu-dead is blocked on as long as the stream is being used.

struct zcomp_strm *zcomp_stream_get(struct zcomp *comp)
{
        for (;;) {
                struct zcomp_strm *zstrm = raw_cpu_ptr(comp->stream);

                /*
                 * Inspired by zswap
                 *
                 * stream is returned with ->mutex locked which prevents
                 * cpu_dead() from releasing this stream under us, however
                 * there is still a race window between raw_cpu_ptr() and
                 * mutex_lock(), during which we could have been migrated
                 * to a CPU that has already destroyed its stream.  If so
                 * then unlock and re-try on the current CPU.
                 */
                mutex_lock(&zstrm->lock);
                if (likely(zstrm->buffer))
                        return zstrm;
                mutex_unlock(&zstrm->lock);
        }
}

void zcomp_stream_put(struct zcomp_strm *zstrm)
{
        mutex_unlock(&zstrm->lock);
}

int zcomp_cpu_dead(unsigned int cpu, struct hlist_node *node)
{
        struct zcomp *comp = hlist_entry(node, struct zcomp, node);
        struct zcomp_strm *zstrm = per_cpu_ptr(comp->stream, cpu);

        mutex_lock(&zstrm->lock);
        zcomp_strm_free(comp, zstrm);
        mutex_unlock(&zstrm->lock);
        return 0;
}

> We had a similar problem recently in zswap and it took me a couple of
> iterations to properly fix it. In short, you need to synchronize the CPU
> hotplug callbacks with the users of the compression stream to make sure
> the stream is not freed under the user.

Agreed.


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCHv4 02/17] zram: do not use per-CPU compression streams
  2025-02-07  2:56             ` Sergey Senozhatsky
@ 2025-02-07  6:12               ` Sergey Senozhatsky
  2025-02-07 21:07                 ` Yosry Ahmed
  0 siblings, 1 reply; 73+ messages in thread
From: Sergey Senozhatsky @ 2025-02-07  6:12 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Kairui Song, Andrew Morton, Minchan Kim, linux-mm, linux-kernel,
	Sergey Senozhatsky

On (25/02/07 11:56), Sergey Senozhatsky wrote:
> struct zcomp_strm *zcomp_stream_get(struct zcomp *comp)
> {
>         for (;;) {
>                 struct zcomp_strm *zstrm = raw_cpu_ptr(comp->stream);
> 
>                 /*
>                  * Inspired by zswap
>                  *
>                  * stream is returned with ->mutex locked which prevents
>                  * cpu_dead() from releasing this stream under us, however
>                  * there is still a race window between raw_cpu_ptr() and
>                  * mutex_lock(), during which we could have been migrated
>                  * to a CPU that has already destroyed its stream.  If so
>                  * then unlock and re-try on the current CPU.
>                  */
>                 mutex_lock(&zstrm->lock);
>                 if (likely(zstrm->buffer))
>                         return zstrm;
>                 mutex_unlock(&zstrm->lock);
>         }
> }
> 
> void zcomp_stream_put(struct zcomp_strm *zstrm)
> {
>         mutex_unlock(&zstrm->lock);
> }
> 
> int zcomp_cpu_dead(unsigned int cpu, struct hlist_node *node)
> {
>         struct zcomp *comp = hlist_entry(node, struct zcomp, node);
>         struct zcomp_strm *zstrm = per_cpu_ptr(comp->stream, cpu);
> 
>         mutex_lock(&zstrm->lock);
>         zcomp_strm_free(comp, zstrm);
>         mutex_unlock(&zstrm->lock);
>         return 0;
> }

One downside of this is that this adds mutex to the locking graph and
limits what zram can do.  In particular we cannot do GFP_NOIO zsmalloc
handle allocations, because NOIO still does reclaim (doesn't reach the
block layer) which grabs some locks internally and this looks a bit
problematics:
	zram strm mutex -> zsmalloc GFP_NOIO -> reclaim
vs
	reclaim -> zram strm mutex -> zsmalloc

GFP_NOWAIT allocation has lower success chances.


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCHv4 02/17] zram: do not use per-CPU compression streams
  2025-02-07  6:12               ` Sergey Senozhatsky
@ 2025-02-07 21:07                 ` Yosry Ahmed
  2025-02-08 16:20                   ` Sergey Senozhatsky
  0 siblings, 1 reply; 73+ messages in thread
From: Yosry Ahmed @ 2025-02-07 21:07 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Kairui Song, Andrew Morton, Minchan Kim, linux-mm, linux-kernel

On Fri, Feb 07, 2025 at 03:12:59PM +0900, Sergey Senozhatsky wrote:
> On (25/02/07 11:56), Sergey Senozhatsky wrote:
> > struct zcomp_strm *zcomp_stream_get(struct zcomp *comp)
> > {
> >         for (;;) {
> >                 struct zcomp_strm *zstrm = raw_cpu_ptr(comp->stream);
> > 
> >                 /*
> >                  * Inspired by zswap
> >                  *
> >                  * stream is returned with ->mutex locked which prevents
> >                  * cpu_dead() from releasing this stream under us, however
> >                  * there is still a race window between raw_cpu_ptr() and
> >                  * mutex_lock(), during which we could have been migrated
> >                  * to a CPU that has already destroyed its stream.  If so
> >                  * then unlock and re-try on the current CPU.
> >                  */
> >                 mutex_lock(&zstrm->lock);
> >                 if (likely(zstrm->buffer))
> >                         return zstrm;
> >                 mutex_unlock(&zstrm->lock);
> >         }
> > }
> > 
> > void zcomp_stream_put(struct zcomp_strm *zstrm)
> > {
> >         mutex_unlock(&zstrm->lock);
> > }
> > 
> > int zcomp_cpu_dead(unsigned int cpu, struct hlist_node *node)
> > {
> >         struct zcomp *comp = hlist_entry(node, struct zcomp, node);
> >         struct zcomp_strm *zstrm = per_cpu_ptr(comp->stream, cpu);
> > 
> >         mutex_lock(&zstrm->lock);
> >         zcomp_strm_free(comp, zstrm);
> >         mutex_unlock(&zstrm->lock);
> >         return 0;
> > }
> 
> One downside of this is that this adds mutex to the locking graph and
> limits what zram can do.  In particular we cannot do GFP_NOIO zsmalloc
> handle allocations, because NOIO still does reclaim (doesn't reach the
> block layer) which grabs some locks internally and this looks a bit
> problematics:
> 	zram strm mutex -> zsmalloc GFP_NOIO -> reclaim
> vs
> 	reclaim -> zram strm mutex -> zsmalloc
> 
> GFP_NOWAIT allocation has lower success chances.

I assume this problem is unique to zram and not zswap because zram can
be used with normal IO (and then recurse through reclaim), while zswap
is only reachable thorugh reclaim (which cannot recurse)?


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCHv4 02/17] zram: do not use per-CPU compression streams
  2025-02-07 21:07                 ` Yosry Ahmed
@ 2025-02-08 16:20                   ` Sergey Senozhatsky
  2025-02-08 16:41                     ` Sergey Senozhatsky
  2025-02-09  6:22                     ` Sergey Senozhatsky
  0 siblings, 2 replies; 73+ messages in thread
From: Sergey Senozhatsky @ 2025-02-08 16:20 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Sergey Senozhatsky, Kairui Song, Andrew Morton, Minchan Kim,
	linux-mm, linux-kernel

On (25/02/07 21:07), Yosry Ahmed wrote:
> I assume this problem is unique to zram and not zswap because zram can
> be used with normal IO (and then recurse through reclaim), while zswap
> is only reachable thorugh reclaim (which cannot recurse)?

I think I figured it out.  It appears that the problem was in lockdep
class key.  Both in zram and in zsmalloc I made the keys static:

 static void zram_slot_lock_init(struct zram *zram, u32 index)
 {
 #ifdef CONFIG_DEBUG_LOCK_ALLOC
       static struct lock_class_key key;

        lockdep_init_map(&zram->table[index].lockdep_map, "zram-entry->lock",
                        &key, 0);
 #endif
 }

Which would put the locks to the same class from lockdep point of
view.  And that means that chains of locks from zram0 (mounted ext4)
and chains of locks from zram1 (swap device) would interleave, leading
to reports that made no sense.  Like ext4 writeback and blkdev_read and
handle_mm_fault->do_swap_page() would be parts of the same lock-chain *.

So I moved lockdep class keys to per-zram device and per-zsmalloc pool
to separate the lockdep chains.  Looks like that did the trick.


*

[ 1714.787676] [    T172] ======================================================
[ 1714.788905] [    T172] WARNING: possible circular locking dependency detected
[ 1714.790114] [    T172] 6.14.0-rc1-next-20250207+ #936 Not tainted
[ 1714.791150] [    T172] ------------------------------------------------------
[ 1714.792356] [    T172] kworker/u96:4/172 is trying to acquire lock:
[ 1714.793421] [    T172] ffff888114cf0598 (ptlock_ptr(ptdesc)#2){+.+.}-{3:3}, at: page_vma_mapped_walk+0x5c0/0x960
[ 1714.795174] [    T172]
                          but task is already holding lock:
[ 1714.796453] [    T172] ffffe8ffff981cf8 (&zstrm->lock){+.+.}-{4:4}, at: zcomp_stream_get+0x20/0x40 [zram]
[ 1714.798098] [    T172]
                          which lock already depends on the new lock.

[ 1714.799901] [    T172]                                                                                                                                                                                                                                                                                   the existing dependency chain (in reverse order) is:
[ 1714.801469] [    T172]
                          -> #3 (&zstrm->lock){+.+.}-{4:4}:
[ 1714.802750] [    T172]        lock_acquire.part.0+0x63/0x1a0
[ 1714.803712] [    T172]        __mutex_lock+0xaa/0xd40
[ 1714.804574] [    T172]        zcomp_stream_get+0x20/0x40 [zram]
[ 1714.805578] [    T172]        zram_read_from_zspool+0x84/0x140 [zram]
[ 1714.806673] [    T172]        zram_bio_read+0x56/0x2c0 [zram]
[ 1714.807641] [    T172]        __submit_bio+0x12d/0x1c0
[ 1714.808511] [    T172]        __submit_bio_noacct+0x7f/0x200
[ 1714.809468] [    T172]        mpage_readahead+0xdd/0x110
[ 1714.810360] [    T172]        read_pages+0x7a/0x1b0
[ 1714.811182] [    T172]        page_cache_ra_unbounded+0x19a/0x210
[ 1714.812215] [    T172]        force_page_cache_ra+0x92/0xb0
[ 1714.813161] [    T172]        filemap_get_pages+0x11f/0x440
[ 1714.814098] [    T172]        filemap_read+0xf6/0x400
[ 1714.814945] [    T172]        blkdev_read_iter+0x66/0x130
[ 1714.815860] [    T172]        vfs_read+0x266/0x370
[ 1714.816674] [    T172]        ksys_read+0x66/0xe0
[ 1714.817477] [    T172]        do_syscall_64+0x64/0x130
[ 1714.818344] [    T172]        entry_SYSCALL_64_after_hwframe+0x4b/0x53
[ 1714.819444] [    T172]
                          -> #2 (zram-entry->lock){+.+.}-{0:0}:
[ 1714.820769] [    T172]        lock_acquire.part.0+0x63/0x1a0
[ 1714.821734] [    T172]        zram_slot_free_notify+0x5c/0x80 [zram]
[ 1714.822811] [    T172]        swap_entry_range_free+0x115/0x1a0
[ 1714.823812] [    T172]        cluster_swap_free_nr+0xb9/0x150
[ 1714.824787] [    T172]        do_swap_page+0x80d/0xea0
[ 1714.825661] [    T172]        __handle_mm_fault+0x538/0x7a0
[ 1714.826592] [    T172]        handle_mm_fault+0xdf/0x240
[ 1714.827485] [    T172]        do_user_addr_fault+0x152/0x700
[ 1714.828432] [    T172]        exc_page_fault+0x66/0x1f0
[ 1714.829317] [    T172]        asm_exc_page_fault+0x22/0x30
[ 1714.830235] [    T172]        do_sys_poll+0x213/0x260
[ 1714.831090] [    T172]        __x64_sys_poll+0x44/0x190
[ 1714.831972] [    T172]        do_syscall_64+0x64/0x130
[ 1714.832846] [    T172]        entry_SYSCALL_64_after_hwframe+0x4b/0x53
[ 1714.833949] [    T172]
                          -> #1 (&cluster_info[i].lock){+.+.}-{3:3}:
[ 1714.835354] [    T172]        lock_acquire.part.0+0x63/0x1a0
[ 1714.836307] [    T172]        _raw_spin_lock+0x2c/0x40
[ 1714.837194] [    T172]        __swap_duplicate+0x5e/0x150
[ 1714.838123] [    T172]        swap_duplicate+0x1c/0x40
[ 1714.838980] [    T172]        try_to_unmap_one+0x6c4/0xd60
[ 1714.839901] [    T172]        rmap_walk_anon+0xe7/0x210
[ 1714.840774] [    T172]        try_to_unmap+0x76/0x80
[ 1714.841613] [    T172]        shrink_folio_list+0x487/0xad0
[ 1714.842546] [    T172]        evict_folios+0x247/0x800
[ 1714.843404] [    T172]        try_to_shrink_lruvec+0x1cd/0x2b0
[ 1714.844382] [    T172]        lru_gen_shrink_node+0xc3/0x190
[ 1714.845335] [    T172]        do_try_to_free_pages+0xee/0x4b0
[ 1714.846292] [    T172]        try_to_free_pages+0xea/0x280
[ 1714.847208] [    T172]        __alloc_pages_slowpath.constprop.0+0x296/0x970
[ 1714.848391] [    T172]        __alloc_frozen_pages_noprof+0x2b3/0x300
[ 1714.849475] [    T172]        __folio_alloc_noprof+0x10/0x30
[ 1714.850422] [    T172]        do_anonymous_page+0x69/0x4b0
[ 1714.851337] [    T172]        __handle_mm_fault+0x557/0x7a0
[ 1714.852265] [    T172]        handle_mm_fault+0xdf/0x240
[ 1714.853153] [    T172]        do_user_addr_fault+0x152/0x700
[ 1714.854099] [    T172]        exc_page_fault+0x66/0x1f0
[ 1714.854976] [    T172]        asm_exc_page_fault+0x22/0x30
[ 1714.855897] [    T172]        rep_movs_alternative+0x3a/0x60
[ 1714.856851] [    T172]        _copy_to_iter+0xe2/0x7a0
[ 1714.857719] [    T172]        get_random_bytes_user+0x95/0x150
[ 1714.858712] [    T172]        vfs_read+0x266/0x370
[ 1714.859512] [    T172]        ksys_read+0x66/0xe0
[ 1714.860301] [    T172]        do_syscall_64+0x64/0x130
[ 1714.861167] [    T172]        entry_SYSCALL_64_after_hwframe+0x4b/0x53
[ 1714.862270] [    T172]
                          -> #0 (ptlock_ptr(ptdesc)#2){+.+.}-{3:3}:
[ 1714.863656] [    T172]        check_prev_add+0xeb/0xca0
[ 1714.864532] [    T172]        __lock_acquire+0xf56/0x12c0
[ 1714.865446] [    T172]        lock_acquire.part.0+0x63/0x1a0
[ 1714.866399] [    T172]        _raw_spin_lock+0x2c/0x40
[ 1714.867258] [    T172]        page_vma_mapped_walk+0x5c0/0x960
[ 1714.868235] [    T172]        folio_referenced_one+0xd0/0x4a0
[ 1714.869205] [    T172]        __rmap_walk_file+0xbe/0x1b0
[ 1714.870119] [    T172]        folio_referenced+0x10b/0x140
[ 1714.871039] [    T172]        shrink_folio_list+0x72c/0xad0
[ 1714.871975] [    T172]        evict_folios+0x247/0x800
[ 1714.872851] [    T172]        try_to_shrink_lruvec+0x1cd/0x2b0
[ 1714.873842] [    T172]        lru_gen_shrink_node+0xc3/0x190
[ 1714.874806] [    T172]        do_try_to_free_pages+0xee/0x4b0
[ 1714.875779] [    T172]        try_to_free_pages+0xea/0x280
[ 1714.876699] [    T172]        __alloc_pages_slowpath.constprop.0+0x296/0x970
[ 1714.877897] [    T172]        __alloc_frozen_pages_noprof+0x2b3/0x300
[ 1714.878977] [    T172]        __alloc_pages_noprof+0xa/0x20
[ 1714.879907] [    T172]        alloc_zspage+0xe6/0x2c0 [zsmalloc]
[ 1714.880924] [    T172]        zs_malloc+0xd2/0x2b0 [zsmalloc]
[ 1714.881881] [    T172]        zram_write_page+0xfc/0x300 [zram]
[ 1714.882873] [    T172]        zram_bio_write+0xd1/0x1c0 [zram]
[ 1714.883845] [    T172]        __submit_bio+0x12d/0x1c0
[ 1714.884712] [    T172]        __submit_bio_noacct+0x7f/0x200
[ 1714.885667] [    T172]        ext4_io_submit+0x20/0x40
[ 1714.886532] [    T172]        ext4_do_writepages+0x3e3/0x8b0
[ 1714.887482] [    T172]        ext4_writepages+0xe8/0x280
[ 1714.888377] [    T172]        do_writepages+0xcf/0x260
[ 1714.889247] [    T172]        __writeback_single_inode+0x56/0x350
[ 1714.890273] [    T172]        writeback_sb_inodes+0x227/0x550
[ 1714.891239] [    T172]        __writeback_inodes_wb+0x4c/0xe0
[ 1714.892202] [    T172]        wb_writeback+0x2f2/0x3f0
[ 1714.893071] [    T172]        wb_do_writeback+0x227/0x2a0
[ 1714.893976] [    T172]        wb_workfn+0x56/0x1b0
[ 1714.894777] [    T172]        process_one_work+0x1eb/0x570
[ 1714.895698] [    T172]        worker_thread+0x1d1/0x3b0
[ 1714.896571] [    T172]        kthread+0xf9/0x200
[ 1714.897356] [    T172]        ret_from_fork+0x2d/0x50
[ 1714.898214] [    T172]        ret_from_fork_asm+0x11/0x20
[ 1714.899142] [    T172]
                          other info that might help us debug this:

[ 1714.900906] [    T172] Chain exists of:
                            ptlock_ptr(ptdesc)#2 --> zram-entry->lock --> &zstrm->lock

[ 1714.903183] [    T172]  Possible unsafe locking scenario:

[ 1714.904463] [    T172]        CPU0                    CPU1
[ 1714.905380] [    T172]        ----                    ----
[ 1714.906293] [    T172]   lock(&zstrm->lock);
[ 1714.907006] [    T172]                                lock(zram-entry->lock);
[ 1714.908204] [    T172]                                lock(&zstrm->lock);
[ 1714.909347] [    T172]   lock(ptlock_ptr(ptdesc)#2);
[ 1714.910179] [    T172]
                           *** DEADLOCK ***

[ 1714.911570] [    T172] 7 locks held by kworker/u96:4/172:
[ 1714.912472] [    T172]  #0: ffff88810165d548 ((wq_completion)writeback){+.+.}-{0:0}, at: process_one_work+0x433/0x570
[ 1714.914273] [    T172]  #1: ffffc90000683e40 ((work_completion)(&(&wb->dwork)->work)){+.+.}-{0:0}, at: process_one_work+0x1ad/0x570
[ 1714.916339] [    T172]  #2: ffff88810b93d0e0 (&type->s_umount_key#28){++++}-{4:4}, at: super_trylock_shared+0x16/0x50
[ 1714.918141] [    T172]  #3: ffff88810b93ab50 (&sbi->s_writepages_rwsem){.+.+}-{0:0}, at: do_writepages+0xcf/0x260
[ 1714.919877] [    T172]  #4: ffffe8ffff981cf8 (&zstrm->lock){+.+.}-{4:4}, at: zcomp_stream_get+0x20/0x40 [zram]
[ 1714.921573] [    T172]  #5: ffff888106809900 (&mapping->i_mmap_rwsem){++++}-{4:4}, at: __rmap_walk_file+0x161/0x1b0
[ 1714.923347] [    T172]  #6: ffffffff82347d40 (rcu_read_lock){....}-{1:3}, at: ___pte_offset_map+0x26/0x1b0
[ 1714.924981] [    T172]
                          stack backtrace:
[ 1714.925998] [    T172] CPU: 6 UID: 0 PID: 172 Comm: kworker/u96:4 Not tainted 6.14.0-rc1-next-20250207+ #936
[ 1714.926005] [    T172] Workqueue: writeback wb_workfn (flush-251:0)
[ 1714.926009] [    T172] Call Trace:
[ 1714.926013] [    T172]  <TASK>
[ 1714.926015] [    T172]  dump_stack_lvl+0x57/0x80
[ 1714.926018] [    T172]  print_circular_bug.cold+0x38/0x45
[ 1714.926021] [    T172]  check_noncircular+0x12e/0x150
[ 1714.926025] [    T172]  check_prev_add+0xeb/0xca0
[ 1714.926027] [    T172]  ? add_chain_cache+0x10c/0x480
[ 1714.926029] [    T172]  __lock_acquire+0xf56/0x12c0
[ 1714.926032] [    T172]  lock_acquire.part.0+0x63/0x1a0
[ 1714.926035] [    T172]  ? page_vma_mapped_walk+0x5c0/0x960
[ 1714.926036] [    T172]  ? page_vma_mapped_walk+0x5c0/0x960
[ 1714.926037] [    T172]  _raw_spin_lock+0x2c/0x40
[ 1714.926040] [    T172]  ? page_vma_mapped_walk+0x5c0/0x960
[ 1714.926041] [    T172]  page_vma_mapped_walk+0x5c0/0x960
[ 1714.926043] [    T172]  folio_referenced_one+0xd0/0x4a0
[ 1714.926046] [    T172]  __rmap_walk_file+0xbe/0x1b0
[ 1714.926047] [    T172]  folio_referenced+0x10b/0x140
[ 1714.926050] [    T172]  ? page_mkclean_one+0xc0/0xc0
[ 1714.926051] [    T172]  ? folio_get_anon_vma+0x220/0x220
[ 1714.926052] [    T172]  ? __traceiter_remove_migration_pte+0x50/0x50
[ 1714.926054] [    T172]  shrink_folio_list+0x72c/0xad0
[ 1714.926060] [    T172]  evict_folios+0x247/0x800
[ 1714.926064] [    T172]  try_to_shrink_lruvec+0x1cd/0x2b0
[ 1714.926066] [    T172]  lru_gen_shrink_node+0xc3/0x190
[ 1714.926068] [    T172]  ? mark_usage+0x61/0x110
[ 1714.926071] [    T172]  do_try_to_free_pages+0xee/0x4b0
[ 1714.926073] [    T172]  try_to_free_pages+0xea/0x280
[ 1714.926077] [    T172]  __alloc_pages_slowpath.constprop.0+0x296/0x970
[ 1714.926079] [    T172]  ? __lock_acquire+0x3d1/0x12c0
[ 1714.926081] [    T172]  ? get_page_from_freelist+0xd9/0x680
[ 1714.926083] [    T172]  ? match_held_lock+0x30/0xa0
[ 1714.926085] [    T172]  __alloc_frozen_pages_noprof+0x2b3/0x300
[ 1714.926088] [    T172]  __alloc_pages_noprof+0xa/0x20
[ 1714.926090] [    T172]  alloc_zspage+0xe6/0x2c0 [zsmalloc]
[ 1714.926092] [    T172]  ? zs_malloc+0xc5/0x2b0 [zsmalloc]
[ 1714.926094] [    T172]  ? __lock_release.isra.0+0x5e/0x180
[ 1714.926096] [    T172]  zs_malloc+0xd2/0x2b0 [zsmalloc]
[ 1714.926099] [    T172]  zram_write_page+0xfc/0x300 [zram]
[ 1714.926102] [    T172]  zram_bio_write+0xd1/0x1c0 [zram]
[ 1714.926105] [    T172]  __submit_bio+0x12d/0x1c0
[ 1714.926107] [    T172]  ? jbd2_journal_stop+0x145/0x320
[ 1714.926109] [    T172]  ? kmem_cache_free+0xb5/0x3e0
[ 1714.926112] [    T172]  ? lock_release+0x6b/0x130
[ 1714.926115] [    T172]  ? __submit_bio_noacct+0x7f/0x200
[ 1714.926116] [    T172]  __submit_bio_noacct+0x7f/0x200
[ 1714.926118] [    T172]  ext4_io_submit+0x20/0x40
[ 1714.926120] [    T172]  ext4_do_writepages+0x3e3/0x8b0
[ 1714.926122] [    T172]  ? lock_acquire.part.0+0x63/0x1a0
[ 1714.926124] [    T172]  ? do_writepages+0xcf/0x260
[ 1714.926127] [    T172]  ? ext4_writepages+0xe8/0x280
[ 1714.926128] [    T172]  ext4_writepages+0xe8/0x280
[ 1714.926130] [    T172]  do_writepages+0xcf/0x260
[ 1714.926133] [    T172]  ? find_held_lock+0x2b/0x80
[ 1714.926134] [    T172]  ? writeback_sb_inodes+0x1b8/0x550
[ 1714.926136] [    T172]  __writeback_single_inode+0x56/0x350
[ 1714.926138] [    T172]  writeback_sb_inodes+0x227/0x550
[ 1714.926143] [    T172]  __writeback_inodes_wb+0x4c/0xe0
[ 1714.926145] [    T172]  wb_writeback+0x2f2/0x3f0
[ 1714.926147] [    T172]  wb_do_writeback+0x227/0x2a0
[ 1714.926150] [    T172]  wb_workfn+0x56/0x1b0
[ 1714.926151] [    T172]  process_one_work+0x1eb/0x570
[ 1714.926154] [    T172]  worker_thread+0x1d1/0x3b0
[ 1714.926157] [    T172]  ? bh_worker+0x250/0x250
[ 1714.926159] [    T172]  kthread+0xf9/0x200
[ 1714.926161] [    T172]  ? kthread_fetch_affinity.isra.0+0x40/0x40
[ 1714.926163] [    T172]  ret_from_fork+0x2d/0x50
[ 1714.926165] [    T172]  ? kthread_fetch_affinity.isra.0+0x40/0x40
[ 1714.926166] [    T172]  ret_from_fork_asm+0x11/0x20
[ 1714.926170] [    T172]  </TASK>



^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCHv4 02/17] zram: do not use per-CPU compression streams
  2025-02-08 16:20                   ` Sergey Senozhatsky
@ 2025-02-08 16:41                     ` Sergey Senozhatsky
  2025-02-09  6:22                     ` Sergey Senozhatsky
  1 sibling, 0 replies; 73+ messages in thread
From: Sergey Senozhatsky @ 2025-02-08 16:41 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Yosry Ahmed, Kairui Song, Andrew Morton, Minchan Kim, linux-mm,
	linux-kernel

On (25/02/09 01:20), Sergey Senozhatsky wrote:
> So I moved lockdep class keys to per-zram device and per-zsmalloc pool
> to separate the lockdep chains.  Looks like that did the trick.
> 
> 
[..]
> 
> [ 1714.900906] [    T172] Chain exists of:
>                             ptlock_ptr(ptdesc)#2 --> zram-entry->lock --> &zstrm->lock
> 
> [ 1714.903183] [    T172]  Possible unsafe locking scenario:
> 
> [ 1714.904463] [    T172]        CPU0                    CPU1
> [ 1714.905380] [    T172]        ----                    ----
> [ 1714.906293] [    T172]   lock(&zstrm->lock);
> [ 1714.907006] [    T172]                                lock(zram-entry->lock);
> [ 1714.908204] [    T172]                                lock(&zstrm->lock);
> [ 1714.909347] [    T172]   lock(ptlock_ptr(ptdesc)#2);
> [ 1714.910179] [    T172]
>                            *** DEADLOCK ***

Actually, let me look at this more.  Maybe I haven't figured it out yet.


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCHv4 02/17] zram: do not use per-CPU compression streams
  2025-02-08 16:20                   ` Sergey Senozhatsky
  2025-02-08 16:41                     ` Sergey Senozhatsky
@ 2025-02-09  6:22                     ` Sergey Senozhatsky
  2025-02-09  7:42                       ` Sergey Senozhatsky
  1 sibling, 1 reply; 73+ messages in thread
From: Sergey Senozhatsky @ 2025-02-09  6:22 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Yosry Ahmed, Kairui Song, Andrew Morton, Minchan Kim, linux-mm,
	linux-kernel

On (25/02/09 01:20), Sergey Senozhatsky wrote:
> So I moved lockdep class keys to per-zram device and per-zsmalloc pool
> to separate the lockdep chains.  Looks like that did the trick.

Also need to indicate "try lock":

drivers/block/zram/zram_drv.c
@@ -86,7 +86,7 @@ static __must_check bool zram_slot_try_lock(struct zram *zram, u32 index)
 
        if (!test_and_set_bit_lock(ZRAM_ENTRY_LOCK, lock)) {
 #ifdef CONFIG_DEBUG_LOCK_ALLOC
-               mutex_acquire(&zram->table[index].lockdep_map, 0, 0, _RET_IP_);
+               mutex_acquire(&zram->table[index].lockdep_map, 0, 1, _RET_IP_);
 #endif
                return true;
        }

and

mm/zsmalloc.c
@@ -388,7 +388,7 @@ static __must_check bool zspage_try_write_lock(struct zspage *zspage)
        preempt_disable();
        if (atomic_try_cmpxchg_acquire(lock, &old, ZS_PAGE_WRLOCKED)) {
 #ifdef CONFIG_DEBUG_LOCK_ALLOC
-               rwsem_acquire(&zspage->lockdep_map, 0, 0, _RET_IP_);
+               rwsem_acquire(&zspage->lockdep_map, 0, 1, _RET_IP_);
 #endif
                return true;
        }


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCHv4 02/17] zram: do not use per-CPU compression streams
  2025-02-09  6:22                     ` Sergey Senozhatsky
@ 2025-02-09  7:42                       ` Sergey Senozhatsky
  0 siblings, 0 replies; 73+ messages in thread
From: Sergey Senozhatsky @ 2025-02-09  7:42 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Yosry Ahmed, Kairui Song, Andrew Morton, Minchan Kim, linux-mm,
	linux-kernel

On (25/02/09 15:22), Sergey Senozhatsky wrote:
> On (25/02/09 01:20), Sergey Senozhatsky wrote:
> > So I moved lockdep class keys to per-zram device and per-zsmalloc pool
> > to separate the lockdep chains.  Looks like that did the trick.
> 
> Also need to indicate "try lock":
> 
> drivers/block/zram/zram_drv.c
> @@ -86,7 +86,7 @@ static __must_check bool zram_slot_try_lock(struct zram *zram, u32 index)
>  
>         if (!test_and_set_bit_lock(ZRAM_ENTRY_LOCK, lock)) {
>  #ifdef CONFIG_DEBUG_LOCK_ALLOC
> -               mutex_acquire(&zram->table[index].lockdep_map, 0, 0, _RET_IP_);
> +               mutex_acquire(&zram->table[index].lockdep_map, 0, 1, _RET_IP_);
>  #endif
>                 return true;
>         }
> 
> and
> 
> mm/zsmalloc.c
> @@ -388,7 +388,7 @@ static __must_check bool zspage_try_write_lock(struct zspage *zspage)
>         preempt_disable();
>         if (atomic_try_cmpxchg_acquire(lock, &old, ZS_PAGE_WRLOCKED)) {
>  #ifdef CONFIG_DEBUG_LOCK_ALLOC
> -               rwsem_acquire(&zspage->lockdep_map, 0, 0, _RET_IP_);
> +               rwsem_acquire(&zspage->lockdep_map, 0, 1, _RET_IP_);
>  #endif
>                 return true;
>         }

I guess this was the point lockdep was making.

lockdep knows about strm->lock -> shrink_folio_list which goes to ptllock
via folio_referenced and cluter_into lock via try_to_unmap.  Then lockdep
knows about zram entry->lock -> strm->lock and, most importantly, lockdep
knows about cluter_into lock -> zram_slot_free_notify() -> zram-entry->lock.
What lockdep doesn't know is that zram_slot_free_notify() is a try lock,
we don't re-enter zram unconditionally.


^ permalink raw reply	[flat|nested] 73+ messages in thread

* [PATCHv4 03/17] zram: remove crypto include
  2025-01-31  9:05 [PATCHv4 00/17] zsmalloc/zram: there be preemption Sergey Senozhatsky
  2025-01-31  9:06 ` [PATCHv4 01/17] zram: switch to non-atomic entry locking Sergey Senozhatsky
  2025-01-31  9:06 ` [PATCHv4 02/17] zram: do not use per-CPU compression streams Sergey Senozhatsky
@ 2025-01-31  9:06 ` Sergey Senozhatsky
  2025-01-31  9:06 ` [PATCHv4 04/17] zram: remove max_comp_streams device attr Sergey Senozhatsky
                   ` (13 subsequent siblings)
  16 siblings, 0 replies; 73+ messages in thread
From: Sergey Senozhatsky @ 2025-01-31  9:06 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Minchan Kim, linux-mm, linux-kernel, Sergey Senozhatsky

Remove a leftover crypto header include.

Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
---
 drivers/block/zram/zcomp.c    | 1 -
 drivers/block/zram/zram_drv.c | 4 +++-
 drivers/block/zram/zram_drv.h | 1 -
 3 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/block/zram/zcomp.c b/drivers/block/zram/zcomp.c
index 982c769d5831..efd5919808d9 100644
--- a/drivers/block/zram/zcomp.c
+++ b/drivers/block/zram/zcomp.c
@@ -7,7 +7,6 @@
 #include <linux/wait.h>
 #include <linux/sched.h>
 #include <linux/cpumask.h>
-#include <linux/crypto.h>
 #include <linux/vmalloc.h>
 
 #include "zcomp.h"
diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index 8d5974ea8ff8..6239fcc340b6 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -43,6 +43,8 @@ static DEFINE_MUTEX(zram_index_mutex);
 static int zram_major;
 static const char *default_compressor = CONFIG_ZRAM_DEF_COMP;
 
+#define ZRAM_MAX_ALGO_NAME_SZ	64
+
 /* Module params (documentation at end) */
 static unsigned int num_devices = 1;
 /*
@@ -1141,7 +1143,7 @@ static int __comp_algorithm_store(struct zram *zram, u32 prio, const char *buf)
 	size_t sz;
 
 	sz = strlen(buf);
-	if (sz >= CRYPTO_MAX_ALG_NAME)
+	if (sz >= ZRAM_MAX_ALGO_NAME_SZ)
 		return -E2BIG;
 
 	compressor = kstrdup(buf, GFP_KERNEL);
diff --git a/drivers/block/zram/zram_drv.h b/drivers/block/zram/zram_drv.h
index e20538cdf565..3ae2988090b3 100644
--- a/drivers/block/zram/zram_drv.h
+++ b/drivers/block/zram/zram_drv.h
@@ -17,7 +17,6 @@
 
 #include <linux/rwsem.h>
 #include <linux/zsmalloc.h>
-#include <linux/crypto.h>
 
 #include "zcomp.h"
 
-- 
2.48.1.362.g079036d154-goog



^ permalink raw reply	[flat|nested] 73+ messages in thread

* [PATCHv4 04/17] zram: remove max_comp_streams device attr
  2025-01-31  9:05 [PATCHv4 00/17] zsmalloc/zram: there be preemption Sergey Senozhatsky
                   ` (2 preceding siblings ...)
  2025-01-31  9:06 ` [PATCHv4 03/17] zram: remove crypto include Sergey Senozhatsky
@ 2025-01-31  9:06 ` Sergey Senozhatsky
  2025-01-31  9:06 ` [PATCHv4 05/17] zram: remove two-staged handle allocation Sergey Senozhatsky
                   ` (12 subsequent siblings)
  16 siblings, 0 replies; 73+ messages in thread
From: Sergey Senozhatsky @ 2025-01-31  9:06 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Minchan Kim, linux-mm, linux-kernel, Sergey Senozhatsky

max_comp_streams device attribute has been defunct since
May 2016 when zram switched to per-CPU compression streams,
remove it.

Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
---
 Documentation/ABI/testing/sysfs-block-zram  |  8 -----
 Documentation/admin-guide/blockdev/zram.rst | 36 ++++++---------------
 drivers/block/zram/zram_drv.c               | 23 -------------
 3 files changed, 10 insertions(+), 57 deletions(-)

diff --git a/Documentation/ABI/testing/sysfs-block-zram b/Documentation/ABI/testing/sysfs-block-zram
index 1ef69e0271f9..36c57de0a10a 100644
--- a/Documentation/ABI/testing/sysfs-block-zram
+++ b/Documentation/ABI/testing/sysfs-block-zram
@@ -22,14 +22,6 @@ Description:
 		device. The reset operation frees all the memory associated
 		with this device.
 
-What:		/sys/block/zram<id>/max_comp_streams
-Date:		February 2014
-Contact:	Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
-Description:
-		The max_comp_streams file is read-write and specifies the
-		number of backend's zcomp_strm compression streams (number of
-		concurrent compress operations).
-
 What:		/sys/block/zram<id>/comp_algorithm
 Date:		February 2014
 Contact:	Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
diff --git a/Documentation/admin-guide/blockdev/zram.rst b/Documentation/admin-guide/blockdev/zram.rst
index 1576fb93f06c..9bdb30901a93 100644
--- a/Documentation/admin-guide/blockdev/zram.rst
+++ b/Documentation/admin-guide/blockdev/zram.rst
@@ -54,7 +54,7 @@ The list of possible return codes:
 If you use 'echo', the returned value is set by the 'echo' utility,
 and, in general case, something like::
 
-	echo 3 > /sys/block/zram0/max_comp_streams
+	echo foo > /sys/block/zram0/comp_algorithm
 	if [ $? -ne 0 ]; then
 		handle_error
 	fi
@@ -73,21 +73,7 @@ This creates 4 devices: /dev/zram{0,1,2,3}
 num_devices parameter is optional and tells zram how many devices should be
 pre-created. Default: 1.
 
-2) Set max number of compression streams
-========================================
-
-Regardless of the value passed to this attribute, ZRAM will always
-allocate multiple compression streams - one per online CPU - thus
-allowing several concurrent compression operations. The number of
-allocated compression streams goes down when some of the CPUs
-become offline. There is no single-compression-stream mode anymore,
-unless you are running a UP system or have only 1 CPU online.
-
-To find out how many streams are currently available::
-
-	cat /sys/block/zram0/max_comp_streams
-
-3) Select compression algorithm
+2) Select compression algorithm
 ===============================
 
 Using comp_algorithm device attribute one can see available and
@@ -107,7 +93,7 @@ Examples::
 For the time being, the `comp_algorithm` content shows only compression
 algorithms that are supported by zram.
 
-4) Set compression algorithm parameters: Optional
+3) Set compression algorithm parameters: Optional
 =================================================
 
 Compression algorithms may support specific parameters which can be
@@ -138,7 +124,7 @@ better the compression ratio, it even can take negatives values for some
 algorithms), for other algorithms `level` is acceleration level (the higher
 the value the lower the compression ratio).
 
-5) Set Disksize
+4) Set Disksize
 ===============
 
 Set disk size by writing the value to sysfs node 'disksize'.
@@ -158,7 +144,7 @@ There is little point creating a zram of greater than twice the size of memory
 since we expect a 2:1 compression ratio. Note that zram uses about 0.1% of the
 size of the disk when not in use so a huge zram is wasteful.
 
-6) Set memory limit: Optional
+5) Set memory limit: Optional
 =============================
 
 Set memory limit by writing the value to sysfs node 'mem_limit'.
@@ -177,7 +163,7 @@ Examples::
 	# To disable memory limit
 	echo 0 > /sys/block/zram0/mem_limit
 
-7) Activate
+6) Activate
 ===========
 
 ::
@@ -188,7 +174,7 @@ Examples::
 	mkfs.ext4 /dev/zram1
 	mount /dev/zram1 /tmp
 
-8) Add/remove zram devices
+7) Add/remove zram devices
 ==========================
 
 zram provides a control interface, which enables dynamic (on-demand) device
@@ -208,7 +194,7 @@ execute::
 
 	echo X > /sys/class/zram-control/hot_remove
 
-9) Stats
+8) Stats
 ========
 
 Per-device statistics are exported as various nodes under /sys/block/zram<id>/
@@ -228,8 +214,6 @@ mem_limit         	WO	specifies the maximum amount of memory ZRAM can
 writeback_limit   	WO	specifies the maximum amount of write IO zram
 				can write out to backing device as 4KB unit
 writeback_limit_enable  RW	show and set writeback_limit feature
-max_comp_streams  	RW	the number of possible concurrent compress
-				operations
 comp_algorithm    	RW	show and change the compression algorithm
 algorithm_params	WO	setup compression algorithm parameters
 compact           	WO	trigger memory compaction
@@ -310,7 +294,7 @@ a single line of text and contains the following stats separated by whitespace:
 		Unit: 4K bytes
  ============== =============================================================
 
-10) Deactivate
+9) Deactivate
 ==============
 
 ::
@@ -318,7 +302,7 @@ a single line of text and contains the following stats separated by whitespace:
 	swapoff /dev/zram0
 	umount /dev/zram1
 
-11) Reset
+10) Reset
 =========
 
 	Write any positive value to 'reset' sysfs node::
diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index 6239fcc340b6..dd987e3942c7 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -1096,27 +1096,6 @@ static void zram_debugfs_register(struct zram *zram) {};
 static void zram_debugfs_unregister(struct zram *zram) {};
 #endif
 
-/*
- * We switched to per-cpu streams and this attr is not needed anymore.
- * However, we will keep it around for some time, because:
- * a) we may revert per-cpu streams in the future
- * b) it's visible to user space and we need to follow our 2 years
- *    retirement rule; but we already have a number of 'soon to be
- *    altered' attrs, so max_comp_streams need to wait for the next
- *    layoff cycle.
- */
-static ssize_t max_comp_streams_show(struct device *dev,
-		struct device_attribute *attr, char *buf)
-{
-	return scnprintf(buf, PAGE_SIZE, "%d\n", num_online_cpus());
-}
-
-static ssize_t max_comp_streams_store(struct device *dev,
-		struct device_attribute *attr, const char *buf, size_t len)
-{
-	return len;
-}
-
 static void comp_algorithm_set(struct zram *zram, u32 prio, const char *alg)
 {
 	/* Do not free statically defined compression algorithms */
@@ -2533,7 +2512,6 @@ static DEVICE_ATTR_WO(reset);
 static DEVICE_ATTR_WO(mem_limit);
 static DEVICE_ATTR_WO(mem_used_max);
 static DEVICE_ATTR_WO(idle);
-static DEVICE_ATTR_RW(max_comp_streams);
 static DEVICE_ATTR_RW(comp_algorithm);
 #ifdef CONFIG_ZRAM_WRITEBACK
 static DEVICE_ATTR_RW(backing_dev);
@@ -2555,7 +2533,6 @@ static struct attribute *zram_disk_attrs[] = {
 	&dev_attr_mem_limit.attr,
 	&dev_attr_mem_used_max.attr,
 	&dev_attr_idle.attr,
-	&dev_attr_max_comp_streams.attr,
 	&dev_attr_comp_algorithm.attr,
 #ifdef CONFIG_ZRAM_WRITEBACK
 	&dev_attr_backing_dev.attr,
-- 
2.48.1.362.g079036d154-goog



^ permalink raw reply	[flat|nested] 73+ messages in thread

* [PATCHv4 05/17] zram: remove two-staged handle allocation
  2025-01-31  9:05 [PATCHv4 00/17] zsmalloc/zram: there be preemption Sergey Senozhatsky
                   ` (3 preceding siblings ...)
  2025-01-31  9:06 ` [PATCHv4 04/17] zram: remove max_comp_streams device attr Sergey Senozhatsky
@ 2025-01-31  9:06 ` Sergey Senozhatsky
  2025-01-31  9:06 ` [PATCHv4 06/17] zram: permit reclaim in zstd custom allocator Sergey Senozhatsky
                   ` (11 subsequent siblings)
  16 siblings, 0 replies; 73+ messages in thread
From: Sergey Senozhatsky @ 2025-01-31  9:06 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Minchan Kim, linux-mm, linux-kernel, Sergey Senozhatsky

Previously zram write() was atomic which required us to pass
__GFP_KSWAPD_RECLAIM to zsmalloc handle allocation on a fast
path and attempt a slow path allocation (with recompression)
when the fast path failed.

Since it's not atomic anymore we can permit direct reclaim
during allocation, and remove fast allocation path and, also,
drop the recompression path (which should reduce CPU/battery
usage).

Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
---
 drivers/block/zram/zram_drv.c | 41 +++++------------------------------
 1 file changed, 6 insertions(+), 35 deletions(-)

diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index dd987e3942c7..0404f5e35cb4 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -1716,11 +1716,11 @@ static int write_incompressible_page(struct zram *zram, struct page *page,
 static int zram_write_page(struct zram *zram, struct page *page, u32 index)
 {
 	int ret = 0;
-	unsigned long handle = -ENOMEM;
-	unsigned int comp_len = 0;
+	unsigned long handle;
+	unsigned int comp_len;
 	void *dst, *mem;
 	struct zcomp_strm *zstrm;
-	unsigned long element = 0;
+	unsigned long element;
 	bool same_filled;
 
 	/* First, free memory allocated to this slot (if any) */
@@ -1734,7 +1734,6 @@ static int zram_write_page(struct zram *zram, struct page *page, u32 index)
 	if (same_filled)
 		return write_same_filled_page(zram, element, index);
 
-compress_again:
 	zstrm = zcomp_stream_get(zram->comps[ZRAM_PRIMARY_COMP]);
 	mem = kmap_local_page(page);
 	ret = zcomp_compress(zram->comps[ZRAM_PRIMARY_COMP], zstrm,
@@ -1743,8 +1742,6 @@ static int zram_write_page(struct zram *zram, struct page *page, u32 index)
 
 	if (unlikely(ret)) {
 		zcomp_stream_put(zram->comps[ZRAM_PRIMARY_COMP], zstrm);
-		pr_err("Compression failed! err=%d\n", ret);
-		zs_free(zram->mem_pool, handle);
 		return ret;
 	}
 
@@ -1753,36 +1750,10 @@ static int zram_write_page(struct zram *zram, struct page *page, u32 index)
 		return write_incompressible_page(zram, page, index);
 	}
 
-	/*
-	 * handle allocation has 2 paths:
-	 * a) fast path is executed with preemption disabled (for
-	 *  per-cpu streams) and has __GFP_DIRECT_RECLAIM bit clear,
-	 *  since we can't sleep;
-	 * b) slow path enables preemption and attempts to allocate
-	 *  the page with __GFP_DIRECT_RECLAIM bit set. we have to
-	 *  put per-cpu compression stream and, thus, to re-do
-	 *  the compression once handle is allocated.
-	 *
-	 * if we have a 'non-null' handle here then we are coming
-	 * from the slow path and handle has already been allocated.
-	 */
+	handle = zs_malloc(zram->mem_pool, comp_len,
+			   GFP_NOIO | __GFP_HIGHMEM | __GFP_MOVABLE);
 	if (IS_ERR_VALUE(handle))
-		handle = zs_malloc(zram->mem_pool, comp_len,
-				   __GFP_KSWAPD_RECLAIM |
-				   __GFP_NOWARN |
-				   __GFP_HIGHMEM |
-				   __GFP_MOVABLE);
-	if (IS_ERR_VALUE(handle)) {
-		zcomp_stream_put(zram->comps[ZRAM_PRIMARY_COMP], zstrm);
-		atomic64_inc(&zram->stats.writestall);
-		handle = zs_malloc(zram->mem_pool, comp_len,
-				   GFP_NOIO | __GFP_HIGHMEM |
-				   __GFP_MOVABLE);
-		if (IS_ERR_VALUE(handle))
-			return PTR_ERR((void *)handle);
-
-		goto compress_again;
-	}
+		return PTR_ERR((void *)handle);
 
 	if (!zram_can_store_page(zram)) {
 		zcomp_stream_put(zram->comps[ZRAM_PRIMARY_COMP], zstrm);
-- 
2.48.1.362.g079036d154-goog



^ permalink raw reply	[flat|nested] 73+ messages in thread

* [PATCHv4 06/17] zram: permit reclaim in zstd custom allocator
  2025-01-31  9:05 [PATCHv4 00/17] zsmalloc/zram: there be preemption Sergey Senozhatsky
                   ` (4 preceding siblings ...)
  2025-01-31  9:06 ` [PATCHv4 05/17] zram: remove two-staged handle allocation Sergey Senozhatsky
@ 2025-01-31  9:06 ` Sergey Senozhatsky
  2025-01-31  9:06 ` [PATCHv4 07/17] zram: permit reclaim in recompression handle allocation Sergey Senozhatsky
                   ` (10 subsequent siblings)
  16 siblings, 0 replies; 73+ messages in thread
From: Sergey Senozhatsky @ 2025-01-31  9:06 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Minchan Kim, linux-mm, linux-kernel, Sergey Senozhatsky

When configured with pre-trained compression/decompression
dictionary support, zstd requires custom memory allocator,
which it calls internally from compression()/decompression()
routines.  This was a tad problematic, because that would
mean allocation from atomic context (either under entry
spin-lock, or per-CPU local-lock or both).  Now, with
non-atomic zram write(), those limitations are relaxed and
we can allow direct and indirect reclaim during allocations.

The tricky part is zram read() path, which is still atomic in
one particular case (read_compressed_page()), due to zsmalloc
handling of object mapping.  However, in zram in order to read()
something one has to write() it first, and write() is when zstd
allocates required internal state memory, and write() path is
non-atomic.  Because of this write() allocation, in theory, zstd
should not call its allocator from the atomic read() path.  Keep
the non-preemptible branch, just in case if zstd allocates memory
from read(), but WARN_ON_ONCE() if it happens.

Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
---
 drivers/block/zram/backend_zstd.c | 11 +++--------
 1 file changed, 3 insertions(+), 8 deletions(-)

diff --git a/drivers/block/zram/backend_zstd.c b/drivers/block/zram/backend_zstd.c
index 1184c0036f44..53431251ea62 100644
--- a/drivers/block/zram/backend_zstd.c
+++ b/drivers/block/zram/backend_zstd.c
@@ -24,19 +24,14 @@ struct zstd_params {
 /*
  * For C/D dictionaries we need to provide zstd with zstd_custom_mem,
  * which zstd uses internally to allocate/free memory when needed.
- *
- * This means that allocator.customAlloc() can be called from zcomp_compress()
- * under local-lock (per-CPU compression stream), in which case we must use
- * GFP_ATOMIC.
- *
- * Another complication here is that we can be configured as a swap device.
  */
 static void *zstd_custom_alloc(void *opaque, size_t size)
 {
-	if (!preemptible())
+	/* Technically this should not happen */
+	if (WARN_ON_ONCE(!preemptible()))
 		return kvzalloc(size, GFP_ATOMIC);

-	return kvzalloc(size, __GFP_KSWAPD_RECLAIM | __GFP_NOWARN);
+	return kvzalloc(size, GFP_NOIO | __GFP_NOWARN);
 }

 static void zstd_custom_free(void *opaque, void *address)
-- 
2.48.1.362.g079036d154-goog

^ permalink raw reply	[flat|nested] 73+ messages in thread

* [PATCHv4 07/17] zram: permit reclaim in recompression handle allocation
  2025-01-31  9:05 [PATCHv4 00/17] zsmalloc/zram: there be preemption Sergey Senozhatsky
                   ` (5 preceding siblings ...)
  2025-01-31  9:06 ` [PATCHv4 06/17] zram: permit reclaim in zstd custom allocator Sergey Senozhatsky
@ 2025-01-31  9:06 ` Sergey Senozhatsky
  2025-01-31  9:06 ` [PATCHv4 08/17] zram: remove writestall zram_stats member Sergey Senozhatsky
                   ` (9 subsequent siblings)
  16 siblings, 0 replies; 73+ messages in thread
From: Sergey Senozhatsky @ 2025-01-31  9:06 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Minchan Kim, linux-mm, linux-kernel, Sergey Senozhatsky

Recompression path can now permit direct reclaim during
new zs_handle allocation, because it's not atomic anymore.

Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
---
 drivers/block/zram/zram_drv.c | 12 +++---------
 1 file changed, 3 insertions(+), 9 deletions(-)

diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index 0404f5e35cb4..33a7bfa53861 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -1979,17 +1979,11 @@ static int recompress_slot(struct zram *zram, u32 index, struct page *page,
 		return 0;
 
 	/*
-	 * No direct reclaim (slow path) for handle allocation and no
-	 * re-compression attempt (unlike in zram_write_bvec()) since
-	 * we already have stored that object in zsmalloc. If we cannot
-	 * alloc memory for recompressed object then we bail out and
-	 * simply keep the old (existing) object in zsmalloc.
+	 * If we cannot alloc memory for recompressed object then we bail out
+	 * and simply keep the old (existing) object in zsmalloc.
 	 */
 	handle_new = zs_malloc(zram->mem_pool, comp_len_new,
-			       __GFP_KSWAPD_RECLAIM |
-			       __GFP_NOWARN |
-			       __GFP_HIGHMEM |
-			       __GFP_MOVABLE);
+			       GFP_NOIO | __GFP_HIGHMEM | __GFP_MOVABLE);
 	if (IS_ERR_VALUE(handle_new)) {
 		zcomp_stream_put(zram->comps[prio], zstrm);
 		return PTR_ERR((void *)handle_new);
-- 
2.48.1.362.g079036d154-goog



^ permalink raw reply	[flat|nested] 73+ messages in thread

* [PATCHv4 08/17] zram: remove writestall zram_stats member
  2025-01-31  9:05 [PATCHv4 00/17] zsmalloc/zram: there be preemption Sergey Senozhatsky
                   ` (6 preceding siblings ...)
  2025-01-31  9:06 ` [PATCHv4 07/17] zram: permit reclaim in recompression handle allocation Sergey Senozhatsky
@ 2025-01-31  9:06 ` Sergey Senozhatsky
  2025-01-31  9:06 ` [PATCHv4 09/17] zram: limit max recompress prio to num_active_comps Sergey Senozhatsky
                   ` (8 subsequent siblings)
  16 siblings, 0 replies; 73+ messages in thread
From: Sergey Senozhatsky @ 2025-01-31  9:06 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Minchan Kim, linux-mm, linux-kernel, Sergey Senozhatsky

There is no zsmalloc handle allocation slow path now and
writestall is not possible any longer.  Remove it from
zram_stats.

Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
---
 drivers/block/zram/zram_drv.c | 3 +--
 drivers/block/zram/zram_drv.h | 1 -
 2 files changed, 1 insertion(+), 3 deletions(-)

diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index 33a7bfa53861..35fca4c468a7 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -1430,9 +1430,8 @@ static ssize_t debug_stat_show(struct device *dev,
 
 	down_read(&zram->init_lock);
 	ret = scnprintf(buf, PAGE_SIZE,
-			"version: %d\n%8llu %8llu\n",
+			"version: %d\n0 %8llu\n",
 			version,
-			(u64)atomic64_read(&zram->stats.writestall),
 			(u64)atomic64_read(&zram->stats.miss_free));
 	up_read(&zram->init_lock);
 
diff --git a/drivers/block/zram/zram_drv.h b/drivers/block/zram/zram_drv.h
index 3ae2988090b3..219d405fc26e 100644
--- a/drivers/block/zram/zram_drv.h
+++ b/drivers/block/zram/zram_drv.h
@@ -79,7 +79,6 @@ struct zram_stats {
 	atomic64_t huge_pages_since;	/* no. of huge pages since zram set up */
 	atomic64_t pages_stored;	/* no. of pages currently stored */
 	atomic_long_t max_used_pages;	/* no. of maximum pages stored */
-	atomic64_t writestall;		/* no. of write slow paths */
 	atomic64_t miss_free;		/* no. of missed free */
 #ifdef	CONFIG_ZRAM_WRITEBACK
 	atomic64_t bd_count;		/* no. of pages in backing device */
-- 
2.48.1.362.g079036d154-goog



^ permalink raw reply	[flat|nested] 73+ messages in thread

* [PATCHv4 09/17] zram: limit max recompress prio to num_active_comps
  2025-01-31  9:05 [PATCHv4 00/17] zsmalloc/zram: there be preemption Sergey Senozhatsky
                   ` (7 preceding siblings ...)
  2025-01-31  9:06 ` [PATCHv4 08/17] zram: remove writestall zram_stats member Sergey Senozhatsky
@ 2025-01-31  9:06 ` Sergey Senozhatsky
  2025-01-31  9:06 ` [PATCHv4 10/17] zram: filter out recomp targets based on priority Sergey Senozhatsky
                   ` (7 subsequent siblings)
  16 siblings, 0 replies; 73+ messages in thread
From: Sergey Senozhatsky @ 2025-01-31  9:06 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Minchan Kim, linux-mm, linux-kernel, Sergey Senozhatsky

Use the actual number of algorithms zram was configure with
instead of theoretical limit of ZRAM_MAX_COMPS.

Also make sure that min prio is not above max prio.

Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
---
 drivers/block/zram/zram_drv.c | 15 ++++++++++++---
 1 file changed, 12 insertions(+), 3 deletions(-)

diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index 35fca4c468a7..c500ace0d02f 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -2009,16 +2009,19 @@ static ssize_t recompress_store(struct device *dev,
 				struct device_attribute *attr,
 				const char *buf, size_t len)
 {
-	u32 prio = ZRAM_SECONDARY_COMP, prio_max = ZRAM_MAX_COMPS;
 	struct zram *zram = dev_to_zram(dev);
 	char *args, *param, *val, *algo = NULL;
 	u64 num_recomp_pages = ULLONG_MAX;
 	struct zram_pp_ctl *ctl = NULL;
 	struct zram_pp_slot *pps;
 	u32 mode = 0, threshold = 0;
+	u32 prio, prio_max;
 	struct page *page;
 	ssize_t ret;
 
+	prio = ZRAM_SECONDARY_COMP;
+	prio_max = zram->num_active_comps;
+
 	args = skip_spaces(buf);
 	while (*args) {
 		args = next_arg(args, &param, &val);
@@ -2071,7 +2074,7 @@ static ssize_t recompress_store(struct device *dev,
 			if (prio == ZRAM_PRIMARY_COMP)
 				prio = ZRAM_SECONDARY_COMP;
 
-			prio_max = min(prio + 1, ZRAM_MAX_COMPS);
+			prio_max = prio + 1;
 			continue;
 		}
 	}
@@ -2099,7 +2102,7 @@ static ssize_t recompress_store(struct device *dev,
 				continue;
 
 			if (!strcmp(zram->comp_algs[prio], algo)) {
-				prio_max = min(prio + 1, ZRAM_MAX_COMPS);
+				prio_max = prio + 1;
 				found = true;
 				break;
 			}
@@ -2111,6 +2114,12 @@ static ssize_t recompress_store(struct device *dev,
 		}
 	}
 
+	prio_max = min(prio_max, (u32)zram->num_active_comps);
+	if (prio >= prio_max) {
+		ret = -EINVAL;
+		goto release_init_lock;
+	}
+
 	page = alloc_page(GFP_KERNEL);
 	if (!page) {
 		ret = -ENOMEM;
-- 
2.48.1.362.g079036d154-goog



^ permalink raw reply	[flat|nested] 73+ messages in thread

* [PATCHv4 10/17] zram: filter out recomp targets based on priority
  2025-01-31  9:05 [PATCHv4 00/17] zsmalloc/zram: there be preemption Sergey Senozhatsky
                   ` (8 preceding siblings ...)
  2025-01-31  9:06 ` [PATCHv4 09/17] zram: limit max recompress prio to num_active_comps Sergey Senozhatsky
@ 2025-01-31  9:06 ` Sergey Senozhatsky
  2025-01-31  9:06 ` [PATCHv4 11/17] zram: unlock slot during recompression Sergey Senozhatsky
                   ` (6 subsequent siblings)
  16 siblings, 0 replies; 73+ messages in thread
From: Sergey Senozhatsky @ 2025-01-31  9:06 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Minchan Kim, linux-mm, linux-kernel, Sergey Senozhatsky

Do no select for post processing slots that are already
compressed with same or higher priority compression
algorithm.

This should save some memory, as previously we would still
put those entries into corresponding post-processing buckets
and filter them out later in recompress_slot().

Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
---
 drivers/block/zram/zram_drv.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index c500ace0d02f..256439361367 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -1811,7 +1811,7 @@ static int zram_bvec_write(struct zram *zram, struct bio_vec *bvec,
 #define RECOMPRESS_IDLE		(1 << 0)
 #define RECOMPRESS_HUGE		(1 << 1)
 
-static int scan_slots_for_recompress(struct zram *zram, u32 mode,
+static int scan_slots_for_recompress(struct zram *zram, u32 mode, u32 prio_max,
 				     struct zram_pp_ctl *ctl)
 {
 	unsigned long nr_pages = zram->disksize >> PAGE_SHIFT;
@@ -1843,6 +1843,10 @@ static int scan_slots_for_recompress(struct zram *zram, u32 mode,
 		    zram_test_flag(zram, index, ZRAM_INCOMPRESSIBLE))
 			goto next;
 
+		/* Already compressed with same of higher priority */
+		if (zram_get_priority(zram, index) + 1 >= prio_max)
+			goto next;
+
 		pps->index = index;
 		place_pp_slot(zram, ctl, pps);
 		pps = NULL;
@@ -2132,7 +2136,7 @@ static ssize_t recompress_store(struct device *dev,
 		goto release_init_lock;
 	}
 
-	scan_slots_for_recompress(zram, mode, ctl);
+	scan_slots_for_recompress(zram, mode, prio_max, ctl);
 
 	ret = len;
 	while ((pps = select_pp_slot(ctl))) {
-- 
2.48.1.362.g079036d154-goog



^ permalink raw reply	[flat|nested] 73+ messages in thread

* [PATCHv4 11/17] zram: unlock slot during recompression
  2025-01-31  9:05 [PATCHv4 00/17] zsmalloc/zram: there be preemption Sergey Senozhatsky
                   ` (9 preceding siblings ...)
  2025-01-31  9:06 ` [PATCHv4 10/17] zram: filter out recomp targets based on priority Sergey Senozhatsky
@ 2025-01-31  9:06 ` Sergey Senozhatsky
  2025-01-31  9:06 ` [PATCHv4 12/17] zsmalloc: factor out pool locking helpers Sergey Senozhatsky
                   ` (5 subsequent siblings)
  16 siblings, 0 replies; 73+ messages in thread
From: Sergey Senozhatsky @ 2025-01-31  9:06 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Minchan Kim, linux-mm, linux-kernel, Sergey Senozhatsky

Recompression, like writeback, makes a local copy of slot data
(we need to decompress it anyway) before post-processing so we
can unlock slot-entry once we have that local copy.

Unlock the entry write-lock before recompression loop (secondary
algorithms can be tried out one by one, in order of priority) and
re-acquire it right after the loop.

There is one more potentially costly operation recompress_slot()
does - new zs_handle allocation, which can schedule().  Release
the slot-entry write-lock before zsmalloc allocation and grab it
again after the allocation.

In both cases, once the slot-lock is re-acquired we examine slot's
ZRAM_PP_SLOT flag to make sure that the slot has not been modified
by a concurrent operation.

Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
---
 drivers/block/zram/zram_drv.c | 80 +++++++++++++++++++----------------
 1 file changed, 44 insertions(+), 36 deletions(-)

diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index 256439361367..cfbb3072ee9e 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -1869,14 +1869,13 @@ static int recompress_slot(struct zram *zram, u32 index, struct page *page,
 			   u64 *num_recomp_pages, u32 threshold, u32 prio,
 			   u32 prio_max)
 {
-	struct zcomp_strm *zstrm = NULL;
+	struct zcomp_strm *zstrm;
 	unsigned long handle_old;
 	unsigned long handle_new;
 	unsigned int comp_len_old;
 	unsigned int comp_len_new;
 	unsigned int class_index_old;
 	unsigned int class_index_new;
-	u32 num_recomps = 0;
 	void *src, *dst;
 	int ret;
 
@@ -1903,6 +1902,13 @@ static int recompress_slot(struct zram *zram, u32 index, struct page *page,
 	zram_clear_flag(zram, index, ZRAM_IDLE);
 
 	class_index_old = zs_lookup_class_index(zram->mem_pool, comp_len_old);
+	prio = max(prio, zram_get_priority(zram, index) + 1);
+	/* Slot data copied out - unlock its bucket */
+	zram_slot_write_unlock(zram, index);
+	/* Recompression slots scan takes care of this, but just in case */
+	if (prio >= prio_max)
+		return 0;
+
 	/*
 	 * Iterate the secondary comp algorithms list (in order of priority)
 	 * and try to recompress the page.
@@ -1911,24 +1917,14 @@ static int recompress_slot(struct zram *zram, u32 index, struct page *page,
 		if (!zram->comps[prio])
 			continue;
 
-		/*
-		 * Skip if the object is already re-compressed with a higher
-		 * priority algorithm (or same algorithm).
-		 */
-		if (prio <= zram_get_priority(zram, index))
-			continue;
-
-		num_recomps++;
 		zstrm = zcomp_stream_get(zram->comps[prio]);
 		src = kmap_local_page(page);
 		ret = zcomp_compress(zram->comps[prio], zstrm,
 				     src, &comp_len_new);
 		kunmap_local(src);
 
-		if (ret) {
-			zcomp_stream_put(zram->comps[prio], zstrm);
-			return ret;
-		}
+		if (ret)
+			break;
 
 		class_index_new = zs_lookup_class_index(zram->mem_pool,
 							comp_len_new);
@@ -1937,6 +1933,7 @@ static int recompress_slot(struct zram *zram, u32 index, struct page *page,
 		if (class_index_new >= class_index_old ||
 		    (threshold && comp_len_new >= threshold)) {
 			zcomp_stream_put(zram->comps[prio], zstrm);
+			zstrm = NULL;
 			continue;
 		}
 
@@ -1944,14 +1941,7 @@ static int recompress_slot(struct zram *zram, u32 index, struct page *page,
 		break;
 	}
 
-	/*
-	 * We did not try to recompress, e.g. when we have only one
-	 * secondary algorithm and the page is already recompressed
-	 * using that algorithm
-	 */
-	if (!zstrm)
-		return 0;
-
+	zram_slot_write_lock(zram, index);
 	/*
 	 * Decrement the limit (if set) on pages we can recompress, even
 	 * when current recompression was unsuccessful or did not compress
@@ -1961,37 +1951,55 @@ static int recompress_slot(struct zram *zram, u32 index, struct page *page,
 	if (*num_recomp_pages)
 		*num_recomp_pages -= 1;
 
-	if (class_index_new >= class_index_old) {
+	/* Compression error */
+	if (ret) {
+		zcomp_stream_put(zram->comps[prio], zstrm);
+		return ret;
+	}
+
+	if (!zstrm) {
 		/*
 		 * Secondary algorithms failed to re-compress the page
-		 * in a way that would save memory, mark the object as
-		 * incompressible so that we will not try to compress
-		 * it again.
+		 * in a way that would save memory.
 		 *
-		 * We need to make sure that all secondary algorithms have
-		 * failed, so we test if the number of recompressions matches
-		 * the number of active secondary algorithms.
+		 * Mark the object incompressible if the max-priority
+		 * algorithm couldn't re-compress it.
 		 */
-		if (num_recomps == zram->num_active_comps - 1)
+		if (prio < zram->num_active_comps)
+			return 0;
+		if (zram_test_flag(zram, index, ZRAM_PP_SLOT))
 			zram_set_flag(zram, index, ZRAM_INCOMPRESSIBLE);
 		return 0;
 	}
 
-	/* Successful recompression but above threshold */
-	if (threshold && comp_len_new >= threshold)
+	/* Slot has been modified concurrently */
+	if (!zram_test_flag(zram, index, ZRAM_PP_SLOT)) {
+		zcomp_stream_put(zram->comps[prio], zstrm);
 		return 0;
+	}
 
-	/*
-	 * If we cannot alloc memory for recompressed object then we bail out
-	 * and simply keep the old (existing) object in zsmalloc.
-	 */
+	/* zsmalloc handle allocation can schedule, unlock slot's bucket */
+	zram_slot_write_unlock(zram, index);
 	handle_new = zs_malloc(zram->mem_pool, comp_len_new,
 			       GFP_NOIO | __GFP_HIGHMEM | __GFP_MOVABLE);
+	zram_slot_write_lock(zram, index);
+
+	/*
+	 * If we couldn't allocate memory for recompressed object then bail
+	 * out and simply keep the old (existing) object in mempool.
+	 */
 	if (IS_ERR_VALUE(handle_new)) {
 		zcomp_stream_put(zram->comps[prio], zstrm);
 		return PTR_ERR((void *)handle_new);
 	}
 
+	/* Slot has been modified concurrently */
+	if (!zram_test_flag(zram, index, ZRAM_PP_SLOT)) {
+		zcomp_stream_put(zram->comps[prio], zstrm);
+		zs_free(zram->mem_pool, handle_new);
+		return 0;
+	}
+
 	dst = zs_map_object(zram->mem_pool, handle_new, ZS_MM_WO);
 	memcpy(dst, zstrm->buffer, comp_len_new);
 	zcomp_stream_put(zram->comps[prio], zstrm);
-- 
2.48.1.362.g079036d154-goog



^ permalink raw reply	[flat|nested] 73+ messages in thread

* [PATCHv4 12/17] zsmalloc: factor out pool locking helpers
  2025-01-31  9:05 [PATCHv4 00/17] zsmalloc/zram: there be preemption Sergey Senozhatsky
                   ` (10 preceding siblings ...)
  2025-01-31  9:06 ` [PATCHv4 11/17] zram: unlock slot during recompression Sergey Senozhatsky
@ 2025-01-31  9:06 ` Sergey Senozhatsky
  2025-01-31 15:46   ` Yosry Ahmed
  2025-01-31  9:06 ` [PATCHv4 13/17] zsmalloc: factor out size-class " Sergey Senozhatsky
                   ` (4 subsequent siblings)
  16 siblings, 1 reply; 73+ messages in thread
From: Sergey Senozhatsky @ 2025-01-31  9:06 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Minchan Kim, linux-mm, linux-kernel, Sergey Senozhatsky, Yosry Ahmed

We currently have a mix of migrate_{read,write}_lock() helpers
that lock zspages, but it's zs_pool that actually has a ->migrate_lock
access to which is opene-coded.  Factor out pool migrate locking
into helpers, zspage migration locking API will be renamed to
reduce confusion.

It's worth mentioning that zsmalloc locks sync not only migration,
but also compaction.

Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Yosry Ahmed <yosry.ahmed@linux.dev>
---
 mm/zsmalloc.c | 69 +++++++++++++++++++++++++++++++++++----------------
 1 file changed, 47 insertions(+), 22 deletions(-)

diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
index 817626a351f8..c129596ab960 100644
--- a/mm/zsmalloc.c
+++ b/mm/zsmalloc.c
@@ -18,7 +18,7 @@
 /*
  * lock ordering:
  *	page_lock
- *	pool->migrate_lock
+ *	pool->lock
  *	class->lock
  *	zspage->lock
  */
@@ -224,10 +224,35 @@ struct zs_pool {
 	struct work_struct free_work;
 #endif
 	/* protect page/zspage migration */
-	rwlock_t migrate_lock;
+	rwlock_t lock;
 	atomic_t compaction_in_progress;
 };
 
+static void pool_write_unlock(struct zs_pool *pool)
+{
+	write_unlock(&pool->lock);
+}
+
+static void pool_write_lock(struct zs_pool *pool)
+{
+	write_lock(&pool->lock);
+}
+
+static void pool_read_unlock(struct zs_pool *pool)
+{
+	read_unlock(&pool->lock);
+}
+
+static void pool_read_lock(struct zs_pool *pool)
+{
+	read_lock(&pool->lock);
+}
+
+static bool pool_lock_is_contended(struct zs_pool *pool)
+{
+	return rwlock_is_contended(&pool->lock);
+}
+
 static inline void zpdesc_set_first(struct zpdesc *zpdesc)
 {
 	SetPagePrivate(zpdesc_page(zpdesc));
@@ -290,7 +315,7 @@ static bool ZsHugePage(struct zspage *zspage)
 	return zspage->huge;
 }
 
-static void migrate_lock_init(struct zspage *zspage);
+static void lock_init(struct zspage *zspage);
 static void migrate_read_lock(struct zspage *zspage);
 static void migrate_read_unlock(struct zspage *zspage);
 static void migrate_write_lock(struct zspage *zspage);
@@ -992,7 +1017,7 @@ static struct zspage *alloc_zspage(struct zs_pool *pool,
 		return NULL;
 
 	zspage->magic = ZSPAGE_MAGIC;
-	migrate_lock_init(zspage);
+	lock_init(zspage);
 
 	for (i = 0; i < class->pages_per_zspage; i++) {
 		struct zpdesc *zpdesc;
@@ -1206,7 +1231,7 @@ void *zs_map_object(struct zs_pool *pool, unsigned long handle,
 	BUG_ON(in_interrupt());
 
 	/* It guarantees it can get zspage from handle safely */
-	read_lock(&pool->migrate_lock);
+	pool_read_lock(pool);
 	obj = handle_to_obj(handle);
 	obj_to_location(obj, &zpdesc, &obj_idx);
 	zspage = get_zspage(zpdesc);
@@ -1218,7 +1243,7 @@ void *zs_map_object(struct zs_pool *pool, unsigned long handle,
 	 * which is smaller granularity.
 	 */
 	migrate_read_lock(zspage);
-	read_unlock(&pool->migrate_lock);
+	pool_read_unlock(pool);
 
 	class = zspage_class(pool, zspage);
 	off = offset_in_page(class->size * obj_idx);
@@ -1450,16 +1475,16 @@ void zs_free(struct zs_pool *pool, unsigned long handle)
 		return;
 
 	/*
-	 * The pool->migrate_lock protects the race with zpage's migration
+	 * The pool->lock protects the race with zpage's migration
 	 * so it's safe to get the page from handle.
 	 */
-	read_lock(&pool->migrate_lock);
+	pool_read_lock(pool);
 	obj = handle_to_obj(handle);
 	obj_to_zpdesc(obj, &f_zpdesc);
 	zspage = get_zspage(f_zpdesc);
 	class = zspage_class(pool, zspage);
 	spin_lock(&class->lock);
-	read_unlock(&pool->migrate_lock);
+	pool_read_unlock(pool);
 
 	class_stat_sub(class, ZS_OBJS_INUSE, 1);
 	obj_free(class->size, obj);
@@ -1703,7 +1728,7 @@ static void lock_zspage(struct zspage *zspage)
 }
 #endif /* CONFIG_COMPACTION */
 
-static void migrate_lock_init(struct zspage *zspage)
+static void lock_init(struct zspage *zspage)
 {
 	rwlock_init(&zspage->lock);
 }
@@ -1793,10 +1818,10 @@ static int zs_page_migrate(struct page *newpage, struct page *page,
 	pool = zspage->pool;
 
 	/*
-	 * The pool migrate_lock protects the race between zpage migration
+	 * The pool lock protects the race between zpage migration
 	 * and zs_free.
 	 */
-	write_lock(&pool->migrate_lock);
+	pool_write_lock(pool);
 	class = zspage_class(pool, zspage);
 
 	/*
@@ -1833,7 +1858,7 @@ static int zs_page_migrate(struct page *newpage, struct page *page,
 	 * Since we complete the data copy and set up new zspage structure,
 	 * it's okay to release migration_lock.
 	 */
-	write_unlock(&pool->migrate_lock);
+	pool_write_unlock(pool);
 	spin_unlock(&class->lock);
 	migrate_write_unlock(zspage);
 
@@ -1956,7 +1981,7 @@ static unsigned long __zs_compact(struct zs_pool *pool,
 	 * protect the race between zpage migration and zs_free
 	 * as well as zpage allocation/free
 	 */
-	write_lock(&pool->migrate_lock);
+	pool_write_lock(pool);
 	spin_lock(&class->lock);
 	while (zs_can_compact(class)) {
 		int fg;
@@ -1983,14 +2008,14 @@ static unsigned long __zs_compact(struct zs_pool *pool,
 		src_zspage = NULL;
 
 		if (get_fullness_group(class, dst_zspage) == ZS_INUSE_RATIO_100
-		    || rwlock_is_contended(&pool->migrate_lock)) {
+		    || pool_lock_is_contended(pool)) {
 			putback_zspage(class, dst_zspage);
 			dst_zspage = NULL;
 
 			spin_unlock(&class->lock);
-			write_unlock(&pool->migrate_lock);
+			pool_write_unlock(pool);
 			cond_resched();
-			write_lock(&pool->migrate_lock);
+			pool_write_lock(pool);
 			spin_lock(&class->lock);
 		}
 	}
@@ -2002,7 +2027,7 @@ static unsigned long __zs_compact(struct zs_pool *pool,
 		putback_zspage(class, dst_zspage);
 
 	spin_unlock(&class->lock);
-	write_unlock(&pool->migrate_lock);
+	pool_write_unlock(pool);
 
 	return pages_freed;
 }
@@ -2014,10 +2039,10 @@ unsigned long zs_compact(struct zs_pool *pool)
 	unsigned long pages_freed = 0;
 
 	/*
-	 * Pool compaction is performed under pool->migrate_lock so it is basically
+	 * Pool compaction is performed under pool->lock so it is basically
 	 * single-threaded. Having more than one thread in __zs_compact()
-	 * will increase pool->migrate_lock contention, which will impact other
-	 * zsmalloc operations that need pool->migrate_lock.
+	 * will increase pool->lock contention, which will impact other
+	 * zsmalloc operations that need pool->lock.
 	 */
 	if (atomic_xchg(&pool->compaction_in_progress, 1))
 		return 0;
@@ -2139,7 +2164,7 @@ struct zs_pool *zs_create_pool(const char *name)
 		return NULL;
 
 	init_deferred_free(pool);
-	rwlock_init(&pool->migrate_lock);
+	rwlock_init(&pool->lock);
 	atomic_set(&pool->compaction_in_progress, 0);
 
 	pool->name = kstrdup(name, GFP_KERNEL);
-- 
2.48.1.362.g079036d154-goog



^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCHv4 12/17] zsmalloc: factor out pool locking helpers
  2025-01-31  9:06 ` [PATCHv4 12/17] zsmalloc: factor out pool locking helpers Sergey Senozhatsky
@ 2025-01-31 15:46   ` Yosry Ahmed
  2025-02-03  4:57     ` Sergey Senozhatsky
  0 siblings, 1 reply; 73+ messages in thread
From: Yosry Ahmed @ 2025-01-31 15:46 UTC (permalink / raw)
  To: Sergey Senozhatsky; +Cc: Andrew Morton, Minchan Kim, linux-mm, linux-kernel

On Fri, Jan 31, 2025 at 06:06:11PM +0900, Sergey Senozhatsky wrote:
> We currently have a mix of migrate_{read,write}_lock() helpers
> that lock zspages, but it's zs_pool that actually has a ->migrate_lock
> access to which is opene-coded.  Factor out pool migrate locking
> into helpers, zspage migration locking API will be renamed to
> reduce confusion.
> 
> It's worth mentioning that zsmalloc locks sync not only migration,
> but also compaction.
> 
> Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
> Cc: Yosry Ahmed <yosry.ahmed@linux.dev>
> ---
>  mm/zsmalloc.c | 69 +++++++++++++++++++++++++++++++++++----------------
>  1 file changed, 47 insertions(+), 22 deletions(-)
> 
> diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
> index 817626a351f8..c129596ab960 100644
> --- a/mm/zsmalloc.c
> +++ b/mm/zsmalloc.c
> @@ -18,7 +18,7 @@
>  /*
>   * lock ordering:
>   *	page_lock
> - *	pool->migrate_lock
> + *	pool->lock
>   *	class->lock
>   *	zspage->lock
>   */
> @@ -224,10 +224,35 @@ struct zs_pool {
>  	struct work_struct free_work;
>  #endif
>  	/* protect page/zspage migration */
> -	rwlock_t migrate_lock;
> +	rwlock_t lock;
>  	atomic_t compaction_in_progress;
>  };
>  
> +static void pool_write_unlock(struct zs_pool *pool)
> +{
> +	write_unlock(&pool->lock);
> +}
> +
> +static void pool_write_lock(struct zs_pool *pool)
> +{
> +	write_lock(&pool->lock);
> +}
> +
> +static void pool_read_unlock(struct zs_pool *pool)
> +{
> +	read_unlock(&pool->lock);
> +}
> +
> +static void pool_read_lock(struct zs_pool *pool)
> +{
> +	read_lock(&pool->lock);
> +}
> +
> +static bool pool_lock_is_contended(struct zs_pool *pool)
> +{
> +	return rwlock_is_contended(&pool->lock);
> +}
> +
>  static inline void zpdesc_set_first(struct zpdesc *zpdesc)
>  {
>  	SetPagePrivate(zpdesc_page(zpdesc));
> @@ -290,7 +315,7 @@ static bool ZsHugePage(struct zspage *zspage)
>  	return zspage->huge;
>  }
>  
> -static void migrate_lock_init(struct zspage *zspage);
> +static void lock_init(struct zspage *zspage);

Seems like this change slipped in here, with a s/migrate_lock/lock
replacement if I have to make a guess :P

>  static void migrate_read_lock(struct zspage *zspage);
>  static void migrate_read_unlock(struct zspage *zspage);
>  static void migrate_write_lock(struct zspage *zspage);
> @@ -992,7 +1017,7 @@ static struct zspage *alloc_zspage(struct zs_pool *pool,
>  		return NULL;
>  
>  	zspage->magic = ZSPAGE_MAGIC;
> -	migrate_lock_init(zspage);
> +	lock_init(zspage);
>  
>  	for (i = 0; i < class->pages_per_zspage; i++) {
>  		struct zpdesc *zpdesc;
> @@ -1206,7 +1231,7 @@ void *zs_map_object(struct zs_pool *pool, unsigned long handle,
>  	BUG_ON(in_interrupt());
>  
>  	/* It guarantees it can get zspage from handle safely */
> -	read_lock(&pool->migrate_lock);
> +	pool_read_lock(pool);
>  	obj = handle_to_obj(handle);
>  	obj_to_location(obj, &zpdesc, &obj_idx);
>  	zspage = get_zspage(zpdesc);
> @@ -1218,7 +1243,7 @@ void *zs_map_object(struct zs_pool *pool, unsigned long handle,
>  	 * which is smaller granularity.
>  	 */
>  	migrate_read_lock(zspage);
> -	read_unlock(&pool->migrate_lock);
> +	pool_read_unlock(pool);
>  
>  	class = zspage_class(pool, zspage);
>  	off = offset_in_page(class->size * obj_idx);
> @@ -1450,16 +1475,16 @@ void zs_free(struct zs_pool *pool, unsigned long handle)
>  		return;
>  
>  	/*
> -	 * The pool->migrate_lock protects the race with zpage's migration
> +	 * The pool->lock protects the race with zpage's migration
>  	 * so it's safe to get the page from handle.
>  	 */
> -	read_lock(&pool->migrate_lock);
> +	pool_read_lock(pool);
>  	obj = handle_to_obj(handle);
>  	obj_to_zpdesc(obj, &f_zpdesc);
>  	zspage = get_zspage(f_zpdesc);
>  	class = zspage_class(pool, zspage);
>  	spin_lock(&class->lock);
> -	read_unlock(&pool->migrate_lock);
> +	pool_read_unlock(pool);
>  
>  	class_stat_sub(class, ZS_OBJS_INUSE, 1);
>  	obj_free(class->size, obj);
> @@ -1703,7 +1728,7 @@ static void lock_zspage(struct zspage *zspage)
>  }
>  #endif /* CONFIG_COMPACTION */
>  
> -static void migrate_lock_init(struct zspage *zspage)
> +static void lock_init(struct zspage *zspage)
>  {
>  	rwlock_init(&zspage->lock);
>  }
> @@ -1793,10 +1818,10 @@ static int zs_page_migrate(struct page *newpage, struct page *page,
>  	pool = zspage->pool;
>  
>  	/*
> -	 * The pool migrate_lock protects the race between zpage migration
> +	 * The pool lock protects the race between zpage migration
>  	 * and zs_free.
>  	 */
> -	write_lock(&pool->migrate_lock);
> +	pool_write_lock(pool);
>  	class = zspage_class(pool, zspage);
>  
>  	/*
> @@ -1833,7 +1858,7 @@ static int zs_page_migrate(struct page *newpage, struct page *page,
>  	 * Since we complete the data copy and set up new zspage structure,
>  	 * it's okay to release migration_lock.
>  	 */
> -	write_unlock(&pool->migrate_lock);
> +	pool_write_unlock(pool);
>  	spin_unlock(&class->lock);
>  	migrate_write_unlock(zspage);
>  
> @@ -1956,7 +1981,7 @@ static unsigned long __zs_compact(struct zs_pool *pool,
>  	 * protect the race between zpage migration and zs_free
>  	 * as well as zpage allocation/free
>  	 */
> -	write_lock(&pool->migrate_lock);
> +	pool_write_lock(pool);
>  	spin_lock(&class->lock);
>  	while (zs_can_compact(class)) {
>  		int fg;
> @@ -1983,14 +2008,14 @@ static unsigned long __zs_compact(struct zs_pool *pool,
>  		src_zspage = NULL;
>  
>  		if (get_fullness_group(class, dst_zspage) == ZS_INUSE_RATIO_100
> -		    || rwlock_is_contended(&pool->migrate_lock)) {
> +		    || pool_lock_is_contended(pool)) {
>  			putback_zspage(class, dst_zspage);
>  			dst_zspage = NULL;
>  
>  			spin_unlock(&class->lock);
> -			write_unlock(&pool->migrate_lock);
> +			pool_write_unlock(pool);
>  			cond_resched();
> -			write_lock(&pool->migrate_lock);
> +			pool_write_lock(pool);
>  			spin_lock(&class->lock);
>  		}
>  	}
> @@ -2002,7 +2027,7 @@ static unsigned long __zs_compact(struct zs_pool *pool,
>  		putback_zspage(class, dst_zspage);
>  
>  	spin_unlock(&class->lock);
> -	write_unlock(&pool->migrate_lock);
> +	pool_write_unlock(pool);
>  
>  	return pages_freed;
>  }
> @@ -2014,10 +2039,10 @@ unsigned long zs_compact(struct zs_pool *pool)
>  	unsigned long pages_freed = 0;
>  
>  	/*
> -	 * Pool compaction is performed under pool->migrate_lock so it is basically
> +	 * Pool compaction is performed under pool->lock so it is basically
>  	 * single-threaded. Having more than one thread in __zs_compact()
> -	 * will increase pool->migrate_lock contention, which will impact other
> -	 * zsmalloc operations that need pool->migrate_lock.
> +	 * will increase pool->lock contention, which will impact other
> +	 * zsmalloc operations that need pool->lock.
>  	 */
>  	if (atomic_xchg(&pool->compaction_in_progress, 1))
>  		return 0;
> @@ -2139,7 +2164,7 @@ struct zs_pool *zs_create_pool(const char *name)
>  		return NULL;
>  
>  	init_deferred_free(pool);
> -	rwlock_init(&pool->migrate_lock);
> +	rwlock_init(&pool->lock);
>  	atomic_set(&pool->compaction_in_progress, 0);
>  
>  	pool->name = kstrdup(name, GFP_KERNEL);
> -- 
> 2.48.1.362.g079036d154-goog
> 


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCHv4 12/17] zsmalloc: factor out pool locking helpers
  2025-01-31 15:46   ` Yosry Ahmed
@ 2025-02-03  4:57     ` Sergey Senozhatsky
  0 siblings, 0 replies; 73+ messages in thread
From: Sergey Senozhatsky @ 2025-02-03  4:57 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Sergey Senozhatsky, Andrew Morton, Minchan Kim, linux-mm, linux-kernel

On (25/01/31 15:46), Yosry Ahmed wrote:
> > +static void pool_write_unlock(struct zs_pool *pool)
> > +{
> > +	write_unlock(&pool->lock);
> > +}
> > +
> > +static void pool_write_lock(struct zs_pool *pool)
> > +{
> > +	write_lock(&pool->lock);
> > +}
> > +
> > +static void pool_read_unlock(struct zs_pool *pool)
> > +{
> > +	read_unlock(&pool->lock);
> > +}
> > +
> > +static void pool_read_lock(struct zs_pool *pool)
> > +{
> > +	read_lock(&pool->lock);
> > +}
> > +
> > +static bool pool_lock_is_contended(struct zs_pool *pool)
> > +{
> > +	return rwlock_is_contended(&pool->lock);
> > +}
> > +
> >  static inline void zpdesc_set_first(struct zpdesc *zpdesc)
> >  {
> >  	SetPagePrivate(zpdesc_page(zpdesc));
> > @@ -290,7 +315,7 @@ static bool ZsHugePage(struct zspage *zspage)
> >  	return zspage->huge;
> >  }
> >  
> > -static void migrate_lock_init(struct zspage *zspage);
> > +static void lock_init(struct zspage *zspage);
> 
> Seems like this change slipped in here, with a s/migrate_lock/lock
> replacement if I have to make a guess :P

Look, it compiles!  :P


^ permalink raw reply	[flat|nested] 73+ messages in thread

* [PATCHv4 13/17] zsmalloc: factor out size-class locking helpers
  2025-01-31  9:05 [PATCHv4 00/17] zsmalloc/zram: there be preemption Sergey Senozhatsky
                   ` (11 preceding siblings ...)
  2025-01-31  9:06 ` [PATCHv4 12/17] zsmalloc: factor out pool locking helpers Sergey Senozhatsky
@ 2025-01-31  9:06 ` Sergey Senozhatsky
  2025-01-31  9:06 ` [PATCHv4 14/17] zsmalloc: make zspage lock preemptible Sergey Senozhatsky
                   ` (3 subsequent siblings)
  16 siblings, 0 replies; 73+ messages in thread
From: Sergey Senozhatsky @ 2025-01-31  9:06 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Minchan Kim, linux-mm, linux-kernel, Sergey Senozhatsky, Yosry Ahmed

Move open-coded size-class locking to dedicated helpers.

Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Reviewed-by: Yosry Ahmed <yosry.ahmed@linux.dev>
---
 mm/zsmalloc.c | 47 ++++++++++++++++++++++++++++-------------------
 1 file changed, 28 insertions(+), 19 deletions(-)

diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
index c129596ab960..4b4c77bc08f9 100644
--- a/mm/zsmalloc.c
+++ b/mm/zsmalloc.c
@@ -253,6 +253,16 @@ static bool pool_lock_is_contended(struct zs_pool *pool)
 	return rwlock_is_contended(&pool->lock);
 }
 
+static void size_class_lock(struct size_class *class)
+{
+	spin_lock(&class->lock);
+}
+
+static void size_class_unlock(struct size_class *class)
+{
+	spin_unlock(&class->lock);
+}
+
 static inline void zpdesc_set_first(struct zpdesc *zpdesc)
 {
 	SetPagePrivate(zpdesc_page(zpdesc));
@@ -613,8 +623,7 @@ static int zs_stats_size_show(struct seq_file *s, void *v)
 		if (class->index != i)
 			continue;
 
-		spin_lock(&class->lock);
-
+		size_class_lock(class);
 		seq_printf(s, " %5u %5u ", i, class->size);
 		for (fg = ZS_INUSE_RATIO_10; fg < NR_FULLNESS_GROUPS; fg++) {
 			inuse_totals[fg] += class_stat_read(class, fg);
@@ -624,7 +633,7 @@ static int zs_stats_size_show(struct seq_file *s, void *v)
 		obj_allocated = class_stat_read(class, ZS_OBJS_ALLOCATED);
 		obj_used = class_stat_read(class, ZS_OBJS_INUSE);
 		freeable = zs_can_compact(class);
-		spin_unlock(&class->lock);
+		size_class_unlock(class);
 
 		objs_per_zspage = class->objs_per_zspage;
 		pages_used = obj_allocated / objs_per_zspage *
@@ -1399,7 +1408,7 @@ unsigned long zs_malloc(struct zs_pool *pool, size_t size, gfp_t gfp)
 	class = pool->size_class[get_size_class_index(size)];
 
 	/* class->lock effectively protects the zpage migration */
-	spin_lock(&class->lock);
+	size_class_lock(class);
 	zspage = find_get_zspage(class);
 	if (likely(zspage)) {
 		obj_malloc(pool, zspage, handle);
@@ -1410,7 +1419,7 @@ unsigned long zs_malloc(struct zs_pool *pool, size_t size, gfp_t gfp)
 		goto out;
 	}
 
-	spin_unlock(&class->lock);
+	size_class_unlock(class);
 
 	zspage = alloc_zspage(pool, class, gfp);
 	if (!zspage) {
@@ -1418,7 +1427,7 @@ unsigned long zs_malloc(struct zs_pool *pool, size_t size, gfp_t gfp)
 		return (unsigned long)ERR_PTR(-ENOMEM);
 	}
 
-	spin_lock(&class->lock);
+	size_class_lock(class);
 	obj_malloc(pool, zspage, handle);
 	newfg = get_fullness_group(class, zspage);
 	insert_zspage(class, zspage, newfg);
@@ -1429,7 +1438,7 @@ unsigned long zs_malloc(struct zs_pool *pool, size_t size, gfp_t gfp)
 	/* We completely set up zspage so mark them as movable */
 	SetZsPageMovable(pool, zspage);
 out:
-	spin_unlock(&class->lock);
+	size_class_unlock(class);
 
 	return handle;
 }
@@ -1483,7 +1492,7 @@ void zs_free(struct zs_pool *pool, unsigned long handle)
 	obj_to_zpdesc(obj, &f_zpdesc);
 	zspage = get_zspage(f_zpdesc);
 	class = zspage_class(pool, zspage);
-	spin_lock(&class->lock);
+	size_class_lock(class);
 	pool_read_unlock(pool);
 
 	class_stat_sub(class, ZS_OBJS_INUSE, 1);
@@ -1493,7 +1502,7 @@ void zs_free(struct zs_pool *pool, unsigned long handle)
 	if (fullness == ZS_INUSE_RATIO_0)
 		free_zspage(pool, class, zspage);
 
-	spin_unlock(&class->lock);
+	size_class_unlock(class);
 	cache_free_handle(pool, handle);
 }
 EXPORT_SYMBOL_GPL(zs_free);
@@ -1827,7 +1836,7 @@ static int zs_page_migrate(struct page *newpage, struct page *page,
 	/*
 	 * the class lock protects zpage alloc/free in the zspage.
 	 */
-	spin_lock(&class->lock);
+	size_class_lock(class);
 	/* the migrate_write_lock protects zpage access via zs_map_object */
 	migrate_write_lock(zspage);
 
@@ -1859,7 +1868,7 @@ static int zs_page_migrate(struct page *newpage, struct page *page,
 	 * it's okay to release migration_lock.
 	 */
 	pool_write_unlock(pool);
-	spin_unlock(&class->lock);
+	size_class_unlock(class);
 	migrate_write_unlock(zspage);
 
 	zpdesc_get(newzpdesc);
@@ -1903,10 +1912,10 @@ static void async_free_zspage(struct work_struct *work)
 		if (class->index != i)
 			continue;
 
-		spin_lock(&class->lock);
+		size_class_lock(class);
 		list_splice_init(&class->fullness_list[ZS_INUSE_RATIO_0],
 				 &free_pages);
-		spin_unlock(&class->lock);
+		size_class_unlock(class);
 	}
 
 	list_for_each_entry_safe(zspage, tmp, &free_pages, list) {
@@ -1914,10 +1923,10 @@ static void async_free_zspage(struct work_struct *work)
 		lock_zspage(zspage);
 
 		class = zspage_class(pool, zspage);
-		spin_lock(&class->lock);
+		size_class_lock(class);
 		class_stat_sub(class, ZS_INUSE_RATIO_0, 1);
 		__free_zspage(pool, class, zspage);
-		spin_unlock(&class->lock);
+		size_class_unlock(class);
 	}
 };
 
@@ -1982,7 +1991,7 @@ static unsigned long __zs_compact(struct zs_pool *pool,
 	 * as well as zpage allocation/free
 	 */
 	pool_write_lock(pool);
-	spin_lock(&class->lock);
+	size_class_lock(class);
 	while (zs_can_compact(class)) {
 		int fg;
 
@@ -2012,11 +2021,11 @@ static unsigned long __zs_compact(struct zs_pool *pool,
 			putback_zspage(class, dst_zspage);
 			dst_zspage = NULL;
 
-			spin_unlock(&class->lock);
+			size_class_unlock(class);
 			pool_write_unlock(pool);
 			cond_resched();
 			pool_write_lock(pool);
-			spin_lock(&class->lock);
+			size_class_lock(class);
 		}
 	}
 
@@ -2026,7 +2035,7 @@ static unsigned long __zs_compact(struct zs_pool *pool,
 	if (dst_zspage)
 		putback_zspage(class, dst_zspage);
 
-	spin_unlock(&class->lock);
+	size_class_unlock(class);
 	pool_write_unlock(pool);
 
 	return pages_freed;
-- 
2.48.1.362.g079036d154-goog



^ permalink raw reply	[flat|nested] 73+ messages in thread

* [PATCHv4 14/17] zsmalloc: make zspage lock preemptible
  2025-01-31  9:05 [PATCHv4 00/17] zsmalloc/zram: there be preemption Sergey Senozhatsky
                   ` (12 preceding siblings ...)
  2025-01-31  9:06 ` [PATCHv4 13/17] zsmalloc: factor out size-class " Sergey Senozhatsky
@ 2025-01-31  9:06 ` Sergey Senozhatsky
  2025-01-31 15:51   ` Yosry Ahmed
  2025-01-31  9:06 ` [PATCHv4 15/17] zsmalloc: introduce new object mapping API Sergey Senozhatsky
                   ` (2 subsequent siblings)
  16 siblings, 1 reply; 73+ messages in thread
From: Sergey Senozhatsky @ 2025-01-31  9:06 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Minchan Kim, linux-mm, linux-kernel, Sergey Senozhatsky, Yosry Ahmed

Switch over from rwlock_t to a atomic_t variable that takes negative
value when the page is under migration, or positive values when the
page is used by zsmalloc users (object map, etc.)   Using a rwsem
per-zspage is a little too memory heavy, a simple atomic_t should
suffice.

zspage lock is a leaf lock for zs_map_object(), where it's read-acquired.
Since this lock now permits preemption extra care needs to be taken when
it is write-acquired - all writers grab it in atomic context, so they
cannot spin and wait for (potentially preempted) reader to unlock zspage.
There are only two writers at this moment - migration and compaction.  In
both cases we use write-try-lock and bail out if zspage is read locked.
Writers, on the other hand, never get preempted, so readers can spin
waiting for the writer to unlock zspage.

With this we can implement a preemptible object mapping.

Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Yosry Ahmed <yosry.ahmed@linux.dev>
---
 mm/zsmalloc.c | 135 +++++++++++++++++++++++++++++++-------------------
 1 file changed, 83 insertions(+), 52 deletions(-)

diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
index 4b4c77bc08f9..f5b5fe732e50 100644
--- a/mm/zsmalloc.c
+++ b/mm/zsmalloc.c
@@ -292,6 +292,9 @@ static inline void free_zpdesc(struct zpdesc *zpdesc)
 	__free_page(page);
 }
 
+#define ZS_PAGE_UNLOCKED	0
+#define ZS_PAGE_WRLOCKED	-1
+
 struct zspage {
 	struct {
 		unsigned int huge:HUGE_BITS;
@@ -304,7 +307,7 @@ struct zspage {
 	struct zpdesc *first_zpdesc;
 	struct list_head list; /* fullness list */
 	struct zs_pool *pool;
-	rwlock_t lock;
+	atomic_t lock;
 };
 
 struct mapping_area {
@@ -314,6 +317,59 @@ struct mapping_area {
 	enum zs_mapmode vm_mm; /* mapping mode */
 };
 
+static void zspage_lock_init(struct zspage *zspage)
+{
+	atomic_set(&zspage->lock, ZS_PAGE_UNLOCKED);
+}
+
+/*
+ * zspage lock permits preemption on the reader-side (there can be multiple
+ * readers).  Writers (exclusive zspage ownership), on the other hand, are
+ * always run in atomic context and cannot spin waiting for a (potentially
+ * preempted) reader to unlock zspage.  This, basically, means that writers
+ * can only call write-try-lock and must bail out if it didn't succeed.
+ *
+ * At the same time, writers cannot reschedule under zspage write-lock,
+ * so readers can spin waiting for the writer to unlock zspage.
+ */
+static void zspage_read_lock(struct zspage *zspage)
+{
+	atomic_t *lock = &zspage->lock;
+	int old = atomic_read(lock);
+
+	do {
+		if (old == ZS_PAGE_WRLOCKED) {
+			cpu_relax();
+			old = atomic_read(lock);
+			continue;
+		}
+	} while (!atomic_try_cmpxchg(lock, &old, old + 1));
+}
+
+static void zspage_read_unlock(struct zspage *zspage)
+{
+	atomic_dec(&zspage->lock);
+}
+
+static bool zspage_try_write_lock(struct zspage *zspage)
+{
+	atomic_t *lock = &zspage->lock;
+	int old = ZS_PAGE_UNLOCKED;
+
+	preempt_disable();
+	if (atomic_try_cmpxchg(lock, &old, ZS_PAGE_WRLOCKED))
+		return true;
+
+	preempt_enable();
+	return false;
+}
+
+static void zspage_write_unlock(struct zspage *zspage)
+{
+	atomic_set(&zspage->lock, ZS_PAGE_UNLOCKED);
+	preempt_enable();
+}
+
 /* huge object: pages_per_zspage == 1 && maxobj_per_zspage == 1 */
 static void SetZsHugePage(struct zspage *zspage)
 {
@@ -325,12 +381,6 @@ static bool ZsHugePage(struct zspage *zspage)
 	return zspage->huge;
 }
 
-static void lock_init(struct zspage *zspage);
-static void migrate_read_lock(struct zspage *zspage);
-static void migrate_read_unlock(struct zspage *zspage);
-static void migrate_write_lock(struct zspage *zspage);
-static void migrate_write_unlock(struct zspage *zspage);
-
 #ifdef CONFIG_COMPACTION
 static void kick_deferred_free(struct zs_pool *pool);
 static void init_deferred_free(struct zs_pool *pool);
@@ -1026,7 +1076,7 @@ static struct zspage *alloc_zspage(struct zs_pool *pool,
 		return NULL;
 
 	zspage->magic = ZSPAGE_MAGIC;
-	lock_init(zspage);
+	zspage_lock_init(zspage);
 
 	for (i = 0; i < class->pages_per_zspage; i++) {
 		struct zpdesc *zpdesc;
@@ -1251,7 +1301,7 @@ void *zs_map_object(struct zs_pool *pool, unsigned long handle,
 	 * zs_unmap_object API so delegate the locking from class to zspage
 	 * which is smaller granularity.
 	 */
-	migrate_read_lock(zspage);
+	zspage_read_lock(zspage);
 	pool_read_unlock(pool);
 
 	class = zspage_class(pool, zspage);
@@ -1311,7 +1361,7 @@ void zs_unmap_object(struct zs_pool *pool, unsigned long handle)
 	}
 	local_unlock(&zs_map_area.lock);
 
-	migrate_read_unlock(zspage);
+	zspage_read_unlock(zspage);
 }
 EXPORT_SYMBOL_GPL(zs_unmap_object);
 
@@ -1705,18 +1755,18 @@ static void lock_zspage(struct zspage *zspage)
 	/*
 	 * Pages we haven't locked yet can be migrated off the list while we're
 	 * trying to lock them, so we need to be careful and only attempt to
-	 * lock each page under migrate_read_lock(). Otherwise, the page we lock
+	 * lock each page under zspage_read_lock(). Otherwise, the page we lock
 	 * may no longer belong to the zspage. This means that we may wait for
 	 * the wrong page to unlock, so we must take a reference to the page
-	 * prior to waiting for it to unlock outside migrate_read_lock().
+	 * prior to waiting for it to unlock outside zspage_read_lock().
 	 */
 	while (1) {
-		migrate_read_lock(zspage);
+		zspage_read_lock(zspage);
 		zpdesc = get_first_zpdesc(zspage);
 		if (zpdesc_trylock(zpdesc))
 			break;
 		zpdesc_get(zpdesc);
-		migrate_read_unlock(zspage);
+		zspage_read_unlock(zspage);
 		zpdesc_wait_locked(zpdesc);
 		zpdesc_put(zpdesc);
 	}
@@ -1727,41 +1777,16 @@ static void lock_zspage(struct zspage *zspage)
 			curr_zpdesc = zpdesc;
 		} else {
 			zpdesc_get(zpdesc);
-			migrate_read_unlock(zspage);
+			zspage_read_unlock(zspage);
 			zpdesc_wait_locked(zpdesc);
 			zpdesc_put(zpdesc);
-			migrate_read_lock(zspage);
+			zspage_read_lock(zspage);
 		}
 	}
-	migrate_read_unlock(zspage);
+	zspage_read_unlock(zspage);
 }
 #endif /* CONFIG_COMPACTION */
 
-static void lock_init(struct zspage *zspage)
-{
-	rwlock_init(&zspage->lock);
-}
-
-static void migrate_read_lock(struct zspage *zspage) __acquires(&zspage->lock)
-{
-	read_lock(&zspage->lock);
-}
-
-static void migrate_read_unlock(struct zspage *zspage) __releases(&zspage->lock)
-{
-	read_unlock(&zspage->lock);
-}
-
-static void migrate_write_lock(struct zspage *zspage)
-{
-	write_lock(&zspage->lock);
-}
-
-static void migrate_write_unlock(struct zspage *zspage)
-{
-	write_unlock(&zspage->lock);
-}
-
 #ifdef CONFIG_COMPACTION
 
 static const struct movable_operations zsmalloc_mops;
@@ -1803,7 +1828,7 @@ static bool zs_page_isolate(struct page *page, isolate_mode_t mode)
 }
 
 static int zs_page_migrate(struct page *newpage, struct page *page,
-		enum migrate_mode mode)
+			   enum migrate_mode mode)
 {
 	struct zs_pool *pool;
 	struct size_class *class;
@@ -1819,15 +1844,12 @@ static int zs_page_migrate(struct page *newpage, struct page *page,
 
 	VM_BUG_ON_PAGE(!zpdesc_is_isolated(zpdesc), zpdesc_page(zpdesc));
 
-	/* We're committed, tell the world that this is a Zsmalloc page. */
-	__zpdesc_set_zsmalloc(newzpdesc);
-
 	/* The page is locked, so this pointer must remain valid */
 	zspage = get_zspage(zpdesc);
 	pool = zspage->pool;
 
 	/*
-	 * The pool lock protects the race between zpage migration
+	 * The pool->lock protects the race between zpage migration
 	 * and zs_free.
 	 */
 	pool_write_lock(pool);
@@ -1837,8 +1859,15 @@ static int zs_page_migrate(struct page *newpage, struct page *page,
 	 * the class lock protects zpage alloc/free in the zspage.
 	 */
 	size_class_lock(class);
-	/* the migrate_write_lock protects zpage access via zs_map_object */
-	migrate_write_lock(zspage);
+	/* the zspage write_lock protects zpage access via zs_map_object */
+	if (!zspage_try_write_lock(zspage)) {
+		size_class_unlock(class);
+		pool_write_unlock(pool);
+		return -EINVAL;
+	}
+
+	/* We're committed, tell the world that this is a Zsmalloc page. */
+	__zpdesc_set_zsmalloc(newzpdesc);
 
 	offset = get_first_obj_offset(zpdesc);
 	s_addr = kmap_local_zpdesc(zpdesc);
@@ -1869,7 +1898,7 @@ static int zs_page_migrate(struct page *newpage, struct page *page,
 	 */
 	pool_write_unlock(pool);
 	size_class_unlock(class);
-	migrate_write_unlock(zspage);
+	zspage_write_unlock(zspage);
 
 	zpdesc_get(newzpdesc);
 	if (zpdesc_zone(newzpdesc) != zpdesc_zone(zpdesc)) {
@@ -2005,9 +2034,11 @@ static unsigned long __zs_compact(struct zs_pool *pool,
 		if (!src_zspage)
 			break;
 
-		migrate_write_lock(src_zspage);
+		if (!zspage_try_write_lock(src_zspage))
+			break;
+
 		migrate_zspage(pool, src_zspage, dst_zspage);
-		migrate_write_unlock(src_zspage);
+		zspage_write_unlock(src_zspage);
 
 		fg = putback_zspage(class, src_zspage);
 		if (fg == ZS_INUSE_RATIO_0) {
-- 
2.48.1.362.g079036d154-goog



^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCHv4 14/17] zsmalloc: make zspage lock preemptible
  2025-01-31  9:06 ` [PATCHv4 14/17] zsmalloc: make zspage lock preemptible Sergey Senozhatsky
@ 2025-01-31 15:51   ` Yosry Ahmed
  2025-02-03  3:13     ` Sergey Senozhatsky
  0 siblings, 1 reply; 73+ messages in thread
From: Yosry Ahmed @ 2025-01-31 15:51 UTC (permalink / raw)
  To: Sergey Senozhatsky; +Cc: Andrew Morton, Minchan Kim, linux-mm, linux-kernel

On Fri, Jan 31, 2025 at 06:06:13PM +0900, Sergey Senozhatsky wrote:
> Switch over from rwlock_t to a atomic_t variable that takes negative
> value when the page is under migration, or positive values when the
> page is used by zsmalloc users (object map, etc.)   Using a rwsem
> per-zspage is a little too memory heavy, a simple atomic_t should
> suffice.
> 
> zspage lock is a leaf lock for zs_map_object(), where it's read-acquired.
> Since this lock now permits preemption extra care needs to be taken when
> it is write-acquired - all writers grab it in atomic context, so they
> cannot spin and wait for (potentially preempted) reader to unlock zspage.
> There are only two writers at this moment - migration and compaction.  In
> both cases we use write-try-lock and bail out if zspage is read locked.
> Writers, on the other hand, never get preempted, so readers can spin
> waiting for the writer to unlock zspage.
> 
> With this we can implement a preemptible object mapping.
> 
> Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
> Cc: Yosry Ahmed <yosry.ahmed@linux.dev>
> ---
>  mm/zsmalloc.c | 135 +++++++++++++++++++++++++++++++-------------------
>  1 file changed, 83 insertions(+), 52 deletions(-)
> 
> diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
> index 4b4c77bc08f9..f5b5fe732e50 100644
> --- a/mm/zsmalloc.c
> +++ b/mm/zsmalloc.c
> @@ -292,6 +292,9 @@ static inline void free_zpdesc(struct zpdesc *zpdesc)
>  	__free_page(page);
>  }
>  
> +#define ZS_PAGE_UNLOCKED	0
> +#define ZS_PAGE_WRLOCKED	-1
> +
>  struct zspage {
>  	struct {
>  		unsigned int huge:HUGE_BITS;
> @@ -304,7 +307,7 @@ struct zspage {
>  	struct zpdesc *first_zpdesc;
>  	struct list_head list; /* fullness list */
>  	struct zs_pool *pool;
> -	rwlock_t lock;
> +	atomic_t lock;
>  };
>  
>  struct mapping_area {
> @@ -314,6 +317,59 @@ struct mapping_area {
>  	enum zs_mapmode vm_mm; /* mapping mode */
>  };
>  
> +static void zspage_lock_init(struct zspage *zspage)
> +{
> +	atomic_set(&zspage->lock, ZS_PAGE_UNLOCKED);
> +}
> +
> +/*
> + * zspage lock permits preemption on the reader-side (there can be multiple
> + * readers).  Writers (exclusive zspage ownership), on the other hand, are
> + * always run in atomic context and cannot spin waiting for a (potentially
> + * preempted) reader to unlock zspage.  This, basically, means that writers
> + * can only call write-try-lock and must bail out if it didn't succeed.
> + *
> + * At the same time, writers cannot reschedule under zspage write-lock,
> + * so readers can spin waiting for the writer to unlock zspage.
> + */
> +static void zspage_read_lock(struct zspage *zspage)
> +{
> +	atomic_t *lock = &zspage->lock;
> +	int old = atomic_read(lock);
> +
> +	do {
> +		if (old == ZS_PAGE_WRLOCKED) {
> +			cpu_relax();
> +			old = atomic_read(lock);
> +			continue;
> +		}
> +	} while (!atomic_try_cmpxchg(lock, &old, old + 1));
> +}
> +
> +static void zspage_read_unlock(struct zspage *zspage)
> +{
> +	atomic_dec(&zspage->lock);
> +}
> +
> +static bool zspage_try_write_lock(struct zspage *zspage)
> +{
> +	atomic_t *lock = &zspage->lock;
> +	int old = ZS_PAGE_UNLOCKED;
> +
> +	preempt_disable();
> +	if (atomic_try_cmpxchg(lock, &old, ZS_PAGE_WRLOCKED))

FWIW, I am usually afraid to manually implement locking like this. For
example, queued_spin_trylock() uses atomic_try_cmpxchg_acquire() not
atomic_try_cmpxchg(), and I am not quite sure what could happen without
ACQUIRE semantics here on some architectures.

We also lose some debugging capabilities as Hilf pointed out in another
patch.

Just my 2c.

> +		return true;
> +
> +	preempt_enable();
> +	return false;
> +}
> +
> +static void zspage_write_unlock(struct zspage *zspage)
> +{
> +	atomic_set(&zspage->lock, ZS_PAGE_UNLOCKED);
> +	preempt_enable();
> +}
> +
>  /* huge object: pages_per_zspage == 1 && maxobj_per_zspage == 1 */
>  static void SetZsHugePage(struct zspage *zspage)
>  {
> @@ -325,12 +381,6 @@ static bool ZsHugePage(struct zspage *zspage)
>  	return zspage->huge;
>  }
>  
> -static void lock_init(struct zspage *zspage);
> -static void migrate_read_lock(struct zspage *zspage);
> -static void migrate_read_unlock(struct zspage *zspage);
> -static void migrate_write_lock(struct zspage *zspage);
> -static void migrate_write_unlock(struct zspage *zspage);
> -
>  #ifdef CONFIG_COMPACTION
>  static void kick_deferred_free(struct zs_pool *pool);
>  static void init_deferred_free(struct zs_pool *pool);
> @@ -1026,7 +1076,7 @@ static struct zspage *alloc_zspage(struct zs_pool *pool,
>  		return NULL;
>  
>  	zspage->magic = ZSPAGE_MAGIC;
> -	lock_init(zspage);
> +	zspage_lock_init(zspage);
>  
>  	for (i = 0; i < class->pages_per_zspage; i++) {
>  		struct zpdesc *zpdesc;
> @@ -1251,7 +1301,7 @@ void *zs_map_object(struct zs_pool *pool, unsigned long handle,
>  	 * zs_unmap_object API so delegate the locking from class to zspage
>  	 * which is smaller granularity.
>  	 */
> -	migrate_read_lock(zspage);
> +	zspage_read_lock(zspage);
>  	pool_read_unlock(pool);
>  
>  	class = zspage_class(pool, zspage);
> @@ -1311,7 +1361,7 @@ void zs_unmap_object(struct zs_pool *pool, unsigned long handle)
>  	}
>  	local_unlock(&zs_map_area.lock);
>  
> -	migrate_read_unlock(zspage);
> +	zspage_read_unlock(zspage);
>  }
>  EXPORT_SYMBOL_GPL(zs_unmap_object);
>  
> @@ -1705,18 +1755,18 @@ static void lock_zspage(struct zspage *zspage)
>  	/*
>  	 * Pages we haven't locked yet can be migrated off the list while we're
>  	 * trying to lock them, so we need to be careful and only attempt to
> -	 * lock each page under migrate_read_lock(). Otherwise, the page we lock
> +	 * lock each page under zspage_read_lock(). Otherwise, the page we lock
>  	 * may no longer belong to the zspage. This means that we may wait for
>  	 * the wrong page to unlock, so we must take a reference to the page
> -	 * prior to waiting for it to unlock outside migrate_read_lock().
> +	 * prior to waiting for it to unlock outside zspage_read_lock().
>  	 */
>  	while (1) {
> -		migrate_read_lock(zspage);
> +		zspage_read_lock(zspage);
>  		zpdesc = get_first_zpdesc(zspage);
>  		if (zpdesc_trylock(zpdesc))
>  			break;
>  		zpdesc_get(zpdesc);
> -		migrate_read_unlock(zspage);
> +		zspage_read_unlock(zspage);
>  		zpdesc_wait_locked(zpdesc);
>  		zpdesc_put(zpdesc);
>  	}
> @@ -1727,41 +1777,16 @@ static void lock_zspage(struct zspage *zspage)
>  			curr_zpdesc = zpdesc;
>  		} else {
>  			zpdesc_get(zpdesc);
> -			migrate_read_unlock(zspage);
> +			zspage_read_unlock(zspage);
>  			zpdesc_wait_locked(zpdesc);
>  			zpdesc_put(zpdesc);
> -			migrate_read_lock(zspage);
> +			zspage_read_lock(zspage);
>  		}
>  	}
> -	migrate_read_unlock(zspage);
> +	zspage_read_unlock(zspage);
>  }
>  #endif /* CONFIG_COMPACTION */
>  
> -static void lock_init(struct zspage *zspage)
> -{
> -	rwlock_init(&zspage->lock);
> -}
> -
> -static void migrate_read_lock(struct zspage *zspage) __acquires(&zspage->lock)
> -{
> -	read_lock(&zspage->lock);
> -}
> -
> -static void migrate_read_unlock(struct zspage *zspage) __releases(&zspage->lock)
> -{
> -	read_unlock(&zspage->lock);
> -}
> -
> -static void migrate_write_lock(struct zspage *zspage)
> -{
> -	write_lock(&zspage->lock);
> -}
> -
> -static void migrate_write_unlock(struct zspage *zspage)
> -{
> -	write_unlock(&zspage->lock);
> -}
> -
>  #ifdef CONFIG_COMPACTION
>  
>  static const struct movable_operations zsmalloc_mops;
> @@ -1803,7 +1828,7 @@ static bool zs_page_isolate(struct page *page, isolate_mode_t mode)
>  }
>  
>  static int zs_page_migrate(struct page *newpage, struct page *page,
> -		enum migrate_mode mode)
> +			   enum migrate_mode mode)
>  {
>  	struct zs_pool *pool;
>  	struct size_class *class;
> @@ -1819,15 +1844,12 @@ static int zs_page_migrate(struct page *newpage, struct page *page,
>  
>  	VM_BUG_ON_PAGE(!zpdesc_is_isolated(zpdesc), zpdesc_page(zpdesc));
>  
> -	/* We're committed, tell the world that this is a Zsmalloc page. */
> -	__zpdesc_set_zsmalloc(newzpdesc);
> -
>  	/* The page is locked, so this pointer must remain valid */
>  	zspage = get_zspage(zpdesc);
>  	pool = zspage->pool;
>  
>  	/*
> -	 * The pool lock protects the race between zpage migration
> +	 * The pool->lock protects the race between zpage migration
>  	 * and zs_free.
>  	 */
>  	pool_write_lock(pool);
> @@ -1837,8 +1859,15 @@ static int zs_page_migrate(struct page *newpage, struct page *page,
>  	 * the class lock protects zpage alloc/free in the zspage.
>  	 */
>  	size_class_lock(class);
> -	/* the migrate_write_lock protects zpage access via zs_map_object */
> -	migrate_write_lock(zspage);
> +	/* the zspage write_lock protects zpage access via zs_map_object */
> +	if (!zspage_try_write_lock(zspage)) {
> +		size_class_unlock(class);
> +		pool_write_unlock(pool);
> +		return -EINVAL;
> +	}
> +
> +	/* We're committed, tell the world that this is a Zsmalloc page. */
> +	__zpdesc_set_zsmalloc(newzpdesc);
>  
>  	offset = get_first_obj_offset(zpdesc);
>  	s_addr = kmap_local_zpdesc(zpdesc);
> @@ -1869,7 +1898,7 @@ static int zs_page_migrate(struct page *newpage, struct page *page,
>  	 */
>  	pool_write_unlock(pool);
>  	size_class_unlock(class);
> -	migrate_write_unlock(zspage);
> +	zspage_write_unlock(zspage);
>  
>  	zpdesc_get(newzpdesc);
>  	if (zpdesc_zone(newzpdesc) != zpdesc_zone(zpdesc)) {
> @@ -2005,9 +2034,11 @@ static unsigned long __zs_compact(struct zs_pool *pool,
>  		if (!src_zspage)
>  			break;
>  
> -		migrate_write_lock(src_zspage);
> +		if (!zspage_try_write_lock(src_zspage))
> +			break;
> +
>  		migrate_zspage(pool, src_zspage, dst_zspage);
> -		migrate_write_unlock(src_zspage);
> +		zspage_write_unlock(src_zspage);
>  
>  		fg = putback_zspage(class, src_zspage);
>  		if (fg == ZS_INUSE_RATIO_0) {
> -- 
> 2.48.1.362.g079036d154-goog
> 


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCHv4 14/17] zsmalloc: make zspage lock preemptible
  2025-01-31 15:51   ` Yosry Ahmed
@ 2025-02-03  3:13     ` Sergey Senozhatsky
  2025-02-03  4:56       ` Sergey Senozhatsky
  2025-02-03 21:11       ` Yosry Ahmed
  0 siblings, 2 replies; 73+ messages in thread
From: Sergey Senozhatsky @ 2025-02-03  3:13 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Sergey Senozhatsky, Andrew Morton, Minchan Kim, linux-mm, linux-kernel

On (25/01/31 15:51), Yosry Ahmed wrote:
> > +static void zspage_read_lock(struct zspage *zspage)
> > +{
> > +	atomic_t *lock = &zspage->lock;
> > +	int old = atomic_read(lock);
> > +
> > +	do {
> > +		if (old == ZS_PAGE_WRLOCKED) {
> > +			cpu_relax();
> > +			old = atomic_read(lock);
> > +			continue;
> > +		}
> > +	} while (!atomic_try_cmpxchg(lock, &old, old + 1));
> > +}
> > +
> > +static void zspage_read_unlock(struct zspage *zspage)
> > +{
> > +	atomic_dec(&zspage->lock);
> > +}
> > +
> > +static bool zspage_try_write_lock(struct zspage *zspage)
> > +{
> > +	atomic_t *lock = &zspage->lock;
> > +	int old = ZS_PAGE_UNLOCKED;
> > +
> > +	preempt_disable();
> > +	if (atomic_try_cmpxchg(lock, &old, ZS_PAGE_WRLOCKED))
> 
> FWIW, I am usually afraid to manually implement locking like this. For
> example, queued_spin_trylock() uses atomic_try_cmpxchg_acquire() not
> atomic_try_cmpxchg(), and I am not quite sure what could happen without
> ACQUIRE semantics here on some architectures.

I looked into it a bit, wasn't sure either.  Perhaps we can switch
to acquire/release semantics, I'm not an expert on this, would highly
appreciate help.

> We also lose some debugging capabilities as Hilf pointed out in another
> patch.

So that zspage lock should have not been a lock, I think, it's a ref-counter
and it's being used as one

map()
{
	page->users++;
}

unmap()
{
	page->users--;
}

migrate()
{
	if (!page->users)
		migrate_page();
}

> Just my 2c.

Perhaps we can sprinkle some lockdep on it.  For instance:

---
 mm/zsmalloc.c | 24 +++++++++++++++++++++++-
 1 file changed, 23 insertions(+), 1 deletion(-)

diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
index 956445f4d554..06b1d8ca9e89 100644
--- a/mm/zsmalloc.c
+++ b/mm/zsmalloc.c
@@ -308,6 +308,10 @@ struct zspage {
 	struct list_head list; /* fullness list */
 	struct zs_pool *pool;
 	atomic_t lock;
+
+#ifdef CONFIG_DEBUG_LOCK_ALLOC
+	struct lockdep_map lockdep_map;
+#endif
 };
 
 struct mapping_area {
@@ -319,6 +323,12 @@ struct mapping_area {
 
 static void zspage_lock_init(struct zspage *zspage)
 {
+#ifdef CONFIG_DEBUG_LOCK_ALLOC
+	static struct lock_class_key key;
+
+	lockdep_init_map(&zspage->lockdep_map, "zsmalloc-page", &key, 0);
+#endif
+
 	atomic_set(&zspage->lock, ZS_PAGE_UNLOCKED);
 }
 
@@ -344,11 +354,19 @@ static void zspage_read_lock(struct zspage *zspage)
 			continue;
 		}
 	} while (!atomic_try_cmpxchg(lock, &old, old + 1));
+
+#ifdef CONFIG_DEBUG_LOCK_ALLOC
+	rwsem_acquire_read(&zspage->lockdep_map, 0, 0, _RET_IP_);
+#endif
 }
 
 static void zspage_read_unlock(struct zspage *zspage)
 {
 	atomic_dec(&zspage->lock);
+
+#ifdef CONFIG_DEBUG_LOCK_ALLOC
+	rwsem_release(&zspage->lockdep_map, _RET_IP_);
+#endif
 }
 
 static bool zspage_try_write_lock(struct zspage *zspage)
@@ -357,8 +375,12 @@ static bool zspage_try_write_lock(struct zspage *zspage)
 	int old = ZS_PAGE_UNLOCKED;
 
 	preempt_disable();
-	if (atomic_try_cmpxchg(lock, &old, ZS_PAGE_WRLOCKED))
+	if (atomic_try_cmpxchg(lock, &old, ZS_PAGE_WRLOCKED)) {
+#ifdef CONFIG_DEBUG_LOCK_ALLOC
+		rwsem_acquire(&zspage->lockdep_map, 0, 0, _RET_IP_);
+#endif
 		return true;
+	}
 
 	preempt_enable();
 	return false;
-- 
2.48.1.362.g079036d154-goog



^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCHv4 14/17] zsmalloc: make zspage lock preemptible
  2025-02-03  3:13     ` Sergey Senozhatsky
@ 2025-02-03  4:56       ` Sergey Senozhatsky
  2025-02-03 21:11       ` Yosry Ahmed
  1 sibling, 0 replies; 73+ messages in thread
From: Sergey Senozhatsky @ 2025-02-03  4:56 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Sergey Senozhatsky, Andrew Morton, Minchan Kim, linux-mm, linux-kernel

On (25/02/03 12:13), Sergey Senozhatsky wrote:
> > Just my 2c.
> 
> Perhaps we can sprinkle some lockdep on it.

I forgot to rwsem_release() in zspage_write_unlock() and that
has triggered lockdep :)

---
 mm/zsmalloc.c | 27 ++++++++++++++++++++++++++-
 1 file changed, 26 insertions(+), 1 deletion(-)

diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
index 956445f4d554..1d4700e457d4 100644
--- a/mm/zsmalloc.c
+++ b/mm/zsmalloc.c
@@ -308,6 +308,10 @@ struct zspage {
 	struct list_head list; /* fullness list */
 	struct zs_pool *pool;
 	atomic_t lock;
+
+#ifdef CONFIG_DEBUG_LOCK_ALLOC
+	struct lockdep_map lockdep_map;
+#endif
 };
 
 struct mapping_area {
@@ -319,6 +323,12 @@ struct mapping_area {
 
 static void zspage_lock_init(struct zspage *zspage)
 {
+#ifdef CONFIG_DEBUG_LOCK_ALLOC
+	static struct lock_class_key key;
+
+	lockdep_init_map(&zspage->lockdep_map, "zsmalloc-page", &key, 0);
+#endif
+
 	atomic_set(&zspage->lock, ZS_PAGE_UNLOCKED);
 }
 
@@ -344,11 +354,19 @@ static void zspage_read_lock(struct zspage *zspage)
 			continue;
 		}
 	} while (!atomic_try_cmpxchg(lock, &old, old + 1));
+
+#ifdef CONFIG_DEBUG_LOCK_ALLOC
+	rwsem_acquire_read(&zspage->lockdep_map, 0, 0, _RET_IP_);
+#endif
 }
 
 static void zspage_read_unlock(struct zspage *zspage)
 {
 	atomic_dec(&zspage->lock);
+
+#ifdef CONFIG_DEBUG_LOCK_ALLOC
+	rwsem_release(&zspage->lockdep_map, _RET_IP_);
+#endif
 }
 
 static bool zspage_try_write_lock(struct zspage *zspage)
@@ -357,8 +375,12 @@ static bool zspage_try_write_lock(struct zspage *zspage)
 	int old = ZS_PAGE_UNLOCKED;
 
 	preempt_disable();
-	if (atomic_try_cmpxchg(lock, &old, ZS_PAGE_WRLOCKED))
+	if (atomic_try_cmpxchg(lock, &old, ZS_PAGE_WRLOCKED)) {
+#ifdef CONFIG_DEBUG_LOCK_ALLOC
+		rwsem_acquire(&zspage->lockdep_map, 0, 0, _RET_IP_);
+#endif
 		return true;
+	}
 
 	preempt_enable();
 	return false;
@@ -367,6 +389,9 @@ static bool zspage_try_write_lock(struct zspage *zspage)
 static void zspage_write_unlock(struct zspage *zspage)
 {
 	atomic_set(&zspage->lock, ZS_PAGE_UNLOCKED);
+#ifdef CONFIG_DEBUG_LOCK_ALLOC
+	rwsem_release(&zspage->lockdep_map, _RET_IP_);
+#endif
 	preempt_enable();
 }
 
-- 
2.48.1.362.g079036d154-goog



^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCHv4 14/17] zsmalloc: make zspage lock preemptible
  2025-02-03  3:13     ` Sergey Senozhatsky
  2025-02-03  4:56       ` Sergey Senozhatsky
@ 2025-02-03 21:11       ` Yosry Ahmed
  2025-02-04  6:59         ` Sergey Senozhatsky
  1 sibling, 1 reply; 73+ messages in thread
From: Yosry Ahmed @ 2025-02-03 21:11 UTC (permalink / raw)
  To: Sergey Senozhatsky; +Cc: Andrew Morton, Minchan Kim, linux-mm, linux-kernel

On Mon, Feb 03, 2025 at 12:13:49PM +0900, Sergey Senozhatsky wrote:
> On (25/01/31 15:51), Yosry Ahmed wrote:
> > > +static void zspage_read_lock(struct zspage *zspage)
> > > +{
> > > +	atomic_t *lock = &zspage->lock;
> > > +	int old = atomic_read(lock);
> > > +
> > > +	do {
> > > +		if (old == ZS_PAGE_WRLOCKED) {
> > > +			cpu_relax();
> > > +			old = atomic_read(lock);
> > > +			continue;
> > > +		}
> > > +	} while (!atomic_try_cmpxchg(lock, &old, old + 1));
> > > +}
> > > +
> > > +static void zspage_read_unlock(struct zspage *zspage)
> > > +{
> > > +	atomic_dec(&zspage->lock);
> > > +}
> > > +
> > > +static bool zspage_try_write_lock(struct zspage *zspage)
> > > +{
> > > +	atomic_t *lock = &zspage->lock;
> > > +	int old = ZS_PAGE_UNLOCKED;
> > > +
> > > +	preempt_disable();
> > > +	if (atomic_try_cmpxchg(lock, &old, ZS_PAGE_WRLOCKED))
> > 
> > FWIW, I am usually afraid to manually implement locking like this. For
> > example, queued_spin_trylock() uses atomic_try_cmpxchg_acquire() not
> > atomic_try_cmpxchg(), and I am not quite sure what could happen without
> > ACQUIRE semantics here on some architectures.
> 
> I looked into it a bit, wasn't sure either.  Perhaps we can switch
> to acquire/release semantics, I'm not an expert on this, would highly
> appreciate help.
> 
> > We also lose some debugging capabilities as Hilf pointed out in another
> > patch.
> 
> So that zspage lock should have not been a lock, I think, it's a ref-counter
> and it's being used as one
> 
> map()
> {
> 	page->users++;
> }
> 
> unmap()
> {
> 	page->users--;
> }
> 
> migrate()
> {
> 	if (!page->users)
> 		migrate_page();
> }

Hmm, but in this case we want migration to block new map/unmap
operations. So a vanilla refcount won't work.

> 
> > Just my 2c.
> 
> Perhaps we can sprinkle some lockdep on it.  For instance:

Honestly this looks like more reason to use existing lock primitives to
me. What are the candidates? I assume rw_semaphore, anything else?

I guess the main reason you didn't use a rw_semaphore is the extra
memory usage. Seems like it uses ~32 bytes more than rwlock_t on x86_64.
That's per zspage. Depending on how many compressed pages we have
per-zspage this may not be too bad.

For example, if a zspage has a chain length of 4, and the average
compression ratio of 1/3, that's 12 compressed pages so the extra
overhead is <3 bytes per compressed page.

Given that the chain length is a function of the class size, I think we
can calculate the exact extra memory overhead per-compressed page for
each class and get a mean/median over all classes.

If the memory overhead is insignificant I'd rather use exisitng lock
primitives tbh.


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCHv4 14/17] zsmalloc: make zspage lock preemptible
  2025-02-03 21:11       ` Yosry Ahmed
@ 2025-02-04  6:59         ` Sergey Senozhatsky
  2025-02-04 17:19           ` Yosry Ahmed
  0 siblings, 1 reply; 73+ messages in thread
From: Sergey Senozhatsky @ 2025-02-04  6:59 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Sergey Senozhatsky, Andrew Morton, Minchan Kim, linux-mm, linux-kernel

On (25/02/03 21:11), Yosry Ahmed wrote:
> > > We also lose some debugging capabilities as Hilf pointed out in another
> > > patch.
> > 
> > So that zspage lock should have not been a lock, I think, it's a ref-counter
> > and it's being used as one
> > 
> > map()
> > {
> > 	page->users++;
> > }
> > 
> > unmap()
> > {
> > 	page->users--;
> > }
> > 
> > migrate()
> > {
> > 	if (!page->users)
> > 		migrate_page();
> > }
> 
> Hmm, but in this case we want migration to block new map/unmap
> operations. So a vanilla refcount won't work.

Yeah, correct - migration needs negative values so that map would
wait until it's positive (or zero).

> > > Just my 2c.
> > 
> > Perhaps we can sprinkle some lockdep on it.  For instance:
> 
> Honestly this looks like more reason to use existing lock primitives to
> me. What are the candidates? I assume rw_semaphore, anything else?

Right, rwsem "was" the first choice.

> I guess the main reason you didn't use a rw_semaphore is the extra
> memory usage.

sizeof(struct zs_page) change is one thing.  Another thing is that
zspage->lock is taken from atomic sections, pretty much everywhere.
compaction/migration write-lock it under pool rwlock and class spinlock,
but both compaction and migration now EAGAIN if the lock is locked
already, so that is sorted out.

The remaining problem is map(), which takes zspage read-lock under pool
rwlock.  RFC series (which you hated with passion :P) converted all zsmalloc
into preemptible ones because of this - zspage->lock is a nested leaf-lock,
so it cannot schedule unless locks it's nested under permit it (needless to
say neither rwlock nor spinlock permit it).

> Seems like it uses ~32 bytes more than rwlock_t on x86_64.
> That's per zspage. Depending on how many compressed pages we have
> per-zspage this may not be too bad.

So on a 16GB laptop our memory pressure test at peak used approx 1M zspages.
That is 32 bytes * 1M ~ 32MB of extra memory use.  Not alarmingly a lot,
less than what a single browser tab needs nowadays.  I suppose on 4GB/8GB
that will be even smaller (because those device generate less zspages).
Numbers are not the main issue, however.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCHv4 14/17] zsmalloc: make zspage lock preemptible
  2025-02-04  6:59         ` Sergey Senozhatsky
@ 2025-02-04 17:19           ` Yosry Ahmed
  2025-02-05  2:43             ` Sergey Senozhatsky
  0 siblings, 1 reply; 73+ messages in thread
From: Yosry Ahmed @ 2025-02-04 17:19 UTC (permalink / raw)
  To: Sergey Senozhatsky; +Cc: Andrew Morton, Minchan Kim, linux-mm, linux-kernel

On Tue, Feb 04, 2025 at 03:59:42PM +0900, Sergey Senozhatsky wrote:
> On (25/02/03 21:11), Yosry Ahmed wrote:
> > > > We also lose some debugging capabilities as Hilf pointed out in another
> > > > patch.
> > > 
> > > So that zspage lock should have not been a lock, I think, it's a ref-counter
> > > and it's being used as one
> > > 
> > > map()
> > > {
> > > 	page->users++;
> > > }
> > > 
> > > unmap()
> > > {
> > > 	page->users--;
> > > }
> > > 
> > > migrate()
> > > {
> > > 	if (!page->users)
> > > 		migrate_page();
> > > }
> > 
> > Hmm, but in this case we want migration to block new map/unmap
> > operations. So a vanilla refcount won't work.
> 
> Yeah, correct - migration needs negative values so that map would
> wait until it's positive (or zero).
> 
> > > > Just my 2c.
> > > 
> > > Perhaps we can sprinkle some lockdep on it.  For instance:
> > 
> > Honestly this looks like more reason to use existing lock primitives to
> > me. What are the candidates? I assume rw_semaphore, anything else?
> 
> Right, rwsem "was" the first choice.
> 
> > I guess the main reason you didn't use a rw_semaphore is the extra
> > memory usage.
> 
> sizeof(struct zs_page) change is one thing.  Another thing is that
> zspage->lock is taken from atomic sections, pretty much everywhere.
> compaction/migration write-lock it under pool rwlock and class spinlock,
> but both compaction and migration now EAGAIN if the lock is locked
> already, so that is sorted out.
> 
> The remaining problem is map(), which takes zspage read-lock under pool
> rwlock.  RFC series (which you hated with passion :P) converted all zsmalloc
> into preemptible ones because of this - zspage->lock is a nested leaf-lock,
> so it cannot schedule unless locks it's nested under permit it (needless to
> say neither rwlock nor spinlock permit it).

Hmm, so we want the lock to be preemtible, but we don't want to use an
existing preemtible lock because it may be held it from atomic context.

I think one problem here is that the lock you are introducing is a
spinning lock but the lock holder can be preempted. This is why spinning
locks do not allow preemption. Others waiting for the lock can spin
waiting for a process that is scheduled out.

For example, the compaction/migration code could be sleeping holding the
write lock, and a map() call would spin waiting for that sleeping task.

I wonder if there's a way to rework the locking instead to avoid the
nesting. It seems like sometimes we lock the zspage with the pool lock
held, sometimes with the class lock held, and sometimes with no lock
held.

What are the rules here for acquiring the zspage lock? Do we need to
hold another lock just to make sure the zspage does not go away from
under us? Can we use RCU or something similar to do that instead?

> 
> > Seems like it uses ~32 bytes more than rwlock_t on x86_64.
> > That's per zspage. Depending on how many compressed pages we have
> > per-zspage this may not be too bad.
> 
> So on a 16GB laptop our memory pressure test at peak used approx 1M zspages.
> That is 32 bytes * 1M ~ 32MB of extra memory use.  Not alarmingly a lot,
> less than what a single browser tab needs nowadays.  I suppose on 4GB/8GB
> that will be even smaller (because those device generate less zspages).
> Numbers are not the main issue, however.
> 


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCHv4 14/17] zsmalloc: make zspage lock preemptible
  2025-02-04 17:19           ` Yosry Ahmed
@ 2025-02-05  2:43             ` Sergey Senozhatsky
  2025-02-05 19:06               ` Yosry Ahmed
  0 siblings, 1 reply; 73+ messages in thread
From: Sergey Senozhatsky @ 2025-02-05  2:43 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Sergey Senozhatsky, Andrew Morton, Minchan Kim, linux-mm, linux-kernel

On (25/02/04 17:19), Yosry Ahmed wrote:
> > sizeof(struct zs_page) change is one thing.  Another thing is that
> > zspage->lock is taken from atomic sections, pretty much everywhere.
> > compaction/migration write-lock it under pool rwlock and class spinlock,
> > but both compaction and migration now EAGAIN if the lock is locked
> > already, so that is sorted out.
> > 
> > The remaining problem is map(), which takes zspage read-lock under pool
> > rwlock.  RFC series (which you hated with passion :P) converted all zsmalloc
> > into preemptible ones because of this - zspage->lock is a nested leaf-lock,
> > so it cannot schedule unless locks it's nested under permit it (needless to
> > say neither rwlock nor spinlock permit it).
> 
> Hmm, so we want the lock to be preemtible, but we don't want to use an
> existing preemtible lock because it may be held it from atomic context.
> 
> I think one problem here is that the lock you are introducing is a
> spinning lock but the lock holder can be preempted. This is why spinning
> locks do not allow preemption. Others waiting for the lock can spin
> waiting for a process that is scheduled out.
> 
> For example, the compaction/migration code could be sleeping holding the
> write lock, and a map() call would spin waiting for that sleeping task.

write-lock holders cannot sleep, that's the key part.

So the rules are:

1) writer cannot sleep
   - migration/compaction runs in atomic context and grabs
	 write-lock only from atomic context
   - write-locking function disables preemption before lock(), just to be
	 safe, and enables it after unlock()

2) writer does not spin waiting
   - that's why there is only write_try_lock function
	  - compaction and migration bail out when they cannot lock the
		zspage

3) readers can sleep and can spin waiting for a lock
   - other (even preempted) readers don't block new readers
   - writers don't sleep, they always unlock

> I wonder if there's a way to rework the locking instead to avoid the
> nesting. It seems like sometimes we lock the zspage with the pool lock
> held, sometimes with the class lock held, and sometimes with no lock
> held.
> 
> What are the rules here for acquiring the zspage lock?

Most of that code is not written by me, but I think the rule is to disable
"migration" be it via pool lock or class lock.

> Do we need to hold another lock just to make sure the zspage does not go
> away from under us?

Yes, the page cannot go away via "normal" path:
   zs_free(last object) -> zspage becomes empty -> free zspage

so when we have active mapping() it's only migration and compaction
that can free zspage (its content is migrated and so it becomes empty).

> Can we use RCU or something similar to do that instead?

Hmm, I don't know... zsmalloc is not "read-mostly", it's whatever data
patterns the clients have.   I suspect we'd need to synchronize RCU every
time a zspage is freed: zs_free() [this one is complicated], or migration,
or compaction?  Sounds like anti-pattern for RCU?


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCHv4 14/17] zsmalloc: make zspage lock preemptible
  2025-02-05  2:43             ` Sergey Senozhatsky
@ 2025-02-05 19:06               ` Yosry Ahmed
  2025-02-06  3:05                 ` Sergey Senozhatsky
  0 siblings, 1 reply; 73+ messages in thread
From: Yosry Ahmed @ 2025-02-05 19:06 UTC (permalink / raw)
  To: Sergey Senozhatsky; +Cc: Andrew Morton, Minchan Kim, linux-mm, linux-kernel

On Wed, Feb 05, 2025 at 11:43:16AM +0900, Sergey Senozhatsky wrote:
> On (25/02/04 17:19), Yosry Ahmed wrote:
> > > sizeof(struct zs_page) change is one thing.  Another thing is that
> > > zspage->lock is taken from atomic sections, pretty much everywhere.
> > > compaction/migration write-lock it under pool rwlock and class spinlock,
> > > but both compaction and migration now EAGAIN if the lock is locked
> > > already, so that is sorted out.
> > > 
> > > The remaining problem is map(), which takes zspage read-lock under pool
> > > rwlock.  RFC series (which you hated with passion :P) converted all zsmalloc
> > > into preemptible ones because of this - zspage->lock is a nested leaf-lock,
> > > so it cannot schedule unless locks it's nested under permit it (needless to
> > > say neither rwlock nor spinlock permit it).
> > 
> > Hmm, so we want the lock to be preemtible, but we don't want to use an
> > existing preemtible lock because it may be held it from atomic context.
> > 
> > I think one problem here is that the lock you are introducing is a
> > spinning lock but the lock holder can be preempted. This is why spinning
> > locks do not allow preemption. Others waiting for the lock can spin
> > waiting for a process that is scheduled out.
> > 
> > For example, the compaction/migration code could be sleeping holding the
> > write lock, and a map() call would spin waiting for that sleeping task.
> 
> write-lock holders cannot sleep, that's the key part.
> 
> So the rules are:
> 
> 1) writer cannot sleep
>    - migration/compaction runs in atomic context and grabs
> 	 write-lock only from atomic context
>    - write-locking function disables preemption before lock(), just to be
> 	 safe, and enables it after unlock()
> 
> 2) writer does not spin waiting
>    - that's why there is only write_try_lock function
> 	  - compaction and migration bail out when they cannot lock the
> 		zspage
> 
> 3) readers can sleep and can spin waiting for a lock
>    - other (even preempted) readers don't block new readers
>    - writers don't sleep, they always unlock

That's useful, thanks. If we go with custom locking we need to document
this clearly and add debug checks where possible.

> 
> > I wonder if there's a way to rework the locking instead to avoid the
> > nesting. It seems like sometimes we lock the zspage with the pool lock
> > held, sometimes with the class lock held, and sometimes with no lock
> > held.
> > 
> > What are the rules here for acquiring the zspage lock?
> 
> Most of that code is not written by me, but I think the rule is to disable
> "migration" be it via pool lock or class lock.

It seems like we're not holding either of these locks in
async_free_zspage() when we call lock_zspage(). Is it safe for a
different reason?

> 
> > Do we need to hold another lock just to make sure the zspage does not go
> > away from under us?
> 
> Yes, the page cannot go away via "normal" path:
>    zs_free(last object) -> zspage becomes empty -> free zspage
> 
> so when we have active mapping() it's only migration and compaction
> that can free zspage (its content is migrated and so it becomes empty).
> 
> > Can we use RCU or something similar to do that instead?
> 
> Hmm, I don't know... zsmalloc is not "read-mostly", it's whatever data
> patterns the clients have.   I suspect we'd need to synchronize RCU every
> time a zspage is freed: zs_free() [this one is complicated], or migration,
> or compaction?  Sounds like anti-pattern for RCU?

Can't we use kfree_rcu() instead of synchronizing? Not sure if this
would still be an antipattern tbh. It just seems like the current
locking scheme is really complicated :/


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCHv4 14/17] zsmalloc: make zspage lock preemptible
  2025-02-05 19:06               ` Yosry Ahmed
@ 2025-02-06  3:05                 ` Sergey Senozhatsky
  2025-02-06  3:28                   ` Sergey Senozhatsky
  2025-02-06 16:19                   ` Yosry Ahmed
  0 siblings, 2 replies; 73+ messages in thread
From: Sergey Senozhatsky @ 2025-02-06  3:05 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Sergey Senozhatsky, Andrew Morton, Minchan Kim, linux-mm, linux-kernel

On (25/02/05 19:06), Yosry Ahmed wrote:
> > > For example, the compaction/migration code could be sleeping holding the
> > > write lock, and a map() call would spin waiting for that sleeping task.
> > 
> > write-lock holders cannot sleep, that's the key part.
> > 
> > So the rules are:
> > 
> > 1) writer cannot sleep
> >    - migration/compaction runs in atomic context and grabs
> > 	 write-lock only from atomic context
> >    - write-locking function disables preemption before lock(), just to be
> > 	 safe, and enables it after unlock()
> > 
> > 2) writer does not spin waiting
> >    - that's why there is only write_try_lock function
> > 	  - compaction and migration bail out when they cannot lock the
> > 		zspage
> > 
> > 3) readers can sleep and can spin waiting for a lock
> >    - other (even preempted) readers don't block new readers
> >    - writers don't sleep, they always unlock
> 
> That's useful, thanks. If we go with custom locking we need to document
> this clearly and add debug checks where possible.

Sure.  That's what it currently looks like (can always improve)

---
/*
 * zspage lock permits preemption on the reader-side (there can be multiple
 * readers).  Writers (exclusive zspage ownership), on the other hand, are
 * always run in atomic context and cannot spin waiting for a (potentially
 * preempted) reader to unlock zspage.  This, basically, means that writers
 * can only call write-try-lock and must bail out if it didn't succeed.
 *
 * At the same time, writers cannot reschedule under zspage write-lock,
 * so readers can spin waiting for the writer to unlock zspage.
 */
static void zspage_read_lock(struct zspage *zspage)
{
        atomic_t *lock = &zspage->lock;
        int old = atomic_read_acquire(lock);

        do {
                if (old == ZS_PAGE_WRLOCKED) {
                        cpu_relax();
                        old = atomic_read_acquire(lock);
                        continue;
                }
        } while (!atomic_try_cmpxchg_acquire(lock, &old, old + 1));

#ifdef CONFIG_DEBUG_LOCK_ALLOC
        rwsem_acquire_read(&zspage->lockdep_map, 0, 0, _RET_IP_);
#endif
}

static void zspage_read_unlock(struct zspage *zspage)
{
        atomic_dec_return_release(&zspage->lock);

#ifdef CONFIG_DEBUG_LOCK_ALLOC
        rwsem_release(&zspage->lockdep_map, _RET_IP_);
#endif
}

static bool zspage_try_write_lock(struct zspage *zspage)
{
        atomic_t *lock = &zspage->lock;
        int old = ZS_PAGE_UNLOCKED;

        preempt_disable();
        if (atomic_try_cmpxchg_acquire(lock, &old, ZS_PAGE_WRLOCKED)) {
#ifdef CONFIG_DEBUG_LOCK_ALLOC
                rwsem_acquire(&zspage->lockdep_map, 0, 0, _RET_IP_);
#endif
                return true;
        }

        preempt_enable();
        return false;
}

static void zspage_write_unlock(struct zspage *zspage)
{
        atomic_set_release(&zspage->lock, ZS_PAGE_UNLOCKED);
#ifdef CONFIG_DEBUG_LOCK_ALLOC
        rwsem_release(&zspage->lockdep_map, _RET_IP_);
#endif
        preempt_enable();
}
---

Maybe I'll just copy-paste the locking rules list, a list is always cleaner.

> > > I wonder if there's a way to rework the locking instead to avoid the
> > > nesting. It seems like sometimes we lock the zspage with the pool lock
> > > held, sometimes with the class lock held, and sometimes with no lock
> > > held.
> > > 
> > > What are the rules here for acquiring the zspage lock?
> > 
> > Most of that code is not written by me, but I think the rule is to disable
> > "migration" be it via pool lock or class lock.
> 
> It seems like we're not holding either of these locks in
> async_free_zspage() when we call lock_zspage(). Is it safe for a
> different reason?

I think we hold size class lock there. async-free is only for pages that
reached 0 usage ratio (empty fullness group), so they don't hold any
objects any more and from her such zspages either get freed or
find_get_zspage() recovers them from fullness 0 and allocates an object.
Both are synchronized by size class lock.

> > Hmm, I don't know... zsmalloc is not "read-mostly", it's whatever data
> > patterns the clients have.   I suspect we'd need to synchronize RCU every
> > time a zspage is freed: zs_free() [this one is complicated], or migration,
> > or compaction?  Sounds like anti-pattern for RCU?
> 
> Can't we use kfree_rcu() instead of synchronizing? Not sure if this
> would still be an antipattern tbh.

Yeah, I don't know.  The last time I wrongly used kfree_rcu() it caused a
27% performance drop (some internal code).  This zspage thingy maybe will
be better, but still has a potential to generate high numbers of RCU calls,
depends on the clients.  Probably the chances are too high.  Apart from
that, kvfree_rcu() can sleep, as far as I understand, so zram might have
some extra things to deal with, namely slot-free notifications which can
be called from softirq, and always called under spinlock:

 mm slot-free -> zram slot-free -> zs_free -> empty zspage -> kfree_rcu

> It just seems like the current locking scheme is really complicated :/

That's very true.


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCHv4 14/17] zsmalloc: make zspage lock preemptible
  2025-02-06  3:05                 ` Sergey Senozhatsky
@ 2025-02-06  3:28                   ` Sergey Senozhatsky
  2025-02-06 16:19                   ` Yosry Ahmed
  1 sibling, 0 replies; 73+ messages in thread
From: Sergey Senozhatsky @ 2025-02-06  3:28 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Andrew Morton, Minchan Kim, linux-mm, linux-kernel, Sergey Senozhatsky

On (25/02/06 12:05), Sergey Senozhatsky wrote:
> 
> Sure.  That's what it currently looks like (can always improve)
> 

- added must-check
- added preemptible() check  // just in case
- added locking rules list

Oh, and also switched to acquire/release semantics, like you suggested
a couple of days ago.

---

/*
 * zspage locking rules:
 *
 * 1) writer-lock is exclusive
 *
 * 2) writer-lock owner cannot sleep
 *
 * 3) writer-lock owner cannot spin waiting for the lock
 *   - caller (e.g. compaction and migration) must check return value and
 *     handle locking failures
 *   - there is only TRY variant of writer-lock function
 *
 * 4) reader-lock owners (multiple) can sleep
 *
 * 5) reader-lock owners can spin waiting for the lock, in any context
 *   - existing readers (even preempted ones) don't block new readers
 *   - writer-lock owners never sleep, always unlock at some point
 */
static void zspage_read_lock(struct zspage *zspage)
{
        atomic_t *lock = &zspage->lock;
        int old = atomic_read_acquire(lock);

        do {
                if (old == ZS_PAGE_WRLOCKED) {
                        cpu_relax();
                        old = atomic_read_acquire(lock);
                        continue;
                }
        } while (!atomic_try_cmpxchg_acquire(lock, &old, old + 1));

#ifdef CONFIG_DEBUG_LOCK_ALLOC
        rwsem_acquire_read(&zspage->lockdep_map, 0, 0, _RET_IP_);
#endif
}

static void zspage_read_unlock(struct zspage *zspage)
{
        atomic_dec_return_release(&zspage->lock);

#ifdef CONFIG_DEBUG_LOCK_ALLOC
        rwsem_release(&zspage->lockdep_map, _RET_IP_);
#endif
}

static __must_check bool zspage_try_write_lock(struct zspage *zspage)
{
        atomic_t *lock = &zspage->lock;
        int old = ZS_PAGE_UNLOCKED;

        WARN_ON_ONCE(preemptible());

        preempt_disable();
        if (atomic_try_cmpxchg_acquire(lock, &old, ZS_PAGE_WRLOCKED)) {
#ifdef CONFIG_DEBUG_LOCK_ALLOC
                rwsem_acquire(&zspage->lockdep_map, 0, 0, _RET_IP_);
#endif
                return true;
        }

        preempt_enable();
        return false;
}

static void zspage_write_unlock(struct zspage *zspage)
{
        atomic_set_release(&zspage->lock, ZS_PAGE_UNLOCKED);
#ifdef CONFIG_DEBUG_LOCK_ALLOC
        rwsem_release(&zspage->lockdep_map, _RET_IP_);
#endif
        preempt_enable();
}


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCHv4 14/17] zsmalloc: make zspage lock preemptible
  2025-02-06  3:05                 ` Sergey Senozhatsky
  2025-02-06  3:28                   ` Sergey Senozhatsky
@ 2025-02-06 16:19                   ` Yosry Ahmed
  2025-02-07  2:48                     ` Sergey Senozhatsky
  1 sibling, 1 reply; 73+ messages in thread
From: Yosry Ahmed @ 2025-02-06 16:19 UTC (permalink / raw)
  To: Sergey Senozhatsky; +Cc: Andrew Morton, Minchan Kim, linux-mm, linux-kernel

On Thu, Feb 06, 2025 at 12:05:55PM +0900, Sergey Senozhatsky wrote:
> On (25/02/05 19:06), Yosry Ahmed wrote:
> > > > For example, the compaction/migration code could be sleeping holding the
> > > > write lock, and a map() call would spin waiting for that sleeping task.
> > > 
> > > write-lock holders cannot sleep, that's the key part.
> > > 
> > > So the rules are:
> > > 
> > > 1) writer cannot sleep
> > >    - migration/compaction runs in atomic context and grabs
> > > 	 write-lock only from atomic context
> > >    - write-locking function disables preemption before lock(), just to be
> > > 	 safe, and enables it after unlock()
> > > 
> > > 2) writer does not spin waiting
> > >    - that's why there is only write_try_lock function
> > > 	  - compaction and migration bail out when they cannot lock the
> > > 		zspage
> > > 
> > > 3) readers can sleep and can spin waiting for a lock
> > >    - other (even preempted) readers don't block new readers
> > >    - writers don't sleep, they always unlock
> > 
> > That's useful, thanks. If we go with custom locking we need to document
> > this clearly and add debug checks where possible.
> 
> Sure.  That's what it currently looks like (can always improve)
> 
> ---
> /*
>  * zspage lock permits preemption on the reader-side (there can be multiple
>  * readers).  Writers (exclusive zspage ownership), on the other hand, are
>  * always run in atomic context and cannot spin waiting for a (potentially
>  * preempted) reader to unlock zspage.  This, basically, means that writers
>  * can only call write-try-lock and must bail out if it didn't succeed.
>  *
>  * At the same time, writers cannot reschedule under zspage write-lock,
>  * so readers can spin waiting for the writer to unlock zspage.
>  */
> static void zspage_read_lock(struct zspage *zspage)
> {
>         atomic_t *lock = &zspage->lock;
>         int old = atomic_read_acquire(lock);
> 
>         do {
>                 if (old == ZS_PAGE_WRLOCKED) {
>                         cpu_relax();
>                         old = atomic_read_acquire(lock);
>                         continue;
>                 }
>         } while (!atomic_try_cmpxchg_acquire(lock, &old, old + 1));
> 
> #ifdef CONFIG_DEBUG_LOCK_ALLOC
>         rwsem_acquire_read(&zspage->lockdep_map, 0, 0, _RET_IP_);
> #endif
> }
> 
> static void zspage_read_unlock(struct zspage *zspage)
> {
>         atomic_dec_return_release(&zspage->lock);
> 
> #ifdef CONFIG_DEBUG_LOCK_ALLOC
>         rwsem_release(&zspage->lockdep_map, _RET_IP_);
> #endif
> }
> 
> static bool zspage_try_write_lock(struct zspage *zspage)
> {
>         atomic_t *lock = &zspage->lock;
>         int old = ZS_PAGE_UNLOCKED;
> 
>         preempt_disable();
>         if (atomic_try_cmpxchg_acquire(lock, &old, ZS_PAGE_WRLOCKED)) {
> #ifdef CONFIG_DEBUG_LOCK_ALLOC
>                 rwsem_acquire(&zspage->lockdep_map, 0, 0, _RET_IP_);
> #endif
>                 return true;
>         }
> 
>         preempt_enable();
>         return false;
> }
> 
> static void zspage_write_unlock(struct zspage *zspage)
> {
>         atomic_set_release(&zspage->lock, ZS_PAGE_UNLOCKED);
> #ifdef CONFIG_DEBUG_LOCK_ALLOC
>         rwsem_release(&zspage->lockdep_map, _RET_IP_);
> #endif
>         preempt_enable();
> }
> ---
> 
> Maybe I'll just copy-paste the locking rules list, a list is always cleaner.

Thanks. I think it would be nice if we could also get someone with
locking expertise to take a look at this.

> 
> > > > I wonder if there's a way to rework the locking instead to avoid the
> > > > nesting. It seems like sometimes we lock the zspage with the pool lock
> > > > held, sometimes with the class lock held, and sometimes with no lock
> > > > held.
> > > > 
> > > > What are the rules here for acquiring the zspage lock?
> > > 
> > > Most of that code is not written by me, but I think the rule is to disable
> > > "migration" be it via pool lock or class lock.
> > 
> > It seems like we're not holding either of these locks in
> > async_free_zspage() when we call lock_zspage(). Is it safe for a
> > different reason?
> 
> I think we hold size class lock there. async-free is only for pages that
> reached 0 usage ratio (empty fullness group), so they don't hold any
> objects any more and from her such zspages either get freed or
> find_get_zspage() recovers them from fullness 0 and allocates an object.
> Both are synchronized by size class lock.
> 
> > > Hmm, I don't know... zsmalloc is not "read-mostly", it's whatever data
> > > patterns the clients have.   I suspect we'd need to synchronize RCU every
> > > time a zspage is freed: zs_free() [this one is complicated], or migration,
> > > or compaction?  Sounds like anti-pattern for RCU?
> > 
> > Can't we use kfree_rcu() instead of synchronizing? Not sure if this
> > would still be an antipattern tbh.
> 
> Yeah, I don't know.  The last time I wrongly used kfree_rcu() it caused a
> 27% performance drop (some internal code).  This zspage thingy maybe will
> be better, but still has a potential to generate high numbers of RCU calls,
> depends on the clients.  Probably the chances are too high.  Apart from
> that, kvfree_rcu() can sleep, as far as I understand, so zram might have
> some extra things to deal with, namely slot-free notifications which can
> be called from softirq, and always called under spinlock:
> 
>  mm slot-free -> zram slot-free -> zs_free -> empty zspage -> kfree_rcu
> 
> > It just seems like the current locking scheme is really complicated :/
> 
> That's very true.

Seems like we have to compromise either way, custom locking or we enter
into a new complexity realm with RCU freeing.


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCHv4 14/17] zsmalloc: make zspage lock preemptible
  2025-02-06 16:19                   ` Yosry Ahmed
@ 2025-02-07  2:48                     ` Sergey Senozhatsky
  2025-02-07 21:09                       ` Yosry Ahmed
  0 siblings, 1 reply; 73+ messages in thread
From: Sergey Senozhatsky @ 2025-02-07  2:48 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Sergey Senozhatsky, Andrew Morton, Minchan Kim, linux-mm, linux-kernel

On (25/02/06 16:19), Yosry Ahmed wrote:
> > static void zspage_read_lock(struct zspage *zspage)
> > {
> >         atomic_t *lock = &zspage->lock;
> >         int old = atomic_read_acquire(lock);
> > 
> >         do {
> >                 if (old == ZS_PAGE_WRLOCKED) {
> >                         cpu_relax();
> >                         old = atomic_read_acquire(lock);
> >                         continue;
> >                 }
> >         } while (!atomic_try_cmpxchg_acquire(lock, &old, old + 1));
> > 
> > #ifdef CONFIG_DEBUG_LOCK_ALLOC
> >         rwsem_acquire_read(&zspage->lockdep_map, 0, 0, _RET_IP_);
> > #endif
> > }
> > 
> > static void zspage_read_unlock(struct zspage *zspage)
> > {
> >         atomic_dec_return_release(&zspage->lock);
> > 
> > #ifdef CONFIG_DEBUG_LOCK_ALLOC
> >         rwsem_release(&zspage->lockdep_map, _RET_IP_);
> > #endif
> > }
> > 
> > static bool zspage_try_write_lock(struct zspage *zspage)
> > {
> >         atomic_t *lock = &zspage->lock;
> >         int old = ZS_PAGE_UNLOCKED;
> > 
> >         preempt_disable();
> >         if (atomic_try_cmpxchg_acquire(lock, &old, ZS_PAGE_WRLOCKED)) {
> > #ifdef CONFIG_DEBUG_LOCK_ALLOC
> >                 rwsem_acquire(&zspage->lockdep_map, 0, 0, _RET_IP_);
> > #endif
> >                 return true;
> >         }
> > 
> >         preempt_enable();
> >         return false;
> > }
> > 
> > static void zspage_write_unlock(struct zspage *zspage)
> > {
> >         atomic_set_release(&zspage->lock, ZS_PAGE_UNLOCKED);
> > #ifdef CONFIG_DEBUG_LOCK_ALLOC
> >         rwsem_release(&zspage->lockdep_map, _RET_IP_);
> > #endif
> >         preempt_enable();
> > }
> > ---
> > 
> > Maybe I'll just copy-paste the locking rules list, a list is always cleaner.
> 
> Thanks. I think it would be nice if we could also get someone with
> locking expertise to take a look at this.

Sure.

I moved the lockdep acquire/release before atomic ops (except for try),
as was suggested by Sebastian in zram sub-thread.

[..]
> Seems like we have to compromise either way, custom locking or we enter
> into a new complexity realm with RCU freeing.

Let's take the blue pill? :)


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCHv4 14/17] zsmalloc: make zspage lock preemptible
  2025-02-07  2:48                     ` Sergey Senozhatsky
@ 2025-02-07 21:09                       ` Yosry Ahmed
  2025-02-12  5:00                         ` Sergey Senozhatsky
  0 siblings, 1 reply; 73+ messages in thread
From: Yosry Ahmed @ 2025-02-07 21:09 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Andrew Morton, Minchan Kim, linux-mm, linux-kernel, Kairui Song

On Fri, Feb 07, 2025 at 11:48:55AM +0900, Sergey Senozhatsky wrote:
> On (25/02/06 16:19), Yosry Ahmed wrote:
> > > static void zspage_read_lock(struct zspage *zspage)
> > > {
> > >         atomic_t *lock = &zspage->lock;
> > >         int old = atomic_read_acquire(lock);
> > > 
> > >         do {
> > >                 if (old == ZS_PAGE_WRLOCKED) {
> > >                         cpu_relax();
> > >                         old = atomic_read_acquire(lock);
> > >                         continue;
> > >                 }
> > >         } while (!atomic_try_cmpxchg_acquire(lock, &old, old + 1));
> > > 
> > > #ifdef CONFIG_DEBUG_LOCK_ALLOC
> > >         rwsem_acquire_read(&zspage->lockdep_map, 0, 0, _RET_IP_);
> > > #endif
> > > }
> > > 
> > > static void zspage_read_unlock(struct zspage *zspage)
> > > {
> > >         atomic_dec_return_release(&zspage->lock);
> > > 
> > > #ifdef CONFIG_DEBUG_LOCK_ALLOC
> > >         rwsem_release(&zspage->lockdep_map, _RET_IP_);
> > > #endif
> > > }
> > > 
> > > static bool zspage_try_write_lock(struct zspage *zspage)
> > > {
> > >         atomic_t *lock = &zspage->lock;
> > >         int old = ZS_PAGE_UNLOCKED;
> > > 
> > >         preempt_disable();
> > >         if (atomic_try_cmpxchg_acquire(lock, &old, ZS_PAGE_WRLOCKED)) {
> > > #ifdef CONFIG_DEBUG_LOCK_ALLOC
> > >                 rwsem_acquire(&zspage->lockdep_map, 0, 0, _RET_IP_);
> > > #endif
> > >                 return true;
> > >         }
> > > 
> > >         preempt_enable();
> > >         return false;
> > > }
> > > 
> > > static void zspage_write_unlock(struct zspage *zspage)
> > > {
> > >         atomic_set_release(&zspage->lock, ZS_PAGE_UNLOCKED);
> > > #ifdef CONFIG_DEBUG_LOCK_ALLOC
> > >         rwsem_release(&zspage->lockdep_map, _RET_IP_);
> > > #endif
> > >         preempt_enable();
> > > }
> > > ---
> > > 
> > > Maybe I'll just copy-paste the locking rules list, a list is always cleaner.
> > 
> > Thanks. I think it would be nice if we could also get someone with
> > locking expertise to take a look at this.
> 
> Sure.
> 
> I moved the lockdep acquire/release before atomic ops (except for try),
> as was suggested by Sebastian in zram sub-thread.
> 
> [..]
> > Seems like we have to compromise either way, custom locking or we enter
> > into a new complexity realm with RCU freeing.
> 
> Let's take the blue pill? :)

Can we do some perf testing to make sure this custom locking is not
regressing performance (selfishly I'd like some zswap testing too)?

Perhaps Kairui can help with that since he was already testing this
series.


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCHv4 14/17] zsmalloc: make zspage lock preemptible
  2025-02-07 21:09                       ` Yosry Ahmed
@ 2025-02-12  5:00                         ` Sergey Senozhatsky
  2025-02-12 15:35                           ` Yosry Ahmed
  0 siblings, 1 reply; 73+ messages in thread
From: Sergey Senozhatsky @ 2025-02-12  5:00 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Sergey Senozhatsky, Andrew Morton, Minchan Kim, linux-mm,
	linux-kernel, Kairui Song

On (25/02/07 21:09), Yosry Ahmed wrote:
> Can we do some perf testing to make sure this custom locking is not
> regressing performance (selfishly I'd like some zswap testing too)?

So for zsmalloc I (usually) write some simple testing code which is
triggered via sysfs (device attr) and that is completely reproducible,
so that I compares apples to apples.  In this particular case I just
have a loop that creates objects (we don't need to compress or decompress
anything, zsmalloc doesn't really care)

-	echo 1 > /sys/ ... / test_prepare

	for (sz = 32; sz < PAGE_SIZE; sz += 64) {
		for (i = 0; i < 4096; i++) {
			ent->handle = zs_malloc(zram->mem_pool, sz)
			list_add(ent)
		}
	}


And now I just `perf stat` writes:

-	perf stat echo 1 > /sys/ ... / test_exec_old

	list_for_each_entry
		zs_map_object(ent->handle, ZS_MM_RO);
		zs_unmap_object(ent->handle)

	list_for_each_entry
		dst = zs_map_object(ent->handle, ZS_MM_WO);
		memcpy(dst, tmpbuf, ent->sz)
		zs_unmap_object(ent->handle)



-	perf stat echo 1 > /sys/ ... / test_exec_new

	list_for_each_entry
		dst = zs_obj_read_begin(ent->handle, loc);
		zs_obj_read_end(ent->handle, dst);

	list_for_each_entry
		zs_obj_write(ent->handle, tmpbuf, ent->sz);


-	echo 1 > /sys/ ... / test_finish

	free all handles and ent-s


The nice part is that we don't depend on any of the upper layers, we
don't even need to compress/decompress anything; we allocate objects
of required sizes and memcpy static data there (zsmalloc doesn't have
any opinion on that) and that's pretty much it.


OLD API
=======

10 runs

       369,205,778      instructions                     #    0.80  insn per cycle            
        40,467,926      branches                         #  113.732 M/sec                     

       369,002,122      instructions                     #    0.62  insn per cycle            
        40,426,145      branches                         #  189.361 M/sec                     

       369,051,170      instructions                     #    0.45  insn per cycle            
        40,434,677      branches                         #  157.574 M/sec                     

       369,014,522      instructions                     #    0.63  insn per cycle            
        40,427,754      branches                         #  201.464 M/sec                     

       369,019,179      instructions                     #    0.64  insn per cycle            
        40,429,327      branches                         #  198.321 M/sec                     

       368,973,095      instructions                     #    0.64  insn per cycle            
        40,419,245      branches                         #  234.210 M/sec                     

       368,950,705      instructions                     #    0.64  insn per cycle            
        40,414,305      branches                         #  231.460 M/sec                     

       369,041,288      instructions                     #    0.46  insn per cycle            
        40,432,599      branches                         #  155.576 M/sec                     

       368,964,080      instructions                     #    0.67  insn per cycle            
        40,417,025      branches                         #  245.665 M/sec                     

       369,036,706      instructions                     #    0.63  insn per cycle            
        40,430,860      branches                         #  204.105 M/sec                     


NEW API
=======

10 runs

       265,799,293      instructions                     #    0.51  insn per cycle            
        29,834,567      branches                         #  170.281 M/sec                     

       265,765,970      instructions                     #    0.55  insn per cycle            
        29,829,019      branches                         #  161.602 M/sec                     

       265,764,702      instructions                     #    0.51  insn per cycle            
        29,828,015      branches                         #  189.677 M/sec                     

       265,836,506      instructions                     #    0.38  insn per cycle            
        29,840,650      branches                         #  124.237 M/sec                     

       265,836,061      instructions                     #    0.36  insn per cycle            
        29,842,285      branches                         #  137.670 M/sec                     

       265,887,080      instructions                     #    0.37  insn per cycle            
        29,852,881      branches                         #  126.060 M/sec                     

       265,769,869      instructions                     #    0.57  insn per cycle            
        29,829,873      branches                         #  210.157 M/sec                     

       265,803,732      instructions                     #    0.58  insn per cycle            
        29,835,391      branches                         #  186.940 M/sec                     

       265,766,624      instructions                     #    0.58  insn per cycle            
        29,827,537      branches                         #  212.609 M/sec                     

       265,843,597      instructions                     #    0.57  insn per cycle            
        29,843,650      branches                         #  171.877 M/sec                     


x old-api-insn
+ new-api-insn
+-------------------------------------------------------------------------------------+
|+                                                                                   x|
|+                                                                                   x|
|+                                                                                   x|
|+                                                                                   x|
|+                                                                                   x|
|+                                                                                   x|
|+                                                                                   x|
|+                                                                                   x|
|+                                                                                   x|
|+                                                                                   x|
|A                                                                                   A|
+-------------------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x  10  3.689507e+08 3.6920578e+08 3.6901918e+08 3.6902586e+08     71765.519
+  10  2.657647e+08 2.6588708e+08 2.6580373e+08 2.6580734e+08     42187.024
Difference at 95.0% confidence
	-1.03219e+08 +/- 55308.7
	-27.9705% +/- 0.0149878%
	(Student's t, pooled s = 58864.4)


> Perhaps Kairui can help with that since he was already testing this
> series.

Yeah, would be great.


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCHv4 14/17] zsmalloc: make zspage lock preemptible
  2025-02-12  5:00                         ` Sergey Senozhatsky
@ 2025-02-12 15:35                           ` Yosry Ahmed
  2025-02-13  2:18                             ` Sergey Senozhatsky
  0 siblings, 1 reply; 73+ messages in thread
From: Yosry Ahmed @ 2025-02-12 15:35 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Andrew Morton, Minchan Kim, linux-mm, linux-kernel, Kairui Song

On Wed, Feb 12, 2025 at 02:00:26PM +0900, Sergey Senozhatsky wrote:
> On (25/02/07 21:09), Yosry Ahmed wrote:
> > Can we do some perf testing to make sure this custom locking is not
> > regressing performance (selfishly I'd like some zswap testing too)?
> 
> So for zsmalloc I (usually) write some simple testing code which is
> triggered via sysfs (device attr) and that is completely reproducible,
> so that I compares apples to apples.  In this particular case I just
> have a loop that creates objects (we don't need to compress or decompress
> anything, zsmalloc doesn't really care)
> 
> -	echo 1 > /sys/ ... / test_prepare
> 
> 	for (sz = 32; sz < PAGE_SIZE; sz += 64) {
> 		for (i = 0; i < 4096; i++) {
> 			ent->handle = zs_malloc(zram->mem_pool, sz)
> 			list_add(ent)
> 		}
> 	}
> 
> 
> And now I just `perf stat` writes:
> 
> -	perf stat echo 1 > /sys/ ... / test_exec_old
> 
> 	list_for_each_entry
> 		zs_map_object(ent->handle, ZS_MM_RO);
> 		zs_unmap_object(ent->handle)
> 
> 	list_for_each_entry
> 		dst = zs_map_object(ent->handle, ZS_MM_WO);
> 		memcpy(dst, tmpbuf, ent->sz)
> 		zs_unmap_object(ent->handle)
> 
> 
> 
> -	perf stat echo 1 > /sys/ ... / test_exec_new
> 
> 	list_for_each_entry
> 		dst = zs_obj_read_begin(ent->handle, loc);
> 		zs_obj_read_end(ent->handle, dst);
> 
> 	list_for_each_entry
> 		zs_obj_write(ent->handle, tmpbuf, ent->sz);
> 
> 
> -	echo 1 > /sys/ ... / test_finish
> 
> 	free all handles and ent-s
> 
> 
> The nice part is that we don't depend on any of the upper layers, we
> don't even need to compress/decompress anything; we allocate objects
> of required sizes and memcpy static data there (zsmalloc doesn't have
> any opinion on that) and that's pretty much it.
> 
> 
> OLD API
> =======
> 
> 10 runs
> 
>        369,205,778      instructions                     #    0.80  insn per cycle            
>         40,467,926      branches                         #  113.732 M/sec                     
> 
>        369,002,122      instructions                     #    0.62  insn per cycle            
>         40,426,145      branches                         #  189.361 M/sec                     
> 
>        369,051,170      instructions                     #    0.45  insn per cycle            
>         40,434,677      branches                         #  157.574 M/sec                     
> 
>        369,014,522      instructions                     #    0.63  insn per cycle            
>         40,427,754      branches                         #  201.464 M/sec                     
> 
>        369,019,179      instructions                     #    0.64  insn per cycle            
>         40,429,327      branches                         #  198.321 M/sec                     
> 
>        368,973,095      instructions                     #    0.64  insn per cycle            
>         40,419,245      branches                         #  234.210 M/sec                     
> 
>        368,950,705      instructions                     #    0.64  insn per cycle            
>         40,414,305      branches                         #  231.460 M/sec                     
> 
>        369,041,288      instructions                     #    0.46  insn per cycle            
>         40,432,599      branches                         #  155.576 M/sec                     
> 
>        368,964,080      instructions                     #    0.67  insn per cycle            
>         40,417,025      branches                         #  245.665 M/sec                     
> 
>        369,036,706      instructions                     #    0.63  insn per cycle            
>         40,430,860      branches                         #  204.105 M/sec                     
> 
> 
> NEW API
> =======
> 
> 10 runs
> 
>        265,799,293      instructions                     #    0.51  insn per cycle            
>         29,834,567      branches                         #  170.281 M/sec                     
> 
>        265,765,970      instructions                     #    0.55  insn per cycle            
>         29,829,019      branches                         #  161.602 M/sec                     
> 
>        265,764,702      instructions                     #    0.51  insn per cycle            
>         29,828,015      branches                         #  189.677 M/sec                     
> 
>        265,836,506      instructions                     #    0.38  insn per cycle            
>         29,840,650      branches                         #  124.237 M/sec                     
> 
>        265,836,061      instructions                     #    0.36  insn per cycle            
>         29,842,285      branches                         #  137.670 M/sec                     
> 
>        265,887,080      instructions                     #    0.37  insn per cycle            
>         29,852,881      branches                         #  126.060 M/sec                     
> 
>        265,769,869      instructions                     #    0.57  insn per cycle            
>         29,829,873      branches                         #  210.157 M/sec                     
> 
>        265,803,732      instructions                     #    0.58  insn per cycle            
>         29,835,391      branches                         #  186.940 M/sec                     
> 
>        265,766,624      instructions                     #    0.58  insn per cycle            
>         29,827,537      branches                         #  212.609 M/sec                     
> 
>        265,843,597      instructions                     #    0.57  insn per cycle            
>         29,843,650      branches                         #  171.877 M/sec                     
> 
> 
> x old-api-insn
> + new-api-insn
> +-------------------------------------------------------------------------------------+
> |+                                                                                   x|
> |+                                                                                   x|
> |+                                                                                   x|
> |+                                                                                   x|
> |+                                                                                   x|
> |+                                                                                   x|
> |+                                                                                   x|
> |+                                                                                   x|
> |+                                                                                   x|
> |+                                                                                   x|
> |A                                                                                   A|
> +-------------------------------------------------------------------------------------+
>     N           Min           Max        Median           Avg        Stddev
> x  10  3.689507e+08 3.6920578e+08 3.6901918e+08 3.6902586e+08     71765.519
> +  10  2.657647e+08 2.6588708e+08 2.6580373e+08 2.6580734e+08     42187.024
> Difference at 95.0% confidence
> 	-1.03219e+08 +/- 55308.7
> 	-27.9705% +/- 0.0149878%
> 	(Student's t, pooled s = 58864.4)

Thanks for sharing these results, but I wonder if this will capture
regressions from locking changes (e.g. a lock being preemtible)? IIUC
this is counting the instructions executed in these paths, and that
won't change if the task gets preempted. Lock contention may be captured
as extra instructions, but I am not sure we'll directly see its effect
in terms of serialization and delays.

I think we also need some high level testing (e.g. concurrent
swapins/swapouts) to find that out. I think that's what Kairui's testing
covers.


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCHv4 14/17] zsmalloc: make zspage lock preemptible
  2025-02-12 15:35                           ` Yosry Ahmed
@ 2025-02-13  2:18                             ` Sergey Senozhatsky
  2025-02-13  2:57                               ` Yosry Ahmed
  0 siblings, 1 reply; 73+ messages in thread
From: Sergey Senozhatsky @ 2025-02-13  2:18 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Sergey Senozhatsky, Andrew Morton, Minchan Kim, linux-mm,
	linux-kernel, Kairui Song

On (25/02/12 15:35), Yosry Ahmed wrote:
> > Difference at 95.0% confidence
> > 	-1.03219e+08 +/- 55308.7
> > 	-27.9705% +/- 0.0149878%
> > 	(Student's t, pooled s = 58864.4)
> 
> Thanks for sharing these results, but I wonder if this will capture
> regressions from locking changes (e.g. a lock being preemtible)? IIUC
> this is counting the instructions executed in these paths, and that
> won't change if the task gets preempted. Lock contention may be captured
> as extra instructions, but I am not sure we'll directly see its effect
> in terms of serialization and delays.

Yeah..

> I think we also need some high level testing (e.g. concurrent
> swapins/swapouts) to find that out. I think that's what Kairui's testing
> covers.

I do a fair amount of high-level testing: heavy parallel (make -j36 and
parallel dd) workloads (multiple zram devices configuration - zram0 ext4,
zram1 writeback device, zram2 swap) w/ and w/o lockdep.  In addition I also
run these workloads under heavy memory pressure (a 4GB VM), when oom-killer
starts to run around with a pair of scissors.  But it's mostly regression
testing.


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCHv4 14/17] zsmalloc: make zspage lock preemptible
  2025-02-13  2:18                             ` Sergey Senozhatsky
@ 2025-02-13  2:57                               ` Yosry Ahmed
  2025-02-13  7:21                                 ` Sergey Senozhatsky
  0 siblings, 1 reply; 73+ messages in thread
From: Yosry Ahmed @ 2025-02-13  2:57 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Sergey Senozhatsky, Andrew Morton, Minchan Kim, linux-mm,
	linux-kernel, Kairui Song

February 12, 2025 at 6:18 PM, "Sergey Senozhatsky" <senozhatsky@chromium.org> wrote:



> 
> On (25/02/12 15:35), Yosry Ahmed wrote:
> 
> > 
> > Difference at 95.0% confidence
> > 
> >  -1.03219e+08 +/- 55308.7
> > 
> >  -27.9705% +/- 0.0149878%
> > 
> >  (Student's t, pooled s = 58864.4)
> > 
> >  
> > 
> >  Thanks for sharing these results, but I wonder if this will capture
> > 
> >  regressions from locking changes (e.g. a lock being preemtible)? IIUC
> > 
> >  this is counting the instructions executed in these paths, and that
> > 
> >  won't change if the task gets preempted. Lock contention may be captured
> > 
> >  as extra instructions, but I am not sure we'll directly see its effect
> > 
> >  in terms of serialization and delays.
> > 
> 
> Yeah..
> 
> > 
> > I think we also need some high level testing (e.g. concurrent
> > 
> >  swapins/swapouts) to find that out. I think that's what Kairui's testing
> > 
> >  covers.
> > 
> 
> I do a fair amount of high-level testing: heavy parallel (make -j36 and
> 
> parallel dd) workloads (multiple zram devices configuration - zram0 ext4,
> 
> zram1 writeback device, zram2 swap) w/ and w/o lockdep. In addition I also
> 
> run these workloads under heavy memory pressure (a 4GB VM), when oom-killer
> 
> starts to run around with a pair of scissors. But it's mostly regression
> 
> testing.
>

If we can get some numbers from these parallel workloads that would be better than the perf stats imo.


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCHv4 14/17] zsmalloc: make zspage lock preemptible
  2025-02-13  2:57                               ` Yosry Ahmed
@ 2025-02-13  7:21                                 ` Sergey Senozhatsky
  2025-02-13  8:22                                   ` Sergey Senozhatsky
  0 siblings, 1 reply; 73+ messages in thread
From: Sergey Senozhatsky @ 2025-02-13  7:21 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Sergey Senozhatsky, Andrew Morton, Minchan Kim, linux-mm,
	linux-kernel, Kairui Song

On (25/02/13 02:57), Yosry Ahmed wrote:
> > > I think we also need some high level testing (e.g. concurrent
> > >
> > >  swapins/swapouts) to find that out. I think that's what Kairui's testing
> > >
> > >  covers.
> > >
> >
> > I do a fair amount of high-level testing: heavy parallel (make -j36 and
> >
> > parallel dd) workloads (multiple zram devices configuration - zram0 ext4,
> >
> > zram1 writeback device, zram2 swap) w/ and w/o lockdep. In addition I also
> >
> > run these workloads under heavy memory pressure (a 4GB VM), when oom-killer
> >
> > starts to run around with a pair of scissors. But it's mostly regression
> >
> > testing.
> >

// JFI it seems your email client/service for some reason injects a lot
// of empty lines

> If we can get some numbers from these parallel workloads that would be better than the perf stats imo.

make -j24  CONFIG_PREEMPT


BASE
====

1363.64user 157.08system 1:30.89elapsed 1673%CPU (0avgtext+0avgdata 825692maxresident)k

lock stats

                              class name    con-bounces    contentions   waittime-min   waittime-max waittime-total   waittime-avg    acq-bounces   acquisitions   holdtime-min   holdtime-max holdtime-total   holdtime-avg
                   &pool->migrate_lock-R:             0              0           0.00           0.00           0.00           0.00          10001         702081           0.14         104.74      125571.64           0.18
                            &class->lock:             1              1           0.25           0.25           0.25           0.25           6320         840542           0.06         809.72      191214.87           0.23
                         &zspage->lock-R:             0              0           0.00           0.00           0.00           0.00           6452         664129           0.12         660.24      201888.61           0.30
                &zram->table[index].lock:             0              0           0.00           0.00           0.00           0.00        1716362        3096466           0.07         811.10      365551.24           0.12
                            &zstrm->lock:             0              0           0.00           0.00           0.00           0.00              0         664129           1.68        1004.80    14853571.32          22.37

PATCHED
=======

1366.50user 154.89system 1:30.33elapsed 1684%CPU (0avgtext+0avgdata 825692maxresident)k

lock stats

                              class name    con-bounces    contentions   waittime-min   waittime-max waittime-total   waittime-avg    acq-bounces   acquisitions   holdtime-min   holdtime-max holdtime-total   holdtime-avg
                         &pool->lock#3-R:             0              0           0.00           0.00           0.00           0.00           3648         701979           0.12          44.09      107333.02           0.15
                            &class->lock:             0              0           0.00           0.00           0.00           0.00           5038         840434           0.06        1245.90      211814.60           0.25
                         zsmalloc-page-R:             0              0           0.00           0.00           0.00           0.00              0         664078           0.05         699.35      236641.75           0.36
                        zram-entry->lock:             0              0           0.00           0.00           0.00           0.00              0        3098328           0.06        2987.02      313339.11           0.10
   &per_cpu_ptr(comp->stream, cpu)->lock:             0              0           0.00           0.00           0.00           0.00             23         664078           1.77        7071.30    14838397.61          22.34


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCHv4 14/17] zsmalloc: make zspage lock preemptible
  2025-02-13  7:21                                 ` Sergey Senozhatsky
@ 2025-02-13  8:22                                   ` Sergey Senozhatsky
  2025-02-13 15:25                                     ` Yosry Ahmed
  0 siblings, 1 reply; 73+ messages in thread
From: Sergey Senozhatsky @ 2025-02-13  8:22 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Andrew Morton, Minchan Kim, linux-mm, linux-kernel, Kairui Song,
	Sergey Senozhatsky

On (25/02/13 16:21), Sergey Senozhatsky wrote:
> BASE
> ====
> 
> 1363.64user 157.08system 1:30.89elapsed 1673%CPU (0avgtext+0avgdata 825692maxresident)k
> 
> lock stats
> 
>                               class name    con-bounces    contentions   waittime-min   waittime-max waittime-total   waittime-avg    acq-bounces   acquisitions   holdtime-min   holdtime-max holdtime-total   holdtime-avg
>                    &pool->migrate_lock-R:             0              0           0.00           0.00           0.00           0.00          10001         702081           0.14         104.74      125571.64           0.18
>                             &class->lock:             1              1           0.25           0.25           0.25           0.25           6320         840542           0.06         809.72      191214.87           0.23
>                          &zspage->lock-R:             0              0           0.00           0.00           0.00           0.00           6452         664129           0.12         660.24      201888.61           0.30
>                 &zram->table[index].lock:             0              0           0.00           0.00           0.00           0.00        1716362        3096466           0.07         811.10      365551.24           0.12
>                             &zstrm->lock:             0              0           0.00           0.00           0.00           0.00              0         664129           1.68        1004.80    14853571.32          22.37
> 
> PATCHED
> =======
> 
> 1366.50user 154.89system 1:30.33elapsed 1684%CPU (0avgtext+0avgdata 825692maxresident)k
> 
> lock stats
> 
>                               class name    con-bounces    contentions   waittime-min   waittime-max waittime-total   waittime-avg    acq-bounces   acquisitions   holdtime-min   holdtime-max holdtime-total   holdtime-avg
>                          &pool->lock#3-R:             0              0           0.00           0.00           0.00           0.00           3648         701979           0.12          44.09      107333.02           0.15
>                             &class->lock:             0              0           0.00           0.00           0.00           0.00           5038         840434           0.06        1245.90      211814.60           0.25
>                          zsmalloc-page-R:             0              0           0.00           0.00           0.00           0.00              0         664078           0.05         699.35      236641.75           0.36
>                         zram-entry->lock:             0              0           0.00           0.00           0.00           0.00              0        3098328           0.06        2987.02      313339.11           0.10
>    &per_cpu_ptr(comp->stream, cpu)->lock:             0              0           0.00           0.00           0.00           0.00             23         664078           1.77        7071.30    14838397.61          22.34

So...

I added lock-stat handling to zspage->lock and to zram (in zram it's only
trylock that we can track, but it doesn't really bother me).  I also
renamed zsmalloc-page-R to old zspage->lock-R and zram-entry->lock to
old zram->table[index].lock, just in case if anyone cares.

Now bounces stats for zspage->lock and zram->table[index].lock look
pretty much like in BASE case.

PATCHED
=======

                              class name    con-bounces    contentions   waittime-min   waittime-max waittime-total   waittime-avg    acq-bounces   acquisitions   holdtime-min   holdtime-max holdtime-total   holdtime-avg
                         &pool->lock#3-R:             0              0           0.00           0.00           0.00           0.00           2702         703841           0.22         873.90      197110.49           0.28
                            &class->lock:             0              0           0.00           0.00           0.00           0.00           4590         842336           0.10        3329.63      256595.70           0.30
                          zspage->lock-R:             0              0           0.00           0.00           0.00           0.00           4750         665011           0.08        3360.60      258402.21           0.39
                 zram->table[index].lock:             0              0           0.00           0.00           0.00           0.00        1722291        3099346           0.12        6943.09      721282.34           0.23
   &per_cpu_ptr(comp->stream, cpu)->lock:             0              0           0.00           0.00           0.00           0.00             23         665011           2.84        7062.18    14896206.16          22.40



^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCHv4 14/17] zsmalloc: make zspage lock preemptible
  2025-02-13  8:22                                   ` Sergey Senozhatsky
@ 2025-02-13 15:25                                     ` Yosry Ahmed
  2025-02-14  3:33                                       ` Sergey Senozhatsky
  0 siblings, 1 reply; 73+ messages in thread
From: Yosry Ahmed @ 2025-02-13 15:25 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Andrew Morton, Minchan Kim, linux-mm, linux-kernel, Kairui Song

On Thu, Feb 13, 2025 at 05:22:20PM +0900, Sergey Senozhatsky wrote:
> On (25/02/13 16:21), Sergey Senozhatsky wrote:
> > BASE
> > ====
> > 
> > 1363.64user 157.08system 1:30.89elapsed 1673%CPU (0avgtext+0avgdata 825692maxresident)k
> > 
> > lock stats
> > 
> >                               class name    con-bounces    contentions   waittime-min   waittime-max waittime-total   waittime-avg    acq-bounces   acquisitions   holdtime-min   holdtime-max holdtime-total   holdtime-avg
> >                    &pool->migrate_lock-R:             0              0           0.00           0.00           0.00           0.00          10001         702081           0.14         104.74      125571.64           0.18
> >                             &class->lock:             1              1           0.25           0.25           0.25           0.25           6320         840542           0.06         809.72      191214.87           0.23
> >                          &zspage->lock-R:             0              0           0.00           0.00           0.00           0.00           6452         664129           0.12         660.24      201888.61           0.30
> >                 &zram->table[index].lock:             0              0           0.00           0.00           0.00           0.00        1716362        3096466           0.07         811.10      365551.24           0.12
> >                             &zstrm->lock:             0              0           0.00           0.00           0.00           0.00              0         664129           1.68        1004.80    14853571.32          22.37
> > 
> > PATCHED
> > =======
> > 
> > 1366.50user 154.89system 1:30.33elapsed 1684%CPU (0avgtext+0avgdata 825692maxresident)k
> > 
> > lock stats
> > 
> >                               class name    con-bounces    contentions   waittime-min   waittime-max waittime-total   waittime-avg    acq-bounces   acquisitions   holdtime-min   holdtime-max holdtime-total   holdtime-avg
> >                          &pool->lock#3-R:             0              0           0.00           0.00           0.00           0.00           3648         701979           0.12          44.09      107333.02           0.15
> >                             &class->lock:             0              0           0.00           0.00           0.00           0.00           5038         840434           0.06        1245.90      211814.60           0.25
> >                          zsmalloc-page-R:             0              0           0.00           0.00           0.00           0.00              0         664078           0.05         699.35      236641.75           0.36
> >                         zram-entry->lock:             0              0           0.00           0.00           0.00           0.00              0        3098328           0.06        2987.02      313339.11           0.10
> >    &per_cpu_ptr(comp->stream, cpu)->lock:             0              0           0.00           0.00           0.00           0.00             23         664078           1.77        7071.30    14838397.61          22.34
> 
> So...
> 
> I added lock-stat handling to zspage->lock and to zram (in zram it's only
> trylock that we can track, but it doesn't really bother me).  I also
> renamed zsmalloc-page-R to old zspage->lock-R and zram-entry->lock to
> old zram->table[index].lock, just in case if anyone cares.
> 
> Now bounces stats for zspage->lock and zram->table[index].lock look
> pretty much like in BASE case.
> 
> PATCHED
> =======
> 
>                               class name    con-bounces    contentions   waittime-min   waittime-max waittime-total   waittime-avg    acq-bounces   acquisitions   holdtime-min   holdtime-max holdtime-total   holdtime-avg
>                          &pool->lock#3-R:             0              0           0.00           0.00           0.00           0.00           2702         703841           0.22         873.90      197110.49           0.28
>                             &class->lock:             0              0           0.00           0.00           0.00           0.00           4590         842336           0.10        3329.63      256595.70           0.30
>                           zspage->lock-R:             0              0           0.00           0.00           0.00           0.00           4750         665011           0.08        3360.60      258402.21           0.39
>                  zram->table[index].lock:             0              0           0.00           0.00           0.00           0.00        1722291        3099346           0.12        6943.09      721282.34           0.23
>    &per_cpu_ptr(comp->stream, cpu)->lock:             0              0           0.00           0.00           0.00           0.00             23         665011           2.84        7062.18    14896206.16          22.40
> 

holdtime-max and holdtime-total are higher in the patched kernel. Not
sure if this is just an artifact of lock holders being preemtible. 


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCHv4 14/17] zsmalloc: make zspage lock preemptible
  2025-02-13 15:25                                     ` Yosry Ahmed
@ 2025-02-14  3:33                                       ` Sergey Senozhatsky
  0 siblings, 0 replies; 73+ messages in thread
From: Sergey Senozhatsky @ 2025-02-14  3:33 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Sergey Senozhatsky, Andrew Morton, Minchan Kim, linux-mm,
	linux-kernel, Kairui Song

On (25/02/13 15:25), Yosry Ahmed wrote:
> On Thu, Feb 13, 2025 at 05:22:20PM +0900, Sergey Senozhatsky wrote:
> > On (25/02/13 16:21), Sergey Senozhatsky wrote:
> > > BASE
> > > ====
> > > 
> > > 1363.64user 157.08system 1:30.89elapsed 1673%CPU (0avgtext+0avgdata 825692maxresident)k
> > > 
> > > lock stats
> > > 
> > >                               class name    con-bounces    contentions   waittime-min   waittime-max waittime-total   waittime-avg    acq-bounces   acquisitions   holdtime-min   holdtime-max holdtime-total   holdtime-avg
> > >                    &pool->migrate_lock-R:             0              0           0.00           0.00           0.00           0.00          10001         702081           0.14         104.74      125571.64           0.18
> > >                             &class->lock:             1              1           0.25           0.25           0.25           0.25           6320         840542           0.06         809.72      191214.87           0.23
> > >                          &zspage->lock-R:             0              0           0.00           0.00           0.00           0.00           6452         664129           0.12         660.24      201888.61           0.30
> > >                 &zram->table[index].lock:             0              0           0.00           0.00           0.00           0.00        1716362        3096466           0.07         811.10      365551.24           0.12
> > >                             &zstrm->lock:             0              0           0.00           0.00           0.00           0.00              0         664129           1.68        1004.80    14853571.32          22.37
> > > 
> > > PATCHED
> > > =======
> > > 
> > > 1366.50user 154.89system 1:30.33elapsed 1684%CPU (0avgtext+0avgdata 825692maxresident)k
> > > 
> > > lock stats
> > > 
> > >                               class name    con-bounces    contentions   waittime-min   waittime-max waittime-total   waittime-avg    acq-bounces   acquisitions   holdtime-min   holdtime-max holdtime-total   holdtime-avg
> > >                          &pool->lock#3-R:             0              0           0.00           0.00           0.00           0.00           3648         701979           0.12          44.09      107333.02           0.15
> > >                             &class->lock:             0              0           0.00           0.00           0.00           0.00           5038         840434           0.06        1245.90      211814.60           0.25
> > >                          zsmalloc-page-R:             0              0           0.00           0.00           0.00           0.00              0         664078           0.05         699.35      236641.75           0.36
> > >                         zram-entry->lock:             0              0           0.00           0.00           0.00           0.00              0        3098328           0.06        2987.02      313339.11           0.10
> > >    &per_cpu_ptr(comp->stream, cpu)->lock:             0              0           0.00           0.00           0.00           0.00             23         664078           1.77        7071.30    14838397.61          22.34
> > 
> > So...
> > 
> > I added lock-stat handling to zspage->lock and to zram (in zram it's only
> > trylock that we can track, but it doesn't really bother me).  I also
> > renamed zsmalloc-page-R to old zspage->lock-R and zram-entry->lock to
> > old zram->table[index].lock, just in case if anyone cares.
> > 
> > Now bounces stats for zspage->lock and zram->table[index].lock look
> > pretty much like in BASE case.
> > 
> > PATCHED
> > =======
> > 
> >                               class name    con-bounces    contentions   waittime-min   waittime-max waittime-total   waittime-avg    acq-bounces   acquisitions   holdtime-min   holdtime-max holdtime-total   holdtime-avg
> >                          &pool->lock#3-R:             0              0           0.00           0.00           0.00           0.00           2702         703841           0.22         873.90      197110.49           0.28
> >                             &class->lock:             0              0           0.00           0.00           0.00           0.00           4590         842336           0.10        3329.63      256595.70           0.30
> >                           zspage->lock-R:             0              0           0.00           0.00           0.00           0.00           4750         665011           0.08        3360.60      258402.21           0.39
> >                  zram->table[index].lock:             0              0           0.00           0.00           0.00           0.00        1722291        3099346           0.12        6943.09      721282.34           0.23
> >    &per_cpu_ptr(comp->stream, cpu)->lock:             0              0           0.00           0.00           0.00           0.00             23         665011           2.84        7062.18    14896206.16          22.40
> > 
> 
> holdtime-max and holdtime-total are higher in the patched kernel. Not
> sure if this is just an artifact of lock holders being preemtible. 

Hmm, pool->lock shouldn't be affected at all, however BASE holds it much
longer than PATCHED

        holdtime-max            holdtime-total
BASE    104.74                  125571.64
PATCHED 44.09                   107333.02

Doesn't make sense.  I can understand zspage->lock and
zram->table[index].lock, but for zram->table[index] things look
strange (comparing run #1 and #2)

        holdtime-total
BASE    365551.24
PATCHED 313339.11

And run #3 is in its own league.



Very likely just a very very bad way to test things.


Re-based on 6.14.0-rc2-next-20250213.


BASE
====

PREEMPT_NONE

                              class name    con-bounces    contentions   waittime-min   waittime-max waittime-total   waittime-avg    acq-bounces   acquisitions   holdtime-min   holdtime-max holdtime-total   holdtime-avg
                   &pool->migrate_lock-R:             0              0           0.00           0.00           0.00           0.00           3624         702276           0.15          35.96      126562.90           0.18
                            &class->lock:             0              0           0.00           0.00           0.00           0.00           5084         840733           0.06         795.26      183238.22           0.22
                         &zspage->lock-R:             0              0           0.00           0.00           0.00           0.00           5358         664228           0.12          43.71      192732.71           0.29
                &zram->table[index].lock:             0              0           0.00           0.00           0.00           0.00        1528645        3095862           0.07         764.76      370881.23           0.12
                            &zstrm->lock:             0              0           0.00           0.00           0.00           0.00              0         664228           2.52        2033.81    14605911.45          21.99

PREEMPT_VOLUNTARY

                              class name    con-bounces    contentions   waittime-min   waittime-max waittime-total   waittime-avg    acq-bounces   acquisitions   holdtime-min   holdtime-max holdtime-total   holdtime-avg
                   &pool->migrate_lock-R:             0              0           0.00           0.00           0.00           0.00           3039         699556           0.14          50.78      125553.59           0.18
                            &class->lock:             0              0           0.00           0.00           0.00           0.00           5259         838005           0.06         943.43      177108.05           0.21
                         &zspage->lock-R:             0              0           0.00           0.00           0.00           0.00           5581         664096           0.12          81.56      190235.48           0.29
                &zram->table[index].lock:             0              0           0.00           0.00           0.00           0.00        1731706        3098570           0.07         796.87      366934.54           0.12
                            &zstrm->lock:             0              0           0.00           0.00           0.00           0.00              0         664096           3.38        5074.72    14472697.91          21.79

PREEMPT

                              class name    con-bounces    contentions   waittime-min   waittime-max waittime-total   waittime-avg    acq-bounces   acquisitions   holdtime-min   holdtime-max holdtime-total   holdtime-avg
                   &pool->migrate_lock-R:             0              0           0.00           0.00           0.00           0.00           2545         701827           0.14         773.56      125463.37           0.18
                            &class->lock:             0              0           0.00           0.00           0.00           0.00           4697         840281           0.06        1701.18      231657.38           0.28
                         &zspage->lock-R:             0              0           0.00           0.00           0.00           0.00           4778         664002           0.12         755.62      181215.17           0.27
                &zram->table[index].lock:             0              0           0.00           0.00           0.00           0.00        1731737        3096937           0.07        1703.92      384633.29           0.12
                            &zstrm->lock:             0              0           0.00           0.00           0.00           0.00              0         664002           2.85        3603.20    14586900.58          21.97


So somehow holdtime-max for per-CPU stream is 2.5x higher for PREEMPT_VOLUNTARY
than for PREEMPT_NONE.  And class->lock holdtime-total is much much higher for
PREEMPT than for any other preemption models.  And that's BASE kernel, which
runs fully atomic zsmalloc and zram.  I call this rubbish.


^ permalink raw reply	[flat|nested] 73+ messages in thread

* [PATCHv4 15/17] zsmalloc: introduce new object mapping API
  2025-01-31  9:05 [PATCHv4 00/17] zsmalloc/zram: there be preemption Sergey Senozhatsky
                   ` (13 preceding siblings ...)
  2025-01-31  9:06 ` [PATCHv4 14/17] zsmalloc: make zspage lock preemptible Sergey Senozhatsky
@ 2025-01-31  9:06 ` Sergey Senozhatsky
  2025-01-31  9:06 ` [PATCHv4 16/17] zram: switch to new zsmalloc " Sergey Senozhatsky
  2025-01-31  9:06 ` [PATCHv4 17/17] zram: add might_sleep to zcomp API Sergey Senozhatsky
  16 siblings, 0 replies; 73+ messages in thread
From: Sergey Senozhatsky @ 2025-01-31  9:06 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Minchan Kim, linux-mm, linux-kernel, Sergey Senozhatsky, Yosry Ahmed

Current object mapping API is a little cumbersome.  First, it's
inconsistent, sometimes it returns with page-faults disabled and
sometimes with page-faults enabled.  Second, and most importantly,
it enforces atomicity restrictions on its users.  zs_map_object()
has to return a liner object address which is not always possible
because some objects span multiple physical (non-contiguous)
pages.  For such objects zsmalloc uses a per-CPU buffer to which
object's data is copied before a pointer to that per-CPU buffer
is returned back to the caller.  This leads to another, final,
issue - extra memcpy().  Since the caller gets a pointer to
per-CPU buffer it can memcpy() data only to that buffer, and
during zs_unmap_object() zsmalloc will memcpy() from that per-CPU
buffer to physical pages that object in question spans across.

New API splits functions by access mode:
- zs_obj_read_begin(handle, local_copy)
  Returns a pointer to handle memory.  For objects that span two
  physical pages a local_copy buffer is used to store object's
  data before the address is returned to the caller.  Otherwise
  the object's page is kmap_local mapped directly.

- zs_obj_read_end(handle, buf)
  Unmaps the page if it was kmap_local mapped by zs_obj_read_begin().

- zs_obj_write(handle, buf, len)
  Copies len-bytes from compression buffer to handle memory
  (takes care of objects that span two pages).  This does not
  need any additional (e.g. per-CPU) buffers and writes the data
  directly to zsmalloc pool pages.

The old API will stay around until the remaining users switch
to the new one.  After that we'll also remove zsmalloc per-CPU
buffer and CPU hotplug handling.

Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Reviewed-by: Yosry Ahmed <yosry.ahmed@linux.dev>
---
 include/linux/zsmalloc.h |   8 +++
 mm/zsmalloc.c            | 129 +++++++++++++++++++++++++++++++++++++++
 2 files changed, 137 insertions(+)

diff --git a/include/linux/zsmalloc.h b/include/linux/zsmalloc.h
index a48cd0ffe57d..7d70983cf398 100644
--- a/include/linux/zsmalloc.h
+++ b/include/linux/zsmalloc.h
@@ -58,4 +58,12 @@ unsigned long zs_compact(struct zs_pool *pool);
 unsigned int zs_lookup_class_index(struct zs_pool *pool, unsigned int size);
 
 void zs_pool_stats(struct zs_pool *pool, struct zs_pool_stats *stats);
+
+void *zs_obj_read_begin(struct zs_pool *pool, unsigned long handle,
+			void *local_copy);
+void zs_obj_read_end(struct zs_pool *pool, unsigned long handle,
+		     void *handle_mem);
+void zs_obj_write(struct zs_pool *pool, unsigned long handle,
+		  void *handle_mem, size_t mem_len);
+
 #endif
diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
index f5b5fe732e50..f9d840f77b18 100644
--- a/mm/zsmalloc.c
+++ b/mm/zsmalloc.c
@@ -1365,6 +1365,135 @@ void zs_unmap_object(struct zs_pool *pool, unsigned long handle)
 }
 EXPORT_SYMBOL_GPL(zs_unmap_object);
 
+void *zs_obj_read_begin(struct zs_pool *pool, unsigned long handle,
+			void *local_copy)
+{
+	struct zspage *zspage;
+	struct zpdesc *zpdesc;
+	unsigned long obj, off;
+	unsigned int obj_idx;
+	struct size_class *class;
+	void *addr;
+
+	WARN_ON(in_interrupt());
+
+	/* Guarantee we can get zspage from handle safely */
+	pool_read_lock(pool);
+	obj = handle_to_obj(handle);
+	obj_to_location(obj, &zpdesc, &obj_idx);
+	zspage = get_zspage(zpdesc);
+
+	/* Make sure migration doesn't move any pages in this zspage */
+	zspage_read_lock(zspage);
+	pool_read_unlock(pool);
+
+	class = zspage_class(pool, zspage);
+	off = offset_in_page(class->size * obj_idx);
+
+	if (off + class->size <= PAGE_SIZE) {
+		/* this object is contained entirely within a page */
+		addr = kmap_local_zpdesc(zpdesc);
+		addr += off;
+	} else {
+		size_t sizes[2];
+
+		/* this object spans two pages */
+		sizes[0] = PAGE_SIZE - off;
+		sizes[1] = class->size - sizes[0];
+		addr = local_copy;
+
+		memcpy_from_page(addr, zpdesc_page(zpdesc),
+				 off, sizes[0]);
+		zpdesc = get_next_zpdesc(zpdesc);
+		memcpy_from_page(addr + sizes[0],
+				 zpdesc_page(zpdesc),
+				 0, sizes[1]);
+	}
+
+	if (!ZsHugePage(zspage))
+		addr += ZS_HANDLE_SIZE;
+
+	return addr;
+}
+EXPORT_SYMBOL_GPL(zs_obj_read_begin);
+
+void zs_obj_read_end(struct zs_pool *pool, unsigned long handle,
+		     void *handle_mem)
+{
+	struct zspage *zspage;
+	struct zpdesc *zpdesc;
+	unsigned long obj, off;
+	unsigned int obj_idx;
+	struct size_class *class;
+
+	obj = handle_to_obj(handle);
+	obj_to_location(obj, &zpdesc, &obj_idx);
+	zspage = get_zspage(zpdesc);
+	class = zspage_class(pool, zspage);
+	off = offset_in_page(class->size * obj_idx);
+
+	if (off + class->size <= PAGE_SIZE) {
+		if (!ZsHugePage(zspage))
+			off += ZS_HANDLE_SIZE;
+		handle_mem -= off;
+		kunmap_local(handle_mem);
+	}
+
+	zspage_read_unlock(zspage);
+}
+EXPORT_SYMBOL_GPL(zs_obj_read_end);
+
+void zs_obj_write(struct zs_pool *pool, unsigned long handle,
+		  void *handle_mem, size_t mem_len)
+{
+	struct zspage *zspage;
+	struct zpdesc *zpdesc;
+	unsigned long obj, off;
+	unsigned int obj_idx;
+	struct size_class *class;
+
+	WARN_ON(in_interrupt());
+
+	/* Guarantee we can get zspage from handle safely */
+	pool_read_lock(pool);
+	obj = handle_to_obj(handle);
+	obj_to_location(obj, &zpdesc, &obj_idx);
+	zspage = get_zspage(zpdesc);
+
+	/* Make sure migration doesn't move any pages in this zspage */
+	zspage_read_lock(zspage);
+	pool_read_unlock(pool);
+
+	class = zspage_class(pool, zspage);
+	off = offset_in_page(class->size * obj_idx);
+
+	if (off + class->size <= PAGE_SIZE) {
+		/* this object is contained entirely within a page */
+		void *dst = kmap_local_zpdesc(zpdesc);
+
+		if (!ZsHugePage(zspage))
+			off += ZS_HANDLE_SIZE;
+		memcpy(dst + off, handle_mem, mem_len);
+		kunmap_local(dst);
+	} else {
+		/* this object spans two pages */
+		size_t sizes[2];
+
+		off += ZS_HANDLE_SIZE;
+		sizes[0] = PAGE_SIZE - off;
+		sizes[1] = mem_len - sizes[0];
+
+		memcpy_to_page(zpdesc_page(zpdesc), off,
+			       handle_mem, sizes[0]);
+		zpdesc = get_next_zpdesc(zpdesc);
+		memcpy_to_page(zpdesc_page(zpdesc), 0,
+			       handle_mem + sizes[0], sizes[1]);
+	}
+
+	zspage_read_unlock(zspage);
+}
+EXPORT_SYMBOL_GPL(zs_obj_write);
+
 /**
  * zs_huge_class_size() - Returns the size (in bytes) of the first huge
  *                        zsmalloc &size_class.
-- 
2.48.1.362.g079036d154-goog



^ permalink raw reply	[flat|nested] 73+ messages in thread

* [PATCHv4 16/17] zram: switch to new zsmalloc object mapping API
  2025-01-31  9:05 [PATCHv4 00/17] zsmalloc/zram: there be preemption Sergey Senozhatsky
                   ` (14 preceding siblings ...)
  2025-01-31  9:06 ` [PATCHv4 15/17] zsmalloc: introduce new object mapping API Sergey Senozhatsky
@ 2025-01-31  9:06 ` Sergey Senozhatsky
  2025-01-31  9:06 ` [PATCHv4 17/17] zram: add might_sleep to zcomp API Sergey Senozhatsky
  16 siblings, 0 replies; 73+ messages in thread
From: Sergey Senozhatsky @ 2025-01-31  9:06 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Minchan Kim, linux-mm, linux-kernel, Sergey Senozhatsky

Use new read/write zsmalloc object API.  For cases when RO mapped
object spans two physical pages (requires temp buffer) compression
streams now carry around one extra physical page.

Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
---
 drivers/block/zram/zcomp.c    |  4 +++-
 drivers/block/zram/zcomp.h    |  2 ++
 drivers/block/zram/zram_drv.c | 28 ++++++++++------------------
 3 files changed, 15 insertions(+), 19 deletions(-)

diff --git a/drivers/block/zram/zcomp.c b/drivers/block/zram/zcomp.c
index efd5919808d9..675f2a51ad5f 100644
--- a/drivers/block/zram/zcomp.c
+++ b/drivers/block/zram/zcomp.c
@@ -45,6 +45,7 @@ static const struct zcomp_ops *backends[] = {
 static void zcomp_strm_free(struct zcomp *comp, struct zcomp_strm *strm)
 {
 	comp->ops->destroy_ctx(&strm->ctx);
+	vfree(strm->local_copy);
 	vfree(strm->buffer);
 	kfree(strm);
 }
@@ -66,12 +67,13 @@ static struct zcomp_strm *zcomp_strm_alloc(struct zcomp *comp)
 		return NULL;
 	}
 
+	strm->local_copy = vzalloc(PAGE_SIZE);
 	/*
 	 * allocate 2 pages. 1 for compressed data, plus 1 extra in case if
 	 * compressed data is larger than the original one.
 	 */
 	strm->buffer = vzalloc(2 * PAGE_SIZE);
-	if (!strm->buffer) {
+	if (!strm->buffer || !strm->local_copy) {
 		zcomp_strm_free(comp, strm);
 		return NULL;
 	}
diff --git a/drivers/block/zram/zcomp.h b/drivers/block/zram/zcomp.h
index 62330829db3f..9683d4aa822d 100644
--- a/drivers/block/zram/zcomp.h
+++ b/drivers/block/zram/zcomp.h
@@ -34,6 +34,8 @@ struct zcomp_strm {
 	struct list_head entry;
 	/* compression buffer */
 	void *buffer;
+	/* local copy of handle memory */
+	void *local_copy;
 	struct zcomp_ctx ctx;
 };
 
diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index cfbb3072ee9e..f85502ae7dce 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -1553,11 +1553,11 @@ static int read_incompressible_page(struct zram *zram, struct page *page,
 	void *src, *dst;
 
 	handle = zram_get_handle(zram, index);
-	src = zs_map_object(zram->mem_pool, handle, ZS_MM_RO);
+	src = zs_obj_read_begin(zram->mem_pool, handle, NULL);
 	dst = kmap_local_page(page);
 	copy_page(dst, src);
 	kunmap_local(dst);
-	zs_unmap_object(zram->mem_pool, handle);
+	zs_obj_read_end(zram->mem_pool, handle, src);
 
 	return 0;
 }
@@ -1575,11 +1575,11 @@ static int read_compressed_page(struct zram *zram, struct page *page, u32 index)
 	prio = zram_get_priority(zram, index);
 
 	zstrm = zcomp_stream_get(zram->comps[prio]);
-	src = zs_map_object(zram->mem_pool, handle, ZS_MM_RO);
+	src = zs_obj_read_begin(zram->mem_pool, handle, zstrm->local_copy);
 	dst = kmap_local_page(page);
 	ret = zcomp_decompress(zram->comps[prio], zstrm, src, size, dst);
 	kunmap_local(dst);
-	zs_unmap_object(zram->mem_pool, handle);
+	zs_obj_read_end(zram->mem_pool, handle, src);
 	zcomp_stream_put(zram->comps[prio], zstrm);
 
 	return ret;
@@ -1675,7 +1675,7 @@ static int write_incompressible_page(struct zram *zram, struct page *page,
 				     u32 index)
 {
 	unsigned long handle;
-	void *src, *dst;
+	void *src;
 
 	/*
 	 * This function is called from preemptible context so we don't need
@@ -1692,11 +1692,9 @@ static int write_incompressible_page(struct zram *zram, struct page *page,
 		return -ENOMEM;
 	}
 
-	dst = zs_map_object(zram->mem_pool, handle, ZS_MM_WO);
 	src = kmap_local_page(page);
-	memcpy(dst, src, PAGE_SIZE);
+	zs_obj_write(zram->mem_pool, handle, src, PAGE_SIZE);
 	kunmap_local(src);
-	zs_unmap_object(zram->mem_pool, handle);
 
 	zram_slot_write_lock(zram, index);
 	zram_set_flag(zram, index, ZRAM_HUGE);
@@ -1717,7 +1715,7 @@ static int zram_write_page(struct zram *zram, struct page *page, u32 index)
 	int ret = 0;
 	unsigned long handle;
 	unsigned int comp_len;
-	void *dst, *mem;
+	void *mem;
 	struct zcomp_strm *zstrm;
 	unsigned long element;
 	bool same_filled;
@@ -1760,11 +1758,8 @@ static int zram_write_page(struct zram *zram, struct page *page, u32 index)
 		return -ENOMEM;
 	}
 
-	dst = zs_map_object(zram->mem_pool, handle, ZS_MM_WO);
-
-	memcpy(dst, zstrm->buffer, comp_len);
+	zs_obj_write(zram->mem_pool, handle, zstrm->buffer, comp_len);
 	zcomp_stream_put(zram->comps[ZRAM_PRIMARY_COMP], zstrm);
-	zs_unmap_object(zram->mem_pool, handle);
 
 	zram_slot_write_lock(zram, index);
 	zram_set_handle(zram, index, handle);
@@ -1876,7 +1871,7 @@ static int recompress_slot(struct zram *zram, u32 index, struct page *page,
 	unsigned int comp_len_new;
 	unsigned int class_index_old;
 	unsigned int class_index_new;
-	void *src, *dst;
+	void *src;
 	int ret;
 
 	handle_old = zram_get_handle(zram, index);
@@ -2000,12 +1995,9 @@ static int recompress_slot(struct zram *zram, u32 index, struct page *page,
 		return 0;
 	}
 
-	dst = zs_map_object(zram->mem_pool, handle_new, ZS_MM_WO);
-	memcpy(dst, zstrm->buffer, comp_len_new);
+	zs_obj_write(zram->mem_pool, handle_new, zstrm->buffer, comp_len_new);
 	zcomp_stream_put(zram->comps[prio], zstrm);
 
-	zs_unmap_object(zram->mem_pool, handle_new);
-
 	zram_free_page(zram, index);
 	zram_set_handle(zram, index, handle_new);
 	zram_set_obj_size(zram, index, comp_len_new);
-- 
2.48.1.362.g079036d154-goog



^ permalink raw reply	[flat|nested] 73+ messages in thread

* [PATCHv4 17/17] zram: add might_sleep to zcomp API
  2025-01-31  9:05 [PATCHv4 00/17] zsmalloc/zram: there be preemption Sergey Senozhatsky
                   ` (15 preceding siblings ...)
  2025-01-31  9:06 ` [PATCHv4 16/17] zram: switch to new zsmalloc " Sergey Senozhatsky
@ 2025-01-31  9:06 ` Sergey Senozhatsky
  16 siblings, 0 replies; 73+ messages in thread
From: Sergey Senozhatsky @ 2025-01-31  9:06 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Minchan Kim, linux-mm, linux-kernel, Sergey Senozhatsky

Explicitly state that zcomp compress/decompress must be
called from non-atomic context.

Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
---
 drivers/block/zram/zcomp.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/block/zram/zcomp.c b/drivers/block/zram/zcomp.c
index 675f2a51ad5f..f4235735787b 100644
--- a/drivers/block/zram/zcomp.c
+++ b/drivers/block/zram/zcomp.c
@@ -185,6 +185,7 @@ int zcomp_compress(struct zcomp *comp, struct zcomp_strm *zstrm,
 	};
 	int ret;
 
+	might_sleep();
 	ret = comp->ops->compress(comp->params, &zstrm->ctx, &req);
 	if (!ret)
 		*dst_len = req.dst_len;
@@ -201,6 +202,7 @@ int zcomp_decompress(struct zcomp *comp, struct zcomp_strm *zstrm,
 		.dst_len = PAGE_SIZE,
 	};
 
+	might_sleep();
 	return comp->ops->decompress(comp->params, &zstrm->ctx, &req);
 }
 
-- 
2.48.1.362.g079036d154-goog



^ permalink raw reply	[flat|nested] 73+ messages in thread

end of thread, other threads:[~2025-02-14  3:33 UTC | newest]

Thread overview: 73+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-01-31  9:05 [PATCHv4 00/17] zsmalloc/zram: there be preemption Sergey Senozhatsky
2025-01-31  9:06 ` [PATCHv4 01/17] zram: switch to non-atomic entry locking Sergey Senozhatsky
2025-01-31 11:41   ` Hillf Danton
2025-02-03  3:21     ` Sergey Senozhatsky
2025-02-03  3:52       ` Sergey Senozhatsky
2025-02-03 12:39       ` Sergey Senozhatsky
2025-01-31 22:55   ` Andrew Morton
2025-02-03  3:26     ` Sergey Senozhatsky
2025-02-03  7:11       ` Sergey Senozhatsky
2025-02-03  7:33         ` Sergey Senozhatsky
2025-02-04  0:19       ` Andrew Morton
2025-02-04  4:22         ` Sergey Senozhatsky
2025-02-06  7:01     ` Sergey Senozhatsky
2025-02-06  7:38       ` Sebastian Andrzej Siewior
2025-02-06  7:47         ` Sergey Senozhatsky
2025-02-06  8:13           ` Sebastian Andrzej Siewior
2025-02-06  8:17             ` Sergey Senozhatsky
2025-02-06  8:26               ` Sebastian Andrzej Siewior
2025-02-06  8:29                 ` Sergey Senozhatsky
2025-01-31  9:06 ` [PATCHv4 02/17] zram: do not use per-CPU compression streams Sergey Senozhatsky
2025-02-01  9:21   ` Kairui Song
2025-02-03  3:49     ` Sergey Senozhatsky
2025-02-03 21:00       ` Yosry Ahmed
2025-02-06 12:26         ` Sergey Senozhatsky
2025-02-06  6:55       ` Kairui Song
2025-02-06  7:22         ` Sergey Senozhatsky
2025-02-06  8:22           ` Sergey Senozhatsky
2025-02-06 16:16           ` Yosry Ahmed
2025-02-07  2:56             ` Sergey Senozhatsky
2025-02-07  6:12               ` Sergey Senozhatsky
2025-02-07 21:07                 ` Yosry Ahmed
2025-02-08 16:20                   ` Sergey Senozhatsky
2025-02-08 16:41                     ` Sergey Senozhatsky
2025-02-09  6:22                     ` Sergey Senozhatsky
2025-02-09  7:42                       ` Sergey Senozhatsky
2025-01-31  9:06 ` [PATCHv4 03/17] zram: remove crypto include Sergey Senozhatsky
2025-01-31  9:06 ` [PATCHv4 04/17] zram: remove max_comp_streams device attr Sergey Senozhatsky
2025-01-31  9:06 ` [PATCHv4 05/17] zram: remove two-staged handle allocation Sergey Senozhatsky
2025-01-31  9:06 ` [PATCHv4 06/17] zram: permit reclaim in zstd custom allocator Sergey Senozhatsky
2025-01-31  9:06 ` [PATCHv4 07/17] zram: permit reclaim in recompression handle allocation Sergey Senozhatsky
2025-01-31  9:06 ` [PATCHv4 08/17] zram: remove writestall zram_stats member Sergey Senozhatsky
2025-01-31  9:06 ` [PATCHv4 09/17] zram: limit max recompress prio to num_active_comps Sergey Senozhatsky
2025-01-31  9:06 ` [PATCHv4 10/17] zram: filter out recomp targets based on priority Sergey Senozhatsky
2025-01-31  9:06 ` [PATCHv4 11/17] zram: unlock slot during recompression Sergey Senozhatsky
2025-01-31  9:06 ` [PATCHv4 12/17] zsmalloc: factor out pool locking helpers Sergey Senozhatsky
2025-01-31 15:46   ` Yosry Ahmed
2025-02-03  4:57     ` Sergey Senozhatsky
2025-01-31  9:06 ` [PATCHv4 13/17] zsmalloc: factor out size-class " Sergey Senozhatsky
2025-01-31  9:06 ` [PATCHv4 14/17] zsmalloc: make zspage lock preemptible Sergey Senozhatsky
2025-01-31 15:51   ` Yosry Ahmed
2025-02-03  3:13     ` Sergey Senozhatsky
2025-02-03  4:56       ` Sergey Senozhatsky
2025-02-03 21:11       ` Yosry Ahmed
2025-02-04  6:59         ` Sergey Senozhatsky
2025-02-04 17:19           ` Yosry Ahmed
2025-02-05  2:43             ` Sergey Senozhatsky
2025-02-05 19:06               ` Yosry Ahmed
2025-02-06  3:05                 ` Sergey Senozhatsky
2025-02-06  3:28                   ` Sergey Senozhatsky
2025-02-06 16:19                   ` Yosry Ahmed
2025-02-07  2:48                     ` Sergey Senozhatsky
2025-02-07 21:09                       ` Yosry Ahmed
2025-02-12  5:00                         ` Sergey Senozhatsky
2025-02-12 15:35                           ` Yosry Ahmed
2025-02-13  2:18                             ` Sergey Senozhatsky
2025-02-13  2:57                               ` Yosry Ahmed
2025-02-13  7:21                                 ` Sergey Senozhatsky
2025-02-13  8:22                                   ` Sergey Senozhatsky
2025-02-13 15:25                                     ` Yosry Ahmed
2025-02-14  3:33                                       ` Sergey Senozhatsky
2025-01-31  9:06 ` [PATCHv4 15/17] zsmalloc: introduce new object mapping API Sergey Senozhatsky
2025-01-31  9:06 ` [PATCHv4 16/17] zram: switch to new zsmalloc " Sergey Senozhatsky
2025-01-31  9:06 ` [PATCHv4 17/17] zram: add might_sleep to zcomp API Sergey Senozhatsky

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox