[PATCH 0/7] zram: preemptible writes and occasionally preemptible reads

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/7] zram: preemptible writes and occasionally preemptible reads
@ 2025-01-22  5:57 Sergey Senozhatsky
  2025-01-22  5:57 ` [PATCH 1/7] zram: switch to non-atomic entry locking Sergey Senozhatsky
                   ` (6 more replies)
  0 siblings, 7 replies; 12+ messages in thread
From: Sergey Senozhatsky @ 2025-01-22  5:57 UTC (permalink / raw)
  To: Andrew Morton, Minchan Kim; +Cc: linux-mm, linux-kernel, Sergey Senozhatsky

This is Part I (aka Episode IV), which only changes zram and seems
like a good start.  More work needs to be done, outside of zram, in
order to make reads() preemptible in a non-occasional manner.  That
work will be carried out independently.

There are more details in the commit messages, but in short:

Currently zram runs compression and decompression in non-preemptible
sections, e.g.

    zcomp_stream_get()     // grabs CPU local lock
    zcomp_compress()

or

    zram_slot_lock()       // grabs entry spin-lock
    zcomp_stream_get()     // grabs CPU local lock
    zs_map_object()        // grabs rwlock and CPU local lock
    zcomp_decompress()

Potentially a little troublesome for a number of reasons.

For instance, this makes it impossible to use async compression
algorithms or/and H/W compression algorithms, which can wait for OP
completion or resource availability.  This also restricts what
compression algorithms can do internally, for example, zstd can
allocate internal state memory for C/D dictionaries:

do_fsync()
 do_writepages()
  zram_bio_write()
   zram_write_page()                          // become non-preemptible
    zcomp_compress()
     zstd_compress()
      ZSTD_compress_usingCDict()
       ZSTD_compressBegin_usingCDict_internal()
        ZSTD_resetCCtx_usingCDict()
         ZSTD_resetCCtx_internal()
          zstd_custom_alloc()                 // memory allocation

Not to mention that the system can be configured to maximize
compression ratio at a cost of CPU/HW time (e.g. lz4hc or deflate
with very high compression level) so zram can stay in non-preemptible
section (even under spin-lock or/and rwlock) for an extended period
of time.  Aside from compression algorithms, this also restricts what
zram can do.  One particular example is zram_write_page() zsmalloc
handle allocation, which has an optimistic allocation (disallowing
direct reclaim) and a pessimistic fallback path, which then forces
zram to compress the page one more time.

This series changes zram to not directly impose atomicity restrictions
on compression algorithms (and on itself), which makes zram write()
fully preemptible; zram read(), sadly, is not always preemptible.  There
are still indirect atomicity restrictions imposed by zsmalloc().  Changing
zsmalloc to permit preemption under zs_map_object() is a separate effort
(Part II) which will require some coordination.

Sergey Senozhatsky (7):
  zram: switch to non-atomic entry locking
  zram: do not use per-CPU compression streams
  zram: remove two-staged handle allocation
  zram: permit reclaim in zstd custom allocator
  zram: permit reclaim in recompression handle allocation
  zram: remove writestall zram_stats member
  zram: unlock slot bucket during recompression

 drivers/block/zram/backend_zstd.c |  11 +-
 drivers/block/zram/zcomp.c        | 162 +++++++++--------
 drivers/block/zram/zcomp.h        |  17 +-
 drivers/block/zram/zram_drv.c     | 277 ++++++++++++++++--------------
 drivers/block/zram/zram_drv.h     |   9 +-
 include/linux/cpuhotplug.h        |   1 -
 6 files changed, 252 insertions(+), 225 deletions(-)

-- 
2.48.0.rc2.279.g1de40edade-goog

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH 1/7] zram: switch to non-atomic entry locking
  2025-01-22  5:57 [PATCH 0/7] zram: preemptible writes and occasionally preemptible reads Sergey Senozhatsky
@ 2025-01-22  5:57 ` Sergey Senozhatsky
  2025-01-24  5:06   ` Sergey Senozhatsky
  2025-01-24 10:30   ` Hillf Danton
  2025-01-22  5:57 ` [PATCH 2/7] zram: do not use per-CPU compression streams Sergey Senozhatsky
                   ` (5 subsequent siblings)
  6 siblings, 2 replies; 12+ messages in thread
From: Sergey Senozhatsky @ 2025-01-22  5:57 UTC (permalink / raw)
  To: Andrew Morton, Minchan Kim; +Cc: linux-mm, linux-kernel, Sergey Senozhatsky

Concurrent modifications of meta table entries is now handled
by per-entry spin-lock.  This has a number of shortcomings.

First, this imposes atomic requirements on compression backends.
zram can call both zcomp_compress() and zcomp_decompress() under
entry spin-lock, which implies that we can use only compression
algorithms that don't schedule/sleep/wait during compression and
decompression.  This, for instance, makes it impossible to use
some of the ASYNC compression algorithms (H/W compression, etc.)
implementations.

Second, this can potentially trigger watchdogs.  For example,
entry re-compression with secondary algorithms is performed
under entry spin-lock.  Given that we chain secondary
compression algorithms and that some of them can be configured
for best compression ratio (and worst compression speed) zram
can stay under spin-lock for quite some time.

Do not use per-entry spin-locks and instead switch to a bucket
RW-sem locking scheme.  Each rw-lock controls access to a number
of zram entries (a bucket).  Each bucket is configured to protect
12 (for the time being) pages of zram disk, in order to minimize
memory footprint of bucket locks - we cannot afford a mutex of
rwsem per-entry, that can use hundreds of megabytes on a relatively
common setup, yet we still want some degree of parallelism wrt
entries access.

Bucket lock is taken in write-mode only to modify zram entry,
compression and zsmalloc handle allocation performed outside of
bucket lock scopr; decompression is performed under bucket
read-lock.  At a glance there doesn't seem to be any immediate
performance difference:

time make -j$(nproc) ARCH=x86 all modules # defconfig, 24-vCPU VM

BEFORE
------
Kernel: arch/x86/boot/bzImage is ready  (#1)
1371.43user 158.84system 1:30.70elapsed 1687%CPU (0avgtext+0avgdata 825620maxresident)k
32504inputs+1768352outputs (259major+51536895minor)pagefaults 0swaps

AFTER
-----
Kernel: arch/x86/boot/bzImage is ready  (#1)
1366.79user 158.82system 1:31.17elapsed 1673%CPU (0avgtext+0avgdata 825676maxresident)k
32504inputs+1768352outputs (263major+51538123minor)pagefaults 0swaps

Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
---
 drivers/block/zram/zram_drv.c | 151 ++++++++++++++++++++--------------
 drivers/block/zram/zram_drv.h |   8 +-
 2 files changed, 98 insertions(+), 61 deletions(-)

diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index 9f5020b077c5..7eb7feba3cac 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -58,19 +58,44 @@ static void zram_free_page(struct zram *zram, size_t index);
 static int zram_read_from_zspool(struct zram *zram, struct page *page,
 				 u32 index);
 
-static int zram_slot_trylock(struct zram *zram, u32 index)
+static u32 zram_bucket_idx(u32 index)
 {
-	return spin_trylock(&zram->table[index].lock);
+	return index / ZRAM_PAGES_PER_BUCKET_LOCK;
 }
 
-static void zram_slot_lock(struct zram *zram, u32 index)
+static int zram_slot_write_trylock(struct zram *zram, u32 index)
 {
-	spin_lock(&zram->table[index].lock);
+	u32 idx = zram_bucket_idx(index);
+
+	return down_write_trylock(&zram->locks[idx].lock);
+}
+
+static void zram_slot_write_lock(struct zram *zram, u32 index)
+{
+	u32 idx = zram_bucket_idx(index);
+
+	down_write(&zram->locks[idx].lock);
+}
+
+static void zram_slot_write_unlock(struct zram *zram, u32 index)
+{
+	u32 idx = zram_bucket_idx(index);
+
+	up_write(&zram->locks[idx].lock);
+}
+
+static void zram_slot_read_lock(struct zram *zram, u32 index)
+{
+	u32 idx = zram_bucket_idx(index);
+
+	down_read(&zram->locks[idx].lock);
 }
 
-static void zram_slot_unlock(struct zram *zram, u32 index)
+static void zram_slot_read_unlock(struct zram *zram, u32 index)
 {
-	spin_unlock(&zram->table[index].lock);
+	u32 idx = zram_bucket_idx(index);
+
+	up_read(&zram->locks[idx].lock);
 }
 
 static inline bool init_done(struct zram *zram)
@@ -93,7 +118,6 @@ static void zram_set_handle(struct zram *zram, u32 index, unsigned long handle)
 	zram->table[index].handle = handle;
 }
 
-/* flag operations require table entry bit_spin_lock() being held */
 static bool zram_test_flag(struct zram *zram, u32 index,
 			enum zram_pageflags flag)
 {
@@ -229,9 +253,9 @@ static void release_pp_slot(struct zram *zram, struct zram_pp_slot *pps)
 {
 	list_del_init(&pps->entry);
 
-	zram_slot_lock(zram, pps->index);
+	zram_slot_write_lock(zram, pps->index);
 	zram_clear_flag(zram, pps->index, ZRAM_PP_SLOT);
-	zram_slot_unlock(zram, pps->index);
+	zram_slot_write_unlock(zram, pps->index);
 
 	kfree(pps);
 }
@@ -394,11 +418,11 @@ static void mark_idle(struct zram *zram, ktime_t cutoff)
 		 *
 		 * And ZRAM_WB slots simply cannot be ZRAM_IDLE.
 		 */
-		zram_slot_lock(zram, index);
+		zram_slot_write_lock(zram, index);
 		if (!zram_allocated(zram, index) ||
 		    zram_test_flag(zram, index, ZRAM_WB) ||
 		    zram_test_flag(zram, index, ZRAM_SAME)) {
-			zram_slot_unlock(zram, index);
+			zram_slot_write_unlock(zram, index);
 			continue;
 		}
 
@@ -410,7 +434,7 @@ static void mark_idle(struct zram *zram, ktime_t cutoff)
 			zram_set_flag(zram, index, ZRAM_IDLE);
 		else
 			zram_clear_flag(zram, index, ZRAM_IDLE);
-		zram_slot_unlock(zram, index);
+		zram_slot_write_unlock(zram, index);
 	}
 }
 
@@ -709,7 +733,7 @@ static int scan_slots_for_writeback(struct zram *zram, u32 mode,
 
 		INIT_LIST_HEAD(&pps->entry);
 
-		zram_slot_lock(zram, index);
+		zram_slot_write_lock(zram, index);
 		if (!zram_allocated(zram, index))
 			goto next;
 
@@ -731,7 +755,7 @@ static int scan_slots_for_writeback(struct zram *zram, u32 mode,
 		place_pp_slot(zram, ctl, pps);
 		pps = NULL;
 next:
-		zram_slot_unlock(zram, index);
+		zram_slot_write_unlock(zram, index);
 	}
 
 	kfree(pps);
@@ -822,7 +846,7 @@ static ssize_t writeback_store(struct device *dev,
 		}
 
 		index = pps->index;
-		zram_slot_lock(zram, index);
+		zram_slot_read_lock(zram, index);
 		/*
 		 * scan_slots() sets ZRAM_PP_SLOT and relases slot lock, so
 		 * slots can change in the meantime. If slots are accessed or
@@ -833,7 +857,7 @@ static ssize_t writeback_store(struct device *dev,
 			goto next;
 		if (zram_read_from_zspool(zram, page, index))
 			goto next;
-		zram_slot_unlock(zram, index);
+		zram_slot_read_unlock(zram, index);
 
 		bio_init(&bio, zram->bdev, &bio_vec, 1,
 			 REQ_OP_WRITE | REQ_SYNC);
@@ -860,7 +884,7 @@ static ssize_t writeback_store(struct device *dev,
 		}
 
 		atomic64_inc(&zram->stats.bd_writes);
-		zram_slot_lock(zram, index);
+		zram_slot_write_lock(zram, index);
 		/*
 		 * Same as above, we release slot lock during writeback so
 		 * slot can change under us: slot_free() or slot_free() and
@@ -882,7 +906,7 @@ static ssize_t writeback_store(struct device *dev,
 			zram->bd_wb_limit -=  1UL << (PAGE_SHIFT - 12);
 		spin_unlock(&zram->wb_limit_lock);
 next:
-		zram_slot_unlock(zram, index);
+		zram_slot_write_unlock(zram, index);
 		release_pp_slot(zram, pps);
 
 		cond_resched();
@@ -1001,7 +1025,7 @@ static ssize_t read_block_state(struct file *file, char __user *buf,
 	for (index = *ppos; index < nr_pages; index++) {
 		int copied;
 
-		zram_slot_lock(zram, index);
+		zram_slot_read_lock(zram, index);
 		if (!zram_allocated(zram, index))
 			goto next;
 
@@ -1019,13 +1043,13 @@ static ssize_t read_block_state(struct file *file, char __user *buf,
 				       ZRAM_INCOMPRESSIBLE) ? 'n' : '.');
 
 		if (count <= copied) {
-			zram_slot_unlock(zram, index);
+			zram_slot_read_unlock(zram, index);
 			break;
 		}
 		written += copied;
 		count -= copied;
 next:
-		zram_slot_unlock(zram, index);
+		zram_slot_read_unlock(zram, index);
 		*ppos += 1;
 	}
 
@@ -1451,37 +1475,44 @@ static void zram_meta_free(struct zram *zram, u64 disksize)
 	zs_destroy_pool(zram->mem_pool);
 	vfree(zram->table);
 	zram->table = NULL;
+	vfree(zram->locks);
+	zram->locks = NULL;
 }
 
 static bool zram_meta_alloc(struct zram *zram, u64 disksize)
 {
-	size_t num_pages, index;
+	size_t num_ents, index;
 
-	num_pages = disksize >> PAGE_SHIFT;
-	zram->table = vzalloc(array_size(num_pages, sizeof(*zram->table)));
+	num_ents = disksize >> PAGE_SHIFT;
+	zram->table = vzalloc(array_size(num_ents, sizeof(*zram->table)));
 	if (!zram->table)
-		return false;
+		goto error;
+
+	num_ents /= ZRAM_PAGES_PER_BUCKET_LOCK;
+	zram->locks = vzalloc(array_size(num_ents, sizeof(*zram->locks)));
+	if (!zram->locks)
+		goto error;
 
 	zram->mem_pool = zs_create_pool(zram->disk->disk_name);
-	if (!zram->mem_pool) {
-		vfree(zram->table);
-		zram->table = NULL;
-		return false;
-	}
+	if (!zram->mem_pool)
+		goto error;
 
 	if (!huge_class_size)
 		huge_class_size = zs_huge_class_size(zram->mem_pool);
 
-	for (index = 0; index < num_pages; index++)
-		spin_lock_init(&zram->table[index].lock);
+	for (index = 0; index < num_ents; index++)
+		init_rwsem(&zram->locks[index].lock);
+
 	return true;
+
+error:
+	vfree(zram->table);
+	zram->table = NULL;
+	vfree(zram->locks);
+	zram->locks = NULL;
+	return false;
 }
 
-/*
- * To protect concurrent access to the same index entry,
- * caller should hold this table index entry's bit_spinlock to
- * indicate this index entry is accessing.
- */
 static void zram_free_page(struct zram *zram, size_t index)
 {
 	unsigned long handle;
@@ -1602,17 +1633,17 @@ static int zram_read_page(struct zram *zram, struct page *page, u32 index,
 {
 	int ret;
 
-	zram_slot_lock(zram, index);
+	zram_slot_read_lock(zram, index);
 	if (!zram_test_flag(zram, index, ZRAM_WB)) {
 		/* Slot should be locked through out the function call */
 		ret = zram_read_from_zspool(zram, page, index);
-		zram_slot_unlock(zram, index);
+		zram_slot_read_unlock(zram, index);
 	} else {
 		/*
 		 * The slot should be unlocked before reading from the backing
 		 * device.
 		 */
-		zram_slot_unlock(zram, index);
+		zram_slot_read_unlock(zram, index);
 
 		ret = read_from_bdev(zram, page, zram_get_handle(zram, index),
 				     parent);
@@ -1655,10 +1686,10 @@ static int zram_bvec_read(struct zram *zram, struct bio_vec *bvec,
 static int write_same_filled_page(struct zram *zram, unsigned long fill,
 				  u32 index)
 {
-	zram_slot_lock(zram, index);
+	zram_slot_write_lock(zram, index);
 	zram_set_flag(zram, index, ZRAM_SAME);
 	zram_set_handle(zram, index, fill);
-	zram_slot_unlock(zram, index);
+	zram_slot_write_unlock(zram, index);
 
 	atomic64_inc(&zram->stats.same_pages);
 	atomic64_inc(&zram->stats.pages_stored);
@@ -1693,11 +1724,11 @@ static int write_incompressible_page(struct zram *zram, struct page *page,
 	kunmap_local(src);
 	zs_unmap_object(zram->mem_pool, handle);
 
-	zram_slot_lock(zram, index);
+	zram_slot_write_lock(zram, index);
 	zram_set_flag(zram, index, ZRAM_HUGE);
 	zram_set_handle(zram, index, handle);
 	zram_set_obj_size(zram, index, PAGE_SIZE);
-	zram_slot_unlock(zram, index);
+	zram_slot_write_unlock(zram, index);
 
 	atomic64_add(PAGE_SIZE, &zram->stats.compr_data_size);
 	atomic64_inc(&zram->stats.huge_pages);
@@ -1718,9 +1749,9 @@ static int zram_write_page(struct zram *zram, struct page *page, u32 index)
 	bool same_filled;
 
 	/* First, free memory allocated to this slot (if any) */
-	zram_slot_lock(zram, index);
+	zram_slot_write_lock(zram, index);
 	zram_free_page(zram, index);
-	zram_slot_unlock(zram, index);
+	zram_slot_write_unlock(zram, index);
 
 	mem = kmap_local_page(page);
 	same_filled = page_same_filled(mem, &element);
@@ -1790,10 +1821,10 @@ static int zram_write_page(struct zram *zram, struct page *page, u32 index)
 	zcomp_stream_put(zram->comps[ZRAM_PRIMARY_COMP]);
 	zs_unmap_object(zram->mem_pool, handle);
 
-	zram_slot_lock(zram, index);
+	zram_slot_write_lock(zram, index);
 	zram_set_handle(zram, index, handle);
 	zram_set_obj_size(zram, index, comp_len);
-	zram_slot_unlock(zram, index);
+	zram_slot_write_unlock(zram, index);
 
 	/* Update stats */
 	atomic64_inc(&zram->stats.pages_stored);
@@ -1850,7 +1881,7 @@ static int scan_slots_for_recompress(struct zram *zram, u32 mode,
 
 		INIT_LIST_HEAD(&pps->entry);
 
-		zram_slot_lock(zram, index);
+		zram_slot_write_lock(zram, index);
 		if (!zram_allocated(zram, index))
 			goto next;
 
@@ -1871,7 +1902,7 @@ static int scan_slots_for_recompress(struct zram *zram, u32 mode,
 		place_pp_slot(zram, ctl, pps);
 		pps = NULL;
 next:
-		zram_slot_unlock(zram, index);
+		zram_slot_write_unlock(zram, index);
 	}
 
 	kfree(pps);
@@ -2162,7 +2193,7 @@ static ssize_t recompress_store(struct device *dev,
 		if (!num_recomp_pages)
 			break;
 
-		zram_slot_lock(zram, pps->index);
+		zram_slot_write_lock(zram, pps->index);
 		if (!zram_test_flag(zram, pps->index, ZRAM_PP_SLOT))
 			goto next;
 
@@ -2170,7 +2201,7 @@ static ssize_t recompress_store(struct device *dev,
 				      &num_recomp_pages, threshold,
 				      prio, prio_max);
 next:
-		zram_slot_unlock(zram, pps->index);
+		zram_slot_write_unlock(zram, pps->index);
 		release_pp_slot(zram, pps);
 
 		if (err) {
@@ -2217,9 +2248,9 @@ static void zram_bio_discard(struct zram *zram, struct bio *bio)
 	}
 
 	while (n >= PAGE_SIZE) {
-		zram_slot_lock(zram, index);
+		zram_slot_write_lock(zram, index);
 		zram_free_page(zram, index);
-		zram_slot_unlock(zram, index);
+		zram_slot_write_unlock(zram, index);
 		atomic64_inc(&zram->stats.notify_free);
 		index++;
 		n -= PAGE_SIZE;
@@ -2248,9 +2279,9 @@ static void zram_bio_read(struct zram *zram, struct bio *bio)
 		}
 		flush_dcache_page(bv.bv_page);
 
-		zram_slot_lock(zram, index);
+		zram_slot_write_lock(zram, index);
 		zram_accessed(zram, index);
-		zram_slot_unlock(zram, index);
+		zram_slot_write_unlock(zram, index);
 
 		bio_advance_iter_single(bio, &iter, bv.bv_len);
 	} while (iter.bi_size);
@@ -2278,9 +2309,9 @@ static void zram_bio_write(struct zram *zram, struct bio *bio)
 			break;
 		}
 
-		zram_slot_lock(zram, index);
+		zram_slot_write_lock(zram, index);
 		zram_accessed(zram, index);
-		zram_slot_unlock(zram, index);
+		zram_slot_write_unlock(zram, index);
 
 		bio_advance_iter_single(bio, &iter, bv.bv_len);
 	} while (iter.bi_size);
@@ -2321,13 +2352,13 @@ static void zram_slot_free_notify(struct block_device *bdev,
 	zram = bdev->bd_disk->private_data;
 
 	atomic64_inc(&zram->stats.notify_free);
-	if (!zram_slot_trylock(zram, index)) {
+	if (!zram_slot_write_trylock(zram, index)) {
 		atomic64_inc(&zram->stats.miss_free);
 		return;
 	}
 
 	zram_free_page(zram, index);
-	zram_slot_unlock(zram, index);
+	zram_slot_write_unlock(zram, index);
 }
 
 static void zram_comp_params_reset(struct zram *zram)
diff --git a/drivers/block/zram/zram_drv.h b/drivers/block/zram/zram_drv.h
index db78d7c01b9a..b272ede404b0 100644
--- a/drivers/block/zram/zram_drv.h
+++ b/drivers/block/zram/zram_drv.h
@@ -28,6 +28,7 @@
 #define ZRAM_SECTOR_PER_LOGICAL_BLOCK	\
 	(1 << (ZRAM_LOGICAL_BLOCK_SHIFT - SECTOR_SHIFT))
 
+#define ZRAM_PAGES_PER_BUCKET_LOCK	12
 
 /*
  * ZRAM is mainly used for memory efficiency so we want to keep memory
@@ -63,13 +64,17 @@ enum zram_pageflags {
 /* Allocated for each disk page */
 struct zram_table_entry {
 	unsigned long handle;
+	/* u32 suffice for flags + u32 padding */
 	unsigned int flags;
-	spinlock_t lock;
 #ifdef CONFIG_ZRAM_TRACK_ENTRY_ACTIME
 	ktime_t ac_time;
 #endif
 };
 
+struct zram_bucket_lock {
+	struct rw_semaphore lock;
+};
+
 struct zram_stats {
 	atomic64_t compr_data_size;	/* compressed size of pages stored */
 	atomic64_t failed_reads;	/* can happen when memory is too low */
@@ -101,6 +106,7 @@ struct zram_stats {
 
 struct zram {
 	struct zram_table_entry *table;
+	struct zram_bucket_lock *locks;
 	struct zs_pool *mem_pool;
 	struct zcomp *comps[ZRAM_MAX_COMPS];
 	struct zcomp_params params[ZRAM_MAX_COMPS];
-- 
2.48.0.rc2.279.g1de40edade-goog



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/7] zram: switch to non-atomic entry locking
  2025-01-22  5:57 ` [PATCH 1/7] zram: switch to non-atomic entry locking Sergey Senozhatsky
@ 2025-01-24  5:06   ` Sergey Senozhatsky
  2025-01-24 10:30   ` Hillf Danton
  1 sibling, 0 replies; 12+ messages in thread
From: Sergey Senozhatsky @ 2025-01-24  5:06 UTC (permalink / raw)
  To: Sergey Senozhatsky; +Cc: Andrew Morton, Minchan Kim, linux-mm, linux-kernel

On (25/01/22 14:57), Sergey Senozhatsky wrote:
[..]
>  static bool zram_meta_alloc(struct zram *zram, u64 disksize)
>  {
> -	size_t num_pages, index;
> +	size_t num_ents, index;
>  
> -	num_pages = disksize >> PAGE_SHIFT;
> -	zram->table = vzalloc(array_size(num_pages, sizeof(*zram->table)));
> +	num_ents = disksize >> PAGE_SHIFT;
> +	zram->table = vzalloc(array_size(num_ents, sizeof(*zram->table)));
>  	if (!zram->table)
> -		return false;
> +		goto error;
> +
> +	num_ents /= ZRAM_PAGES_PER_BUCKET_LOCK;
> +	zram->locks = vzalloc(array_size(num_ents, sizeof(*zram->locks)));

This better use ceil().  I'm working on v2.

---

diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index 0413438e4500..098e86fe70a5 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -1486,7 +1486,7 @@ static bool zram_meta_alloc(struct zram *zram, u64 disksize)
 	if (!zram->table)
 		goto error;
 
-	num_ents /= ZRAM_PAGES_PER_BUCKET_LOCK;
+	num_ents = DIV_ROUND_UP(num_ents, ZRAM_PAGES_PER_BUCKET_LOCK);
 	zram->locks = vzalloc(array_size(num_ents, sizeof(*zram->locks)));
 	if (!zram->locks)
 		goto error;


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/7] zram: switch to non-atomic entry locking
  2025-01-22  5:57 ` [PATCH 1/7] zram: switch to non-atomic entry locking Sergey Senozhatsky
  2025-01-24  5:06   ` Sergey Senozhatsky
@ 2025-01-24 10:30   ` Hillf Danton
  2025-01-24 11:03     ` Sergey Senozhatsky
  1 sibling, 1 reply; 12+ messages in thread
From: Hillf Danton @ 2025-01-24 10:30 UTC (permalink / raw)
  To: Sergey Senozhatsky; +Cc: Minchan Kim, linux-mm, linux-kernel

On (25/01/22 14:57), Sergey Senozhatsky wrote:
> 
> -	for (index = 0; index < num_pages; index++)
> -		spin_lock_init(&zram->table[index].lock);
> +	for (index = 0; index < num_ents; index++)
> +		init_rwsem(&zram->locks[index].lock);

Curious if lockdep trick [1] is needed here.

[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/fs/kernfs/file.c?id=16b52bbee482


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/7] zram: switch to non-atomic entry locking
  2025-01-24 10:30   ` Hillf Danton
@ 2025-01-24 11:03     ` Sergey Senozhatsky
  2025-01-24 11:09       ` Sergey Senozhatsky
  0 siblings, 1 reply; 12+ messages in thread
From: Sergey Senozhatsky @ 2025-01-24 11:03 UTC (permalink / raw)
  To: Hillf Danton; +Cc: Sergey Senozhatsky, Minchan Kim, linux-mm, linux-kernel

On (25/01/24 18:30), Hillf Danton wrote:
> On (25/01/22 14:57), Sergey Senozhatsky wrote:
> > 
> > -	for (index = 0; index < num_pages; index++)
> > -		spin_lock_init(&zram->table[index].lock);
> > +	for (index = 0; index < num_ents; index++)
> > +		init_rwsem(&zram->locks[index].lock);
> 
> Curious if lockdep trick [1] is needed here.

These bucket locks are not part of the v2 which I'm currnetly
working on.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/7] zram: switch to non-atomic entry locking
  2025-01-24 11:03     ` Sergey Senozhatsky
@ 2025-01-24 11:09       ` Sergey Senozhatsky
  0 siblings, 0 replies; 12+ messages in thread
From: Sergey Senozhatsky @ 2025-01-24 11:09 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Minchan Kim, linux-mm, linux-kernel, Andrew Morton, Sergey Senozhatsky

On (25/01/24 20:03), Sergey Senozhatsky wrote:
> On (25/01/24 18:30), Hillf Danton wrote:
> > On (25/01/22 14:57), Sergey Senozhatsky wrote:
> > > 
> > > -	for (index = 0; index < num_pages; index++)
> > > -		spin_lock_init(&zram->table[index].lock);
> > > +	for (index = 0; index < num_ents; index++)
> > > +		init_rwsem(&zram->locks[index].lock);
> > 
> > Curious if lockdep trick [1] is needed here.
> 
> These bucket locks are not part of the v2 which I'm currnetly
> working on.

v2 will also come with a draft version of new zsmalloc API that does
not impose atomicity restrictions in zs_map_object() (no local CPU
lock and no migration rwlock involved.)


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH 2/7] zram: do not use per-CPU compression streams
  2025-01-22  5:57 [PATCH 0/7] zram: preemptible writes and occasionally preemptible reads Sergey Senozhatsky
  2025-01-22  5:57 ` [PATCH 1/7] zram: switch to non-atomic entry locking Sergey Senozhatsky
@ 2025-01-22  5:57 ` Sergey Senozhatsky
  2025-01-22  5:57 ` [PATCH 3/7] zram: remove two-staged handle allocation Sergey Senozhatsky
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 12+ messages in thread
From: Sergey Senozhatsky @ 2025-01-22  5:57 UTC (permalink / raw)
  To: Andrew Morton, Minchan Kim; +Cc: linux-mm, linux-kernel, Sergey Senozhatsky

Similarly to per-entry spin-lock per-CPU compression streams
also have a number of shortcoming.

First, per-CPU stream access has to be done from a non-preemptible
(atomic) section, which imposes the same atomicity requirements on
compression backends as entry spin-lock do and makes it impossible
to use algorithms that can schedule/wait/sleep during compression
and decompression.

Second, per-CPU streams noticeably increase memory usage (actually
more like wastage) of secondary compression streams.  The problem
is that secondary compression streams are allocated per-CPU, just
like the primary streams are.  Yet we never use more that one
secondary stream at a time, because recompression is a single
threaded action.  Which means that remaining num_online_cpu() - 1
streams are allocated for nothing, and this is per-priority list
(we can have several secondary compression algorithms).  Depending
on the algorithm this may lead to a significant memory wastage, in
addition each stream also carries a workmem buffer (2 physical
pages).

Instead of per-CPU streams, maintain a list of idle compression
streams and allocate new streams on-demand (something that we
used to do many years ago).  So that zram read() and write() become
non-atomic and ease requirements on the compression algorithm
implementation.  This also means that we now should have only one
secondary stream per-priority list.

Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
---
 drivers/block/zram/zcomp.c    | 162 +++++++++++++++++++---------------
 drivers/block/zram/zcomp.h    |  17 ++--
 drivers/block/zram/zram_drv.c |  29 +++---
 include/linux/cpuhotplug.h    |   1 -
 4 files changed, 108 insertions(+), 101 deletions(-)

diff --git a/drivers/block/zram/zcomp.c b/drivers/block/zram/zcomp.c
index bb514403e305..5d8298fc2616 100644
--- a/drivers/block/zram/zcomp.c
+++ b/drivers/block/zram/zcomp.c
@@ -43,31 +43,40 @@ static const struct zcomp_ops *backends[] = {
 	NULL
 };
 
-static void zcomp_strm_free(struct zcomp *comp, struct zcomp_strm *zstrm)
+static void zcomp_strm_free(struct zcomp *comp, struct zcomp_strm *strm)
 {
-	comp->ops->destroy_ctx(&zstrm->ctx);
-	vfree(zstrm->buffer);
-	zstrm->buffer = NULL;
+	comp->ops->destroy_ctx(&strm->ctx);
+	vfree(strm->buffer);
+	kfree(strm);
 }
 
-static int zcomp_strm_init(struct zcomp *comp, struct zcomp_strm *zstrm)
+static struct zcomp_strm *zcomp_strm_alloc(struct zcomp *comp)
 {
+	struct zcomp_strm *strm;
 	int ret;
 
-	ret = comp->ops->create_ctx(comp->params, &zstrm->ctx);
-	if (ret)
-		return ret;
+	strm = kzalloc(sizeof(*strm), GFP_KERNEL);
+	if (!strm)
+		return NULL;
+
+	INIT_LIST_HEAD(&strm->entry);
+
+	ret = comp->ops->create_ctx(comp->params, &strm->ctx);
+	if (ret) {
+		kfree(strm);
+		return NULL;
+	}
 
 	/*
-	 * allocate 2 pages. 1 for compressed data, plus 1 extra for the
-	 * case when compressed size is larger than the original one
+	 * allocate 2 pages. 1 for compressed data, plus 1 extra in case if
+	 * compressed data is larger than the original one.
 	 */
-	zstrm->buffer = vzalloc(2 * PAGE_SIZE);
-	if (!zstrm->buffer) {
-		zcomp_strm_free(comp, zstrm);
-		return -ENOMEM;
+	strm->buffer = vzalloc(2 * PAGE_SIZE);
+	if (!strm->buffer) {
+		zcomp_strm_free(comp, strm);
+		return NULL;
 	}
-	return 0;
+	return strm;
 }
 
 static const struct zcomp_ops *lookup_backend_ops(const char *comp)
@@ -109,13 +118,59 @@ ssize_t zcomp_available_show(const char *comp, char *buf)
 
 struct zcomp_strm *zcomp_stream_get(struct zcomp *comp)
 {
-	local_lock(&comp->stream->lock);
-	return this_cpu_ptr(comp->stream);
+	struct zcomp_strm *strm;
+
+	might_sleep();
+
+	while (1) {
+		spin_lock(&comp->strm_lock);
+		if (!list_empty(&comp->idle_strm)) {
+			strm = list_first_entry(&comp->idle_strm,
+						struct zcomp_strm,
+						entry);
+			list_del(&strm->entry);
+			spin_unlock(&comp->strm_lock);
+			return strm;
+		}
+
+		/* cannot allocate new stream, wait for an idle one */
+		if (comp->avail_strm >= num_online_cpus()) {
+			spin_unlock(&comp->strm_lock);
+			wait_event(comp->strm_wait,
+				   !list_empty(&comp->idle_strm));
+			continue;
+		}
+
+		/* allocate new stream */
+		comp->avail_strm++;
+		spin_unlock(&comp->strm_lock);
+
+		strm = zcomp_strm_alloc(comp);
+		if (strm)
+			break;
+
+		spin_lock(&comp->strm_lock);
+		comp->avail_strm--;
+		spin_unlock(&comp->strm_lock);
+		wait_event(comp->strm_wait, !list_empty(&comp->idle_strm));
+	}
+
+	return strm;
 }
 
-void zcomp_stream_put(struct zcomp *comp)
+void zcomp_stream_put(struct zcomp *comp, struct zcomp_strm *strm)
 {
-	local_unlock(&comp->stream->lock);
+	spin_lock(&comp->strm_lock);
+	if (comp->avail_strm <= num_online_cpus()) {
+		list_add(&strm->entry, &comp->idle_strm);
+		spin_unlock(&comp->strm_lock);
+		wake_up(&comp->strm_wait);
+		return;
+	}
+
+	comp->avail_strm--;
+	spin_unlock(&comp->strm_lock);
+	zcomp_strm_free(comp, strm);
 }
 
 int zcomp_compress(struct zcomp *comp, struct zcomp_strm *zstrm,
@@ -148,61 +203,19 @@ int zcomp_decompress(struct zcomp *comp, struct zcomp_strm *zstrm,
 	return comp->ops->decompress(comp->params, &zstrm->ctx, &req);
 }
 
-int zcomp_cpu_up_prepare(unsigned int cpu, struct hlist_node *node)
-{
-	struct zcomp *comp = hlist_entry(node, struct zcomp, node);
-	struct zcomp_strm *zstrm;
-	int ret;
-
-	zstrm = per_cpu_ptr(comp->stream, cpu);
-	local_lock_init(&zstrm->lock);
-
-	ret = zcomp_strm_init(comp, zstrm);
-	if (ret)
-		pr_err("Can't allocate a compression stream\n");
-	return ret;
-}
-
-int zcomp_cpu_dead(unsigned int cpu, struct hlist_node *node)
-{
-	struct zcomp *comp = hlist_entry(node, struct zcomp, node);
-	struct zcomp_strm *zstrm;
-
-	zstrm = per_cpu_ptr(comp->stream, cpu);
-	zcomp_strm_free(comp, zstrm);
-	return 0;
-}
-
-static int zcomp_init(struct zcomp *comp, struct zcomp_params *params)
-{
-	int ret;
-
-	comp->stream = alloc_percpu(struct zcomp_strm);
-	if (!comp->stream)
-		return -ENOMEM;
-
-	comp->params = params;
-	ret = comp->ops->setup_params(comp->params);
-	if (ret)
-		goto cleanup;
-
-	ret = cpuhp_state_add_instance(CPUHP_ZCOMP_PREPARE, &comp->node);
-	if (ret < 0)
-		goto cleanup;
-
-	return 0;
-
-cleanup:
-	comp->ops->release_params(comp->params);
-	free_percpu(comp->stream);
-	return ret;
-}
-
 void zcomp_destroy(struct zcomp *comp)
 {
-	cpuhp_state_remove_instance(CPUHP_ZCOMP_PREPARE, &comp->node);
+	struct zcomp_strm *strm;
+
+	while (!list_empty(&comp->idle_strm)) {
+		strm = list_first_entry(&comp->idle_strm,
+					struct zcomp_strm,
+					entry);
+		list_del(&strm->entry);
+		zcomp_strm_free(comp, strm);
+	}
+
 	comp->ops->release_params(comp->params);
-	free_percpu(comp->stream);
 	kfree(comp);
 }
 
@@ -229,7 +242,12 @@ struct zcomp *zcomp_create(const char *alg, struct zcomp_params *params)
 		return ERR_PTR(-EINVAL);
 	}
 
-	error = zcomp_init(comp, params);
+	INIT_LIST_HEAD(&comp->idle_strm);
+	init_waitqueue_head(&comp->strm_wait);
+	spin_lock_init(&comp->strm_lock);
+
+	comp->params = params;
+	error = comp->ops->setup_params(comp->params);
 	if (error) {
 		kfree(comp);
 		return ERR_PTR(error);
diff --git a/drivers/block/zram/zcomp.h b/drivers/block/zram/zcomp.h
index ad5762813842..62330829db3f 100644
--- a/drivers/block/zram/zcomp.h
+++ b/drivers/block/zram/zcomp.h
@@ -3,10 +3,10 @@
 #ifndef _ZCOMP_H_
 #define _ZCOMP_H_
 
-#include <linux/local_lock.h>
-
 #define ZCOMP_PARAM_NO_LEVEL	INT_MIN
 
+#include <linux/wait.h>
+
 /*
  * Immutable driver (backend) parameters. The driver may attach private
  * data to it (e.g. driver representation of the dictionary, etc.).
@@ -31,7 +31,7 @@ struct zcomp_ctx {
 };
 
 struct zcomp_strm {
-	local_lock_t lock;
+	struct list_head entry;
 	/* compression buffer */
 	void *buffer;
 	struct zcomp_ctx ctx;
@@ -60,16 +60,15 @@ struct zcomp_ops {
 	const char *name;
 };
 
-/* dynamic per-device compression frontend */
 struct zcomp {
-	struct zcomp_strm __percpu *stream;
+	struct list_head idle_strm;
+	spinlock_t strm_lock;
+	u32 avail_strm;
+	wait_queue_head_t strm_wait;
 	const struct zcomp_ops *ops;
 	struct zcomp_params *params;
-	struct hlist_node node;
 };
 
-int zcomp_cpu_up_prepare(unsigned int cpu, struct hlist_node *node);
-int zcomp_cpu_dead(unsigned int cpu, struct hlist_node *node);
 ssize_t zcomp_available_show(const char *comp, char *buf);
 bool zcomp_available_algorithm(const char *comp);
 
@@ -77,7 +76,7 @@ struct zcomp *zcomp_create(const char *alg, struct zcomp_params *params);
 void zcomp_destroy(struct zcomp *comp);
 
 struct zcomp_strm *zcomp_stream_get(struct zcomp *comp);
-void zcomp_stream_put(struct zcomp *comp);
+void zcomp_stream_put(struct zcomp *comp, struct zcomp_strm *strm);
 
 int zcomp_compress(struct zcomp *comp, struct zcomp_strm *zstrm,
 		   const void *src, unsigned int *dst_len);
diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index 7eb7feba3cac..b217c29448ce 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -31,7 +31,6 @@
 #include <linux/idr.h>
 #include <linux/sysfs.h>
 #include <linux/debugfs.h>
-#include <linux/cpuhotplug.h>
 #include <linux/part_stat.h>
 #include <linux/kernel_read_file.h>
 
@@ -1606,7 +1605,7 @@ static int read_compressed_page(struct zram *zram, struct page *page, u32 index)
 	ret = zcomp_decompress(zram->comps[prio], zstrm, src, size, dst);
 	kunmap_local(dst);
 	zs_unmap_object(zram->mem_pool, handle);
-	zcomp_stream_put(zram->comps[prio]);
+	zcomp_stream_put(zram->comps[prio], zstrm);
 
 	return ret;
 }
@@ -1767,14 +1766,14 @@ static int zram_write_page(struct zram *zram, struct page *page, u32 index)
 	kunmap_local(mem);
 
 	if (unlikely(ret)) {
-		zcomp_stream_put(zram->comps[ZRAM_PRIMARY_COMP]);
+		zcomp_stream_put(zram->comps[ZRAM_PRIMARY_COMP], zstrm);
 		pr_err("Compression failed! err=%d\n", ret);
 		zs_free(zram->mem_pool, handle);
 		return ret;
 	}
 
 	if (comp_len >= huge_class_size) {
-		zcomp_stream_put(zram->comps[ZRAM_PRIMARY_COMP]);
+		zcomp_stream_put(zram->comps[ZRAM_PRIMARY_COMP], zstrm);
 		return write_incompressible_page(zram, page, index);
 	}
 
@@ -1798,7 +1797,7 @@ static int zram_write_page(struct zram *zram, struct page *page, u32 index)
 				   __GFP_HIGHMEM |
 				   __GFP_MOVABLE);
 	if (IS_ERR_VALUE(handle)) {
-		zcomp_stream_put(zram->comps[ZRAM_PRIMARY_COMP]);
+		zcomp_stream_put(zram->comps[ZRAM_PRIMARY_COMP], zstrm);
 		atomic64_inc(&zram->stats.writestall);
 		handle = zs_malloc(zram->mem_pool, comp_len,
 				   GFP_NOIO | __GFP_HIGHMEM |
@@ -1810,7 +1809,7 @@ static int zram_write_page(struct zram *zram, struct page *page, u32 index)
 	}
 
 	if (!zram_can_store_page(zram)) {
-		zcomp_stream_put(zram->comps[ZRAM_PRIMARY_COMP]);
+		zcomp_stream_put(zram->comps[ZRAM_PRIMARY_COMP], zstrm);
 		zs_free(zram->mem_pool, handle);
 		return -ENOMEM;
 	}
@@ -1818,7 +1817,7 @@ static int zram_write_page(struct zram *zram, struct page *page, u32 index)
 	dst = zs_map_object(zram->mem_pool, handle, ZS_MM_WO);
 
 	memcpy(dst, zstrm->buffer, comp_len);
-	zcomp_stream_put(zram->comps[ZRAM_PRIMARY_COMP]);
+	zcomp_stream_put(zram->comps[ZRAM_PRIMARY_COMP], zstrm);
 	zs_unmap_object(zram->mem_pool, handle);
 
 	zram_slot_write_lock(zram, index);
@@ -1977,7 +1976,7 @@ static int recompress_slot(struct zram *zram, u32 index, struct page *page,
 		kunmap_local(src);
 
 		if (ret) {
-			zcomp_stream_put(zram->comps[prio]);
+			zcomp_stream_put(zram->comps[prio], zstrm);
 			return ret;
 		}
 
@@ -1987,7 +1986,7 @@ static int recompress_slot(struct zram *zram, u32 index, struct page *page,
 		/* Continue until we make progress */
 		if (class_index_new >= class_index_old ||
 		    (threshold && comp_len_new >= threshold)) {
-			zcomp_stream_put(zram->comps[prio]);
+			zcomp_stream_put(zram->comps[prio], zstrm);
 			continue;
 		}
 
@@ -2045,13 +2044,13 @@ static int recompress_slot(struct zram *zram, u32 index, struct page *page,
 			       __GFP_HIGHMEM |
 			       __GFP_MOVABLE);
 	if (IS_ERR_VALUE(handle_new)) {
-		zcomp_stream_put(zram->comps[prio]);
+		zcomp_stream_put(zram->comps[prio], zstrm);
 		return PTR_ERR((void *)handle_new);
 	}
 
 	dst = zs_map_object(zram->mem_pool, handle_new, ZS_MM_WO);
 	memcpy(dst, zstrm->buffer, comp_len_new);
-	zcomp_stream_put(zram->comps[prio]);
+	zcomp_stream_put(zram->comps[prio], zstrm);
 
 	zs_unmap_object(zram->mem_pool, handle_new);
 
@@ -2799,7 +2798,6 @@ static void destroy_devices(void)
 	zram_debugfs_destroy();
 	idr_destroy(&zram_index_idr);
 	unregister_blkdev(zram_major, "zram");
-	cpuhp_remove_multi_state(CPUHP_ZCOMP_PREPARE);
 }
 
 static int __init zram_init(void)
@@ -2809,15 +2807,9 @@ static int __init zram_init(void)
 
 	BUILD_BUG_ON(__NR_ZRAM_PAGEFLAGS > sizeof(zram_te.flags) * 8);
 
-	ret = cpuhp_setup_state_multi(CPUHP_ZCOMP_PREPARE, "block/zram:prepare",
-				      zcomp_cpu_up_prepare, zcomp_cpu_dead);
-	if (ret < 0)
-		return ret;
-
 	ret = class_register(&zram_control_class);
 	if (ret) {
 		pr_err("Unable to register zram-control class\n");
-		cpuhp_remove_multi_state(CPUHP_ZCOMP_PREPARE);
 		return ret;
 	}
 
@@ -2826,7 +2818,6 @@ static int __init zram_init(void)
 	if (zram_major <= 0) {
 		pr_err("Unable to get major number\n");
 		class_unregister(&zram_control_class);
-		cpuhp_remove_multi_state(CPUHP_ZCOMP_PREPARE);
 		return -EBUSY;
 	}
 
diff --git a/include/linux/cpuhotplug.h b/include/linux/cpuhotplug.h
index 6cc5e484547c..092ace7db8ee 100644
--- a/include/linux/cpuhotplug.h
+++ b/include/linux/cpuhotplug.h
@@ -119,7 +119,6 @@ enum cpuhp_state {
 	CPUHP_MM_ZS_PREPARE,
 	CPUHP_MM_ZSWP_POOL_PREPARE,
 	CPUHP_KVM_PPC_BOOK3S_PREPARE,
-	CPUHP_ZCOMP_PREPARE,
 	CPUHP_TIMERS_PREPARE,
 	CPUHP_TMIGR_PREPARE,
 	CPUHP_MIPS_SOC_PREPARE,
-- 
2.48.0.rc2.279.g1de40edade-goog



^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH 3/7] zram: remove two-staged handle allocation
  2025-01-22  5:57 [PATCH 0/7] zram: preemptible writes and occasionally preemptible reads Sergey Senozhatsky
  2025-01-22  5:57 ` [PATCH 1/7] zram: switch to non-atomic entry locking Sergey Senozhatsky
  2025-01-22  5:57 ` [PATCH 2/7] zram: do not use per-CPU compression streams Sergey Senozhatsky
@ 2025-01-22  5:57 ` Sergey Senozhatsky
  2025-01-22  5:57 ` [PATCH 4/7] zram: permit reclaim in zstd custom allocator Sergey Senozhatsky
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 12+ messages in thread
From: Sergey Senozhatsky @ 2025-01-22  5:57 UTC (permalink / raw)
  To: Andrew Morton, Minchan Kim; +Cc: linux-mm, linux-kernel, Sergey Senozhatsky

Previously zram write() was atomic which required us to pass
__GFP_KSWAPD_RECLAIM to zsmalloc handle allocation on a fast
path and attempt a slow path allocation (with recompression)
when the fast path failed.

Since it's not atomic anymore we can permit direct reclaim
during allocation, and remove fast allocation path and, also,
drop the recompression path (which should reduce CPU/battery
usage).

Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
---
 drivers/block/zram/zram_drv.c | 39 ++++++-----------------------------
 1 file changed, 6 insertions(+), 33 deletions(-)

diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index b217c29448ce..8029e0fe864a 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -1740,11 +1740,11 @@ static int write_incompressible_page(struct zram *zram, struct page *page,
 static int zram_write_page(struct zram *zram, struct page *page, u32 index)
 {
 	int ret = 0;
-	unsigned long handle = -ENOMEM;
-	unsigned int comp_len = 0;
+	unsigned long handle;
+	unsigned int comp_len;
 	void *dst, *mem;
 	struct zcomp_strm *zstrm;
-	unsigned long element = 0;
+	unsigned long element;
 	bool same_filled;
 
 	/* First, free memory allocated to this slot (if any) */
@@ -1758,7 +1758,6 @@ static int zram_write_page(struct zram *zram, struct page *page, u32 index)
 	if (same_filled)
 		return write_same_filled_page(zram, element, index);
 
-compress_again:
 	zstrm = zcomp_stream_get(zram->comps[ZRAM_PRIMARY_COMP]);
 	mem = kmap_local_page(page);
 	ret = zcomp_compress(zram->comps[ZRAM_PRIMARY_COMP], zstrm,
@@ -1777,36 +1776,10 @@ static int zram_write_page(struct zram *zram, struct page *page, u32 index)
 		return write_incompressible_page(zram, page, index);
 	}
 
-	/*
-	 * handle allocation has 2 paths:
-	 * a) fast path is executed with preemption disabled (for
-	 *  per-cpu streams) and has __GFP_DIRECT_RECLAIM bit clear,
-	 *  since we can't sleep;
-	 * b) slow path enables preemption and attempts to allocate
-	 *  the page with __GFP_DIRECT_RECLAIM bit set. we have to
-	 *  put per-cpu compression stream and, thus, to re-do
-	 *  the compression once handle is allocated.
-	 *
-	 * if we have a 'non-null' handle here then we are coming
-	 * from the slow path and handle has already been allocated.
-	 */
+	handle = zs_malloc(zram->mem_pool, comp_len,
+			   GFP_NOIO | __GFP_HIGHMEM | __GFP_MOVABLE);
 	if (IS_ERR_VALUE(handle))
-		handle = zs_malloc(zram->mem_pool, comp_len,
-				   __GFP_KSWAPD_RECLAIM |
-				   __GFP_NOWARN |
-				   __GFP_HIGHMEM |
-				   __GFP_MOVABLE);
-	if (IS_ERR_VALUE(handle)) {
-		zcomp_stream_put(zram->comps[ZRAM_PRIMARY_COMP], zstrm);
-		atomic64_inc(&zram->stats.writestall);
-		handle = zs_malloc(zram->mem_pool, comp_len,
-				   GFP_NOIO | __GFP_HIGHMEM |
-				   __GFP_MOVABLE);
-		if (IS_ERR_VALUE(handle))
-			return PTR_ERR((void *)handle);
-
-		goto compress_again;
-	}
+		return PTR_ERR((void *)handle);
 
 	if (!zram_can_store_page(zram)) {
 		zcomp_stream_put(zram->comps[ZRAM_PRIMARY_COMP], zstrm);
-- 
2.48.0.rc2.279.g1de40edade-goog



^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH 4/7] zram: permit reclaim in zstd custom allocator
  2025-01-22  5:57 [PATCH 0/7] zram: preemptible writes and occasionally preemptible reads Sergey Senozhatsky
                   ` (2 preceding siblings ...)
  2025-01-22  5:57 ` [PATCH 3/7] zram: remove two-staged handle allocation Sergey Senozhatsky
@ 2025-01-22  5:57 ` Sergey Senozhatsky
  2025-01-22  5:57 ` [PATCH 5/7] zram: permit reclaim in recompression handle allocation Sergey Senozhatsky
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 12+ messages in thread
From: Sergey Senozhatsky @ 2025-01-22  5:57 UTC (permalink / raw)
  To: Andrew Morton, Minchan Kim; +Cc: linux-mm, linux-kernel, Sergey Senozhatsky

When configured with pre-trained compression/decompression
dictionary support, zstd requires custom memory allocator,
which it calls internally from compression()/decompression()
routines.  This was a tad problematic, because that would
mean allocation from atomic context (either under entry
spin-lock, or per-CPU local-lock or both).  Now, with
non-atomic zram write(), those limitations are relaxed and
we can allow direct and indirect reclaim during allocations.

The tricky part is zram read() path, which is still atomic in
one particular case (read_compressed_page()), due to zsmalloc
handling of object mapping.  However, in zram in order to read()
something one has to write() it first, and write() is when zstd
allocates required internal state memory, and write() path is
non-atomic. Because of this write() allocation, in theory, zstd
should not call its allocator from the atomic read() path.  Keep
the non-preemptible branch, just in case if zstd allocates memory
from read(), but WARN_ON_ONCE() if it happens.

Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
---
 drivers/block/zram/backend_zstd.c | 11 +++--------
 1 file changed, 3 insertions(+), 8 deletions(-)

diff --git a/drivers/block/zram/backend_zstd.c b/drivers/block/zram/backend_zstd.c
index 1184c0036f44..53431251ea62 100644
--- a/drivers/block/zram/backend_zstd.c
+++ b/drivers/block/zram/backend_zstd.c
@@ -24,19 +24,14 @@ struct zstd_params {
 /*
  * For C/D dictionaries we need to provide zstd with zstd_custom_mem,
  * which zstd uses internally to allocate/free memory when needed.
- *
- * This means that allocator.customAlloc() can be called from zcomp_compress()
- * under local-lock (per-CPU compression stream), in which case we must use
- * GFP_ATOMIC.
- *
- * Another complication here is that we can be configured as a swap device.
  */
 static void *zstd_custom_alloc(void *opaque, size_t size)
 {
-	if (!preemptible())
+	/* Technically this should not happen */
+	if (WARN_ON_ONCE(!preemptible()))
 		return kvzalloc(size, GFP_ATOMIC);

-	return kvzalloc(size, __GFP_KSWAPD_RECLAIM | __GFP_NOWARN);
+	return kvzalloc(size, GFP_NOIO | __GFP_NOWARN);
 }

 static void zstd_custom_free(void *opaque, void *address)
-- 
2.48.0.rc2.279.g1de40edade-goog

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH 5/7] zram: permit reclaim in recompression handle allocation
  2025-01-22  5:57 [PATCH 0/7] zram: preemptible writes and occasionally preemptible reads Sergey Senozhatsky
                   ` (3 preceding siblings ...)
  2025-01-22  5:57 ` [PATCH 4/7] zram: permit reclaim in zstd custom allocator Sergey Senozhatsky
@ 2025-01-22  5:57 ` Sergey Senozhatsky
  2025-01-22  5:57 ` [PATCH 6/7] zram: remove writestall zram_stats member Sergey Senozhatsky
  2025-01-22  5:57 ` [PATCH 7/7] zram: unlock slot bucket during recompression Sergey Senozhatsky
  6 siblings, 0 replies; 12+ messages in thread
From: Sergey Senozhatsky @ 2025-01-22  5:57 UTC (permalink / raw)
  To: Andrew Morton, Minchan Kim; +Cc: linux-mm, linux-kernel, Sergey Senozhatsky

Recompression path can now permit direct reclaim during
new zs_handle allocation, because it's not atomic anymore.

Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
---
 drivers/block/zram/zram_drv.c | 12 +++---------
 1 file changed, 3 insertions(+), 9 deletions(-)

diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index 8029e0fe864a..faccf9923391 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -2005,17 +2005,11 @@ static int recompress_slot(struct zram *zram, u32 index, struct page *page,
 		return 0;
 
 	/*
-	 * No direct reclaim (slow path) for handle allocation and no
-	 * re-compression attempt (unlike in zram_write_bvec()) since
-	 * we already have stored that object in zsmalloc. If we cannot
-	 * alloc memory for recompressed object then we bail out and
-	 * simply keep the old (existing) object in zsmalloc.
+	 * If we cannot alloc memory for recompressed object then we bail out
+	 * and simply keep the old (existing) object in zsmalloc.
 	 */
 	handle_new = zs_malloc(zram->mem_pool, comp_len_new,
-			       __GFP_KSWAPD_RECLAIM |
-			       __GFP_NOWARN |
-			       __GFP_HIGHMEM |
-			       __GFP_MOVABLE);
+			       GFP_NOIO | __GFP_HIGHMEM | __GFP_MOVABLE);
 	if (IS_ERR_VALUE(handle_new)) {
 		zcomp_stream_put(zram->comps[prio], zstrm);
 		return PTR_ERR((void *)handle_new);
-- 
2.48.0.rc2.279.g1de40edade-goog



^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH 6/7] zram: remove writestall zram_stats member
  2025-01-22  5:57 [PATCH 0/7] zram: preemptible writes and occasionally preemptible reads Sergey Senozhatsky
                   ` (4 preceding siblings ...)
  2025-01-22  5:57 ` [PATCH 5/7] zram: permit reclaim in recompression handle allocation Sergey Senozhatsky
@ 2025-01-22  5:57 ` Sergey Senozhatsky
  2025-01-22  5:57 ` [PATCH 7/7] zram: unlock slot bucket during recompression Sergey Senozhatsky
  6 siblings, 0 replies; 12+ messages in thread
From: Sergey Senozhatsky @ 2025-01-22  5:57 UTC (permalink / raw)
  To: Andrew Morton, Minchan Kim; +Cc: linux-mm, linux-kernel, Sergey Senozhatsky

There is no zsmalloc handle allocation slow path now and
writestall is not possible any longer.  Remove it from
zram_stats.

Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
---
 drivers/block/zram/zram_drv.c | 3 +--
 drivers/block/zram/zram_drv.h | 1 -
 2 files changed, 1 insertion(+), 3 deletions(-)

diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index faccf9923391..d516f968321e 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -1443,9 +1443,8 @@ static ssize_t debug_stat_show(struct device *dev,
 
 	down_read(&zram->init_lock);
 	ret = scnprintf(buf, PAGE_SIZE,
-			"version: %d\n%8llu %8llu\n",
+			"version: %d\n0 %8llu\n",
 			version,
-			(u64)atomic64_read(&zram->stats.writestall),
 			(u64)atomic64_read(&zram->stats.miss_free));
 	up_read(&zram->init_lock);
 
diff --git a/drivers/block/zram/zram_drv.h b/drivers/block/zram/zram_drv.h
index b272ede404b0..4f707dabed12 100644
--- a/drivers/block/zram/zram_drv.h
+++ b/drivers/block/zram/zram_drv.h
@@ -85,7 +85,6 @@ struct zram_stats {
 	atomic64_t huge_pages_since;	/* no. of huge pages since zram set up */
 	atomic64_t pages_stored;	/* no. of pages currently stored */
 	atomic_long_t max_used_pages;	/* no. of maximum pages stored */
-	atomic64_t writestall;		/* no. of write slow paths */
 	atomic64_t miss_free;		/* no. of missed free */
 #ifdef	CONFIG_ZRAM_WRITEBACK
 	atomic64_t bd_count;		/* no. of pages in backing device */
-- 
2.48.0.rc2.279.g1de40edade-goog



^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH 7/7] zram: unlock slot bucket during recompression
  2025-01-22  5:57 [PATCH 0/7] zram: preemptible writes and occasionally preemptible reads Sergey Senozhatsky
                   ` (5 preceding siblings ...)
  2025-01-22  5:57 ` [PATCH 6/7] zram: remove writestall zram_stats member Sergey Senozhatsky
@ 2025-01-22  5:57 ` Sergey Senozhatsky
  6 siblings, 0 replies; 12+ messages in thread
From: Sergey Senozhatsky @ 2025-01-22  5:57 UTC (permalink / raw)
  To: Andrew Morton, Minchan Kim; +Cc: linux-mm, linux-kernel, Sergey Senozhatsky

As of now recompress_slot() is called under slot bucket write-lock,
which is suboptimal as it blocks access to a huge number of entries.
The good news is that recompression, like writeback, makes a local
copy of slot data (we need to decompress it anyway) before
post-processing so we can unlock slot bucket once we have that local
copy.

Unlock the bucket write-lock before recompression loop (secondary
algorithms can be tried out one by one, in order of priority) and
re-acquire it right after the loop.

There is one more potentially costly operation recompress_slot()
does - new zs_handle allocation, which can schedule().  Release
the bucket write-lock before zsmalloc allocation and grab it again
after the allocation.

In both cases, once the bucket lock is re-acquired we examine slot's
ZRAM_PP_SLOT flag to make sure that the slot has not been modified
by a concurrent operation.

Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
---
 drivers/block/zram/zram_drv.c | 53 +++++++++++++++++++++++++----------
 1 file changed, 38 insertions(+), 15 deletions(-)

diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index d516f968321e..0413438e4500 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -1925,6 +1925,14 @@ static int recompress_slot(struct zram *zram, u32 index, struct page *page,
 	zram_clear_flag(zram, index, ZRAM_IDLE);
 
 	class_index_old = zs_lookup_class_index(zram->mem_pool, comp_len_old);
+
+	/*
+	 * Set prio to one past current slot's compression prio, so that
+	 * we automatically skip lower priority algorithms.
+	 */
+	prio = zram_get_priority(zram, index) + 1;
+	/* Slot data copied out - unlock its bucket */
+	zram_slot_write_unlock(zram, index);
 	/*
 	 * Iterate the secondary comp algorithms list (in order of priority)
 	 * and try to recompress the page.
@@ -1933,13 +1941,6 @@ static int recompress_slot(struct zram *zram, u32 index, struct page *page,
 		if (!zram->comps[prio])
 			continue;
 
-		/*
-		 * Skip if the object is already re-compressed with a higher
-		 * priority algorithm (or same algorithm).
-		 */
-		if (prio <= zram_get_priority(zram, index))
-			continue;
-
 		num_recomps++;
 		zstrm = zcomp_stream_get(zram->comps[prio]);
 		src = kmap_local_page(page);
@@ -1947,10 +1948,8 @@ static int recompress_slot(struct zram *zram, u32 index, struct page *page,
 				     src, &comp_len_new);
 		kunmap_local(src);
 
-		if (ret) {
-			zcomp_stream_put(zram->comps[prio], zstrm);
-			return ret;
-		}
+		if (ret)
+			break;
 
 		class_index_new = zs_lookup_class_index(zram->mem_pool,
 							comp_len_new);
@@ -1966,6 +1965,19 @@ static int recompress_slot(struct zram *zram, u32 index, struct page *page,
 		break;
 	}
 
+	zram_slot_write_lock(zram, index);
+	/* Compression error */
+	if (ret) {
+		zcomp_stream_put(zram->comps[prio], zstrm);
+		return ret;
+	}
+
+	/* Slot has been modified concurrently */
+	if (!zram_test_flag(zram, index, ZRAM_PP_SLOT)) {
+		zcomp_stream_put(zram->comps[prio], zstrm);
+		return 0;
+	}
+
 	/*
 	 * We did not try to recompress, e.g. when we have only one
 	 * secondary algorithm and the page is already recompressed
@@ -2003,17 +2015,28 @@ static int recompress_slot(struct zram *zram, u32 index, struct page *page,
 	if (threshold && comp_len_new >= threshold)
 		return 0;
 
-	/*
-	 * If we cannot alloc memory for recompressed object then we bail out
-	 * and simply keep the old (existing) object in zsmalloc.
-	 */
+	/* zsmalloc handle allocation can schedule, unlock slot's bucket */
+	zram_slot_write_unlock(zram, index);
 	handle_new = zs_malloc(zram->mem_pool, comp_len_new,
 			       GFP_NOIO | __GFP_HIGHMEM | __GFP_MOVABLE);
+	zram_slot_write_lock(zram, index);
+
+	/*
+	 * If we couldn't allocate memory for recompressed object then bail
+	 * out and simply keep the old (existing) object in mempool.
+	 */
 	if (IS_ERR_VALUE(handle_new)) {
 		zcomp_stream_put(zram->comps[prio], zstrm);
 		return PTR_ERR((void *)handle_new);
 	}
 
+	/* Slot has been modified concurrently */
+	if (!zram_test_flag(zram, index, ZRAM_PP_SLOT)) {
+		zcomp_stream_put(zram->comps[prio], zstrm);
+		zs_free(zram->mem_pool, handle_new);
+		return 0;
+	}
+
 	dst = zs_map_object(zram->mem_pool, handle_new, ZS_MM_WO);
 	memcpy(dst, zstrm->buffer, comp_len_new);
 	zcomp_stream_put(zram->comps[prio], zstrm);
-- 
2.48.0.rc2.279.g1de40edade-goog



^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2025-01-24 11:10 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-01-22  5:57 [PATCH 0/7] zram: preemptible writes and occasionally preemptible reads Sergey Senozhatsky
2025-01-22  5:57 ` [PATCH 1/7] zram: switch to non-atomic entry locking Sergey Senozhatsky
2025-01-24  5:06   ` Sergey Senozhatsky
2025-01-24 10:30   ` Hillf Danton
2025-01-24 11:03     ` Sergey Senozhatsky
2025-01-24 11:09       ` Sergey Senozhatsky
2025-01-22  5:57 ` [PATCH 2/7] zram: do not use per-CPU compression streams Sergey Senozhatsky
2025-01-22  5:57 ` [PATCH 3/7] zram: remove two-staged handle allocation Sergey Senozhatsky
2025-01-22  5:57 ` [PATCH 4/7] zram: permit reclaim in zstd custom allocator Sergey Senozhatsky
2025-01-22  5:57 ` [PATCH 5/7] zram: permit reclaim in recompression handle allocation Sergey Senozhatsky
2025-01-22  5:57 ` [PATCH 6/7] zram: remove writestall zram_stats member Sergey Senozhatsky
2025-01-22  5:57 ` [PATCH 7/7] zram: unlock slot bucket during recompression Sergey Senozhatsky

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox