[PATCHv1 0/6] zsmalloc: preemptible object mapping

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCHv1 0/6] zsmalloc: preemptible object mapping
@ 2025-01-29  6:43 Sergey Senozhatsky
  2025-01-29  6:43 ` [PATCHv1 1/6] zsmalloc: factor out pool locking helpers Sergey Senozhatsky
                   ` (6 more replies)
  0 siblings, 7 replies; 21+ messages in thread
From: Sergey Senozhatsky @ 2025-01-29  6:43 UTC (permalink / raw)
  To: Andrew Morton, Minchan Kim, Johannes Weiner, Yosry Ahmed, Nhat Pham
  Cc: linux-mm, linux-kernel, Sergey Senozhatsky

This is Part II of the series [1] that makes zram read() and write()
preemptible.  This part focuses only zsmalloc because zsmalloc imposes
atomicity restrictions on its users.  One notable example is object
mapping API, which returns with:
a) local CPU lock held
b) zspage rwlock held

First, zsmalloc is converted to use sleepable RW-"lock" (it's atomic_t
in fact) for zspage migration protection.  Second, a new handle mapping
is introduced which doesn't use per-CPU buffers (and hence no local CPU
lock), does fewer memcpy() calls, but requires users to provide a
pointer to temp buffer for object copy-in (when needed).  Third, zram is
converted to the new zsmalloc mapping API and thus zram read() becomes
preemptible.

[1] https://lore.kernel.org/linux-mm/20250127072932.1289973-1-senozhatsky@chromium.org

RFC -> v1:
- Only zspage->lock (leaf-lock for zs_map_object()) is converted
  to a preemptible lock.  The rest of the zspool locks remain the
  same (Yosry hated with passion the fact that in RFC series all
  zspool looks would become preemptible).
- New zs object mapping API (Yosry hated RFC API with passion).
  We know have obj_read_begin()/obj_read_end() and obj_write().
- obj_write() saves extra memcpy() calls for objects that span two
  physical pages.
- Dropped zram deferred slot-free-notification handling (I hated
  it with passion)

Sergey Senozhatsky (6):
  zsmalloc: factor out pool locking helpers
  zsmalloc: factor out size-class locking helpers
  zsmalloc: make zspage lock preemptible
  zsmalloc: introduce new object mapping API
  zram: switch to new zsmalloc object mapping API
  zram: add might_sleep to zcomp API

 drivers/block/zram/zcomp.c    |   6 +-
 drivers/block/zram/zcomp.h    |   2 +
 drivers/block/zram/zram_drv.c |  28 +--
 include/linux/zsmalloc.h      |   8 +
 mm/zsmalloc.c                 | 372 ++++++++++++++++++++++++++--------
 5 files changed, 311 insertions(+), 105 deletions(-)

-- 
2.48.1.262.g85cc9f2d1e-goog



^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCHv1 1/6] zsmalloc: factor out pool locking helpers
  2025-01-29  6:43 [PATCHv1 0/6] zsmalloc: preemptible object mapping Sergey Senozhatsky
@ 2025-01-29  6:43 ` Sergey Senozhatsky
  2025-01-29 16:59   ` Yosry Ahmed
  2025-01-29  6:43 ` [PATCHv1 2/6] zsmalloc: factor out size-class " Sergey Senozhatsky
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 21+ messages in thread
From: Sergey Senozhatsky @ 2025-01-29  6:43 UTC (permalink / raw)
  To: Andrew Morton, Minchan Kim, Johannes Weiner, Yosry Ahmed, Nhat Pham
  Cc: linux-mm, linux-kernel, Sergey Senozhatsky

We currently have a mix of migrate_{read,write}_lock() helpers
that lock zspages, but it's zs_pool that actually has a ->migrate_lock
access to which is opene-coded.  Factor out pool migrate locking
into helpers, zspage migration locking API will be renamed to
reduce confusion.

Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
---
 mm/zsmalloc.c | 56 +++++++++++++++++++++++++++++++++++++--------------
 1 file changed, 41 insertions(+), 15 deletions(-)

diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
index 817626a351f8..2f8a2b139919 100644
--- a/mm/zsmalloc.c
+++ b/mm/zsmalloc.c
@@ -204,7 +204,8 @@ struct link_free {
 };
 
 struct zs_pool {
-	const char *name;
+	/* protect page/zspage migration */
+	rwlock_t migrate_lock;
 
 	struct size_class *size_class[ZS_SIZE_CLASSES];
 	struct kmem_cache *handle_cachep;
@@ -213,6 +214,7 @@ struct zs_pool {
 	atomic_long_t pages_allocated;
 
 	struct zs_pool_stats stats;
+	atomic_t compaction_in_progress;
 
 	/* Compact classes */
 	struct shrinker *shrinker;
@@ -223,11 +225,35 @@ struct zs_pool {
 #ifdef CONFIG_COMPACTION
 	struct work_struct free_work;
 #endif
-	/* protect page/zspage migration */
-	rwlock_t migrate_lock;
-	atomic_t compaction_in_progress;
+
+	const char *name;
 };
 
+static void pool_write_unlock(struct zs_pool *pool)
+{
+	write_unlock(&pool->migrate_lock);
+}
+
+static void pool_write_lock(struct zs_pool *pool)
+{
+	write_lock(&pool->migrate_lock);
+}
+
+static void pool_read_unlock(struct zs_pool *pool)
+{
+	read_unlock(&pool->migrate_lock);
+}
+
+static void pool_read_lock(struct zs_pool *pool)
+{
+	read_lock(&pool->migrate_lock);
+}
+
+static bool pool_lock_is_contended(struct zs_pool *pool)
+{
+	return rwlock_is_contended(&pool->migrate_lock);
+}
+
 static inline void zpdesc_set_first(struct zpdesc *zpdesc)
 {
 	SetPagePrivate(zpdesc_page(zpdesc));
@@ -1206,7 +1232,7 @@ void *zs_map_object(struct zs_pool *pool, unsigned long handle,
 	BUG_ON(in_interrupt());
 
 	/* It guarantees it can get zspage from handle safely */
-	read_lock(&pool->migrate_lock);
+	pool_read_lock(pool);
 	obj = handle_to_obj(handle);
 	obj_to_location(obj, &zpdesc, &obj_idx);
 	zspage = get_zspage(zpdesc);
@@ -1218,7 +1244,7 @@ void *zs_map_object(struct zs_pool *pool, unsigned long handle,
 	 * which is smaller granularity.
 	 */
 	migrate_read_lock(zspage);
-	read_unlock(&pool->migrate_lock);
+	pool_read_unlock(pool);
 
 	class = zspage_class(pool, zspage);
 	off = offset_in_page(class->size * obj_idx);
@@ -1453,13 +1479,13 @@ void zs_free(struct zs_pool *pool, unsigned long handle)
 	 * The pool->migrate_lock protects the race with zpage's migration
 	 * so it's safe to get the page from handle.
 	 */
-	read_lock(&pool->migrate_lock);
+	pool_read_lock(pool);
 	obj = handle_to_obj(handle);
 	obj_to_zpdesc(obj, &f_zpdesc);
 	zspage = get_zspage(f_zpdesc);
 	class = zspage_class(pool, zspage);
 	spin_lock(&class->lock);
-	read_unlock(&pool->migrate_lock);
+	pool_read_unlock(pool);
 
 	class_stat_sub(class, ZS_OBJS_INUSE, 1);
 	obj_free(class->size, obj);
@@ -1796,7 +1822,7 @@ static int zs_page_migrate(struct page *newpage, struct page *page,
 	 * The pool migrate_lock protects the race between zpage migration
 	 * and zs_free.
 	 */
-	write_lock(&pool->migrate_lock);
+	pool_write_lock(pool);
 	class = zspage_class(pool, zspage);
 
 	/*
@@ -1833,7 +1859,7 @@ static int zs_page_migrate(struct page *newpage, struct page *page,
 	 * Since we complete the data copy and set up new zspage structure,
 	 * it's okay to release migration_lock.
 	 */
-	write_unlock(&pool->migrate_lock);
+	pool_write_unlock(pool);
 	spin_unlock(&class->lock);
 	migrate_write_unlock(zspage);
 
@@ -1956,7 +1982,7 @@ static unsigned long __zs_compact(struct zs_pool *pool,
 	 * protect the race between zpage migration and zs_free
 	 * as well as zpage allocation/free
 	 */
-	write_lock(&pool->migrate_lock);
+	pool_write_lock(pool);
 	spin_lock(&class->lock);
 	while (zs_can_compact(class)) {
 		int fg;
@@ -1983,14 +2009,14 @@ static unsigned long __zs_compact(struct zs_pool *pool,
 		src_zspage = NULL;
 
 		if (get_fullness_group(class, dst_zspage) == ZS_INUSE_RATIO_100
-		    || rwlock_is_contended(&pool->migrate_lock)) {
+		    || pool_lock_is_contended(pool)) {
 			putback_zspage(class, dst_zspage);
 			dst_zspage = NULL;
 
 			spin_unlock(&class->lock);
-			write_unlock(&pool->migrate_lock);
+			pool_write_unlock(pool);
 			cond_resched();
-			write_lock(&pool->migrate_lock);
+			pool_write_lock(pool);
 			spin_lock(&class->lock);
 		}
 	}
@@ -2002,7 +2028,7 @@ static unsigned long __zs_compact(struct zs_pool *pool,
 		putback_zspage(class, dst_zspage);
 
 	spin_unlock(&class->lock);
-	write_unlock(&pool->migrate_lock);
+	pool_write_unlock(pool);
 
 	return pages_freed;
 }
-- 
2.48.1.262.g85cc9f2d1e-goog



^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCHv1 2/6] zsmalloc: factor out size-class locking helpers
  2025-01-29  6:43 [PATCHv1 0/6] zsmalloc: preemptible object mapping Sergey Senozhatsky
  2025-01-29  6:43 ` [PATCHv1 1/6] zsmalloc: factor out pool locking helpers Sergey Senozhatsky
@ 2025-01-29  6:43 ` Sergey Senozhatsky
  2025-01-29 17:01   ` Yosry Ahmed
  2025-01-29  6:43 ` [PATCHv1 3/6] zsmalloc: make zspage lock preemptible Sergey Senozhatsky
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 21+ messages in thread
From: Sergey Senozhatsky @ 2025-01-29  6:43 UTC (permalink / raw)
  To: Andrew Morton, Minchan Kim, Johannes Weiner, Yosry Ahmed, Nhat Pham
  Cc: linux-mm, linux-kernel, Sergey Senozhatsky

Move open-coded size-class locking to dedicated helpers.

Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
---
 mm/zsmalloc.c | 47 ++++++++++++++++++++++++++++-------------------
 1 file changed, 28 insertions(+), 19 deletions(-)

diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
index 2f8a2b139919..0f575307675d 100644
--- a/mm/zsmalloc.c
+++ b/mm/zsmalloc.c
@@ -254,6 +254,16 @@ static bool pool_lock_is_contended(struct zs_pool *pool)
 	return rwlock_is_contended(&pool->migrate_lock);
 }
 
+static void size_class_lock(struct size_class *class)
+{
+	spin_lock(&class->lock);
+}
+
+static void size_class_unlock(struct size_class *class)
+{
+	spin_unlock(&class->lock);
+}
+
 static inline void zpdesc_set_first(struct zpdesc *zpdesc)
 {
 	SetPagePrivate(zpdesc_page(zpdesc));
@@ -614,8 +624,7 @@ static int zs_stats_size_show(struct seq_file *s, void *v)
 		if (class->index != i)
 			continue;
 
-		spin_lock(&class->lock);
-
+		size_class_lock(class);
 		seq_printf(s, " %5u %5u ", i, class->size);
 		for (fg = ZS_INUSE_RATIO_10; fg < NR_FULLNESS_GROUPS; fg++) {
 			inuse_totals[fg] += class_stat_read(class, fg);
@@ -625,7 +634,7 @@ static int zs_stats_size_show(struct seq_file *s, void *v)
 		obj_allocated = class_stat_read(class, ZS_OBJS_ALLOCATED);
 		obj_used = class_stat_read(class, ZS_OBJS_INUSE);
 		freeable = zs_can_compact(class);
-		spin_unlock(&class->lock);
+		size_class_unlock(class);
 
 		objs_per_zspage = class->objs_per_zspage;
 		pages_used = obj_allocated / objs_per_zspage *
@@ -1400,7 +1409,7 @@ unsigned long zs_malloc(struct zs_pool *pool, size_t size, gfp_t gfp)
 	class = pool->size_class[get_size_class_index(size)];
 
 	/* class->lock effectively protects the zpage migration */
-	spin_lock(&class->lock);
+	size_class_lock(class);
 	zspage = find_get_zspage(class);
 	if (likely(zspage)) {
 		obj_malloc(pool, zspage, handle);
@@ -1411,7 +1420,7 @@ unsigned long zs_malloc(struct zs_pool *pool, size_t size, gfp_t gfp)
 		goto out;
 	}
 
-	spin_unlock(&class->lock);
+	size_class_unlock(class);
 
 	zspage = alloc_zspage(pool, class, gfp);
 	if (!zspage) {
@@ -1419,7 +1428,7 @@ unsigned long zs_malloc(struct zs_pool *pool, size_t size, gfp_t gfp)
 		return (unsigned long)ERR_PTR(-ENOMEM);
 	}
 
-	spin_lock(&class->lock);
+	size_class_lock(class);
 	obj_malloc(pool, zspage, handle);
 	newfg = get_fullness_group(class, zspage);
 	insert_zspage(class, zspage, newfg);
@@ -1430,7 +1439,7 @@ unsigned long zs_malloc(struct zs_pool *pool, size_t size, gfp_t gfp)
 	/* We completely set up zspage so mark them as movable */
 	SetZsPageMovable(pool, zspage);
 out:
-	spin_unlock(&class->lock);
+	size_class_unlock(class);
 
 	return handle;
 }
@@ -1484,7 +1493,7 @@ void zs_free(struct zs_pool *pool, unsigned long handle)
 	obj_to_zpdesc(obj, &f_zpdesc);
 	zspage = get_zspage(f_zpdesc);
 	class = zspage_class(pool, zspage);
-	spin_lock(&class->lock);
+	size_class_lock(class);
 	pool_read_unlock(pool);
 
 	class_stat_sub(class, ZS_OBJS_INUSE, 1);
@@ -1494,7 +1503,7 @@ void zs_free(struct zs_pool *pool, unsigned long handle)
 	if (fullness == ZS_INUSE_RATIO_0)
 		free_zspage(pool, class, zspage);
 
-	spin_unlock(&class->lock);
+	size_class_unlock(class);
 	cache_free_handle(pool, handle);
 }
 EXPORT_SYMBOL_GPL(zs_free);
@@ -1828,7 +1837,7 @@ static int zs_page_migrate(struct page *newpage, struct page *page,
 	/*
 	 * the class lock protects zpage alloc/free in the zspage.
 	 */
-	spin_lock(&class->lock);
+	size_class_lock(class);
 	/* the migrate_write_lock protects zpage access via zs_map_object */
 	migrate_write_lock(zspage);
 
@@ -1860,7 +1869,7 @@ static int zs_page_migrate(struct page *newpage, struct page *page,
 	 * it's okay to release migration_lock.
 	 */
 	pool_write_unlock(pool);
-	spin_unlock(&class->lock);
+	size_class_unlock(class);
 	migrate_write_unlock(zspage);
 
 	zpdesc_get(newzpdesc);
@@ -1904,10 +1913,10 @@ static void async_free_zspage(struct work_struct *work)
 		if (class->index != i)
 			continue;
 
-		spin_lock(&class->lock);
+		size_class_lock(class);
 		list_splice_init(&class->fullness_list[ZS_INUSE_RATIO_0],
 				 &free_pages);
-		spin_unlock(&class->lock);
+		size_class_unlock(class);
 	}
 
 	list_for_each_entry_safe(zspage, tmp, &free_pages, list) {
@@ -1915,10 +1924,10 @@ static void async_free_zspage(struct work_struct *work)
 		lock_zspage(zspage);
 
 		class = zspage_class(pool, zspage);
-		spin_lock(&class->lock);
+		size_class_lock(class);
 		class_stat_sub(class, ZS_INUSE_RATIO_0, 1);
 		__free_zspage(pool, class, zspage);
-		spin_unlock(&class->lock);
+		size_class_unlock(class);
 	}
 };
 
@@ -1983,7 +1992,7 @@ static unsigned long __zs_compact(struct zs_pool *pool,
 	 * as well as zpage allocation/free
 	 */
 	pool_write_lock(pool);
-	spin_lock(&class->lock);
+	size_class_lock(class);
 	while (zs_can_compact(class)) {
 		int fg;
 
@@ -2013,11 +2022,11 @@ static unsigned long __zs_compact(struct zs_pool *pool,
 			putback_zspage(class, dst_zspage);
 			dst_zspage = NULL;
 
-			spin_unlock(&class->lock);
+			size_class_unlock(class);
 			pool_write_unlock(pool);
 			cond_resched();
 			pool_write_lock(pool);
-			spin_lock(&class->lock);
+			size_class_lock(class);
 		}
 	}
 
@@ -2027,7 +2036,7 @@ static unsigned long __zs_compact(struct zs_pool *pool,
 	if (dst_zspage)
 		putback_zspage(class, dst_zspage);
 
-	spin_unlock(&class->lock);
+	size_class_unlock(class);
 	pool_write_unlock(pool);
 
 	return pages_freed;
-- 
2.48.1.262.g85cc9f2d1e-goog



^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCHv1 3/6] zsmalloc: make zspage lock preemptible
  2025-01-29  6:43 [PATCHv1 0/6] zsmalloc: preemptible object mapping Sergey Senozhatsky
  2025-01-29  6:43 ` [PATCHv1 1/6] zsmalloc: factor out pool locking helpers Sergey Senozhatsky
  2025-01-29  6:43 ` [PATCHv1 2/6] zsmalloc: factor out size-class " Sergey Senozhatsky
@ 2025-01-29  6:43 ` Sergey Senozhatsky
  2025-01-29 11:25   ` Sergey Senozhatsky
  2025-01-29 15:22   ` Uros Bizjak
  2025-01-29  6:43 ` [PATCHv1 4/6] zsmalloc: introduce new object mapping API Sergey Senozhatsky
                   ` (3 subsequent siblings)
  6 siblings, 2 replies; 21+ messages in thread
From: Sergey Senozhatsky @ 2025-01-29  6:43 UTC (permalink / raw)
  To: Andrew Morton, Minchan Kim, Johannes Weiner, Yosry Ahmed, Nhat Pham
  Cc: linux-mm, linux-kernel, Sergey Senozhatsky

Switch over from rwlock_t to a atomic_t variable that takes negative
value when the page is under migration, or positive values when the
page is used by zsmalloc users (object map, etc.)   Using a rwsem
per-zspage is a little too memory heavy, a simple atomic_t should
suffice.

zspage lock is a leaf lock for zs_map_object(), where it's read-acquired.
Since this lock now permits preemption extra care needs to be taken when
it is write-acquired - all writers grab it in atomic context, so they
cannot spin and wait for (potentially preempted) reader to unlock zspage.
There are only two writers at this moment - migration and compaction.  In
both cases we use write-try-lock and bail out if zspage is read locked.
Writers, on the other hand, never get preempted, so readers can spin
waiting for the writer to unlock zspage.

With this we can implement a preemptible object mapping.

Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
---
 mm/zsmalloc.c | 140 +++++++++++++++++++++++++++++++-------------------
 1 file changed, 88 insertions(+), 52 deletions(-)

diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
index 0f575307675d..8f4011713bc8 100644
--- a/mm/zsmalloc.c
+++ b/mm/zsmalloc.c
@@ -293,6 +293,9 @@ static inline void free_zpdesc(struct zpdesc *zpdesc)
 	__free_page(page);
 }
 
+#define ZS_PAGE_UNLOCKED	0
+#define ZS_PAGE_WRLOCKED	-1
+
 struct zspage {
 	struct {
 		unsigned int huge:HUGE_BITS;
@@ -305,7 +308,7 @@ struct zspage {
 	struct zpdesc *first_zpdesc;
 	struct list_head list; /* fullness list */
 	struct zs_pool *pool;
-	rwlock_t lock;
+	atomic_t lock;
 };
 
 struct mapping_area {
@@ -315,6 +318,64 @@ struct mapping_area {
 	enum zs_mapmode vm_mm; /* mapping mode */
 };
 
+static void zspage_lock_init(struct zspage *zspage)
+{
+	atomic_set(&zspage->lock, ZS_PAGE_UNLOCKED);
+}
+
+/*
+ * zspage lock permits preemption on the reader-side (there can be multiple
+ * readers).  Writers (exclusive zspage ownership), on the other hand, are
+ * always run in atomic context and cannot spin waiting for a (potentially
+ * preempted) reader to unlock zspage.  This, basically, means that writers
+ * can only call write-try-lock and must bail out if it didn't succeed.
+ *
+ * At the same time, writers cannot reschedule under zspage write-lock,
+ * so readers can spin waiting for the writer to unlock zspage.
+ */
+static void zspage_read_lock(struct zspage *zspage)
+{
+	atomic_t *lock = &zspage->lock;
+	int old;
+
+	while (1) {
+		old = atomic_read(lock);
+		if (old == ZS_PAGE_WRLOCKED) {
+			cpu_relax();
+			continue;
+		}
+
+		if (atomic_try_cmpxchg(lock, &old, old + 1))
+			return;
+
+		cpu_relax();
+	}
+}
+
+static void zspage_read_unlock(struct zspage *zspage)
+{
+	atomic_dec(&zspage->lock);
+}
+
+static int zspage_try_write_lock(struct zspage *zspage)
+{
+	atomic_t *lock = &zspage->lock;
+	int old = ZS_PAGE_UNLOCKED;
+
+	preempt_disable();
+	if (atomic_try_cmpxchg(lock, &old, ZS_PAGE_WRLOCKED))
+		return 1;
+
+	preempt_enable();
+	return 0;
+}
+
+static void zspage_write_unlock(struct zspage *zspage)
+{
+	atomic_set(&zspage->lock, ZS_PAGE_UNLOCKED);
+	preempt_enable();
+}
+
 /* huge object: pages_per_zspage == 1 && maxobj_per_zspage == 1 */
 static void SetZsHugePage(struct zspage *zspage)
 {
@@ -326,12 +387,6 @@ static bool ZsHugePage(struct zspage *zspage)
 	return zspage->huge;
 }
 
-static void migrate_lock_init(struct zspage *zspage);
-static void migrate_read_lock(struct zspage *zspage);
-static void migrate_read_unlock(struct zspage *zspage);
-static void migrate_write_lock(struct zspage *zspage);
-static void migrate_write_unlock(struct zspage *zspage);
-
 #ifdef CONFIG_COMPACTION
 static void kick_deferred_free(struct zs_pool *pool);
 static void init_deferred_free(struct zs_pool *pool);
@@ -1027,7 +1082,7 @@ static struct zspage *alloc_zspage(struct zs_pool *pool,
 		return NULL;
 
 	zspage->magic = ZSPAGE_MAGIC;
-	migrate_lock_init(zspage);
+	zspage_lock_init(zspage);
 
 	for (i = 0; i < class->pages_per_zspage; i++) {
 		struct zpdesc *zpdesc;
@@ -1252,7 +1307,7 @@ void *zs_map_object(struct zs_pool *pool, unsigned long handle,
 	 * zs_unmap_object API so delegate the locking from class to zspage
 	 * which is smaller granularity.
 	 */
-	migrate_read_lock(zspage);
+	zspage_read_lock(zspage);
 	pool_read_unlock(pool);
 
 	class = zspage_class(pool, zspage);
@@ -1312,7 +1367,7 @@ void zs_unmap_object(struct zs_pool *pool, unsigned long handle)
 	}
 	local_unlock(&zs_map_area.lock);
 
-	migrate_read_unlock(zspage);
+	zspage_read_unlock(zspage);
 }
 EXPORT_SYMBOL_GPL(zs_unmap_object);
 
@@ -1706,18 +1761,18 @@ static void lock_zspage(struct zspage *zspage)
 	/*
 	 * Pages we haven't locked yet can be migrated off the list while we're
 	 * trying to lock them, so we need to be careful and only attempt to
-	 * lock each page under migrate_read_lock(). Otherwise, the page we lock
+	 * lock each page under zspage_read_lock(). Otherwise, the page we lock
 	 * may no longer belong to the zspage. This means that we may wait for
 	 * the wrong page to unlock, so we must take a reference to the page
-	 * prior to waiting for it to unlock outside migrate_read_lock().
+	 * prior to waiting for it to unlock outside zspage_read_lock().
 	 */
 	while (1) {
-		migrate_read_lock(zspage);
+		zspage_read_lock(zspage);
 		zpdesc = get_first_zpdesc(zspage);
 		if (zpdesc_trylock(zpdesc))
 			break;
 		zpdesc_get(zpdesc);
-		migrate_read_unlock(zspage);
+		zspage_read_unlock(zspage);
 		zpdesc_wait_locked(zpdesc);
 		zpdesc_put(zpdesc);
 	}
@@ -1728,41 +1783,16 @@ static void lock_zspage(struct zspage *zspage)
 			curr_zpdesc = zpdesc;
 		} else {
 			zpdesc_get(zpdesc);
-			migrate_read_unlock(zspage);
+			zspage_read_unlock(zspage);
 			zpdesc_wait_locked(zpdesc);
 			zpdesc_put(zpdesc);
-			migrate_read_lock(zspage);
+			zspage_read_lock(zspage);
 		}
 	}
-	migrate_read_unlock(zspage);
+	zspage_read_unlock(zspage);
 }
 #endif /* CONFIG_COMPACTION */
 
-static void migrate_lock_init(struct zspage *zspage)
-{
-	rwlock_init(&zspage->lock);
-}
-
-static void migrate_read_lock(struct zspage *zspage) __acquires(&zspage->lock)
-{
-	read_lock(&zspage->lock);
-}
-
-static void migrate_read_unlock(struct zspage *zspage) __releases(&zspage->lock)
-{
-	read_unlock(&zspage->lock);
-}
-
-static void migrate_write_lock(struct zspage *zspage)
-{
-	write_lock(&zspage->lock);
-}
-
-static void migrate_write_unlock(struct zspage *zspage)
-{
-	write_unlock(&zspage->lock);
-}
-
 #ifdef CONFIG_COMPACTION
 
 static const struct movable_operations zsmalloc_mops;
@@ -1804,7 +1834,7 @@ static bool zs_page_isolate(struct page *page, isolate_mode_t mode)
 }
 
 static int zs_page_migrate(struct page *newpage, struct page *page,
-		enum migrate_mode mode)
+			   enum migrate_mode mode)
 {
 	struct zs_pool *pool;
 	struct size_class *class;
@@ -1820,15 +1850,12 @@ static int zs_page_migrate(struct page *newpage, struct page *page,
 
 	VM_BUG_ON_PAGE(!zpdesc_is_isolated(zpdesc), zpdesc_page(zpdesc));
 
-	/* We're committed, tell the world that this is a Zsmalloc page. */
-	__zpdesc_set_zsmalloc(newzpdesc);
-
 	/* The page is locked, so this pointer must remain valid */
 	zspage = get_zspage(zpdesc);
 	pool = zspage->pool;
 
 	/*
-	 * The pool migrate_lock protects the race between zpage migration
+	 * The pool->migrate_lock protects the race between zpage migration
 	 * and zs_free.
 	 */
 	pool_write_lock(pool);
@@ -1838,8 +1865,15 @@ static int zs_page_migrate(struct page *newpage, struct page *page,
 	 * the class lock protects zpage alloc/free in the zspage.
 	 */
 	size_class_lock(class);
-	/* the migrate_write_lock protects zpage access via zs_map_object */
-	migrate_write_lock(zspage);
+	/* the zspage write_lock protects zpage access via zs_map_object */
+	if (!zspage_try_write_lock(zspage)) {
+		size_class_unlock(class);
+		pool_write_unlock(pool);
+		return -EINVAL;
+	}
+
+	/* We're committed, tell the world that this is a Zsmalloc page. */
+	__zpdesc_set_zsmalloc(newzpdesc);
 
 	offset = get_first_obj_offset(zpdesc);
 	s_addr = kmap_local_zpdesc(zpdesc);
@@ -1870,7 +1904,7 @@ static int zs_page_migrate(struct page *newpage, struct page *page,
 	 */
 	pool_write_unlock(pool);
 	size_class_unlock(class);
-	migrate_write_unlock(zspage);
+	zspage_write_unlock(zspage);
 
 	zpdesc_get(newzpdesc);
 	if (zpdesc_zone(newzpdesc) != zpdesc_zone(zpdesc)) {
@@ -2006,9 +2040,11 @@ static unsigned long __zs_compact(struct zs_pool *pool,
 		if (!src_zspage)
 			break;
 
-		migrate_write_lock(src_zspage);
+		if (!zspage_try_write_lock(src_zspage))
+			break;
+
 		migrate_zspage(pool, src_zspage, dst_zspage);
-		migrate_write_unlock(src_zspage);
+		zspage_write_unlock(src_zspage);
 
 		fg = putback_zspage(class, src_zspage);
 		if (fg == ZS_INUSE_RATIO_0) {
-- 
2.48.1.262.g85cc9f2d1e-goog



^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCHv1 4/6] zsmalloc: introduce new object mapping API
  2025-01-29  6:43 [PATCHv1 0/6] zsmalloc: preemptible object mapping Sergey Senozhatsky
                   ` (2 preceding siblings ...)
  2025-01-29  6:43 ` [PATCHv1 3/6] zsmalloc: make zspage lock preemptible Sergey Senozhatsky
@ 2025-01-29  6:43 ` Sergey Senozhatsky
  2025-01-29 17:31   ` Yosry Ahmed
  2025-01-29  6:43 ` [PATCHv1 5/6] zram: switch to new zsmalloc " Sergey Senozhatsky
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 21+ messages in thread
From: Sergey Senozhatsky @ 2025-01-29  6:43 UTC (permalink / raw)
  To: Andrew Morton, Minchan Kim, Johannes Weiner, Yosry Ahmed, Nhat Pham
  Cc: linux-mm, linux-kernel, Sergey Senozhatsky

Current object mapping API is a little cumbersome.  First, it's
inconsistent, sometimes it returns with page-faults disabled and
sometimes with page-faults enabled.  Second, and most importantly,
it enforces atomicity restrictions on its users.  zs_map_object()
has to return a liner object address which is not always possible
because some objects span multiple physical (non-contiguous)
pages.  For such objects zsmalloc uses a per-CPU buffer to which
object's data is copied before a pointer to that per-CPU buffer
is returned back to the caller.  This leads to another, final,
issue - extra memcpy().  Since the caller gets a pointer to
per-CPU buffer it can memcpy() data only to that buffer, and
during zs_unmap_object() zsmalloc will memcpy() from that per-CPU
buffer to physical pages that object in question spans across.

New API splits functions by access mode:
- zs_obj_read_begin(handle, local_copy)
  Returns a pointer to handle memory.  For objects that span two
  physical pages a local_copy buffer is used to store object's
  data before the address is returned to the caller.  Otherwise
  the object's page is kmap_local mapped directly.

- zs_obj_read_end(handle, buf)
  Unmaps the page if it was kmap_local mapped by zs_obj_read_begin().

- zs_obj_write(handle, buf, len)
  Copies len-bytes from compression buffer to handle memory
  (takes care of objects that span two pages).  This does not
  need any additional (e.g. per-CPU) buffers and writes the data
  directly to zsmalloc pool pages.

The old API will stay around until the remaining users switch
to the new one.  After that we'll also remove zsmalloc per-CPU
buffer and CPU hotplug handling.

Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
---
 include/linux/zsmalloc.h |   8 +++
 mm/zsmalloc.c            | 129 +++++++++++++++++++++++++++++++++++++++
 2 files changed, 137 insertions(+)

diff --git a/include/linux/zsmalloc.h b/include/linux/zsmalloc.h
index a48cd0ffe57d..625adae8e547 100644
--- a/include/linux/zsmalloc.h
+++ b/include/linux/zsmalloc.h
@@ -58,4 +58,12 @@ unsigned long zs_compact(struct zs_pool *pool);
 unsigned int zs_lookup_class_index(struct zs_pool *pool, unsigned int size);
 
 void zs_pool_stats(struct zs_pool *pool, struct zs_pool_stats *stats);
+
+void zs_obj_read_end(struct zs_pool *pool, unsigned long handle,
+		     void *handle_mem);
+void *zs_obj_read_begin(struct zs_pool *pool, unsigned long handle,
+			void *local_copy);
+void zs_obj_write(struct zs_pool *pool, unsigned long handle,
+		  void *handle_mem, size_t mem_len);
+
 #endif
diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
index 8f4011713bc8..0e21bc57470b 100644
--- a/mm/zsmalloc.c
+++ b/mm/zsmalloc.c
@@ -1371,6 +1371,135 @@ void zs_unmap_object(struct zs_pool *pool, unsigned long handle)
 }
 EXPORT_SYMBOL_GPL(zs_unmap_object);
 
+void zs_obj_write(struct zs_pool *pool, unsigned long handle,
+		  void *handle_mem, size_t mem_len)
+{
+	struct zspage *zspage;
+	struct zpdesc *zpdesc;
+	unsigned long obj, off;
+	unsigned int obj_idx;
+	struct size_class *class;
+
+	WARN_ON(in_interrupt());
+
+	/* Guarantee we can get zspage from handle safely */
+	pool_read_lock(pool);
+	obj = handle_to_obj(handle);
+	obj_to_location(obj, &zpdesc, &obj_idx);
+	zspage = get_zspage(zpdesc);
+
+	/* Make sure migration doesn't move any pages in this zspage */
+	zspage_read_lock(zspage);
+	pool_read_unlock(pool);
+
+	class = zspage_class(pool, zspage);
+	off = offset_in_page(class->size * obj_idx);
+
+	if (off + class->size <= PAGE_SIZE) {
+		/* this object is contained entirely within a page */
+		void *dst = kmap_local_zpdesc(zpdesc);
+
+		if (!ZsHugePage(zspage))
+			off += ZS_HANDLE_SIZE;
+		memcpy(dst + off, handle_mem, mem_len);
+		kunmap_local(dst);
+	} else {
+		size_t sizes[2];
+
+		/* this object spans two pages */
+		off += ZS_HANDLE_SIZE;
+		sizes[0] = PAGE_SIZE - off;
+		sizes[1] = mem_len - sizes[0];
+
+		memcpy_to_page(zpdesc_page(zpdesc), off,
+			       handle_mem, sizes[0]);
+		zpdesc = get_next_zpdesc(zpdesc);
+		memcpy_to_page(zpdesc_page(zpdesc), 0,
+			       handle_mem + sizes[0], sizes[1]);
+	}
+
+	zspage_read_unlock(zspage);
+}
+EXPORT_SYMBOL_GPL(zs_obj_write);
+
+void zs_obj_read_end(struct zs_pool *pool, unsigned long handle,
+		     void *handle_mem)
+{
+	struct zspage *zspage;
+	struct zpdesc *zpdesc;
+	unsigned long obj, off;
+	unsigned int obj_idx;
+	struct size_class *class;
+
+	obj = handle_to_obj(handle);
+	obj_to_location(obj, &zpdesc, &obj_idx);
+	zspage = get_zspage(zpdesc);
+	class = zspage_class(pool, zspage);
+	off = offset_in_page(class->size * obj_idx);
+
+	if (off + class->size <= PAGE_SIZE) {
+		if (!ZsHugePage(zspage))
+			off += ZS_HANDLE_SIZE;
+		handle_mem -= off;
+		kunmap_local(handle_mem);
+	}
+
+	zspage_read_unlock(zspage);
+}
+EXPORT_SYMBOL_GPL(zs_obj_read_end);
+
+void *zs_obj_read_begin(struct zs_pool *pool, unsigned long handle,
+			void *local_copy)
+{
+	struct zspage *zspage;
+	struct zpdesc *zpdesc;
+	unsigned long obj, off;
+	unsigned int obj_idx;
+	struct size_class *class;
+	void *addr;
+
+	WARN_ON(in_interrupt());
+
+	/* Guarantee we can get zspage from handle safely */
+	pool_read_lock(pool);
+	obj = handle_to_obj(handle);
+	obj_to_location(obj, &zpdesc, &obj_idx);
+	zspage = get_zspage(zpdesc);
+
+	/* Make sure migration doesn't move any pages in this zspage */
+	zspage_read_lock(zspage);
+	pool_read_unlock(pool);
+
+	class = zspage_class(pool, zspage);
+	off = offset_in_page(class->size * obj_idx);
+
+	if (off + class->size <= PAGE_SIZE) {
+		/* this object is contained entirely within a page */
+		addr = kmap_local_zpdesc(zpdesc);
+		addr += off;
+	} else {
+		size_t sizes[2];
+
+		/* this object spans two pages */
+		sizes[0] = PAGE_SIZE - off;
+		sizes[1] = class->size - sizes[0];
+		addr = local_copy;
+
+		memcpy_from_page(addr, zpdesc_page(zpdesc),
+				 off, sizes[0]);
+		zpdesc = get_next_zpdesc(zpdesc);
+		memcpy_from_page(addr + sizes[0],
+				 zpdesc_page(zpdesc),
+				 0, sizes[1]);
+	}
+
+	if (!ZsHugePage(zspage))
+		addr += ZS_HANDLE_SIZE;
+
+	return addr;
+}
+EXPORT_SYMBOL_GPL(zs_obj_read_begin);
+
 /**
  * zs_huge_class_size() - Returns the size (in bytes) of the first huge
  *                        zsmalloc &size_class.
-- 
2.48.1.262.g85cc9f2d1e-goog



^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCHv1 5/6] zram: switch to new zsmalloc object mapping API
  2025-01-29  6:43 [PATCHv1 0/6] zsmalloc: preemptible object mapping Sergey Senozhatsky
                   ` (3 preceding siblings ...)
  2025-01-29  6:43 ` [PATCHv1 4/6] zsmalloc: introduce new object mapping API Sergey Senozhatsky
@ 2025-01-29  6:43 ` Sergey Senozhatsky
  2025-01-29  6:43 ` [PATCHv1 6/6] zram: add might_sleep to zcomp API Sergey Senozhatsky
  2025-01-29 15:53 ` [PATCHv1 0/6] zsmalloc: preemptible object mapping Yosry Ahmed
  6 siblings, 0 replies; 21+ messages in thread
From: Sergey Senozhatsky @ 2025-01-29  6:43 UTC (permalink / raw)
  To: Andrew Morton, Minchan Kim, Johannes Weiner, Yosry Ahmed, Nhat Pham
  Cc: linux-mm, linux-kernel, Sergey Senozhatsky

Use new read/write zsmalloc object API.  For cases when RO mapped
object spans two physical pages (requires temp buffer) compression
streams now carry around one extra physical page.

Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
---
 drivers/block/zram/zcomp.c    |  4 +++-
 drivers/block/zram/zcomp.h    |  2 ++
 drivers/block/zram/zram_drv.c | 28 ++++++++++------------------
 3 files changed, 15 insertions(+), 19 deletions(-)

diff --git a/drivers/block/zram/zcomp.c b/drivers/block/zram/zcomp.c
index efd5919808d9..675f2a51ad5f 100644
--- a/drivers/block/zram/zcomp.c
+++ b/drivers/block/zram/zcomp.c
@@ -45,6 +45,7 @@ static const struct zcomp_ops *backends[] = {
 static void zcomp_strm_free(struct zcomp *comp, struct zcomp_strm *strm)
 {
 	comp->ops->destroy_ctx(&strm->ctx);
+	vfree(strm->local_copy);
 	vfree(strm->buffer);
 	kfree(strm);
 }
@@ -66,12 +67,13 @@ static struct zcomp_strm *zcomp_strm_alloc(struct zcomp *comp)
 		return NULL;
 	}
 
+	strm->local_copy = vzalloc(PAGE_SIZE);
 	/*
 	 * allocate 2 pages. 1 for compressed data, plus 1 extra in case if
 	 * compressed data is larger than the original one.
 	 */
 	strm->buffer = vzalloc(2 * PAGE_SIZE);
-	if (!strm->buffer) {
+	if (!strm->buffer || !strm->local_copy) {
 		zcomp_strm_free(comp, strm);
 		return NULL;
 	}
diff --git a/drivers/block/zram/zcomp.h b/drivers/block/zram/zcomp.h
index 62330829db3f..9683d4aa822d 100644
--- a/drivers/block/zram/zcomp.h
+++ b/drivers/block/zram/zcomp.h
@@ -34,6 +34,8 @@ struct zcomp_strm {
 	struct list_head entry;
 	/* compression buffer */
 	void *buffer;
+	/* local copy of handle memory */
+	void *local_copy;
 	struct zcomp_ctx ctx;
 };
 
diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index ad3e8885b0d2..d73e8374e9cc 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -1562,11 +1562,11 @@ static int read_incompressible_page(struct zram *zram, struct page *page,
 	void *src, *dst;
 
 	handle = zram_get_handle(zram, index);
-	src = zs_map_object(zram->mem_pool, handle, ZS_MM_RO);
+	src = zs_obj_read_begin(zram->mem_pool, handle, NULL);
 	dst = kmap_local_page(page);
 	copy_page(dst, src);
 	kunmap_local(dst);
-	zs_unmap_object(zram->mem_pool, handle);
+	zs_obj_read_end(zram->mem_pool, handle, src);
 
 	return 0;
 }
@@ -1584,11 +1584,11 @@ static int read_compressed_page(struct zram *zram, struct page *page, u32 index)
 	prio = zram_get_priority(zram, index);
 
 	zstrm = zcomp_stream_get(zram->comps[prio]);
-	src = zs_map_object(zram->mem_pool, handle, ZS_MM_RO);
+	src = zs_obj_read_begin(zram->mem_pool, handle, zstrm->local_copy);
 	dst = kmap_local_page(page);
 	ret = zcomp_decompress(zram->comps[prio], zstrm, src, size, dst);
 	kunmap_local(dst);
-	zs_unmap_object(zram->mem_pool, handle);
+	zs_obj_read_end(zram->mem_pool, handle, src);
 	zcomp_stream_put(zram->comps[prio], zstrm);
 
 	return ret;
@@ -1684,7 +1684,7 @@ static int write_incompressible_page(struct zram *zram, struct page *page,
 				     u32 index)
 {
 	unsigned long handle;
-	void *src, *dst;
+	void *src;
 
 	/*
 	 * This function is called from preemptible context so we don't need
@@ -1701,11 +1701,9 @@ static int write_incompressible_page(struct zram *zram, struct page *page,
 		return -ENOMEM;
 	}
 
-	dst = zs_map_object(zram->mem_pool, handle, ZS_MM_WO);
 	src = kmap_local_page(page);
-	memcpy(dst, src, PAGE_SIZE);
+	zs_obj_write(zram->mem_pool, handle, src, PAGE_SIZE);
 	kunmap_local(src);
-	zs_unmap_object(zram->mem_pool, handle);
 
 	zram_slot_write_lock(zram, index);
 	zram_set_flag(zram, index, ZRAM_HUGE);
@@ -1726,7 +1724,7 @@ static int zram_write_page(struct zram *zram, struct page *page, u32 index)
 	int ret = 0;
 	unsigned long handle;
 	unsigned int comp_len;
-	void *dst, *mem;
+	void *mem;
 	struct zcomp_strm *zstrm;
 	unsigned long element;
 	bool same_filled;
@@ -1769,11 +1767,8 @@ static int zram_write_page(struct zram *zram, struct page *page, u32 index)
 		return -ENOMEM;
 	}
 
-	dst = zs_map_object(zram->mem_pool, handle, ZS_MM_WO);
-
-	memcpy(dst, zstrm->buffer, comp_len);
+	zs_obj_write(zram->mem_pool, handle, zstrm->buffer, comp_len);
 	zcomp_stream_put(zram->comps[ZRAM_PRIMARY_COMP], zstrm);
-	zs_unmap_object(zram->mem_pool, handle);
 
 	zram_slot_write_lock(zram, index);
 	zram_set_handle(zram, index, handle);
@@ -1882,7 +1877,7 @@ static int recompress_slot(struct zram *zram, u32 index, struct page *page,
 	unsigned int class_index_old;
 	unsigned int class_index_new;
 	u32 num_recomps = 0;
-	void *src, *dst;
+	void *src;
 	int ret;
 
 	handle_old = zram_get_handle(zram, index);
@@ -2020,12 +2015,9 @@ static int recompress_slot(struct zram *zram, u32 index, struct page *page,
 		return 0;
 	}
 
-	dst = zs_map_object(zram->mem_pool, handle_new, ZS_MM_WO);
-	memcpy(dst, zstrm->buffer, comp_len_new);
+	zs_obj_write(zram->mem_pool, handle_new, zstrm->buffer, comp_len_new);
 	zcomp_stream_put(zram->comps[prio], zstrm);
 
-	zs_unmap_object(zram->mem_pool, handle_new);
-
 	zram_free_page(zram, index);
 	zram_set_handle(zram, index, handle_new);
 	zram_set_obj_size(zram, index, comp_len_new);
-- 
2.48.1.262.g85cc9f2d1e-goog



^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCHv1 6/6] zram: add might_sleep to zcomp API
  2025-01-29  6:43 [PATCHv1 0/6] zsmalloc: preemptible object mapping Sergey Senozhatsky
                   ` (4 preceding siblings ...)
  2025-01-29  6:43 ` [PATCHv1 5/6] zram: switch to new zsmalloc " Sergey Senozhatsky
@ 2025-01-29  6:43 ` Sergey Senozhatsky
  2025-01-29 15:53 ` [PATCHv1 0/6] zsmalloc: preemptible object mapping Yosry Ahmed
  6 siblings, 0 replies; 21+ messages in thread
From: Sergey Senozhatsky @ 2025-01-29  6:43 UTC (permalink / raw)
  To: Andrew Morton, Minchan Kim, Johannes Weiner, Yosry Ahmed, Nhat Pham
  Cc: linux-mm, linux-kernel, Sergey Senozhatsky

Explicitly state that zcomp compress/decompress must be
called from non-atomic context.

Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
---
 drivers/block/zram/zcomp.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/block/zram/zcomp.c b/drivers/block/zram/zcomp.c
index 675f2a51ad5f..f4235735787b 100644
--- a/drivers/block/zram/zcomp.c
+++ b/drivers/block/zram/zcomp.c
@@ -185,6 +185,7 @@ int zcomp_compress(struct zcomp *comp, struct zcomp_strm *zstrm,
 	};
 	int ret;
 
+	might_sleep();
 	ret = comp->ops->compress(comp->params, &zstrm->ctx, &req);
 	if (!ret)
 		*dst_len = req.dst_len;
@@ -201,6 +202,7 @@ int zcomp_decompress(struct zcomp *comp, struct zcomp_strm *zstrm,
 		.dst_len = PAGE_SIZE,
 	};
 
+	might_sleep();
 	return comp->ops->decompress(comp->params, &zstrm->ctx, &req);
 }
 
-- 
2.48.1.262.g85cc9f2d1e-goog



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCHv1 3/6] zsmalloc: make zspage lock preemptible
  2025-01-29  6:43 ` [PATCHv1 3/6] zsmalloc: make zspage lock preemptible Sergey Senozhatsky
@ 2025-01-29 11:25   ` Sergey Senozhatsky
  2025-01-29 15:22   ` Uros Bizjak
  1 sibling, 0 replies; 21+ messages in thread
From: Sergey Senozhatsky @ 2025-01-29 11:25 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Andrew Morton, Minchan Kim, Johannes Weiner, Yosry Ahmed,
	Nhat Pham, linux-mm, linux-kernel, Uros Bizjak

JFI

On (25/01/29 15:43), Sergey Senozhatsky wrote:
> +static void zspage_lock_init(struct zspage *zspage)
> +{
> +	atomic_set(&zspage->lock, ZS_PAGE_UNLOCKED);
> +}
> +
> +/*
> + * zspage lock permits preemption on the reader-side (there can be multiple
> + * readers).  Writers (exclusive zspage ownership), on the other hand, are
> + * always run in atomic context and cannot spin waiting for a (potentially
> + * preempted) reader to unlock zspage.  This, basically, means that writers
> + * can only call write-try-lock and must bail out if it didn't succeed.
> + *
> + * At the same time, writers cannot reschedule under zspage write-lock,
> + * so readers can spin waiting for the writer to unlock zspage.
> + */
> +static void zspage_read_lock(struct zspage *zspage)
> +{
> +	atomic_t *lock = &zspage->lock;
> +	int old;
> +
> +	while (1) {
> +		old = atomic_read(lock);
> +		if (old == ZS_PAGE_WRLOCKED) {
> +			cpu_relax();
> +			continue;
> +		}
> +
> +		if (atomic_try_cmpxchg(lock, &old, old + 1))
> +			return;
> +
> +		cpu_relax();
> +	}
> +}
> +
> +static void zspage_read_unlock(struct zspage *zspage)
> +{
> +	atomic_dec(&zspage->lock);
> +}
> +
> +static int zspage_try_write_lock(struct zspage *zspage)
> +{
> +	atomic_t *lock = &zspage->lock;
> +	int old = ZS_PAGE_UNLOCKED;
> +
> +	preempt_disable();
> +	if (atomic_try_cmpxchg(lock, &old, ZS_PAGE_WRLOCKED))
> +		return 1;
> +
> +	preempt_enable();
> +	return 0;
> +}
> +
> +static void zspage_write_unlock(struct zspage *zspage)
> +{
> +	atomic_set(&zspage->lock, ZS_PAGE_UNLOCKED);
> +	preempt_enable();
> +}

Below is what I currently have based on a (private) feedback from
Uros.  No functional changes.

I think I'll wait before sending out v2, based on the fact that
this is not a fix or a functional change (WR-locked case is relatively
uncommon; it's only when we map an object from a page that is currently
either under migration or compaction.)

---

static void zspage_lock_init(struct zspage *zspage)
{
	atomic_set(&zspage->lock, ZS_PAGE_UNLOCKED);
}

static void zspage_read_lock(struct zspage *zspage)
{
	atomic_t *lock = &zspage->lock;
	int old = atomic_read(lock);

	do {
		if (old == ZS_PAGE_WRLOCKED) {
			cpu_relax();
			old = atomic_read(lock);
			continue;
		}
	} while (!atomic_try_cmpxchg(lock, &old, old + 1));
}

static void zspage_read_unlock(struct zspage *zspage)
{
	atomic_dec(&zspage->lock);
}

static bool zspage_try_write_lock(struct zspage *zspage)
{
	atomic_t *lock = &zspage->lock;
	int old = ZS_PAGE_UNLOCKED;

	preempt_disable();
	if (atomic_try_cmpxchg(lock, &old, ZS_PAGE_WRLOCKED))
		return true;

	preempt_enable();
	return false;
}

static void zspage_write_unlock(struct zspage *zspage)
{
	atomic_set(&zspage->lock, ZS_PAGE_UNLOCKED);
	preempt_enable();
}


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCHv1 3/6] zsmalloc: make zspage lock preemptible
  2025-01-29  6:43 ` [PATCHv1 3/6] zsmalloc: make zspage lock preemptible Sergey Senozhatsky
  2025-01-29 11:25   ` Sergey Senozhatsky
@ 2025-01-29 15:22   ` Uros Bizjak
  2025-01-30  3:22     ` Sergey Senozhatsky
  1 sibling, 1 reply; 21+ messages in thread
From: Uros Bizjak @ 2025-01-29 15:22 UTC (permalink / raw)
  To: Sergey Senozhatsky, Andrew Morton, Minchan Kim, Johannes Weiner,
	Yosry Ahmed, Nhat Pham
  Cc: linux-mm, linux-kernel



On 29. 01. 25 07:43, Sergey Senozhatsky wrote:

> +/*
> + * zspage lock permits preemption on the reader-side (there can be multiple
> + * readers).  Writers (exclusive zspage ownership), on the other hand, are
> + * always run in atomic context and cannot spin waiting for a (potentially
> + * preempted) reader to unlock zspage.  This, basically, means that writers
> + * can only call write-try-lock and must bail out if it didn't succeed.
> + *
> + * At the same time, writers cannot reschedule under zspage write-lock,
> + * so readers can spin waiting for the writer to unlock zspage.
> + */
> +static void zspage_read_lock(struct zspage *zspage)
> +{
> +	atomic_t *lock = &zspage->lock;
> +	int old;
> +
> +	while (1) {
> +		old = atomic_read(lock);
> +		if (old == ZS_PAGE_WRLOCKED) {
> +			cpu_relax();
> +			continue;
> +		}
> +
> +		if (atomic_try_cmpxchg(lock, &old, old + 1))
> +			return;
> +
> +		cpu_relax();
> +	}
> +}

Please note that atomic_try_cmpxchg updates old variable on failure, so 
the whole loop can be rewritten as:

{
         atomic_t *lock = &zspage->lock;
         int old = atomic_read(lock);

         while (1) {
                 if (old == ZS_PAGE_WRLOCKED) {
                         cpu_relax();
                         old = atomic_read(lock);
                         continue;
                 }

                 if (atomic_try_cmpxchg(lock, &old, old + 1))
                         return;

                 cpu_relax();
         }
}

Please note that cpu_relax() in the cmpxchg() loop is actually harmful 
[1] because:

--q--
On the x86-64 architecture even a failing cmpxchg grants exclusive
access to the cacheline, making it preferable to retry the failed op
immediately instead of stalling with the pause instruction.
--/q--

[1]
https://lore.kernel.org/all/20230113184447.1707316-1-mjguzik@gmail.com/

Based on the above, cpu_relax() should be removed from the loop, which 
becomes:

{
         atomic_t *lock = &zspage->lock;
         int old = atomic_read(lock);

         do {
                 if (old == ZS_PAGE_WRLOCKED) {
                         cpu_relax();
                         old = atomic_read(lock);
                         continue;
                 }

         } while (!atomic_try_cmpxchg(lock, &old, old + 1));
  }

> +static int zspage_try_write_lock(struct zspage *zspage)

This function can be declared as bool, returning true/false.

Uros.



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCHv1 0/6] zsmalloc: preemptible object mapping
  2025-01-29  6:43 [PATCHv1 0/6] zsmalloc: preemptible object mapping Sergey Senozhatsky
                   ` (5 preceding siblings ...)
  2025-01-29  6:43 ` [PATCHv1 6/6] zram: add might_sleep to zcomp API Sergey Senozhatsky
@ 2025-01-29 15:53 ` Yosry Ahmed
  2025-01-30  3:13   ` Sergey Senozhatsky
  6 siblings, 1 reply; 21+ messages in thread
From: Yosry Ahmed @ 2025-01-29 15:53 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Andrew Morton, Minchan Kim, Johannes Weiner, Nhat Pham, linux-mm,
	linux-kernel

On Wed, Jan 29, 2025 at 03:43:46PM +0900, Sergey Senozhatsky wrote:
> This is Part II of the series [1] that makes zram read() and write()
> preemptible.  This part focuses only zsmalloc because zsmalloc imposes
> atomicity restrictions on its users.  One notable example is object
> mapping API, which returns with:
> a) local CPU lock held
> b) zspage rwlock held
> 
> First, zsmalloc is converted to use sleepable RW-"lock" (it's atomic_t
> in fact) for zspage migration protection.  Second, a new handle mapping
> is introduced which doesn't use per-CPU buffers (and hence no local CPU
> lock), does fewer memcpy() calls, but requires users to provide a
> pointer to temp buffer for object copy-in (when needed).  Third, zram is
> converted to the new zsmalloc mapping API and thus zram read() becomes
> preemptible.
> 
> [1] https://lore.kernel.org/linux-mm/20250127072932.1289973-1-senozhatsky@chromium.org
> 
> RFC -> v1:
> - Only zspage->lock (leaf-lock for zs_map_object()) is converted
>   to a preemptible lock.  The rest of the zspool locks remain the
>   same (Yosry hated with passion the fact that in RFC series all
>   zspool looks would become preemptible).

Hated is a big word here, I was merely concerned about how the locking
changes would affect performance :P

> - New zs object mapping API (Yosry hated RFC API with passion).
>   We know have obj_read_begin()/obj_read_end() and obj_write().
> - obj_write() saves extra memcpy() calls for objects that span two
>   physical pages.
> - Dropped zram deferred slot-free-notification handling (I hated
>   it with passion)
> 
> Sergey Senozhatsky (6):
>   zsmalloc: factor out pool locking helpers
>   zsmalloc: factor out size-class locking helpers
>   zsmalloc: make zspage lock preemptible
>   zsmalloc: introduce new object mapping API
>   zram: switch to new zsmalloc object mapping API
>   zram: add might_sleep to zcomp API
> 
>  drivers/block/zram/zcomp.c    |   6 +-
>  drivers/block/zram/zcomp.h    |   2 +
>  drivers/block/zram/zram_drv.c |  28 +--
>  include/linux/zsmalloc.h      |   8 +
>  mm/zsmalloc.c                 | 372 ++++++++++++++++++++++++++--------
>  5 files changed, 311 insertions(+), 105 deletions(-)
> 
> -- 
> 2.48.1.262.g85cc9f2d1e-goog
> 


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCHv1 1/6] zsmalloc: factor out pool locking helpers
  2025-01-29  6:43 ` [PATCHv1 1/6] zsmalloc: factor out pool locking helpers Sergey Senozhatsky
@ 2025-01-29 16:59   ` Yosry Ahmed
  2025-01-30  4:01     ` Sergey Senozhatsky
  0 siblings, 1 reply; 21+ messages in thread
From: Yosry Ahmed @ 2025-01-29 16:59 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Andrew Morton, Minchan Kim, Johannes Weiner, Nhat Pham, linux-mm,
	linux-kernel

On Wed, Jan 29, 2025 at 03:43:47PM +0900, Sergey Senozhatsky wrote:
> We currently have a mix of migrate_{read,write}_lock() helpers
> that lock zspages, but it's zs_pool that actually has a ->migrate_lock
> access to which is opene-coded.  Factor out pool migrate locking
> into helpers, zspage migration locking API will be renamed to
> reduce confusion.

But why did we remove "migrate" from the helpers' names if this is the
real migrate lock? It seems like the real problem is how the zspage lock
helpers are named, which is addressed later.

> 
> Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
> ---
>  mm/zsmalloc.c | 56 +++++++++++++++++++++++++++++++++++++--------------
>  1 file changed, 41 insertions(+), 15 deletions(-)
> 
> diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
> index 817626a351f8..2f8a2b139919 100644
> --- a/mm/zsmalloc.c
> +++ b/mm/zsmalloc.c
> @@ -204,7 +204,8 @@ struct link_free {
>  };
>  
>  struct zs_pool {
> -	const char *name;
> +	/* protect page/zspage migration */
> +	rwlock_t migrate_lock;
>  
>  	struct size_class *size_class[ZS_SIZE_CLASSES];
>  	struct kmem_cache *handle_cachep;
> @@ -213,6 +214,7 @@ struct zs_pool {
>  	atomic_long_t pages_allocated;
>  
>  	struct zs_pool_stats stats;
> +	atomic_t compaction_in_progress;
>  
>  	/* Compact classes */
>  	struct shrinker *shrinker;
> @@ -223,11 +225,35 @@ struct zs_pool {
>  #ifdef CONFIG_COMPACTION
>  	struct work_struct free_work;
>  #endif
> -	/* protect page/zspage migration */
> -	rwlock_t migrate_lock;
> -	atomic_t compaction_in_progress;
> +
> +	const char *name;

Are the struct moves relevant to the patch or just to improve
readability? I am generally scared to move hot members around
unnecessarily due to cache-line sharing problems -- although I don't
know if these are really hot.

>  };
>  
> +static void pool_write_unlock(struct zs_pool *pool)
> +{
> +	write_unlock(&pool->migrate_lock);
> +}
> +
> +static void pool_write_lock(struct zs_pool *pool)
> +{
> +	write_lock(&pool->migrate_lock);
> +}
> +
> +static void pool_read_unlock(struct zs_pool *pool)
> +{
> +	read_unlock(&pool->migrate_lock);
> +}
> +
> +static void pool_read_lock(struct zs_pool *pool)
> +{
> +	read_lock(&pool->migrate_lock);
> +}
> +
> +static bool pool_lock_is_contended(struct zs_pool *pool)
> +{
> +	return rwlock_is_contended(&pool->migrate_lock);
> +}
> +
>  static inline void zpdesc_set_first(struct zpdesc *zpdesc)
>  {
>  	SetPagePrivate(zpdesc_page(zpdesc));
> @@ -1206,7 +1232,7 @@ void *zs_map_object(struct zs_pool *pool, unsigned long handle,
>  	BUG_ON(in_interrupt());
>  
>  	/* It guarantees it can get zspage from handle safely */
> -	read_lock(&pool->migrate_lock);
> +	pool_read_lock(pool);
>  	obj = handle_to_obj(handle);
>  	obj_to_location(obj, &zpdesc, &obj_idx);
>  	zspage = get_zspage(zpdesc);
> @@ -1218,7 +1244,7 @@ void *zs_map_object(struct zs_pool *pool, unsigned long handle,
>  	 * which is smaller granularity.
>  	 */
>  	migrate_read_lock(zspage);
> -	read_unlock(&pool->migrate_lock);
> +	pool_read_unlock(pool);
>  
>  	class = zspage_class(pool, zspage);
>  	off = offset_in_page(class->size * obj_idx);
> @@ -1453,13 +1479,13 @@ void zs_free(struct zs_pool *pool, unsigned long handle)
>  	 * The pool->migrate_lock protects the race with zpage's migration
>  	 * so it's safe to get the page from handle.
>  	 */
> -	read_lock(&pool->migrate_lock);
> +	pool_read_lock(pool);
>  	obj = handle_to_obj(handle);
>  	obj_to_zpdesc(obj, &f_zpdesc);
>  	zspage = get_zspage(f_zpdesc);
>  	class = zspage_class(pool, zspage);
>  	spin_lock(&class->lock);
> -	read_unlock(&pool->migrate_lock);
> +	pool_read_unlock(pool);
>  
>  	class_stat_sub(class, ZS_OBJS_INUSE, 1);
>  	obj_free(class->size, obj);
> @@ -1796,7 +1822,7 @@ static int zs_page_migrate(struct page *newpage, struct page *page,
>  	 * The pool migrate_lock protects the race between zpage migration
>  	 * and zs_free.
>  	 */
> -	write_lock(&pool->migrate_lock);
> +	pool_write_lock(pool);
>  	class = zspage_class(pool, zspage);
>  
>  	/*
> @@ -1833,7 +1859,7 @@ static int zs_page_migrate(struct page *newpage, struct page *page,
>  	 * Since we complete the data copy and set up new zspage structure,
>  	 * it's okay to release migration_lock.
>  	 */
> -	write_unlock(&pool->migrate_lock);
> +	pool_write_unlock(pool);
>  	spin_unlock(&class->lock);
>  	migrate_write_unlock(zspage);
>  
> @@ -1956,7 +1982,7 @@ static unsigned long __zs_compact(struct zs_pool *pool,
>  	 * protect the race between zpage migration and zs_free
>  	 * as well as zpage allocation/free
>  	 */
> -	write_lock(&pool->migrate_lock);
> +	pool_write_lock(pool);
>  	spin_lock(&class->lock);
>  	while (zs_can_compact(class)) {
>  		int fg;
> @@ -1983,14 +2009,14 @@ static unsigned long __zs_compact(struct zs_pool *pool,
>  		src_zspage = NULL;
>  
>  		if (get_fullness_group(class, dst_zspage) == ZS_INUSE_RATIO_100
> -		    || rwlock_is_contended(&pool->migrate_lock)) {
> +		    || pool_lock_is_contended(pool)) {
>  			putback_zspage(class, dst_zspage);
>  			dst_zspage = NULL;
>  
>  			spin_unlock(&class->lock);
> -			write_unlock(&pool->migrate_lock);
> +			pool_write_unlock(pool);
>  			cond_resched();
> -			write_lock(&pool->migrate_lock);
> +			pool_write_lock(pool);
>  			spin_lock(&class->lock);
>  		}
>  	}
> @@ -2002,7 +2028,7 @@ static unsigned long __zs_compact(struct zs_pool *pool,
>  		putback_zspage(class, dst_zspage);
>  
>  	spin_unlock(&class->lock);
> -	write_unlock(&pool->migrate_lock);
> +	pool_write_unlock(pool);
>  
>  	return pages_freed;
>  }
> -- 
> 2.48.1.262.g85cc9f2d1e-goog
> 


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCHv1 2/6] zsmalloc: factor out size-class locking helpers
  2025-01-29  6:43 ` [PATCHv1 2/6] zsmalloc: factor out size-class " Sergey Senozhatsky
@ 2025-01-29 17:01   ` Yosry Ahmed
  0 siblings, 0 replies; 21+ messages in thread
From: Yosry Ahmed @ 2025-01-29 17:01 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Andrew Morton, Minchan Kim, Johannes Weiner, Nhat Pham, linux-mm,
	linux-kernel

On Wed, Jan 29, 2025 at 03:43:48PM +0900, Sergey Senozhatsky wrote:
> Move open-coded size-class locking to dedicated helpers.
> 
> Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>

Reviewed-by: Yosry Ahmed <yosry.ahmed@linux.dev>

> ---
>  mm/zsmalloc.c | 47 ++++++++++++++++++++++++++++-------------------
>  1 file changed, 28 insertions(+), 19 deletions(-)
> 
> diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
> index 2f8a2b139919..0f575307675d 100644
> --- a/mm/zsmalloc.c
> +++ b/mm/zsmalloc.c
> @@ -254,6 +254,16 @@ static bool pool_lock_is_contended(struct zs_pool *pool)
>  	return rwlock_is_contended(&pool->migrate_lock);
>  }
>  
> +static void size_class_lock(struct size_class *class)
> +{
> +	spin_lock(&class->lock);
> +}
> +
> +static void size_class_unlock(struct size_class *class)
> +{
> +	spin_unlock(&class->lock);
> +}
> +
>  static inline void zpdesc_set_first(struct zpdesc *zpdesc)
>  {
>  	SetPagePrivate(zpdesc_page(zpdesc));
> @@ -614,8 +624,7 @@ static int zs_stats_size_show(struct seq_file *s, void *v)
>  		if (class->index != i)
>  			continue;
>  
> -		spin_lock(&class->lock);
> -
> +		size_class_lock(class);
>  		seq_printf(s, " %5u %5u ", i, class->size);
>  		for (fg = ZS_INUSE_RATIO_10; fg < NR_FULLNESS_GROUPS; fg++) {
>  			inuse_totals[fg] += class_stat_read(class, fg);
> @@ -625,7 +634,7 @@ static int zs_stats_size_show(struct seq_file *s, void *v)
>  		obj_allocated = class_stat_read(class, ZS_OBJS_ALLOCATED);
>  		obj_used = class_stat_read(class, ZS_OBJS_INUSE);
>  		freeable = zs_can_compact(class);
> -		spin_unlock(&class->lock);
> +		size_class_unlock(class);
>  
>  		objs_per_zspage = class->objs_per_zspage;
>  		pages_used = obj_allocated / objs_per_zspage *
> @@ -1400,7 +1409,7 @@ unsigned long zs_malloc(struct zs_pool *pool, size_t size, gfp_t gfp)
>  	class = pool->size_class[get_size_class_index(size)];
>  
>  	/* class->lock effectively protects the zpage migration */
> -	spin_lock(&class->lock);
> +	size_class_lock(class);
>  	zspage = find_get_zspage(class);
>  	if (likely(zspage)) {
>  		obj_malloc(pool, zspage, handle);
> @@ -1411,7 +1420,7 @@ unsigned long zs_malloc(struct zs_pool *pool, size_t size, gfp_t gfp)
>  		goto out;
>  	}
>  
> -	spin_unlock(&class->lock);
> +	size_class_unlock(class);
>  
>  	zspage = alloc_zspage(pool, class, gfp);
>  	if (!zspage) {
> @@ -1419,7 +1428,7 @@ unsigned long zs_malloc(struct zs_pool *pool, size_t size, gfp_t gfp)
>  		return (unsigned long)ERR_PTR(-ENOMEM);
>  	}
>  
> -	spin_lock(&class->lock);
> +	size_class_lock(class);
>  	obj_malloc(pool, zspage, handle);
>  	newfg = get_fullness_group(class, zspage);
>  	insert_zspage(class, zspage, newfg);
> @@ -1430,7 +1439,7 @@ unsigned long zs_malloc(struct zs_pool *pool, size_t size, gfp_t gfp)
>  	/* We completely set up zspage so mark them as movable */
>  	SetZsPageMovable(pool, zspage);
>  out:
> -	spin_unlock(&class->lock);
> +	size_class_unlock(class);
>  
>  	return handle;
>  }
> @@ -1484,7 +1493,7 @@ void zs_free(struct zs_pool *pool, unsigned long handle)
>  	obj_to_zpdesc(obj, &f_zpdesc);
>  	zspage = get_zspage(f_zpdesc);
>  	class = zspage_class(pool, zspage);
> -	spin_lock(&class->lock);
> +	size_class_lock(class);
>  	pool_read_unlock(pool);
>  
>  	class_stat_sub(class, ZS_OBJS_INUSE, 1);
> @@ -1494,7 +1503,7 @@ void zs_free(struct zs_pool *pool, unsigned long handle)
>  	if (fullness == ZS_INUSE_RATIO_0)
>  		free_zspage(pool, class, zspage);
>  
> -	spin_unlock(&class->lock);
> +	size_class_unlock(class);
>  	cache_free_handle(pool, handle);
>  }
>  EXPORT_SYMBOL_GPL(zs_free);
> @@ -1828,7 +1837,7 @@ static int zs_page_migrate(struct page *newpage, struct page *page,
>  	/*
>  	 * the class lock protects zpage alloc/free in the zspage.
>  	 */
> -	spin_lock(&class->lock);
> +	size_class_lock(class);
>  	/* the migrate_write_lock protects zpage access via zs_map_object */
>  	migrate_write_lock(zspage);
>  
> @@ -1860,7 +1869,7 @@ static int zs_page_migrate(struct page *newpage, struct page *page,
>  	 * it's okay to release migration_lock.
>  	 */
>  	pool_write_unlock(pool);
> -	spin_unlock(&class->lock);
> +	size_class_unlock(class);
>  	migrate_write_unlock(zspage);
>  
>  	zpdesc_get(newzpdesc);
> @@ -1904,10 +1913,10 @@ static void async_free_zspage(struct work_struct *work)
>  		if (class->index != i)
>  			continue;
>  
> -		spin_lock(&class->lock);
> +		size_class_lock(class);
>  		list_splice_init(&class->fullness_list[ZS_INUSE_RATIO_0],
>  				 &free_pages);
> -		spin_unlock(&class->lock);
> +		size_class_unlock(class);
>  	}
>  
>  	list_for_each_entry_safe(zspage, tmp, &free_pages, list) {
> @@ -1915,10 +1924,10 @@ static void async_free_zspage(struct work_struct *work)
>  		lock_zspage(zspage);
>  
>  		class = zspage_class(pool, zspage);
> -		spin_lock(&class->lock);
> +		size_class_lock(class);
>  		class_stat_sub(class, ZS_INUSE_RATIO_0, 1);
>  		__free_zspage(pool, class, zspage);
> -		spin_unlock(&class->lock);
> +		size_class_unlock(class);
>  	}
>  };
>  
> @@ -1983,7 +1992,7 @@ static unsigned long __zs_compact(struct zs_pool *pool,
>  	 * as well as zpage allocation/free
>  	 */
>  	pool_write_lock(pool);
> -	spin_lock(&class->lock);
> +	size_class_lock(class);
>  	while (zs_can_compact(class)) {
>  		int fg;
>  
> @@ -2013,11 +2022,11 @@ static unsigned long __zs_compact(struct zs_pool *pool,
>  			putback_zspage(class, dst_zspage);
>  			dst_zspage = NULL;
>  
> -			spin_unlock(&class->lock);
> +			size_class_unlock(class);
>  			pool_write_unlock(pool);
>  			cond_resched();
>  			pool_write_lock(pool);
> -			spin_lock(&class->lock);
> +			size_class_lock(class);
>  		}
>  	}
>  
> @@ -2027,7 +2036,7 @@ static unsigned long __zs_compact(struct zs_pool *pool,
>  	if (dst_zspage)
>  		putback_zspage(class, dst_zspage);
>  
> -	spin_unlock(&class->lock);
> +	size_class_unlock(class);
>  	pool_write_unlock(pool);
>  
>  	return pages_freed;
> -- 
> 2.48.1.262.g85cc9f2d1e-goog
> 


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCHv1 4/6] zsmalloc: introduce new object mapping API
  2025-01-29  6:43 ` [PATCHv1 4/6] zsmalloc: introduce new object mapping API Sergey Senozhatsky
@ 2025-01-29 17:31   ` Yosry Ahmed
  2025-01-30  3:21     ` Sergey Senozhatsky
  0 siblings, 1 reply; 21+ messages in thread
From: Yosry Ahmed @ 2025-01-29 17:31 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Andrew Morton, Minchan Kim, Johannes Weiner, Nhat Pham, linux-mm,
	linux-kernel

On Wed, Jan 29, 2025 at 03:43:50PM +0900, Sergey Senozhatsky wrote:
> Current object mapping API is a little cumbersome.  First, it's
> inconsistent, sometimes it returns with page-faults disabled and
> sometimes with page-faults enabled.  Second, and most importantly,
> it enforces atomicity restrictions on its users.  zs_map_object()
> has to return a liner object address which is not always possible
> because some objects span multiple physical (non-contiguous)
> pages.  For such objects zsmalloc uses a per-CPU buffer to which
> object's data is copied before a pointer to that per-CPU buffer
> is returned back to the caller.  This leads to another, final,
> issue - extra memcpy().  Since the caller gets a pointer to
> per-CPU buffer it can memcpy() data only to that buffer, and
> during zs_unmap_object() zsmalloc will memcpy() from that per-CPU
> buffer to physical pages that object in question spans across.
> 
> New API splits functions by access mode:
> - zs_obj_read_begin(handle, local_copy)
>   Returns a pointer to handle memory.  For objects that span two
>   physical pages a local_copy buffer is used to store object's
>   data before the address is returned to the caller.  Otherwise
>   the object's page is kmap_local mapped directly.
> 
> - zs_obj_read_end(handle, buf)
>   Unmaps the page if it was kmap_local mapped by zs_obj_read_begin().
> 
> - zs_obj_write(handle, buf, len)
>   Copies len-bytes from compression buffer to handle memory
>   (takes care of objects that span two pages).  This does not
>   need any additional (e.g. per-CPU) buffers and writes the data
>   directly to zsmalloc pool pages.
> 
> The old API will stay around until the remaining users switch
> to the new one.  After that we'll also remove zsmalloc per-CPU
> buffer and CPU hotplug handling.

I will propose removing zbud (in addition to z3fold) soon. If that gets
in then we'd only need to update zpool and zswap code to use the new
API. I can take care of that if you want.

> 
> Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>

I have a couple of questions below, but generally LGTM:

Reviewed-by: Yosry Ahmed <yosry.ahmed@linux.dev>

> ---
>  include/linux/zsmalloc.h |   8 +++
>  mm/zsmalloc.c            | 129 +++++++++++++++++++++++++++++++++++++++
>  2 files changed, 137 insertions(+)
> 
> diff --git a/include/linux/zsmalloc.h b/include/linux/zsmalloc.h
> index a48cd0ffe57d..625adae8e547 100644
> --- a/include/linux/zsmalloc.h
> +++ b/include/linux/zsmalloc.h
> @@ -58,4 +58,12 @@ unsigned long zs_compact(struct zs_pool *pool);
>  unsigned int zs_lookup_class_index(struct zs_pool *pool, unsigned int size);
>  
>  void zs_pool_stats(struct zs_pool *pool, struct zs_pool_stats *stats);
> +
> +void zs_obj_read_end(struct zs_pool *pool, unsigned long handle,
> +		     void *handle_mem);
> +void *zs_obj_read_begin(struct zs_pool *pool, unsigned long handle,
> +			void *local_copy);

Nit: Any reason to put 'end' before 'begin'? Same for the function
definitions.

> +void zs_obj_write(struct zs_pool *pool, unsigned long handle,
> +		  void *handle_mem, size_t mem_len);
> +
>  #endif
> diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
> index 8f4011713bc8..0e21bc57470b 100644
> --- a/mm/zsmalloc.c
> +++ b/mm/zsmalloc.c
> @@ -1371,6 +1371,135 @@ void zs_unmap_object(struct zs_pool *pool, unsigned long handle)
>  }
>  EXPORT_SYMBOL_GPL(zs_unmap_object);
>  
> +void zs_obj_write(struct zs_pool *pool, unsigned long handle,
> +		  void *handle_mem, size_t mem_len)
> +{
> +	struct zspage *zspage;
> +	struct zpdesc *zpdesc;
> +	unsigned long obj, off;
> +	unsigned int obj_idx;
> +	struct size_class *class;
> +
> +	WARN_ON(in_interrupt());
> +
> +	/* Guarantee we can get zspage from handle safely */
> +	pool_read_lock(pool);
> +	obj = handle_to_obj(handle);
> +	obj_to_location(obj, &zpdesc, &obj_idx);
> +	zspage = get_zspage(zpdesc);
> +
> +	/* Make sure migration doesn't move any pages in this zspage */
> +	zspage_read_lock(zspage);
> +	pool_read_unlock(pool);
> +
> +	class = zspage_class(pool, zspage);
> +	off = offset_in_page(class->size * obj_idx);
> +
> +	if (off + class->size <= PAGE_SIZE) {
> +		/* this object is contained entirely within a page */
> +		void *dst = kmap_local_zpdesc(zpdesc);
> +
> +		if (!ZsHugePage(zspage))
> +			off += ZS_HANDLE_SIZE;
> +		memcpy(dst + off, handle_mem, mem_len);
> +		kunmap_local(dst);
> +	} else {
> +		size_t sizes[2];
> +
> +		/* this object spans two pages */
> +		off += ZS_HANDLE_SIZE;

Are huge pages always stored in a single page? If yes, can we just do
this before the if block for both cases:

if (!ZsHugePage(zspage))
	off += ZS_HANDLE_SIZE;

> +		sizes[0] = PAGE_SIZE - off;
> +		sizes[1] = mem_len - sizes[0];
> +
> +		memcpy_to_page(zpdesc_page(zpdesc), off,
> +			       handle_mem, sizes[0]);
> +		zpdesc = get_next_zpdesc(zpdesc);
> +		memcpy_to_page(zpdesc_page(zpdesc), 0,
> +			       handle_mem + sizes[0], sizes[1]);
> +	}
> +
> +	zspage_read_unlock(zspage);
> +}
> +EXPORT_SYMBOL_GPL(zs_obj_write);
> +
> +void zs_obj_read_end(struct zs_pool *pool, unsigned long handle,
> +		     void *handle_mem)
> +{
> +	struct zspage *zspage;
> +	struct zpdesc *zpdesc;
> +	unsigned long obj, off;
> +	unsigned int obj_idx;
> +	struct size_class *class;
> +
> +	obj = handle_to_obj(handle);
> +	obj_to_location(obj, &zpdesc, &obj_idx);
> +	zspage = get_zspage(zpdesc);
> +	class = zspage_class(pool, zspage);
> +	off = offset_in_page(class->size * obj_idx);
> +
> +	if (off + class->size <= PAGE_SIZE) {
> +		if (!ZsHugePage(zspage))
> +			off += ZS_HANDLE_SIZE;
> +		handle_mem -= off;
> +		kunmap_local(handle_mem);
> +	}
> +
> +	zspage_read_unlock(zspage);
> +}
> +EXPORT_SYMBOL_GPL(zs_obj_read_end);
> +
> +void *zs_obj_read_begin(struct zs_pool *pool, unsigned long handle,
> +			void *local_copy)
> +{
> +	struct zspage *zspage;
> +	struct zpdesc *zpdesc;
> +	unsigned long obj, off;
> +	unsigned int obj_idx;
> +	struct size_class *class;
> +	void *addr;
> +
> +	WARN_ON(in_interrupt());
> +
> +	/* Guarantee we can get zspage from handle safely */
> +	pool_read_lock(pool);
> +	obj = handle_to_obj(handle);
> +	obj_to_location(obj, &zpdesc, &obj_idx);
> +	zspage = get_zspage(zpdesc);
> +
> +	/* Make sure migration doesn't move any pages in this zspage */
> +	zspage_read_lock(zspage);
> +	pool_read_unlock(pool);
> +
> +	class = zspage_class(pool, zspage);
> +	off = offset_in_page(class->size * obj_idx);
> +
> +	if (off + class->size <= PAGE_SIZE) {
> +		/* this object is contained entirely within a page */
> +		addr = kmap_local_zpdesc(zpdesc);
> +		addr += off;
> +	} else {
> +		size_t sizes[2];
> +
> +		/* this object spans two pages */
> +		sizes[0] = PAGE_SIZE - off;
> +		sizes[1] = class->size - sizes[0];
> +		addr = local_copy;
> +
> +		memcpy_from_page(addr, zpdesc_page(zpdesc),
> +				 off, sizes[0]);
> +		zpdesc = get_next_zpdesc(zpdesc);
> +		memcpy_from_page(addr + sizes[0],
> +				 zpdesc_page(zpdesc),
> +				 0, sizes[1]);
> +	}
> +
> +	if (!ZsHugePage(zspage))
> +		addr += ZS_HANDLE_SIZE;
> +
> +	return addr;
> +}
> +EXPORT_SYMBOL_GPL(zs_obj_read_begin);
> +
>  /**
>   * zs_huge_class_size() - Returns the size (in bytes) of the first huge
>   *                        zsmalloc &size_class.
> -- 
> 2.48.1.262.g85cc9f2d1e-goog
> 


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCHv1 0/6] zsmalloc: preemptible object mapping
  2025-01-29 15:53 ` [PATCHv1 0/6] zsmalloc: preemptible object mapping Yosry Ahmed
@ 2025-01-30  3:13   ` Sergey Senozhatsky
  0 siblings, 0 replies; 21+ messages in thread
From: Sergey Senozhatsky @ 2025-01-30  3:13 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Sergey Senozhatsky, Andrew Morton, Minchan Kim, Johannes Weiner,
	Nhat Pham, linux-mm, linux-kernel

On (25/01/29 15:53), Yosry Ahmed wrote:
> On Wed, Jan 29, 2025 at 03:43:46PM +0900, Sergey Senozhatsky wrote:
> > This is Part II of the series [1] that makes zram read() and write()
> > preemptible.  This part focuses only zsmalloc because zsmalloc imposes
> > atomicity restrictions on its users.  One notable example is object
> > mapping API, which returns with:
> > a) local CPU lock held
> > b) zspage rwlock held
> > 
> > First, zsmalloc is converted to use sleepable RW-"lock" (it's atomic_t
> > in fact) for zspage migration protection.  Second, a new handle mapping
> > is introduced which doesn't use per-CPU buffers (and hence no local CPU
> > lock), does fewer memcpy() calls, but requires users to provide a
> > pointer to temp buffer for object copy-in (when needed).  Third, zram is
> > converted to the new zsmalloc mapping API and thus zram read() becomes
> > preemptible.
> > 
> > [1] https://lore.kernel.org/linux-mm/20250127072932.1289973-1-senozhatsky@chromium.org
> > 
> > RFC -> v1:
> > - Only zspage->lock (leaf-lock for zs_map_object()) is converted
> >   to a preemptible lock.  The rest of the zspool locks remain the
> >   same (Yosry hated with passion the fact that in RFC series all
> >   zspool looks would become preemptible).
> 
> Hated is a big word here, I was merely concerned about how the locking
> changes would affect performance :P

Yeah I'm just messing around :)


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCHv1 4/6] zsmalloc: introduce new object mapping API
  2025-01-29 17:31   ` Yosry Ahmed
@ 2025-01-30  3:21     ` Sergey Senozhatsky
  2025-01-30  4:17       ` Sergey Senozhatsky
  2025-01-30 16:27       ` Yosry Ahmed
  0 siblings, 2 replies; 21+ messages in thread
From: Sergey Senozhatsky @ 2025-01-30  3:21 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Sergey Senozhatsky, Andrew Morton, Minchan Kim, Johannes Weiner,
	Nhat Pham, linux-mm, linux-kernel

On (25/01/29 17:31), Yosry Ahmed wrote:
> On Wed, Jan 29, 2025 at 03:43:50PM +0900, Sergey Senozhatsky wrote:
[..]
> > The old API will stay around until the remaining users switch
> > to the new one.  After that we'll also remove zsmalloc per-CPU
> > buffer and CPU hotplug handling.
> 
> I will propose removing zbud (in addition to z3fold) soon. If that gets
> in then we'd only need to update zpool and zswap code to use the new
> API. I can take care of that if you want.

Sounds like a plan.  I think I saw zbud deprecation patch (along with z3fold
removal).  I guess you still want to keep zpool, just because it's there
already?

> > Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
> 
> I have a couple of questions below, but generally LGTM:
> 
> Reviewed-by: Yosry Ahmed <yosry.ahmed@linux.dev>

Thanks.

[..]
> > +void zs_obj_read_end(struct zs_pool *pool, unsigned long handle,
> > +		     void *handle_mem);
> > +void *zs_obj_read_begin(struct zs_pool *pool, unsigned long handle,
> > +			void *local_copy);
> 
> Nit: Any reason to put 'end' before 'begin'? Same for the function
> definitions.

An old habit, I just put release() before init() (or end() before
begin()) because often times you call end() from begin() error path.
Not in this particular case, but I just do that semi-automatically.

[..]
> > +	if (off + class->size <= PAGE_SIZE) {
> > +		/* this object is contained entirely within a page */
> > +		void *dst = kmap_local_zpdesc(zpdesc);
> > +
> > +		if (!ZsHugePage(zspage))
> > +			off += ZS_HANDLE_SIZE;
> > +		memcpy(dst + off, handle_mem, mem_len);
> > +		kunmap_local(dst);
> > +	} else {
> > +		size_t sizes[2];
> > +
> > +		/* this object spans two pages */
> > +		off += ZS_HANDLE_SIZE;
> 
> Are huge pages always stored in a single page? If yes, can we just do
> this before the if block for both cases:

Yes.

> if (!ZsHugePage(zspage))
> 	off += ZS_HANDLE_SIZE;

Looks good.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCHv1 3/6] zsmalloc: make zspage lock preemptible
  2025-01-29 15:22   ` Uros Bizjak
@ 2025-01-30  3:22     ` Sergey Senozhatsky
  0 siblings, 0 replies; 21+ messages in thread
From: Sergey Senozhatsky @ 2025-01-30  3:22 UTC (permalink / raw)
  To: Uros Bizjak
  Cc: Sergey Senozhatsky, Andrew Morton, Minchan Kim, Johannes Weiner,
	Yosry Ahmed, Nhat Pham, linux-mm, linux-kernel

On (25/01/29 16:22), Uros Bizjak wrote:
> > +static void zspage_read_lock(struct zspage *zspage)
> > +{
> > +	atomic_t *lock = &zspage->lock;
> > +	int old;
> > +
> > +	while (1) {
> > +		old = atomic_read(lock);
> > +		if (old == ZS_PAGE_WRLOCKED) {
> > +			cpu_relax();
> > +			continue;
> > +		}
> > +
> > +		if (atomic_try_cmpxchg(lock, &old, old + 1))
> > +			return;
> > +
> > +		cpu_relax();
> > +	}
> > +}

[..]

> Based on the above, cpu_relax() should be removed from the loop, which
> becomes:
> 
> {
>         atomic_t *lock = &zspage->lock;
>         int old = atomic_read(lock);
> 
>         do {
>                 if (old == ZS_PAGE_WRLOCKED) {
>                         cpu_relax();
>                         old = atomic_read(lock);
>                         continue;
>                 }
> 
>         } while (!atomic_try_cmpxchg(lock, &old, old + 1));
>  }

Ack.

> > +static int zspage_try_write_lock(struct zspage *zspage)
> 
> This function can be declared as bool, returning true/false.

Ack.  Thank you.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCHv1 1/6] zsmalloc: factor out pool locking helpers
  2025-01-29 16:59   ` Yosry Ahmed
@ 2025-01-30  4:01     ` Sergey Senozhatsky
  2025-01-30 16:25       ` Yosry Ahmed
  0 siblings, 1 reply; 21+ messages in thread
From: Sergey Senozhatsky @ 2025-01-30  4:01 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Sergey Senozhatsky, Andrew Morton, Minchan Kim, Johannes Weiner,
	Nhat Pham, linux-mm, linux-kernel

On (25/01/29 16:59), Yosry Ahmed wrote:
> On Wed, Jan 29, 2025 at 03:43:47PM +0900, Sergey Senozhatsky wrote:
> > We currently have a mix of migrate_{read,write}_lock() helpers
> > that lock zspages, but it's zs_pool that actually has a ->migrate_lock
> > access to which is opene-coded.  Factor out pool migrate locking
> > into helpers, zspage migration locking API will be renamed to
> > reduce confusion.
> 
> But why did we remove "migrate" from the helpers' names if this is the
> real migrate lock? It seems like the real problem is how the zspage lock
> helpers are named, which is addressed later.

So this is deliberately.  I guess, just like with page-faults in
zs_map_object(), this naming comes from the time when we had fewer
locks and less functionality.  These days we have 3 zsmalloc locks that
can "prevent migration" at different points.  Even size class spin-lock.
But it's not only about migration anymore.  We also now have compaction
(defragmentation) that these locks synchronize.  So I want to have a
reader-write naming scheme.

> >  struct zs_pool {
> > -	const char *name;
> > +	/* protect page/zspage migration */
> > +	rwlock_t migrate_lock;
> >  
> >  	struct size_class *size_class[ZS_SIZE_CLASSES];
> >  	struct kmem_cache *handle_cachep;
> > @@ -213,6 +214,7 @@ struct zs_pool {
> >  	atomic_long_t pages_allocated;
> >  
> >  	struct zs_pool_stats stats;
> > +	atomic_t compaction_in_progress;
> >  
> >  	/* Compact classes */
> >  	struct shrinker *shrinker;
> > @@ -223,11 +225,35 @@ struct zs_pool {
> >  #ifdef CONFIG_COMPACTION
> >  	struct work_struct free_work;
> >  #endif
> > -	/* protect page/zspage migration */
> > -	rwlock_t migrate_lock;
> > -	atomic_t compaction_in_progress;
> > +
> > +	const char *name;
> 
> Are the struct moves relevant to the patch or just to improve
> readability?

Relevance is relative :)  That moved from the patch which also
converted the lock to rwsem.  I'll move this to a separate
patch.

> I am generally scared to move hot members around unnecessarily
> due to cache-line sharing problems -- although I don't know if
> these are really hot.

Size classes are basically static - once we init the array it
becomes RO.  Having migrate_lock at the bottom forces us to
jump over a big RO zs_pool region, besides we never use ->name
(unlike migrate_lock) so having it at offset 0 is unnecessary.
zs_pool_stats and compaction_in_progress are modified only
during compaction, which we do not run in parallel (only one
CPU can do compaction at a time), so we can do something like

---

 struct zs_pool {
-       const char *name;
+       /* protect page/zspage migration */
+       rwlock_t migrate_lock;
 
        struct size_class *size_class[ZS_SIZE_CLASSES];
-       struct kmem_cache *handle_cachep;
-       struct kmem_cache *zspage_cachep;
 
        atomic_long_t pages_allocated;
 
-       struct zs_pool_stats stats;
+       struct kmem_cache *handle_cachep;
+       struct kmem_cache *zspage_cachep;
 
        /* Compact classes */
        struct shrinker *shrinker;
@@ -223,9 +223,9 @@ struct zs_pool {
 #ifdef CONFIG_COMPACTION
        struct work_struct free_work;
 #endif
-       /* protect page/zspage migration */
-       rwlock_t migrate_lock;
+       const char *name;
        atomic_t compaction_in_progress;
+       struct zs_pool_stats stats;
 };


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCHv1 4/6] zsmalloc: introduce new object mapping API
  2025-01-30  3:21     ` Sergey Senozhatsky
@ 2025-01-30  4:17       ` Sergey Senozhatsky
  2025-01-30 16:27       ` Yosry Ahmed
  1 sibling, 0 replies; 21+ messages in thread
From: Sergey Senozhatsky @ 2025-01-30  4:17 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Andrew Morton, Minchan Kim, Johannes Weiner, Nhat Pham, linux-mm,
	linux-kernel, Sergey Senozhatsky

On (25/01/30 12:21), Sergey Senozhatsky wrote:
> [..]
> > > +	if (off + class->size <= PAGE_SIZE) {
> > > +		/* this object is contained entirely within a page */
> > > +		void *dst = kmap_local_zpdesc(zpdesc);
> > > +
> > > +		if (!ZsHugePage(zspage))
> > > +			off += ZS_HANDLE_SIZE;
> > > +		memcpy(dst + off, handle_mem, mem_len);
> > > +		kunmap_local(dst);
> > > +	} else {
> > > +		size_t sizes[2];
> > > +
> > > +		/* this object spans two pages */
> > > +		off += ZS_HANDLE_SIZE;
> > 
> > Are huge pages always stored in a single page? If yes, can we just do
> > this before the if block for both cases:
> 
> Yes.
> 
> > if (!ZsHugePage(zspage))
> > 	off += ZS_HANDLE_SIZE;
> 
> Looks good.

Ah, now I see why I didn't do it.  off += ZS_HANDLE_SIZE before
if (off + size <= PAGE_SIZE) messes it up.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCHv1 1/6] zsmalloc: factor out pool locking helpers
  2025-01-30  4:01     ` Sergey Senozhatsky
@ 2025-01-30 16:25       ` Yosry Ahmed
  2025-01-31  3:34         ` Sergey Senozhatsky
  0 siblings, 1 reply; 21+ messages in thread
From: Yosry Ahmed @ 2025-01-30 16:25 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Andrew Morton, Minchan Kim, Johannes Weiner, Nhat Pham, linux-mm,
	linux-kernel

On Thu, Jan 30, 2025 at 01:01:15PM +0900, Sergey Senozhatsky wrote:
> On (25/01/29 16:59), Yosry Ahmed wrote:
> > On Wed, Jan 29, 2025 at 03:43:47PM +0900, Sergey Senozhatsky wrote:
> > > We currently have a mix of migrate_{read,write}_lock() helpers
> > > that lock zspages, but it's zs_pool that actually has a ->migrate_lock
> > > access to which is opene-coded.  Factor out pool migrate locking
> > > into helpers, zspage migration locking API will be renamed to
> > > reduce confusion.
> > 
> > But why did we remove "migrate" from the helpers' names if this is the
> > real migrate lock? It seems like the real problem is how the zspage lock
> > helpers are named, which is addressed later.
> 
> So this is deliberately.  I guess, just like with page-faults in
> zs_map_object(), this naming comes from the time when we had fewer
> locks and less functionality.  These days we have 3 zsmalloc locks that
> can "prevent migration" at different points.  Even size class spin-lock.
> But it's not only about migration anymore.  We also now have compaction
> (defragmentation) that these locks synchronize.  So I want to have a
> reader-write naming scheme.

In this case shouldn't we also rename the lock itself?

> 
> > >  struct zs_pool {
> > > -	const char *name;
> > > +	/* protect page/zspage migration */
> > > +	rwlock_t migrate_lock;
> > >  
> > >  	struct size_class *size_class[ZS_SIZE_CLASSES];
> > >  	struct kmem_cache *handle_cachep;
> > > @@ -213,6 +214,7 @@ struct zs_pool {
> > >  	atomic_long_t pages_allocated;
> > >  
> > >  	struct zs_pool_stats stats;
> > > +	atomic_t compaction_in_progress;
> > >  
> > >  	/* Compact classes */
> > >  	struct shrinker *shrinker;
> > > @@ -223,11 +225,35 @@ struct zs_pool {
> > >  #ifdef CONFIG_COMPACTION
> > >  	struct work_struct free_work;
> > >  #endif
> > > -	/* protect page/zspage migration */
> > > -	rwlock_t migrate_lock;
> > > -	atomic_t compaction_in_progress;
> > > +
> > > +	const char *name;
> > 
> > Are the struct moves relevant to the patch or just to improve
> > readability?
> 
> Relevance is relative :)  That moved from the patch which also
> converted the lock to rwsem.  I'll move this to a separate
> patch.
> 
> > I am generally scared to move hot members around unnecessarily
> > due to cache-line sharing problems -- although I don't know if
> > these are really hot.
> 
> Size classes are basically static - once we init the array it
> becomes RO.  Having migrate_lock at the bottom forces us to
> jump over a big RO zs_pool region, besides we never use ->name
> (unlike migrate_lock) so having it at offset 0 is unnecessary.

What's special about offset 0 though?

My understanding is that we want to put RW-mostly and RO-mostly members
apart to land in different cachelines to avoid unnecessary bouncing.
Also we want the elements that are usually accessed together next to one
another.

Placing the RW lock right next to the RO size classes seems like it will
result in unnecessary cacheline bouncing. For example, if a CPU holds
the lock and dirties the cacheline then all other CPUs have to drop it
and fetch it again when accessing the size classes.

> zs_pool_stats and compaction_in_progress are modified only
> during compaction, which we do not run in parallel (only one
> CPU can do compaction at a time), so we can do something like

But other CPUs can read compaction_in_progress, even if they don't
modify it.

> 
> ---
> 
>  struct zs_pool {
> -       const char *name;
> +       /* protect page/zspage migration */
> +       rwlock_t migrate_lock;
>  
>         struct size_class *size_class[ZS_SIZE_CLASSES];
>
>
> -       struct kmem_cache *handle_cachep;
> -       struct kmem_cache *zspage_cachep;
>  
>         atomic_long_t pages_allocated;
>  
> -       struct zs_pool_stats stats;
> +       struct kmem_cache *handle_cachep;
> +       struct kmem_cache *zspage_cachep;
>  
>         /* Compact classes */
>         struct shrinker *shrinker;
> @@ -223,9 +223,9 @@ struct zs_pool {
>  #ifdef CONFIG_COMPACTION
>         struct work_struct free_work;
>  #endif
> -       /* protect page/zspage migration */
> -       rwlock_t migrate_lock;
> +       const char *name;
>         atomic_t compaction_in_progress;
> +       struct zs_pool_stats stats;
>  };


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCHv1 4/6] zsmalloc: introduce new object mapping API
  2025-01-30  3:21     ` Sergey Senozhatsky
  2025-01-30  4:17       ` Sergey Senozhatsky
@ 2025-01-30 16:27       ` Yosry Ahmed
  1 sibling, 0 replies; 21+ messages in thread
From: Yosry Ahmed @ 2025-01-30 16:27 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Andrew Morton, Minchan Kim, Johannes Weiner, Nhat Pham, linux-mm,
	linux-kernel

On Thu, Jan 30, 2025 at 12:21:14PM +0900, Sergey Senozhatsky wrote:
> On (25/01/29 17:31), Yosry Ahmed wrote:
> > On Wed, Jan 29, 2025 at 03:43:50PM +0900, Sergey Senozhatsky wrote:
> [..]
> > > The old API will stay around until the remaining users switch
> > > to the new one.  After that we'll also remove zsmalloc per-CPU
> > > buffer and CPU hotplug handling.
> > 
> > I will propose removing zbud (in addition to z3fold) soon. If that gets
> > in then we'd only need to update zpool and zswap code to use the new
> > API. I can take care of that if you want.
> 
> Sounds like a plan.  I think I saw zbud deprecation patch (along with z3fold
> removal).  I guess you still want to keep zpool, just because it's there
> already?

Now the proposal is to remove zbud right away (patch already sent). If
this lands then our lives become easier.

I am keeping zpool around for now because it is not doing any harm, we
can remove it later. For the zbud/z3fold their presence is a problem due
to bit roting, and having to support new APIs (like this one) in them if
we want to use them unconditionally in zswap.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCHv1 1/6] zsmalloc: factor out pool locking helpers
  2025-01-30 16:25       ` Yosry Ahmed
@ 2025-01-31  3:34         ` Sergey Senozhatsky
  0 siblings, 0 replies; 21+ messages in thread
From: Sergey Senozhatsky @ 2025-01-31  3:34 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Sergey Senozhatsky, Andrew Morton, Minchan Kim, Johannes Weiner,
	Nhat Pham, linux-mm, linux-kernel

On (25/01/30 16:25), Yosry Ahmed wrote:
> On Thu, Jan 30, 2025 at 01:01:15PM +0900, Sergey Senozhatsky wrote:
> > On (25/01/29 16:59), Yosry Ahmed wrote:
> > > On Wed, Jan 29, 2025 at 03:43:47PM +0900, Sergey Senozhatsky wrote:
> > > > We currently have a mix of migrate_{read,write}_lock() helpers
> > > > that lock zspages, but it's zs_pool that actually has a ->migrate_lock
> > > > access to which is opene-coded.  Factor out pool migrate locking
> > > > into helpers, zspage migration locking API will be renamed to
> > > > reduce confusion.
> > > 
> > > But why did we remove "migrate" from the helpers' names if this is the
> > > real migrate lock? It seems like the real problem is how the zspage lock
> > > helpers are named, which is addressed later.
> > 
> > So this is deliberately.  I guess, just like with page-faults in
> > zs_map_object(), this naming comes from the time when we had fewer
> > locks and less functionality.  These days we have 3 zsmalloc locks that
> > can "prevent migration" at different points.  Even size class spin-lock.
> > But it's not only about migration anymore.  We also now have compaction
> > (defragmentation) that these locks synchronize.  So I want to have a
> > reader-write naming scheme.
> 
> In this case shouldn't we also rename the lock itself?

Yeah, I guess, I thought about that for a second but figured that on
the grand scheme of things it wasn't all that important.  I'm going to
resend the series (after merge with zram series) so I'll rename it, to
pool->lock maybe.

[..]
> > > I am generally scared to move hot members around unnecessarily
> > > due to cache-line sharing problems -- although I don't know if
> > > these are really hot.
> > 
> > Size classes are basically static - once we init the array it
> > becomes RO.  Having migrate_lock at the bottom forces us to
> > jump over a big RO zs_pool region, besides we never use ->name
> > (unlike migrate_lock) so having it at offset 0 is unnecessary.
> 
> What's special about offset 0 though?
> 
> My understanding is that we want to put RW-mostly and RO-mostly members
> apart to land in different cachelines to avoid unnecessary bouncing.
> Also we want the elements that are usually accessed together next to one
> another.
> 
> Placing the RW lock right next to the RO size classes seems like it will
> result in unnecessary cacheline bouncing. For example, if a CPU holds
> the lock and dirties the cacheline then all other CPUs have to drop it
> and fetch it again when accessing the size classes.

So the idea was that offset 0 is located very close to low size classes,
which are a) not accessed that often and b) some of them are not accessed
at all.  I don't think I've ever seen size-class-32 (the first size-class)
being used.  I think for some algorithms compressing PAGE_SIZE below 32 bytes
is practically impossible, because of all the internal data that the
algorithms puts into compression output (checksums, flags, version numbers,
etc. etc.).

> > zs_pool_stats and compaction_in_progress are modified only
> > during compaction, which we do not run in parallel (only one
> > CPU can do compaction at a time), so we can do something like
> 
> But other CPUs can read compaction_in_progress, even if they don't
> modify it.

Yes, well, that's the nature of that flag.

I'll drop patch, I guess.  It's not worth our time.


^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2025-01-31  3:34 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-01-29  6:43 [PATCHv1 0/6] zsmalloc: preemptible object mapping Sergey Senozhatsky
2025-01-29  6:43 ` [PATCHv1 1/6] zsmalloc: factor out pool locking helpers Sergey Senozhatsky
2025-01-29 16:59   ` Yosry Ahmed
2025-01-30  4:01     ` Sergey Senozhatsky
2025-01-30 16:25       ` Yosry Ahmed
2025-01-31  3:34         ` Sergey Senozhatsky
2025-01-29  6:43 ` [PATCHv1 2/6] zsmalloc: factor out size-class " Sergey Senozhatsky
2025-01-29 17:01   ` Yosry Ahmed
2025-01-29  6:43 ` [PATCHv1 3/6] zsmalloc: make zspage lock preemptible Sergey Senozhatsky
2025-01-29 11:25   ` Sergey Senozhatsky
2025-01-29 15:22   ` Uros Bizjak
2025-01-30  3:22     ` Sergey Senozhatsky
2025-01-29  6:43 ` [PATCHv1 4/6] zsmalloc: introduce new object mapping API Sergey Senozhatsky
2025-01-29 17:31   ` Yosry Ahmed
2025-01-30  3:21     ` Sergey Senozhatsky
2025-01-30  4:17       ` Sergey Senozhatsky
2025-01-30 16:27       ` Yosry Ahmed
2025-01-29  6:43 ` [PATCHv1 5/6] zram: switch to new zsmalloc " Sergey Senozhatsky
2025-01-29  6:43 ` [PATCHv1 6/6] zram: add might_sleep to zcomp API Sergey Senozhatsky
2025-01-29 15:53 ` [PATCHv1 0/6] zsmalloc: preemptible object mapping Yosry Ahmed
2025-01-30  3:13   ` Sergey Senozhatsky

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox