[PATCH 00/13] mm, swap: rework of swap allocator locks

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 00/13] mm, swap: rework of swap allocator locks
@ 2024-10-22 19:24 Kairui Song
  2024-10-22 19:24 ` [PATCH 01/13] mm, swap: minor clean up for swap entry allocation Kairui Song
                   ` (15 more replies)
  0 siblings, 16 replies; 21+ messages in thread
From: Kairui Song @ 2024-10-22 19:24 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Chris Li, Barry Song, Ryan Roberts, Hugh Dickins,
	Yosry Ahmed, Huang, Ying, Tim Chen, Nhat Pham, linux-kernel,
	Kairui Song

From: Kairui Song <kasong@tencent.com>

This series improved the swap allocator performance greatly by reworking
the locking design and simplify a lot of code path.

This is follow up of previous swap cluster allocator series:
https://lore.kernel.org/linux-mm/20240730-swap-allocator-v5-0-cb9c148b9297@kernel.org/

And this series is based on an follow up fix of the swap cluster
allocator:
https://lore.kernel.org/linux-mm/20241022175512.10398-1-ryncsn@gmail.com/

This is part of the new swap allocator work item discussed in
Chris's "Swap Abstraction" discussion at LSF/MM 2024, and
"mTHP and swap allocator" discussion at LPC 2024.

Previous series introduced a fully cluster based allocation algorithm,
this series completely get rid of the old allocation path and makes the
allocator avoid grabbing the si->lock unless needed. This bring huge
performance gain and get rid of slot cache on freeing path.

Currently, swap locking is mainly composed of two locks, cluster lock
(ci->lock) and device lock (si->lock). The device lock is widely used
to protect many things, causing it to be the main bottleneck for SWAP.

Cluster lock is much more fine-grained, so it will be best to use
ci->lock instead of si->lock as much as possible.

`perf lock` indicates this issue clearly. Doing linux kernel build
using tmpfs and ZRAM with limited memory (make -j64 with 1G memcg and 4k
pages), result of "perf lock contention -ab sleep 3":

  contended   total wait     max wait     avg wait         type   caller

     34948     53.63 s       7.11 ms      1.53 ms     spinlock   free_swap_and_cache_nr+0x350
     16569     40.05 s       6.45 ms      2.42 ms     spinlock   get_swap_pages+0x231
     11191     28.41 s       7.03 ms      2.54 ms     spinlock   swapcache_free_entries+0x59
      4147     22.78 s     122.66 ms      5.49 ms     spinlock   page_vma_mapped_walk+0x6f3
      4595      7.17 s       6.79 ms      1.56 ms     spinlock   swapcache_free_entries+0x59
    406027      2.74 s       2.59 ms      6.74 us     spinlock   list_lru_add+0x39
  ...snip...

The top 5 caller are all users of si->lock, total wait time up sums to
several minutes in the 3 seconds time window.

Following the new allocator design, many operation doesn't need to touch
si->lock at all. We only need to take si->lock when doing operations
across multiple clusters (eg. changing the cluster list), other
operations only need to take ci->lock. So ideally allocator should
always take ci->lock first, then, if needed, take si->lock. But due
to historical reasons, ci->lock is used inside si->lock by design,
causing lock inversion if we simply try to acquire si->lock after
acquiring ci->lock.

This series audited all si->lock usage, simplify legacy codes, eliminate
usage of si->lock as much as possible by introducing new designs based
on the new cluster allocator.

Old HDD allocation codes are removed, cluster allocator is adapted
with small changes for HDD usage, test is looking OK.

And this also removed slot cache for freeing path. The performance is
better without it, and this enables other clean up and optimizations
as discussed before:
https://lore.kernel.org/all/CAMgjq7ACohT_uerSz8E_994ZZCv709Zor+43hdmesW_59W1BWw@mail.gmail.com/

After this series, lock contention on si->lock is nearly unobservable
with `perf lock` with the same test above :

  contended   total wait     max wait     avg wait         type   caller
  ... snip ...
         91    204.62 us      4.51 us      2.25 us     spinlock   cluster_move+0x2e
  ... snip ...
         47    125.62 us      4.47 us      2.67 us     spinlock   cluster_move+0x2e
  ... snip ...
         23     63.15 us      3.95 us      2.74 us     spinlock   cluster_move+0x2e
  ... snip ...
         17     41.26 us      4.58 us      2.43 us     spinlock   cluster_isolate_lock+0x1d
  ... snip ...

cluster_move and cluster_isolate_lock are basically the only users
of si->lock now, performance gain is huge with reduced LOC.

Tests
===

Build kernel with defconfig on tmpfs with ZRAM as swap:
---

Running a test matrix which is scaled up progressive for a intuitive result.
The test are ran on top of tmpfs, using memory cgroup for memory limitation,
on a 48c96t system.

12 test run for each case, it can be seen clearly that as concurrent job
number goes higher the performance gain is higher, the performance is
higher even with low concurrency.

   make -j<NR>     |   System Time (seconds)  |   Total Time (seconds)
 (NR / Mem / ZRAM) | (Before / After / Delta) | (Before / After / Delta)
 With 4k pages only:
  6 / 192M / 3G    |    5258 /  5235 / -0.3%  |    1420 /  1414 / -0.3%
 12 / 256M / 4G    |    5518 /  5337 / -3.3%  |     758 /   742 / -2.1%
 24 / 384M / 5G    |    7091 /  5766 / -18.7% |     476 /   422 / -11.3%
 48 / 768M / 7G    |   11139 /  5831 / -47.7% |     330 /   221 / -33.0%
 96 / 1.5G / 10G   |   21303 / 11353 / -46.7% |     283 /   180 / -36.4%
 With 64k mTHP:
 24 / 512M / 5G    |    5104 /  4641 / -18.7% |     376 /   358 / -4.8%
 48 /   1G / 7G    |    8693 /  4662 / -18.7% |     257 /   176 / -31.5%
 96 /   2G / 10G   |   17056 / 10263 / -39.8% |     234 /   169 / -27.8%

With more aggressive setup, it shows clearly both the performance and
fragmentation are better:

tiem make -j96 / 768M memcg, 4K pages, 10G ZRAM, on Intel 8255C * 2:
(avg of 4 test run)
Before:
Sys time: 73578.30, Real time: 864.05
tiem make -j96 / 1G memcg, 4K pages, 10G ZRAM:
After: (-54.7% sys time, -49.3% real time)
Sys time: 33314.76, Real time: 437.67

time make -j96 / 1152M memcg, 64K mTHP, 10G ZRAM, on Intel 8255C * 2:
(avg of 4 test run)
Before:
Sys time: 74044.85, Real time: 846.51
hugepages-64kB/stats/swpout: 1735216
hugepages-64kB/stats/swpout_fallback: 430333
After: (-51.4% sys time, -47.7% real time, -63.2% mTHP failure)
Sys time: 35958.87, Real time: 442.69
hugepages-64kB/stats/swpout: 1866267
hugepages-64kB/stats/swpout_fallback: 158330

There is a up to 54.7% improvement for build kernel test, and lower
fragmentation rate. Performance improvement should be even larger for
micro benchmarks

Build kernel with tinyconfig on tmpfs with HDD as swap:
---

This test is similar to above, but HDD test is very noisy and slow, the
deviation is huge, so just use tinyconfig instead and take the median test
result of 3 test run, which looks OK:

Before this series:
114.44user 29.11system 39:42.90elapsed 6%CPU
2901232inputs+0outputs (238877major+4227640minor)pagefaults

After this commit:
113.90user 23.81system 38:11.77elapsed 6%CPU
2548728inputs+0outputs (235471major+4238110minor)pagefaults

Single thread SWAP:
---

Sequential SWAP should also be slightly faster as we removed a lot of
unnecessary parts. Test using micro benchmark for swapout/in 4G
zero memory using ZRAM, 10 test runs:

Swapout Before (avg. 3359304):
3353796 3358551 3371305 3356043 3367524 3355303 3355924 3354513 3360776

Swapin Before (avg. 1928698):
1920283 1927183 1934105 1921373 1926562 1938261 1927726 1928636 1934155

Swapout After (avg. 3347511, -0.4%):
3337863 3347948 3355235 3339081 3333134 3353006 3354917 3346055 3360359

Swapin After (avg. 1922290, -0.3%):
1919101 1925743 1916810 1917007 1923930 1935152 1917403 1923549 1921913

Worth noticing the patch "mm, swap: use a global swap cluster for
non-rotation device" introduced minor overhead for certain tests (see
the test results in commit message), but the gain from later commit
covered that, it can be further improved later.

Suggested-by: Chris Li <chrisl@kernel.org>
Signed-off-by: Kairui Song <kasong@tencent.com>

Kairui Song (13):
  mm, swap: minor clean up for swap entry allocation
  mm, swap: fold swap_info_get_cont in the only caller
  mm, swap: remove old allocation path for HDD
  mm, swap: use cluster lock for HDD
  mm, swap: clean up device availability check
  mm, swap: clean up plist removal and adding
  mm, swap: hold a reference of si during scan and clean up flags
  mm, swap: use an enum to define all cluster flags and wrap flags
    changes
  mm, swap: reduce contention on device lock
  mm, swap: simplify percpu cluster updating
  mm, swap: introduce a helper for retrieving cluster from offset
  mm, swap: use a global swap cluster for non-rotation device
  mm, swap_slots: remove slot cache for freeing path

 fs/btrfs/inode.c           |    1 -
 fs/iomap/swapfile.c        |    1 -
 include/linux/swap.h       |   36 +-
 include/linux/swap_slots.h |    3 -
 mm/page_io.c               |    1 -
 mm/swap_slots.c            |   78 +--
 mm/swapfile.c              | 1198 ++++++++++++++++--------------------
 7 files changed, 558 insertions(+), 760 deletions(-)

-- 
2.47.0

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 01/13] mm, swap: minor clean up for swap entry allocation
  2024-10-22 19:24 [PATCH 00/13] mm, swap: rework of swap allocator locks Kairui Song
@ 2024-10-22 19:24 ` Kairui Song
  2024-10-22 19:24 ` [PATCH 02/13] mm, swap: fold swap_info_get_cont in the only caller Kairui Song
                   ` (14 subsequent siblings)
  15 siblings, 0 replies; 21+ messages in thread
From: Kairui Song @ 2024-10-22 19:24 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Chris Li, Barry Song, Ryan Roberts, Hugh Dickins,
	Yosry Ahmed, Huang, Ying, Tim Chen, Nhat Pham, linux-kernel,
	Kairui Song

From: Kairui Song <kasong@tencent.com>

Direct reclaim can skip the whole folio after reclaimed a set of
folio based slots. Also simplify the code for allocation, reduce
indention.

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/swapfile.c | 59 +++++++++++++++++++++++++--------------------------
 1 file changed, 29 insertions(+), 30 deletions(-)

diff --git a/mm/swapfile.c b/mm/swapfile.c
index 46bd4b1a3c07..1128cea95c47 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -604,23 +604,28 @@ static bool cluster_reclaim_range(struct swap_info_struct *si,
 				  unsigned long start, unsigned long end)
 {
 	unsigned char *map = si->swap_map;
-	unsigned long offset;
+	unsigned long offset = start;
+	int nr_reclaim;
 
 	spin_unlock(&ci->lock);
 	spin_unlock(&si->lock);
 
-	for (offset = start; offset < end; offset++) {
+	do {
 		switch (READ_ONCE(map[offset])) {
 		case 0:
-			continue;
+			offset++;
+			break;
 		case SWAP_HAS_CACHE:
-			if (__try_to_reclaim_swap(si, offset, TTRS_ANYWAY | TTRS_DIRECT) > 0)
-				continue;
-			goto out;
+			nr_reclaim = __try_to_reclaim_swap(si, offset, TTRS_ANYWAY | TTRS_DIRECT);
+			if (nr_reclaim > 0)
+				offset += nr_reclaim;
+			else
+				goto out;
+			break;
 		default:
 			goto out;
 		}
-	}
+	} while (offset < end);
 out:
 	spin_lock(&si->lock);
 	spin_lock(&ci->lock);
@@ -826,35 +831,30 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
 							 &found, order, usage);
 			frags++;
 			if (found)
-				break;
+				goto done;
 		}
 
-		if (!found) {
+		/*
+		 * Nonfull clusters are moved to frag tail if we reached
+		 * here, count them too, don't over scan the frag list.
+		 */
+		while (frags < si->frag_cluster_nr[order]) {
+			ci = list_first_entry(&si->frag_clusters[order],
+					      struct swap_cluster_info, list);
 			/*
-			 * Nonfull clusters are moved to frag tail if we reached
-			 * here, count them too, don't over scan the frag list.
+			 * Rotate the frag list to iterate, they were all failing
+			 * high order allocation or moved here due to per-CPU usage,
+			 * this help keeping usable cluster ahead.
 			 */
-			while (frags < si->frag_cluster_nr[order]) {
-				ci = list_first_entry(&si->frag_clusters[order],
-						      struct swap_cluster_info, list);
-				/*
-				 * Rotate the frag list to iterate, they were all failing
-				 * high order allocation or moved here due to per-CPU usage,
-				 * this help keeping usable cluster ahead.
-				 */
-				list_move_tail(&ci->list, &si->frag_clusters[order]);
-				offset = alloc_swap_scan_cluster(si, cluster_offset(si, ci),
-								 &found, order, usage);
-				frags++;
-				if (found)
-					break;
-			}
+			list_move_tail(&ci->list, &si->frag_clusters[order]);
+			offset = alloc_swap_scan_cluster(si, cluster_offset(si, ci),
+							 &found, order, usage);
+			frags++;
+			if (found)
+				goto done;
 		}
 	}
 
-	if (found)
-		goto done;
-
 	if (!list_empty(&si->discard_clusters)) {
 		/*
 		 * we don't have free cluster but have some clusters in
@@ -892,7 +892,6 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
 				goto done;
 		}
 	}
-
 done:
 	cluster->next[order] = offset;
 	return found;
-- 
2.47.0



^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 02/13] mm, swap: fold swap_info_get_cont in the only caller
  2024-10-22 19:24 [PATCH 00/13] mm, swap: rework of swap allocator locks Kairui Song
  2024-10-22 19:24 ` [PATCH 01/13] mm, swap: minor clean up for swap entry allocation Kairui Song
@ 2024-10-22 19:24 ` Kairui Song
  2024-10-22 19:24 ` [PATCH 03/13] mm, swap: remove old allocation path for HDD Kairui Song
                   ` (13 subsequent siblings)
  15 siblings, 0 replies; 21+ messages in thread
From: Kairui Song @ 2024-10-22 19:24 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Chris Li, Barry Song, Ryan Roberts, Hugh Dickins,
	Yosry Ahmed, Huang, Ying, Tim Chen, Nhat Pham, linux-kernel,
	Kairui Song

From: Kairui Song <kasong@tencent.com>

The name of the function is confusing, and the code is much easier to
follow after folding, also rename the confusing naming "p" to more
meaningful "si".

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/swapfile.c | 39 +++++++++++++++------------------------
 1 file changed, 15 insertions(+), 24 deletions(-)

diff --git a/mm/swapfile.c b/mm/swapfile.c
index 1128cea95c47..e1e4a1ba4fc5 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1359,22 +1359,6 @@ static struct swap_info_struct *_swap_info_get(swp_entry_t entry)
 	return NULL;
 }
 
-static struct swap_info_struct *swap_info_get_cont(swp_entry_t entry,
-					struct swap_info_struct *q)
-{
-	struct swap_info_struct *p;
-
-	p = _swap_info_get(entry);
-
-	if (p != q) {
-		if (q != NULL)
-			spin_unlock(&q->lock);
-		if (p != NULL)
-			spin_lock(&p->lock);
-	}
-	return p;
-}
-
 static unsigned char __swap_entry_free_locked(struct swap_info_struct *si,
 					      unsigned long offset,
 					      unsigned char usage)
@@ -1671,14 +1655,14 @@ static int swp_entry_cmp(const void *ent1, const void *ent2)
 
 void swapcache_free_entries(swp_entry_t *entries, int n)
 {
-	struct swap_info_struct *p, *prev;
+	struct swap_info_struct *si, *prev;
 	int i;
 
 	if (n <= 0)
 		return;
 
 	prev = NULL;
-	p = NULL;
+	si = NULL;
 
 	/*
 	 * Sort swap entries by swap device, so each lock is only taken once.
@@ -1688,13 +1672,20 @@ void swapcache_free_entries(swp_entry_t *entries, int n)
 	if (nr_swapfiles > 1)
 		sort(entries, n, sizeof(entries[0]), swp_entry_cmp, NULL);
 	for (i = 0; i < n; ++i) {
-		p = swap_info_get_cont(entries[i], prev);
-		if (p)
-			swap_entry_range_free(p, entries[i], 1);
-		prev = p;
+		si = _swap_info_get(entries[i]);
+
+		if (si != prev) {
+			if (prev != NULL)
+				spin_unlock(&prev->lock);
+			if (si != NULL)
+				spin_lock(&si->lock);
+		}
+		if (si)
+			swap_entry_range_free(si, entries[i], 1);
+		prev = si;
 	}
-	if (p)
-		spin_unlock(&p->lock);
+	if (si)
+		spin_unlock(&si->lock);
 }
 
 int __swap_count(swp_entry_t entry)
-- 
2.47.0



^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 03/13] mm, swap: remove old allocation path for HDD
  2024-10-22 19:24 [PATCH 00/13] mm, swap: rework of swap allocator locks Kairui Song
  2024-10-22 19:24 ` [PATCH 01/13] mm, swap: minor clean up for swap entry allocation Kairui Song
  2024-10-22 19:24 ` [PATCH 02/13] mm, swap: fold swap_info_get_cont in the only caller Kairui Song
@ 2024-10-22 19:24 ` Kairui Song
  2024-10-22 19:24 ` [PATCH 04/13] mm, swap: use cluster lock " Kairui Song
                   ` (12 subsequent siblings)
  15 siblings, 0 replies; 21+ messages in thread
From: Kairui Song @ 2024-10-22 19:24 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Chris Li, Barry Song, Ryan Roberts, Hugh Dickins,
	Yosry Ahmed, Huang, Ying, Tim Chen, Nhat Pham, linux-kernel,
	Kairui Song

From: Kairui Song <kasong@tencent.com>

We are currently using different swap allocation algorithm for HDD and
non-HDD. This leads to the existing of different set of locking, and the
code path is heavily bloated, causing troubles for further optimization
and maintenance.

This commit removes all HDD swap allocation and related dead code, and
use cluster allocation algorithm instead.

The performance may drop a little bit temporarily, and should be
negligible: The main advantage of legacy HDD allocation algorithm
is that is tend to use continuous slots, but swap device gets fragmented
quickly anyway, and the attempt to use continuous slots will fail easily.

This commit also enables mTHP swap on HDD, which should be beneficial,
and following commits will adapt and optimize the cluster allocator
for HDD.

Suggested-by: Chris Li <chrisl@kernel.org>
Suggested-by: "Huang, Ying" <ying.huang@intel.com>
Signed-off-by: Kairui Song <kasong@tencent.com>
---
 include/linux/swap.h |   3 -
 mm/swapfile.c        | 235 ++-----------------------------------------
 2 files changed, 9 insertions(+), 229 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index f3e0ac20c2e8..3a71198a6957 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -309,9 +309,6 @@ struct swap_info_struct {
 	unsigned int highest_bit;	/* index of last free in swap_map */
 	unsigned int pages;		/* total of usable pages of swap */
 	unsigned int inuse_pages;	/* number of those currently in use */
-	unsigned int cluster_next;	/* likely index for next allocation */
-	unsigned int cluster_nr;	/* countdown to next cluster search */
-	unsigned int __percpu *cluster_next_cpu; /*percpu index for next allocation */
 	struct percpu_cluster __percpu *percpu_cluster; /* per cpu's swap location */
 	struct rb_root swap_extent_root;/* root of the swap extent rbtree */
 	struct block_device *bdev;	/* swap device or bdev of swap file */
diff --git a/mm/swapfile.c b/mm/swapfile.c
index e1e4a1ba4fc5..ffdf7eedecb5 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -989,49 +989,6 @@ static void swap_range_free(struct swap_info_struct *si, unsigned long offset,
 	WRITE_ONCE(si->inuse_pages, si->inuse_pages - nr_entries);
 }
 
-static void set_cluster_next(struct swap_info_struct *si, unsigned long next)
-{
-	unsigned long prev;
-
-	if (!(si->flags & SWP_SOLIDSTATE)) {
-		si->cluster_next = next;
-		return;
-	}
-
-	prev = this_cpu_read(*si->cluster_next_cpu);
-	/*
-	 * Cross the swap address space size aligned trunk, choose
-	 * another trunk randomly to avoid lock contention on swap
-	 * address space if possible.
-	 */
-	if ((prev >> SWAP_ADDRESS_SPACE_SHIFT) !=
-	    (next >> SWAP_ADDRESS_SPACE_SHIFT)) {
-		/* No free swap slots available */
-		if (si->highest_bit <= si->lowest_bit)
-			return;
-		next = get_random_u32_inclusive(si->lowest_bit, si->highest_bit);
-		next = ALIGN_DOWN(next, SWAP_ADDRESS_SPACE_PAGES);
-		next = max_t(unsigned int, next, si->lowest_bit);
-	}
-	this_cpu_write(*si->cluster_next_cpu, next);
-}
-
-static bool swap_offset_available_and_locked(struct swap_info_struct *si,
-					     unsigned long offset)
-{
-	if (data_race(!si->swap_map[offset])) {
-		spin_lock(&si->lock);
-		return true;
-	}
-
-	if (vm_swap_full() && READ_ONCE(si->swap_map[offset]) == SWAP_HAS_CACHE) {
-		spin_lock(&si->lock);
-		return true;
-	}
-
-	return false;
-}
-
 static int cluster_alloc_swap(struct swap_info_struct *si,
 			     unsigned char usage, int nr,
 			     swp_entry_t slots[], int order)
@@ -1055,13 +1012,7 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
 			       unsigned char usage, int nr,
 			       swp_entry_t slots[], int order)
 {
-	unsigned long offset;
-	unsigned long scan_base;
-	unsigned long last_in_cluster = 0;
-	int latency_ration = LATENCY_LIMIT;
 	unsigned int nr_pages = 1 << order;
-	int n_ret = 0;
-	bool scanned_many = false;
 
 	/*
 	 * We try to cluster swap pages by allocating them sequentially
@@ -1073,7 +1024,6 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
 	 * But we do now try to find an empty cluster.  -Andrea
 	 * And we let swap pages go all over an SSD partition.  Hugh
 	 */
-
 	if (order > 0) {
 		/*
 		 * Should not even be attempting large allocations when huge
@@ -1093,158 +1043,7 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
 			return 0;
 	}
 
-	if (si->cluster_info)
-		return cluster_alloc_swap(si, usage, nr, slots, order);
-
-	si->flags += SWP_SCANNING;
-
-	/* For HDD, sequential access is more important. */
-	scan_base = si->cluster_next;
-	offset = scan_base;
-
-	if (unlikely(!si->cluster_nr--)) {
-		if (si->pages - si->inuse_pages < SWAPFILE_CLUSTER) {
-			si->cluster_nr = SWAPFILE_CLUSTER - 1;
-			goto checks;
-		}
-
-		spin_unlock(&si->lock);
-
-		/*
-		 * If seek is expensive, start searching for new cluster from
-		 * start of partition, to minimize the span of allocated swap.
-		 */
-		scan_base = offset = si->lowest_bit;
-		last_in_cluster = offset + SWAPFILE_CLUSTER - 1;
-
-		/* Locate the first empty (unaligned) cluster */
-		for (; last_in_cluster <= READ_ONCE(si->highest_bit); offset++) {
-			if (si->swap_map[offset])
-				last_in_cluster = offset + SWAPFILE_CLUSTER;
-			else if (offset == last_in_cluster) {
-				spin_lock(&si->lock);
-				offset -= SWAPFILE_CLUSTER - 1;
-				si->cluster_next = offset;
-				si->cluster_nr = SWAPFILE_CLUSTER - 1;
-				goto checks;
-			}
-			if (unlikely(--latency_ration < 0)) {
-				cond_resched();
-				latency_ration = LATENCY_LIMIT;
-			}
-		}
-
-		offset = scan_base;
-		spin_lock(&si->lock);
-		si->cluster_nr = SWAPFILE_CLUSTER - 1;
-	}
-
-checks:
-	if (!(si->flags & SWP_WRITEOK))
-		goto no_page;
-	if (!si->highest_bit)
-		goto no_page;
-	if (offset > si->highest_bit)
-		scan_base = offset = si->lowest_bit;
-
-	/* reuse swap entry of cache-only swap if not busy. */
-	if (vm_swap_full() && si->swap_map[offset] == SWAP_HAS_CACHE) {
-		int swap_was_freed;
-		spin_unlock(&si->lock);
-		swap_was_freed = __try_to_reclaim_swap(si, offset, TTRS_ANYWAY | TTRS_DIRECT);
-		spin_lock(&si->lock);
-		/* entry was freed successfully, try to use this again */
-		if (swap_was_freed > 0)
-			goto checks;
-		goto scan; /* check next one */
-	}
-
-	if (si->swap_map[offset]) {
-		if (!n_ret)
-			goto scan;
-		else
-			goto done;
-	}
-	memset(si->swap_map + offset, usage, nr_pages);
-
-	swap_range_alloc(si, offset, nr_pages);
-	slots[n_ret++] = swp_entry(si->type, offset);
-
-	/* got enough slots or reach max slots? */
-	if ((n_ret == nr) || (offset >= si->highest_bit))
-		goto done;
-
-	/* search for next available slot */
-
-	/* time to take a break? */
-	if (unlikely(--latency_ration < 0)) {
-		if (n_ret)
-			goto done;
-		spin_unlock(&si->lock);
-		cond_resched();
-		spin_lock(&si->lock);
-		latency_ration = LATENCY_LIMIT;
-	}
-
-	if (si->cluster_nr && !si->swap_map[++offset]) {
-		/* non-ssd case, still more slots in cluster? */
-		--si->cluster_nr;
-		goto checks;
-	}
-
-	/*
-	 * Even if there's no free clusters available (fragmented),
-	 * try to scan a little more quickly with lock held unless we
-	 * have scanned too many slots already.
-	 */
-	if (!scanned_many) {
-		unsigned long scan_limit;
-
-		if (offset < scan_base)
-			scan_limit = scan_base;
-		else
-			scan_limit = si->highest_bit;
-		for (; offset <= scan_limit && --latency_ration > 0;
-		     offset++) {
-			if (!si->swap_map[offset])
-				goto checks;
-		}
-	}
-
-done:
-	if (order == 0)
-		set_cluster_next(si, offset + 1);
-	si->flags -= SWP_SCANNING;
-	return n_ret;
-
-scan:
-	VM_WARN_ON(order > 0);
-	spin_unlock(&si->lock);
-	while (++offset <= READ_ONCE(si->highest_bit)) {
-		if (unlikely(--latency_ration < 0)) {
-			cond_resched();
-			latency_ration = LATENCY_LIMIT;
-			scanned_many = true;
-		}
-		if (swap_offset_available_and_locked(si, offset))
-			goto checks;
-	}
-	offset = si->lowest_bit;
-	while (offset < scan_base) {
-		if (unlikely(--latency_ration < 0)) {
-			cond_resched();
-			latency_ration = LATENCY_LIMIT;
-			scanned_many = true;
-		}
-		if (swap_offset_available_and_locked(si, offset))
-			goto checks;
-		offset++;
-	}
-	spin_lock(&si->lock);
-
-no_page:
-	si->flags -= SWP_SCANNING;
-	return n_ret;
+	return cluster_alloc_swap(si, usage, nr, slots, order);
 }
 
 int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_order)
@@ -2855,8 +2654,6 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
 	mutex_unlock(&swapon_mutex);
 	free_percpu(p->percpu_cluster);
 	p->percpu_cluster = NULL;
-	free_percpu(p->cluster_next_cpu);
-	p->cluster_next_cpu = NULL;
 	vfree(swap_map);
 	kvfree(zeromap);
 	kvfree(cluster_info);
@@ -3168,8 +2965,6 @@ static unsigned long read_swap_header(struct swap_info_struct *si,
 	}
 
 	si->lowest_bit  = 1;
-	si->cluster_next = 1;
-	si->cluster_nr = 0;
 
 	maxpages = swapfile_maximum_size;
 	last_page = swap_header->info.last_page;
@@ -3255,7 +3050,6 @@ static struct swap_cluster_info *setup_clusters(struct swap_info_struct *si,
 						unsigned long maxpages)
 {
 	unsigned long nr_clusters = DIV_ROUND_UP(maxpages, SWAPFILE_CLUSTER);
-	unsigned long col = si->cluster_next / SWAPFILE_CLUSTER % SWAP_CLUSTER_COLS;
 	struct swap_cluster_info *cluster_info;
 	unsigned long i, j, k, idx;
 	int cpu, err = -ENOMEM;
@@ -3267,15 +3061,6 @@ static struct swap_cluster_info *setup_clusters(struct swap_info_struct *si,
 	for (i = 0; i < nr_clusters; i++)
 		spin_lock_init(&cluster_info[i].lock);
 
-	si->cluster_next_cpu = alloc_percpu(unsigned int);
-	if (!si->cluster_next_cpu)
-		goto err_free;
-
-	/* Random start position to help with wear leveling */
-	for_each_possible_cpu(cpu)
-		per_cpu(*si->cluster_next_cpu, cpu) =
-		get_random_u32_inclusive(1, si->highest_bit);
-
 	si->percpu_cluster = alloc_percpu(struct percpu_cluster);
 	if (!si->percpu_cluster)
 		goto err_free;
@@ -3317,7 +3102,7 @@ static struct swap_cluster_info *setup_clusters(struct swap_info_struct *si,
 	 * sharing same address space.
 	 */
 	for (k = 0; k < SWAP_CLUSTER_COLS; k++) {
-		j = (k + col) % SWAP_CLUSTER_COLS;
+		j = k % SWAP_CLUSTER_COLS;
 		for (i = 0; i < DIV_ROUND_UP(nr_clusters, SWAP_CLUSTER_COLS); i++) {
 			struct swap_cluster_info *ci;
 			idx = i * SWAP_CLUSTER_COLS + j;
@@ -3467,18 +3252,18 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
 
 	if (si->bdev && bdev_nonrot(si->bdev)) {
 		si->flags |= SWP_SOLIDSTATE;
-
-		cluster_info = setup_clusters(si, swap_header, maxpages);
-		if (IS_ERR(cluster_info)) {
-			error = PTR_ERR(cluster_info);
-			cluster_info = NULL;
-			goto bad_swap_unlock_inode;
-		}
 	} else {
 		atomic_inc(&nr_rotate_swap);
 		inced_nr_rotate_swap = true;
 	}
 
+	cluster_info = setup_clusters(si, swap_header, maxpages);
+	if (IS_ERR(cluster_info)) {
+		error = PTR_ERR(cluster_info);
+		cluster_info = NULL;
+		goto bad_swap_unlock_inode;
+	}
+
 	if ((swap_flags & SWAP_FLAG_DISCARD) &&
 	    si->bdev && bdev_max_discard_sectors(si->bdev)) {
 		/*
@@ -3559,8 +3344,6 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
 bad_swap:
 	free_percpu(si->percpu_cluster);
 	si->percpu_cluster = NULL;
-	free_percpu(si->cluster_next_cpu);
-	si->cluster_next_cpu = NULL;
 	inode = NULL;
 	destroy_swap_extents(si);
 	swap_cgroup_swapoff(si->type);
-- 
2.47.0



^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 04/13] mm, swap: use cluster lock for HDD
  2024-10-22 19:24 [PATCH 00/13] mm, swap: rework of swap allocator locks Kairui Song
                   ` (2 preceding siblings ...)
  2024-10-22 19:24 ` [PATCH 03/13] mm, swap: remove old allocation path for HDD Kairui Song
@ 2024-10-22 19:24 ` Kairui Song
  2024-10-22 19:24 ` [PATCH 05/13] mm, swap: clean up device availability check Kairui Song
                   ` (11 subsequent siblings)
  15 siblings, 0 replies; 21+ messages in thread
From: Kairui Song @ 2024-10-22 19:24 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Chris Li, Barry Song, Ryan Roberts, Hugh Dickins,
	Yosry Ahmed, Huang, Ying, Tim Chen, Nhat Pham, linux-kernel,
	Kairui Song

From: Kairui Song <kasong@tencent.com>

Cluster lock (ci->lock) was introduce to reduce contention for certain
operations. Using cluster lock for HDD is not helpful as HDD have a poor
performance, so locking isn't the bottleneck. But having different set
of locks for HDD / non-HDD prevents further rework of device lock
(si->lock).

This commit just changed all lock_cluster_or_swap_info to lock_cluster,
which is a safe and straight conversion since cluster info is always
allocated now, also removed all cluster_info related checks.

Suggested-by: Chris Li <chrisl@kernel.org>
Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/swapfile.c | 107 ++++++++++++++++----------------------------------
 1 file changed, 34 insertions(+), 73 deletions(-)

diff --git a/mm/swapfile.c b/mm/swapfile.c
index ffdf7eedecb5..f8e70bb5f1d7 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -58,10 +58,9 @@ static void swap_entry_range_free(struct swap_info_struct *si, swp_entry_t entry
 static void swap_range_alloc(struct swap_info_struct *si, unsigned long offset,
 			     unsigned int nr_entries);
 static bool folio_swapcache_freeable(struct folio *folio);
-static struct swap_cluster_info *lock_cluster_or_swap_info(
-		struct swap_info_struct *si, unsigned long offset);
-static void unlock_cluster_or_swap_info(struct swap_info_struct *si,
-					struct swap_cluster_info *ci);
+static struct swap_cluster_info *lock_cluster(struct swap_info_struct *si,
+					      unsigned long offset);
+static void unlock_cluster(struct swap_cluster_info *ci);
 
 static DEFINE_SPINLOCK(swap_lock);
 static unsigned int nr_swapfiles;
@@ -222,9 +221,9 @@ static int __try_to_reclaim_swap(struct swap_info_struct *si,
 	 * swap_map is HAS_CACHE only, which means the slots have no page table
 	 * reference or pending writeback, and can't be allocated to others.
 	 */
-	ci = lock_cluster_or_swap_info(si, offset);
+	ci = lock_cluster(si, offset);
 	need_reclaim = swap_is_has_cache(si, offset, nr_pages);
-	unlock_cluster_or_swap_info(si, ci);
+	unlock_cluster(ci);
 	if (!need_reclaim)
 		goto out_unlock;
 
@@ -404,45 +403,15 @@ static inline struct swap_cluster_info *lock_cluster(struct swap_info_struct *si
 {
 	struct swap_cluster_info *ci;
 
-	ci = si->cluster_info;
-	if (ci) {
-		ci += offset / SWAPFILE_CLUSTER;
-		spin_lock(&ci->lock);
-	}
-	return ci;
-}
-
-static inline void unlock_cluster(struct swap_cluster_info *ci)
-{
-	if (ci)
-		spin_unlock(&ci->lock);
-}
-
-/*
- * Determine the locking method in use for this device.  Return
- * swap_cluster_info if SSD-style cluster-based locking is in place.
- */
-static inline struct swap_cluster_info *lock_cluster_or_swap_info(
-		struct swap_info_struct *si, unsigned long offset)
-{
-	struct swap_cluster_info *ci;
-
-	/* Try to use fine-grained SSD-style locking if available: */
-	ci = lock_cluster(si, offset);
-	/* Otherwise, fall back to traditional, coarse locking: */
-	if (!ci)
-		spin_lock(&si->lock);
+	ci = &si->cluster_info[offset / SWAPFILE_CLUSTER];
+	spin_lock(&ci->lock);
 
 	return ci;
 }
 
-static inline void unlock_cluster_or_swap_info(struct swap_info_struct *si,
-					       struct swap_cluster_info *ci)
+static inline void unlock_cluster(struct swap_cluster_info *ci)
 {
-	if (ci)
-		unlock_cluster(ci);
-	else
-		spin_unlock(&si->lock);
+	spin_unlock(&ci->lock);
 }
 
 /* Add a cluster to discard list and schedule it to do discard */
@@ -558,9 +527,6 @@ static void inc_cluster_info_page(struct swap_info_struct *si,
 	unsigned long idx = page_nr / SWAPFILE_CLUSTER;
 	struct swap_cluster_info *ci;
 
-	if (!cluster_info)
-		return;
-
 	ci = cluster_info + idx;
 	ci->count++;
 
@@ -576,9 +542,6 @@ static void inc_cluster_info_page(struct swap_info_struct *si,
 static void dec_cluster_info_page(struct swap_info_struct *si,
 				  struct swap_cluster_info *ci, int nr_pages)
 {
-	if (!si->cluster_info)
-		return;
-
 	VM_BUG_ON(ci->count < nr_pages);
 	VM_BUG_ON(cluster_is_free(ci));
 	lockdep_assert_held(&si->lock);
@@ -995,8 +958,6 @@ static int cluster_alloc_swap(struct swap_info_struct *si,
 {
 	int n_ret = 0;
 
-	VM_BUG_ON(!si->cluster_info);
-
 	while (n_ret < nr) {
 		unsigned long offset = cluster_alloc_swap_entry(si, order, usage);
 
@@ -1036,10 +997,10 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
 		}
 
 		/*
-		 * Swapfile is not block device or not using clusters so unable
+		 * Swapfile is not block device so unable
 		 * to allocate large entries.
 		 */
-		if (!(si->flags & SWP_BLKDEV) || !si->cluster_info)
+		if (!(si->flags & SWP_BLKDEV))
 			return 0;
 	}
 
@@ -1279,9 +1240,9 @@ static unsigned char __swap_entry_free(struct swap_info_struct *si,
 	unsigned long offset = swp_offset(entry);
 	unsigned char usage;
 
-	ci = lock_cluster_or_swap_info(si, offset);
+	ci = lock_cluster(si, offset);
 	usage = __swap_entry_free_locked(si, offset, 1);
-	unlock_cluster_or_swap_info(si, ci);
+	unlock_cluster(ci);
 	if (!usage)
 		free_swap_slot(entry);
 
@@ -1304,14 +1265,14 @@ static bool __swap_entries_free(struct swap_info_struct *si,
 	if (nr > SWAPFILE_CLUSTER - offset % SWAPFILE_CLUSTER)
 		goto fallback;
 
-	ci = lock_cluster_or_swap_info(si, offset);
+	ci = lock_cluster(si, offset);
 	if (!swap_is_last_map(si, offset, nr, &has_cache)) {
-		unlock_cluster_or_swap_info(si, ci);
+		unlock_cluster(ci);
 		goto fallback;
 	}
 	for (i = 0; i < nr; i++)
 		WRITE_ONCE(si->swap_map[offset + i], SWAP_HAS_CACHE);
-	unlock_cluster_or_swap_info(si, ci);
+	unlock_cluster(ci);
 
 	if (!has_cache) {
 		for (i = 0; i < nr; i++)
@@ -1367,7 +1328,7 @@ static void cluster_swap_free_nr(struct swap_info_struct *si,
 	DECLARE_BITMAP(to_free, BITS_PER_LONG) = { 0 };
 	int i, nr;
 
-	ci = lock_cluster_or_swap_info(si, offset);
+	ci = lock_cluster(si, offset);
 	while (nr_pages) {
 		nr = min(BITS_PER_LONG, nr_pages);
 		for (i = 0; i < nr; i++) {
@@ -1375,18 +1336,18 @@ static void cluster_swap_free_nr(struct swap_info_struct *si,
 				bitmap_set(to_free, i, 1);
 		}
 		if (!bitmap_empty(to_free, BITS_PER_LONG)) {
-			unlock_cluster_or_swap_info(si, ci);
+			unlock_cluster(ci);
 			for_each_set_bit(i, to_free, BITS_PER_LONG)
 				free_swap_slot(swp_entry(si->type, offset + i));
 			if (nr == nr_pages)
 				return;
 			bitmap_clear(to_free, 0, BITS_PER_LONG);
-			ci = lock_cluster_or_swap_info(si, offset);
+			ci = lock_cluster(si, offset);
 		}
 		offset += nr;
 		nr_pages -= nr;
 	}
-	unlock_cluster_or_swap_info(si, ci);
+	unlock_cluster(ci);
 }
 
 /*
@@ -1425,9 +1386,9 @@ void put_swap_folio(struct folio *folio, swp_entry_t entry)
 	if (!si)
 		return;
 
-	ci = lock_cluster_or_swap_info(si, offset);
+	ci = lock_cluster(si, offset);
 	if (size > 1 && swap_is_has_cache(si, offset, size)) {
-		unlock_cluster_or_swap_info(si, ci);
+		unlock_cluster(ci);
 		spin_lock(&si->lock);
 		swap_entry_range_free(si, entry, size);
 		spin_unlock(&si->lock);
@@ -1435,14 +1396,14 @@ void put_swap_folio(struct folio *folio, swp_entry_t entry)
 	}
 	for (int i = 0; i < size; i++, entry.val++) {
 		if (!__swap_entry_free_locked(si, offset + i, SWAP_HAS_CACHE)) {
-			unlock_cluster_or_swap_info(si, ci);
+			unlock_cluster(ci);
 			free_swap_slot(entry);
 			if (i == size - 1)
 				return;
-			lock_cluster_or_swap_info(si, offset);
+			lock_cluster(si, offset);
 		}
 	}
-	unlock_cluster_or_swap_info(si, ci);
+	unlock_cluster(ci);
 }
 
 static int swp_entry_cmp(const void *ent1, const void *ent2)
@@ -1506,9 +1467,9 @@ int swap_swapcount(struct swap_info_struct *si, swp_entry_t entry)
 	struct swap_cluster_info *ci;
 	int count;
 
-	ci = lock_cluster_or_swap_info(si, offset);
+	ci = lock_cluster(si, offset);
 	count = swap_count(si->swap_map[offset]);
-	unlock_cluster_or_swap_info(si, ci);
+	unlock_cluster(ci);
 	return count;
 }
 
@@ -1531,7 +1492,7 @@ int swp_swapcount(swp_entry_t entry)
 
 	offset = swp_offset(entry);
 
-	ci = lock_cluster_or_swap_info(si, offset);
+	ci = lock_cluster(si, offset);
 
 	count = swap_count(si->swap_map[offset]);
 	if (!(count & COUNT_CONTINUED))
@@ -1554,7 +1515,7 @@ int swp_swapcount(swp_entry_t entry)
 		n *= (SWAP_CONT_MAX + 1);
 	} while (tmp_count & COUNT_CONTINUED);
 out:
-	unlock_cluster_or_swap_info(si, ci);
+	unlock_cluster(ci);
 	return count;
 }
 
@@ -1569,8 +1530,8 @@ static bool swap_page_trans_huge_swapped(struct swap_info_struct *si,
 	int i;
 	bool ret = false;
 
-	ci = lock_cluster_or_swap_info(si, offset);
-	if (!ci || nr_pages == 1) {
+	ci = lock_cluster(si, offset);
+	if (nr_pages == 1) {
 		if (swap_count(map[roffset]))
 			ret = true;
 		goto unlock_out;
@@ -1582,7 +1543,7 @@ static bool swap_page_trans_huge_swapped(struct swap_info_struct *si,
 		}
 	}
 unlock_out:
-	unlock_cluster_or_swap_info(si, ci);
+	unlock_cluster(ci);
 	return ret;
 }
 
@@ -3412,7 +3373,7 @@ static int __swap_duplicate(swp_entry_t entry, unsigned char usage, int nr)
 	offset = swp_offset(entry);
 	VM_WARN_ON(nr > SWAPFILE_CLUSTER - offset % SWAPFILE_CLUSTER);
 	VM_WARN_ON(usage == 1 && nr > 1);
-	ci = lock_cluster_or_swap_info(si, offset);
+	ci = lock_cluster(si, offset);
 
 	err = 0;
 	for (i = 0; i < nr; i++) {
@@ -3467,7 +3428,7 @@ static int __swap_duplicate(swp_entry_t entry, unsigned char usage, int nr)
 	}
 
 unlock_out:
-	unlock_cluster_or_swap_info(si, ci);
+	unlock_cluster(ci);
 	return err;
 }
 
-- 
2.47.0



^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 05/13] mm, swap: clean up device availability check
  2024-10-22 19:24 [PATCH 00/13] mm, swap: rework of swap allocator locks Kairui Song
                   ` (3 preceding siblings ...)
  2024-10-22 19:24 ` [PATCH 04/13] mm, swap: use cluster lock " Kairui Song
@ 2024-10-22 19:24 ` Kairui Song
  2024-10-22 19:24 ` [PATCH 06/13] mm, swap: clean up plist removal and adding Kairui Song
                   ` (10 subsequent siblings)
  15 siblings, 0 replies; 21+ messages in thread
From: Kairui Song @ 2024-10-22 19:24 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Chris Li, Barry Song, Ryan Roberts, Hugh Dickins,
	Yosry Ahmed, Huang, Ying, Tim Chen, Nhat Pham, linux-kernel,
	Kairui Song

From: Kairui Song <kasong@tencent.com>

Remove highest_bit and lowest_bit. After HDD allocation path is removed,
only purpose of these two fields is to judge if the device is full or
not, which can be done by checking inuse_pages instead.

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 fs/btrfs/inode.c     |  1 -
 fs/iomap/swapfile.c  |  1 -
 include/linux/swap.h |  2 --
 mm/page_io.c         |  1 -
 mm/swapfile.c        | 38 ++++++++------------------------------
 5 files changed, 8 insertions(+), 35 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 5618ca02934a..aba9c0d58998 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -10023,7 +10023,6 @@ static int btrfs_swap_activate(struct swap_info_struct *sis, struct file *file,
 	*span = bsi.highest_ppage - bsi.lowest_ppage + 1;
 	sis->max = bsi.nr_pages;
 	sis->pages = bsi.nr_pages - 1;
-	sis->highest_bit = bsi.nr_pages - 1;
 	return bsi.nr_extents;
 }
 #else
diff --git a/fs/iomap/swapfile.c b/fs/iomap/swapfile.c
index 5fc0ac36dee3..b90d0eda9e51 100644
--- a/fs/iomap/swapfile.c
+++ b/fs/iomap/swapfile.c
@@ -189,7 +189,6 @@ int iomap_swapfile_activate(struct swap_info_struct *sis,
 	*pagespan = 1 + isi.highest_ppage - isi.lowest_ppage;
 	sis->max = isi.nr_pages;
 	sis->pages = isi.nr_pages - 1;
-	sis->highest_bit = isi.nr_pages - 1;
 	return isi.nr_extents;
 }
 EXPORT_SYMBOL_GPL(iomap_swapfile_activate);
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 3a71198a6957..c0d49dad7a4b 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -305,8 +305,6 @@ struct swap_info_struct {
 	struct list_head frag_clusters[SWAP_NR_ORDERS];
 					/* list of cluster that are fragmented or contented */
 	unsigned int frag_cluster_nr[SWAP_NR_ORDERS];
-	unsigned int lowest_bit;	/* index of first free in swap_map */
-	unsigned int highest_bit;	/* index of last free in swap_map */
 	unsigned int pages;		/* total of usable pages of swap */
 	unsigned int inuse_pages;	/* number of those currently in use */
 	struct percpu_cluster __percpu *percpu_cluster; /* per cpu's swap location */
diff --git a/mm/page_io.c b/mm/page_io.c
index a28d28b6b3ce..c8a25203bcf4 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -163,7 +163,6 @@ int generic_swapfile_activate(struct swap_info_struct *sis,
 		page_no = 1;	/* force Empty message */
 	sis->max = page_no;
 	sis->pages = page_no - 1;
-	sis->highest_bit = page_no - 1;
 out:
 	return ret;
 bad_bmap:
diff --git a/mm/swapfile.c b/mm/swapfile.c
index f8e70bb5f1d7..e620b41c3120 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -55,7 +55,7 @@ static bool swap_count_continued(struct swap_info_struct *, pgoff_t,
 static void free_swap_count_continuations(struct swap_info_struct *);
 static void swap_entry_range_free(struct swap_info_struct *si, swp_entry_t entry,
 				  unsigned int nr_pages);
-static void swap_range_alloc(struct swap_info_struct *si, unsigned long offset,
+static void swap_range_alloc(struct swap_info_struct *si,
 			     unsigned int nr_entries);
 static bool folio_swapcache_freeable(struct folio *folio);
 static struct swap_cluster_info *lock_cluster(struct swap_info_struct *si,
@@ -647,7 +647,7 @@ static void cluster_alloc_range(struct swap_info_struct *si, struct swap_cluster
 	}
 
 	memset(si->swap_map + start, usage, nr_pages);
-	swap_range_alloc(si, start, nr_pages);
+	swap_range_alloc(si, nr_pages);
 	ci->count += nr_pages;
 
 	if (ci->count == SWAPFILE_CLUSTER) {
@@ -876,19 +876,11 @@ static void del_from_avail_list(struct swap_info_struct *si)
 	spin_unlock(&swap_avail_lock);
 }
 
-static void swap_range_alloc(struct swap_info_struct *si, unsigned long offset,
+static void swap_range_alloc(struct swap_info_struct *si,
 			     unsigned int nr_entries)
 {
-	unsigned int end = offset + nr_entries - 1;
-
-	if (offset == si->lowest_bit)
-		si->lowest_bit += nr_entries;
-	if (end == si->highest_bit)
-		WRITE_ONCE(si->highest_bit, si->highest_bit - nr_entries);
 	WRITE_ONCE(si->inuse_pages, si->inuse_pages + nr_entries);
 	if (si->inuse_pages == si->pages) {
-		si->lowest_bit = si->max;
-		si->highest_bit = 0;
 		del_from_avail_list(si);
 
 		if (vm_swap_full())
@@ -921,15 +913,8 @@ static void swap_range_free(struct swap_info_struct *si, unsigned long offset,
 	for (i = 0; i < nr_entries; i++)
 		clear_bit(offset + i, si->zeromap);
 
-	if (offset < si->lowest_bit)
-		si->lowest_bit = offset;
-	if (end > si->highest_bit) {
-		bool was_full = !si->highest_bit;
-
-		WRITE_ONCE(si->highest_bit, end);
-		if (was_full && (si->flags & SWP_WRITEOK))
-			add_to_avail_list(si);
-	}
+	if (si->inuse_pages == si->pages)
+		add_to_avail_list(si);
 	if (si->flags & SWP_BLKDEV)
 		swap_slot_free_notify =
 			si->bdev->bd_disk->fops->swap_slot_free_notify;
@@ -1035,15 +1020,12 @@ int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_order)
 		plist_requeue(&si->avail_lists[node], &swap_avail_heads[node]);
 		spin_unlock(&swap_avail_lock);
 		spin_lock(&si->lock);
-		if (!si->highest_bit || !(si->flags & SWP_WRITEOK)) {
+		if ((si->inuse_pages == si->pages) || !(si->flags & SWP_WRITEOK)) {
 			spin_lock(&swap_avail_lock);
 			if (plist_node_empty(&si->avail_lists[node])) {
 				spin_unlock(&si->lock);
 				goto nextsi;
 			}
-			WARN(!si->highest_bit,
-			     "swap_info %d in list but !highest_bit\n",
-			     si->type);
 			WARN(!(si->flags & SWP_WRITEOK),
 			     "swap_info %d in list but !SWP_WRITEOK\n",
 			     si->type);
@@ -2425,8 +2407,8 @@ static void _enable_swap_info(struct swap_info_struct *si)
 	 */
 	plist_add(&si->list, &swap_active_head);
 
-	/* add to available list iff swap device is not full */
-	if (si->highest_bit)
+	/* add to available list if swap device is not full */
+	if (si->inuse_pages < si->pages)
 		add_to_avail_list(si);
 }
 
@@ -2590,7 +2572,6 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
 	drain_mmlist();
 
 	/* wait for anyone still in scan_swap_map_slots */
-	p->highest_bit = 0;		/* cuts scans short */
 	while (p->flags >= SWP_SCANNING) {
 		spin_unlock(&p->lock);
 		spin_unlock(&swap_lock);
@@ -2925,8 +2906,6 @@ static unsigned long read_swap_header(struct swap_info_struct *si,
 		return 0;
 	}
 
-	si->lowest_bit  = 1;
-
 	maxpages = swapfile_maximum_size;
 	last_page = swap_header->info.last_page;
 	if (!last_page) {
@@ -2943,7 +2922,6 @@ static unsigned long read_swap_header(struct swap_info_struct *si,
 		if ((unsigned int)maxpages == 0)
 			maxpages = UINT_MAX;
 	}
-	si->highest_bit = maxpages - 1;
 
 	if (!maxpages)
 		return 0;
-- 
2.47.0



^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 06/13] mm, swap: clean up plist removal and adding
  2024-10-22 19:24 [PATCH 00/13] mm, swap: rework of swap allocator locks Kairui Song
                   ` (4 preceding siblings ...)
  2024-10-22 19:24 ` [PATCH 05/13] mm, swap: clean up device availability check Kairui Song
@ 2024-10-22 19:24 ` Kairui Song
  2024-10-22 19:24 ` [PATCH 07/13] mm, swap: hold a reference of si during scan and clean up flags Kairui Song
                   ` (9 subsequent siblings)
  15 siblings, 0 replies; 21+ messages in thread
From: Kairui Song @ 2024-10-22 19:24 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Chris Li, Barry Song, Ryan Roberts, Hugh Dickins,
	Yosry Ahmed, Huang, Ying, Tim Chen, Nhat Pham, linux-kernel,
	Kairui Song

From: Kairui Song <kasong@tencent.com>

When swap device is full (inuse_pages == pages), it should be
removed from the plist. And if any slot is freed, the swap device should
be added back to the plist. On swapoff / swapon, the swap device will
also be force removed / added.

This is currently serialized by si->lock, and some historical sanity
check code are still here. This commit decouple this from the protection
of si->lock and clean it up to prepare for si->lock rework.

Noticing inuse_pages counter is the only thing decides if a device
should be removed from or added to the plist (except swapon / swapoff
as a special case), and inuse_pages is a very hot counter. So to avoid
extra overhead on the counter update hot path, and make it possible to
check and update the plist when the counter value changes, embed the
plist state into the inuse_pages counter, and turn the counter into
an atomic. This way we can check and update the counter with one CAS
and avoid any extra synchronization.

If the counter is full (inuse_pages == pages) with the off-list bit
unset, try to remove it from the plist. If the counter is not full
(inuse_pages != pages) with the off-list bit set, try to add it to
the plist. Removing and adding is serialized with lock as well as the
bit setting. Ordinary counter updates will be lockless.

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 include/linux/swap.h |   2 +-
 mm/swapfile.c        | 182 +++++++++++++++++++++++++++++++------------
 2 files changed, 132 insertions(+), 52 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index c0d49dad7a4b..16dcf8bd1a4e 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -306,7 +306,7 @@ struct swap_info_struct {
 					/* list of cluster that are fragmented or contented */
 	unsigned int frag_cluster_nr[SWAP_NR_ORDERS];
 	unsigned int pages;		/* total of usable pages of swap */
-	unsigned int inuse_pages;	/* number of those currently in use */
+	atomic_long_t inuse_pages;	/* number of those currently in use */
 	struct percpu_cluster __percpu *percpu_cluster; /* per cpu's swap location */
 	struct rb_root swap_extent_root;/* root of the swap extent rbtree */
 	struct block_device *bdev;	/* swap device or bdev of swap file */
diff --git a/mm/swapfile.c b/mm/swapfile.c
index e620b41c3120..4e629536a07c 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -128,6 +128,25 @@ static inline unsigned char swap_count(unsigned char ent)
 	return ent & ~SWAP_HAS_CACHE;	/* may include COUNT_CONTINUED flag */
 }
 
+/*
+ * Use the second highest bit of inuse_pages as the indicator
+ * of if one swap device is on the allocation plist.
+ *
+ * inuse_pages is the only thing decides of a device should be on
+ * list or not (except swapoff as a special case). By embedding the
+ * on-list bit into it, updaters don't need any lock to check the
+ * device list status.
+ *
+ * This bit will be set to 1 if the device is not on the plist and not
+ * usable, will be cleared if the device is on the plist.
+ */
+#define SWAP_USAGE_OFFLIST_BIT (1UL << (BITS_PER_TYPE(atomic_t) - 2))
+#define SWAP_USAGE_COUNTER_MASK (~SWAP_USAGE_OFFLIST_BIT)
+static long swap_usage_in_pages(struct swap_info_struct *si)
+{
+	return atomic_long_read(&si->inuse_pages) & SWAP_USAGE_COUNTER_MASK;
+}
+
 /* Reclaim the swap entry anyway if possible */
 #define TTRS_ANYWAY		0x1
 /*
@@ -709,7 +728,7 @@ static void swap_reclaim_full_clusters(struct swap_info_struct *si, bool force)
 	int nr_reclaim;
 
 	if (force)
-		to_scan = si->inuse_pages / SWAPFILE_CLUSTER;
+		to_scan = swap_usage_in_pages(si) / SWAPFILE_CLUSTER;
 
 	while (!list_empty(&si->full_clusters)) {
 		ci = list_first_entry(&si->full_clusters, struct swap_cluster_info, list);
@@ -860,42 +879,121 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
 	return found;
 }
 
-static void __del_from_avail_list(struct swap_info_struct *si)
+/*
+ * SWAP_USAGE_OFFLIST_BIT can only be cleared by this helper and synced with
+ * counter updaters with atomic.
+ */
+static void del_from_avail_list(struct swap_info_struct *si, bool swapoff)
 {
 	int nid;
 
-	assert_spin_locked(&si->lock);
+	spin_lock(&swap_avail_lock);
+
+	if (swapoff) {
+		/* Clear SWP_WRITEOK so add_to_avail_list won't add it back */
+		si->flags &= ~SWP_WRITEOK;
+
+		/* Force take it off. */
+		atomic_long_or(SWAP_USAGE_OFFLIST_BIT, &si->inuse_pages);
+	} else {
+		/*
+		 * If not swapoff, take it off-list only when it's full and
+		 * SWAP_USAGE_OFFLIST_BIT is not set (inuse_pages == pages).
+		 * The cmpxchg below will fail and skip the removal if there
+		 * are slots freed or device is off-listed by someone else.
+		 */
+		if (atomic_long_cmpxchg(&si->inuse_pages, si->pages,
+					si->pages | SWAP_USAGE_OFFLIST_BIT) != si->pages)
+			goto skip;
+	}
+
 	for_each_node(nid)
 		plist_del(&si->avail_lists[nid], &swap_avail_heads[nid]);
+
+skip:
+	spin_unlock(&swap_avail_lock);
 }
 
-static void del_from_avail_list(struct swap_info_struct *si)
+/*
+ * SWAP_USAGE_OFFLIST_BIT can only be set by this helper and synced with
+ * counter updaters with atomic.
+ */
+static void add_to_avail_list(struct swap_info_struct *si, bool swapon)
 {
+	int nid;
+	long val;
+	bool swapoff;
+
 	spin_lock(&swap_avail_lock);
-	__del_from_avail_list(si);
+
+	/* Special handling for swapon / swapoff */
+	if (swapon) {
+		si->flags |= SWP_WRITEOK;
+		swapoff = false;
+	} else {
+		swapoff = !(READ_ONCE(si->flags) & SWP_WRITEOK);
+	}
+
+	if (swapoff)
+		goto skip;
+
+	if (!(atomic_long_read(&si->inuse_pages) & SWAP_USAGE_OFFLIST_BIT))
+		goto skip;
+
+	val = atomic_long_fetch_and_relaxed(~SWAP_USAGE_OFFLIST_BIT, &si->inuse_pages);
+
+	/*
+	 * When device is full and device is on the plist, only one updater will
+	 * see (inuse_pages == si->pages) and will call del_from_avail_list. If
+	 * that updater happen to be here, just skip adding.
+	 */
+	if (val == si->pages) {
+		/* Just like the cmpxchg in del_from_avail_list */
+		if (atomic_long_cmpxchg(&si->inuse_pages, si->pages,
+					si->pages | SWAP_USAGE_OFFLIST_BIT) == si->pages)
+			goto skip;
+	}
+
+	for_each_node(nid)
+		plist_add(&si->avail_lists[nid], &swap_avail_heads[nid]);
+
+skip:
 	spin_unlock(&swap_avail_lock);
 }
 
-static void swap_range_alloc(struct swap_info_struct *si,
-			     unsigned int nr_entries)
+/*
+ * swap_usage_add / swap_usage_sub are serialized by ci->lock in each cluster
+ * so the total contribution to the global counter should always be positive.
+ */
+static bool swap_usage_add(struct swap_info_struct *si, unsigned int nr_entries)
 {
-	WRITE_ONCE(si->inuse_pages, si->inuse_pages + nr_entries);
-	if (si->inuse_pages == si->pages) {
-		del_from_avail_list(si);
+	long val = atomic_long_add_return_relaxed(nr_entries, &si->inuse_pages);
 
-		if (vm_swap_full())
-			schedule_work(&si->reclaim_work);
+	/* If device is full, SWAP_USAGE_OFFLIST_BIT not set, try off list it */
+	if (val == si->pages) {
+		del_from_avail_list(si, false);
+		return true;
 	}
+
+	return false;
 }
 
-static void add_to_avail_list(struct swap_info_struct *si)
+static void swap_usage_sub(struct swap_info_struct *si, unsigned int nr_entries)
 {
-	int nid;
+	long val = atomic_long_sub_return_relaxed(nr_entries, &si->inuse_pages);
 
-	spin_lock(&swap_avail_lock);
-	for_each_node(nid)
-		plist_add(&si->avail_lists[nid], &swap_avail_heads[nid]);
-	spin_unlock(&swap_avail_lock);
+	/* If device is off list, try add it back */
+	if (val & SWAP_USAGE_OFFLIST_BIT)
+		add_to_avail_list(si, false);
+}
+
+static void swap_range_alloc(struct swap_info_struct *si,
+			     unsigned int nr_entries)
+{
+	if (swap_usage_add(si, nr_entries)) {
+		if (vm_swap_full())
+			schedule_work(&si->reclaim_work);
+	}
 }
 
 static void swap_range_free(struct swap_info_struct *si, unsigned long offset,
@@ -913,8 +1011,6 @@ static void swap_range_free(struct swap_info_struct *si, unsigned long offset,
 	for (i = 0; i < nr_entries; i++)
 		clear_bit(offset + i, si->zeromap);
 
-	if (si->inuse_pages == si->pages)
-		add_to_avail_list(si);
 	if (si->flags & SWP_BLKDEV)
 		swap_slot_free_notify =
 			si->bdev->bd_disk->fops->swap_slot_free_notify;
@@ -928,13 +1024,13 @@ static void swap_range_free(struct swap_info_struct *si, unsigned long offset,
 	}
 	clear_shadow_from_swap_cache(si->type, begin, end);
 
+	atomic_long_add(nr_entries, &nr_swap_pages);
 	/*
 	 * Make sure that try_to_unuse() observes si->inuse_pages reaching 0
 	 * only after the above cleanups are done.
 	 */
 	smp_wmb();
-	atomic_long_add(nr_entries, &nr_swap_pages);
-	WRITE_ONCE(si->inuse_pages, si->inuse_pages - nr_entries);
+	swap_usage_sub(si, nr_entries);
 }
 
 static int cluster_alloc_swap(struct swap_info_struct *si,
@@ -1020,19 +1116,6 @@ int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_order)
 		plist_requeue(&si->avail_lists[node], &swap_avail_heads[node]);
 		spin_unlock(&swap_avail_lock);
 		spin_lock(&si->lock);
-		if ((si->inuse_pages == si->pages) || !(si->flags & SWP_WRITEOK)) {
-			spin_lock(&swap_avail_lock);
-			if (plist_node_empty(&si->avail_lists[node])) {
-				spin_unlock(&si->lock);
-				goto nextsi;
-			}
-			WARN(!(si->flags & SWP_WRITEOK),
-			     "swap_info %d in list but !SWP_WRITEOK\n",
-			     si->type);
-			__del_from_avail_list(si);
-			spin_unlock(&si->lock);
-			goto nextsi;
-		}
 		n_ret = scan_swap_map_slots(si, SWAP_HAS_CACHE,
 					    n_goal, swp_entries, order);
 		spin_unlock(&si->lock);
@@ -1041,7 +1124,6 @@ int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_order)
 		cond_resched();
 
 		spin_lock(&swap_avail_lock);
-nextsi:
 		/*
 		 * if we got here, it's likely that si was almost full before,
 		 * and since scan_swap_map_slots() can drop the si->lock,
@@ -1773,7 +1855,7 @@ unsigned int count_swap_pages(int type, int free)
 		if (sis->flags & SWP_WRITEOK) {
 			n = sis->pages;
 			if (free)
-				n -= sis->inuse_pages;
+				n -= swap_usage_in_pages(sis);
 		}
 		spin_unlock(&sis->lock);
 	}
@@ -2108,7 +2190,7 @@ static int try_to_unuse(unsigned int type)
 	swp_entry_t entry;
 	unsigned int i;
 
-	if (!READ_ONCE(si->inuse_pages))
+	if (!swap_usage_in_pages(si))
 		goto success;
 
 retry:
@@ -2121,7 +2203,7 @@ static int try_to_unuse(unsigned int type)
 
 	spin_lock(&mmlist_lock);
 	p = &init_mm.mmlist;
-	while (READ_ONCE(si->inuse_pages) &&
+	while (swap_usage_in_pages(si) &&
 	       !signal_pending(current) &&
 	       (p = p->next) != &init_mm.mmlist) {
 
@@ -2149,7 +2231,7 @@ static int try_to_unuse(unsigned int type)
 	mmput(prev_mm);
 
 	i = 0;
-	while (READ_ONCE(si->inuse_pages) &&
+	while (swap_usage_in_pages(si) &&
 	       !signal_pending(current) &&
 	       (i = find_next_to_unuse(si, i)) != 0) {
 
@@ -2184,7 +2266,7 @@ static int try_to_unuse(unsigned int type)
 	 * folio_alloc_swap(), temporarily hiding that swap.  It's easy
 	 * and robust (though cpu-intensive) just to keep retrying.
 	 */
-	if (READ_ONCE(si->inuse_pages)) {
+	if (swap_usage_in_pages(si)) {
 		if (!signal_pending(current))
 			goto retry;
 		return -EINTR;
@@ -2193,7 +2275,7 @@ static int try_to_unuse(unsigned int type)
 success:
 	/*
 	 * Make sure that further cleanups after try_to_unuse() returns happen
-	 * after swap_range_free() reduces si->inuse_pages to 0.
+	 * after swap_range_free() reduces inuse_pages to 0.
 	 */
 	smp_mb();
 	return 0;
@@ -2211,7 +2293,7 @@ static void drain_mmlist(void)
 	unsigned int type;
 
 	for (type = 0; type < nr_swapfiles; type++)
-		if (swap_info[type]->inuse_pages)
+		if (swap_usage_in_pages(swap_info[type]))
 			return;
 	spin_lock(&mmlist_lock);
 	list_for_each_safe(p, next, &init_mm.mmlist)
@@ -2390,7 +2472,6 @@ static void setup_swap_info(struct swap_info_struct *si, int prio,
 
 static void _enable_swap_info(struct swap_info_struct *si)
 {
-	si->flags |= SWP_WRITEOK;
 	atomic_long_add(si->pages, &nr_swap_pages);
 	total_swap_pages += si->pages;
 
@@ -2407,9 +2488,8 @@ static void _enable_swap_info(struct swap_info_struct *si)
 	 */
 	plist_add(&si->list, &swap_active_head);
 
-	/* add to available list if swap device is not full */
-	if (si->inuse_pages < si->pages)
-		add_to_avail_list(si);
+	/* Add back to available list */
+	add_to_avail_list(si, true);
 }
 
 static void enable_swap_info(struct swap_info_struct *si, int prio,
@@ -2507,7 +2587,7 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
 		goto out_dput;
 	}
 	spin_lock(&p->lock);
-	del_from_avail_list(p);
+	del_from_avail_list(p, true);
 	if (p->prio < 0) {
 		struct swap_info_struct *si = p;
 		int nid;
@@ -2525,7 +2605,6 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
 	plist_del(&p->list, &swap_active_head);
 	atomic_long_sub(p->pages, &nr_swap_pages);
 	total_swap_pages -= p->pages;
-	p->flags &= ~SWP_WRITEOK;
 	spin_unlock(&p->lock);
 	spin_unlock(&swap_lock);
 
@@ -2705,7 +2784,7 @@ static int swap_show(struct seq_file *swap, void *v)
 	}
 
 	bytes = K(si->pages);
-	inuse = K(READ_ONCE(si->inuse_pages));
+	inuse = K(swap_usage_in_pages(si));
 
 	file = si->swap_file;
 	len = seq_file_path(swap, file, " \t\n\\");
@@ -2822,6 +2901,7 @@ static struct swap_info_struct *alloc_swap_info(void)
 	}
 	spin_lock_init(&p->lock);
 	spin_lock_init(&p->cont_lock);
+	atomic_long_set(&p->inuse_pages, SWAP_USAGE_OFFLIST_BIT);
 	init_completion(&p->comp);
 
 	return p;
@@ -3319,7 +3399,7 @@ void si_swapinfo(struct sysinfo *val)
 		struct swap_info_struct *si = swap_info[type];
 
 		if ((si->flags & SWP_USED) && !(si->flags & SWP_WRITEOK))
-			nr_to_be_unused += READ_ONCE(si->inuse_pages);
+			nr_to_be_unused += swap_usage_in_pages(si);
 	}
 	val->freeswap = atomic_long_read(&nr_swap_pages) + nr_to_be_unused;
 	val->totalswap = total_swap_pages + nr_to_be_unused;
-- 
2.47.0



^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 07/13] mm, swap: hold a reference of si during scan and clean up flags
  2024-10-22 19:24 [PATCH 00/13] mm, swap: rework of swap allocator locks Kairui Song
                   ` (5 preceding siblings ...)
  2024-10-22 19:24 ` [PATCH 06/13] mm, swap: clean up plist removal and adding Kairui Song
@ 2024-10-22 19:24 ` Kairui Song
  2024-10-22 19:24 ` [PATCH 08/13] mm, swap: use an enum to define all cluster flags and wrap flags changes Kairui Song
                   ` (8 subsequent siblings)
  15 siblings, 0 replies; 21+ messages in thread
From: Kairui Song @ 2024-10-22 19:24 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Chris Li, Barry Song, Ryan Roberts, Hugh Dickins,
	Yosry Ahmed, Huang, Ying, Tim Chen, Nhat Pham, linux-kernel,
	Kairui Song

From: Kairui Song <kasong@tencent.com>

The flag SWP_SCANNING was used as an indicator of whether a device
is being scanned, and prevents swap off. But it's already no longer
used.  The only thing protects the scanning now is the si lock.

However allocation path may drop the si lock, in theory this could
leaf to UAF.

So clean this up, just hold a reference for whole allocation path.
So per CPU counter killing will wait for existing scan and other
usage. The flag SWP_SCANNING can also be dropped.

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 include/linux/swap.h |  1 -
 mm/swapfile.c        | 62 +++++++++++++++++++++++---------------------
 2 files changed, 33 insertions(+), 30 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 16dcf8bd1a4e..1651174959c8 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -219,7 +219,6 @@ enum {
 	SWP_STABLE_WRITES = (1 << 11),	/* no overwrite PG_writeback pages */
 	SWP_SYNCHRONOUS_IO = (1 << 12),	/* synchronous IO is efficient */
 					/* add others here before... */
-	SWP_SCANNING	= (1 << 14),	/* refcount in scan_swap_map */
 };
 
 #define SWAP_CLUSTER_MAX 32UL
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 4e629536a07c..d6b6e71ccc19 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1088,6 +1088,21 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
 	return cluster_alloc_swap(si, usage, nr, slots, order);
 }
 
+static bool get_swap_device_info(struct swap_info_struct *si)
+{
+	if (!percpu_ref_tryget_live(&si->users))
+		return false;
+	/*
+	 * Guarantee the si->users are checked before accessing other
+	 * fields of swap_info_struct.
+	 *
+	 * Paired with the spin_unlock() after setup_swap_info() in
+	 * enable_swap_info().
+	 */
+	smp_rmb();
+	return true;
+}
+
 int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_order)
 {
 	int order = swap_entry_order(entry_order);
@@ -1115,13 +1130,16 @@ int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_order)
 		/* requeue si to after same-priority siblings */
 		plist_requeue(&si->avail_lists[node], &swap_avail_heads[node]);
 		spin_unlock(&swap_avail_lock);
-		spin_lock(&si->lock);
-		n_ret = scan_swap_map_slots(si, SWAP_HAS_CACHE,
-					    n_goal, swp_entries, order);
-		spin_unlock(&si->lock);
-		if (n_ret || size > 1)
-			goto check_out;
-		cond_resched();
+		if (get_swap_device_info(si)) {
+			spin_lock(&si->lock);
+			n_ret = scan_swap_map_slots(si, SWAP_HAS_CACHE,
+					n_goal, swp_entries, order);
+			spin_unlock(&si->lock);
+			put_swap_device(si);
+			if (n_ret || size > 1)
+				goto check_out;
+			cond_resched();
+		}
 
 		spin_lock(&swap_avail_lock);
 		/*
@@ -1272,16 +1290,8 @@ struct swap_info_struct *get_swap_device(swp_entry_t entry)
 	si = swp_swap_info(entry);
 	if (!si)
 		goto bad_nofile;
-	if (!percpu_ref_tryget_live(&si->users))
+	if (!get_swap_device_info(si))
 		goto out;
-	/*
-	 * Guarantee the si->users are checked before accessing other
-	 * fields of swap_info_struct.
-	 *
-	 * Paired with the spin_unlock() after setup_swap_info() in
-	 * enable_swap_info().
-	 */
-	smp_rmb();
 	offset = swp_offset(entry);
 	if (offset >= si->max)
 		goto put_out;
@@ -1761,10 +1771,13 @@ swp_entry_t get_swap_page_of_type(int type)
 		goto fail;
 
 	/* This is called for allocating swap entry, not cache */
-	spin_lock(&si->lock);
-	if ((si->flags & SWP_WRITEOK) && scan_swap_map_slots(si, 1, 1, &entry, 0))
-		atomic_long_dec(&nr_swap_pages);
-	spin_unlock(&si->lock);
+	if (get_swap_device_info(si)) {
+		spin_lock(&si->lock);
+		if ((si->flags & SWP_WRITEOK) && scan_swap_map_slots(si, 1, 1, &entry, 0))
+			atomic_long_dec(&nr_swap_pages);
+		spin_unlock(&si->lock);
+		put_swap_device(si);
+	}
 fail:
 	return entry;
 }
@@ -2650,15 +2663,6 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
 	spin_lock(&p->lock);
 	drain_mmlist();
 
-	/* wait for anyone still in scan_swap_map_slots */
-	while (p->flags >= SWP_SCANNING) {
-		spin_unlock(&p->lock);
-		spin_unlock(&swap_lock);
-		schedule_timeout_uninterruptible(1);
-		spin_lock(&swap_lock);
-		spin_lock(&p->lock);
-	}
-
 	swap_file = p->swap_file;
 	p->swap_file = NULL;
 	p->max = 0;
-- 
2.47.0



^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 08/13] mm, swap: use an enum to define all cluster flags and wrap flags changes
  2024-10-22 19:24 [PATCH 00/13] mm, swap: rework of swap allocator locks Kairui Song
                   ` (6 preceding siblings ...)
  2024-10-22 19:24 ` [PATCH 07/13] mm, swap: hold a reference of si during scan and clean up flags Kairui Song
@ 2024-10-22 19:24 ` Kairui Song
  2024-10-22 19:24 ` [PATCH 09/13] mm, swap: reduce contention on device lock Kairui Song
                   ` (7 subsequent siblings)
  15 siblings, 0 replies; 21+ messages in thread
From: Kairui Song @ 2024-10-22 19:24 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Chris Li, Barry Song, Ryan Roberts, Hugh Dickins,
	Yosry Ahmed, Huang, Ying, Tim Chen, Nhat Pham, linux-kernel,
	Kairui Song

From: Kairui Song <kasong@tencent.com>

Currently we are only using flags to indicate which list the cluster is
on, using one bit for each list type might be a waste as the list type
grows we will consume too much bits. And current the mixed usage of "&"
and "==" is a bit confusing.

Make it clean by using an enum to define all possible cluster status,
only an off-list cluster will have the NONE (0) flag. And use
a wrapper to annotate and sanitize all flag setting and list movement.

Suggested-by: Chris Li <chrisl@kernel.org>
Signed-off-by: Kairui Song <kasong@tencent.com>
---
 include/linux/swap.h | 17 +++++++---
 mm/swapfile.c        | 76 +++++++++++++++++++++++---------------------
 2 files changed, 53 insertions(+), 40 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 1651174959c8..75fc2da1767d 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -256,10 +256,19 @@ struct swap_cluster_info {
 	u8 order;
 	struct list_head list;
 };
-#define CLUSTER_FLAG_FREE 1 /* This cluster is free */
-#define CLUSTER_FLAG_NONFULL 2 /* This cluster is on nonfull list */
-#define CLUSTER_FLAG_FRAG 4 /* This cluster is on nonfull list */
-#define CLUSTER_FLAG_FULL 8 /* This cluster is on full list */
+
+/*
+ * All on-list cluster must have a non-zero flag.
+ */
+enum swap_cluster_flags {
+	CLUSTER_FLAG_NONE = 0, /* For temporary off-list cluster */
+	CLUSTER_FLAG_FREE,
+	CLUSTER_FLAG_NONFULL,
+	CLUSTER_FLAG_FRAG,
+	CLUSTER_FLAG_FULL,
+	CLUSTER_FLAG_DISCARD,
+	CLUSTER_FLAG_MAX,
+};
 
 /*
  * The first page in the swap file is the swap header, which is always marked
diff --git a/mm/swapfile.c b/mm/swapfile.c
index d6b6e71ccc19..96d8012b003c 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -402,7 +402,7 @@ static void discard_swap_cluster(struct swap_info_struct *si,
 
 static inline bool cluster_is_free(struct swap_cluster_info *info)
 {
-	return info->flags & CLUSTER_FLAG_FREE;
+	return info->flags == CLUSTER_FLAG_FREE;
 }
 
 static inline unsigned int cluster_index(struct swap_info_struct *si,
@@ -433,6 +433,27 @@ static inline void unlock_cluster(struct swap_cluster_info *ci)
 	spin_unlock(&ci->lock);
 }
 
+static void cluster_move(struct swap_info_struct *si,
+			 struct swap_cluster_info *ci, struct list_head *list,
+			 enum swap_cluster_flags new_flags)
+{
+	VM_WARN_ON(ci->flags == new_flags);
+	BUILD_BUG_ON(1 << sizeof(ci->flags) * BITS_PER_BYTE < CLUSTER_FLAG_MAX);
+
+	if (ci->flags == CLUSTER_FLAG_NONE) {
+		list_add_tail(&ci->list, list);
+	} else {
+		if (ci->flags == CLUSTER_FLAG_FRAG) {
+			VM_WARN_ON(!si->frag_cluster_nr[ci->order]);
+			si->frag_cluster_nr[ci->order]--;
+		}
+		list_move_tail(&ci->list, list);
+	}
+	ci->flags = new_flags;
+	if (new_flags == CLUSTER_FLAG_FRAG)
+		si->frag_cluster_nr[ci->order]++;
+}
+
 /* Add a cluster to discard list and schedule it to do discard */
 static void swap_cluster_schedule_discard(struct swap_info_struct *si,
 		struct swap_cluster_info *ci)
@@ -446,10 +467,8 @@ static void swap_cluster_schedule_discard(struct swap_info_struct *si,
 	 */
 	memset(si->swap_map + idx * SWAPFILE_CLUSTER,
 			SWAP_MAP_BAD, SWAPFILE_CLUSTER);
-
-	VM_BUG_ON(ci->flags & CLUSTER_FLAG_FREE);
-	list_move_tail(&ci->list, &si->discard_clusters);
-	ci->flags = 0;
+	VM_BUG_ON(ci->flags == CLUSTER_FLAG_FREE);
+	cluster_move(si, ci, &si->discard_clusters, CLUSTER_FLAG_DISCARD);
 	schedule_work(&si->discard_work);
 }
 
@@ -457,12 +476,7 @@ static void __free_cluster(struct swap_info_struct *si, struct swap_cluster_info
 {
 	lockdep_assert_held(&si->lock);
 	lockdep_assert_held(&ci->lock);
-
-	if (ci->flags)
-		list_move_tail(&ci->list, &si->free_clusters);
-	else
-		list_add_tail(&ci->list, &si->free_clusters);
-	ci->flags = CLUSTER_FLAG_FREE;
+	cluster_move(si, ci, &si->free_clusters, CLUSTER_FLAG_FREE);
 	ci->order = 0;
 }
 
@@ -478,6 +492,8 @@ static void swap_do_scheduled_discard(struct swap_info_struct *si)
 	while (!list_empty(&si->discard_clusters)) {
 		ci = list_first_entry(&si->discard_clusters, struct swap_cluster_info, list);
 		list_del(&ci->list);
+		/* Must clear flag when taking a cluster off-list */
+		ci->flags = CLUSTER_FLAG_NONE;
 		idx = cluster_index(si, ci);
 		spin_unlock(&si->lock);
 
@@ -518,9 +534,6 @@ static void free_cluster(struct swap_info_struct *si, struct swap_cluster_info *
 	lockdep_assert_held(&si->lock);
 	lockdep_assert_held(&ci->lock);
 
-	if (ci->flags & CLUSTER_FLAG_FRAG)
-		si->frag_cluster_nr[ci->order]--;
-
 	/*
 	 * If the swap is discardable, prepare discard the cluster
 	 * instead of free it immediately. The cluster will be freed
@@ -572,13 +585,9 @@ static void dec_cluster_info_page(struct swap_info_struct *si,
 		return;
 	}
 
-	if (!(ci->flags & CLUSTER_FLAG_NONFULL)) {
-		VM_BUG_ON(ci->flags & CLUSTER_FLAG_FREE);
-		if (ci->flags & CLUSTER_FLAG_FRAG)
-			si->frag_cluster_nr[ci->order]--;
-		list_move_tail(&ci->list, &si->nonfull_clusters[ci->order]);
-		ci->flags = CLUSTER_FLAG_NONFULL;
-	}
+	if (ci->flags != CLUSTER_FLAG_NONFULL)
+		cluster_move(si, ci, &si->nonfull_clusters[ci->order],
+			     CLUSTER_FLAG_NONFULL);
 }
 
 static bool cluster_reclaim_range(struct swap_info_struct *si,
@@ -657,11 +666,14 @@ static void cluster_alloc_range(struct swap_info_struct *si, struct swap_cluster
 {
 	unsigned int nr_pages = 1 << order;
 
+	VM_BUG_ON(ci->flags != CLUSTER_FLAG_FREE &&
+		  ci->flags != CLUSTER_FLAG_NONFULL &&
+		  ci->flags != CLUSTER_FLAG_FRAG);
+
 	if (cluster_is_free(ci)) {
-		if (nr_pages < SWAPFILE_CLUSTER) {
-			list_move_tail(&ci->list, &si->nonfull_clusters[order]);
-			ci->flags = CLUSTER_FLAG_NONFULL;
-		}
+		if (nr_pages < SWAPFILE_CLUSTER)
+			cluster_move(si, ci, &si->nonfull_clusters[order],
+				     CLUSTER_FLAG_NONFULL);
 		ci->order = order;
 	}
 
@@ -669,14 +681,8 @@ static void cluster_alloc_range(struct swap_info_struct *si, struct swap_cluster
 	swap_range_alloc(si, nr_pages);
 	ci->count += nr_pages;
 
-	if (ci->count == SWAPFILE_CLUSTER) {
-		VM_BUG_ON(!(ci->flags &
-			  (CLUSTER_FLAG_FREE | CLUSTER_FLAG_NONFULL | CLUSTER_FLAG_FRAG)));
-		if (ci->flags & CLUSTER_FLAG_FRAG)
-			si->frag_cluster_nr[ci->order]--;
-		list_move_tail(&ci->list, &si->full_clusters);
-		ci->flags = CLUSTER_FLAG_FULL;
-	}
+	if (ci->count == SWAPFILE_CLUSTER)
+		cluster_move(si, ci, &si->full_clusters, CLUSTER_FLAG_FULL);
 }
 
 static unsigned int alloc_swap_scan_cluster(struct swap_info_struct *si, unsigned long offset,
@@ -806,9 +812,7 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
 		while (!list_empty(&si->nonfull_clusters[order])) {
 			ci = list_first_entry(&si->nonfull_clusters[order],
 					      struct swap_cluster_info, list);
-			list_move_tail(&ci->list, &si->frag_clusters[order]);
-			ci->flags = CLUSTER_FLAG_FRAG;
-			si->frag_cluster_nr[order]++;
+			cluster_move(si, ci, &si->frag_clusters[order], CLUSTER_FLAG_FRAG);
 			offset = alloc_swap_scan_cluster(si, cluster_offset(si, ci),
 							 &found, order, usage);
 			frags++;
-- 
2.47.0



^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 09/13] mm, swap: reduce contention on device lock
  2024-10-22 19:24 [PATCH 00/13] mm, swap: rework of swap allocator locks Kairui Song
                   ` (7 preceding siblings ...)
  2024-10-22 19:24 ` [PATCH 08/13] mm, swap: use an enum to define all cluster flags and wrap flags changes Kairui Song
@ 2024-10-22 19:24 ` Kairui Song
  2024-10-22 19:24 ` [PATCH 10/13] mm, swap: simplify percpu cluster updating Kairui Song
                   ` (6 subsequent siblings)
  15 siblings, 0 replies; 21+ messages in thread
From: Kairui Song @ 2024-10-22 19:24 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Chris Li, Barry Song, Ryan Roberts, Hugh Dickins,
	Yosry Ahmed, Huang, Ying, Tim Chen, Nhat Pham, linux-kernel,
	Kairui Song

From: Kairui Song <kasong@tencent.com>

Currently swap locking is mainly composed of two locks, cluster
lock (ci->lock) and device lock (si->lock).

Cluster lock is much more fine-grained, so it will be best to use
ci->lock instead of si->lock as much as possible.

Following the new cluster allocator design, many operation doesn't
need to touch si->lock at all. In practise, we only need to take
si->lock when moving clusters between lists.

To archive it, this commit reworked the locking pattern of all si->lock
and ci->lock users, eliminated all usage of ci->lock inside si->lock,
introduce new design to avoid touching si->lock as much as possible.

For minimal contention for allocation and easier understanding, two
ideas are introduced with the corresponding helpers: `isolation`
and `relocation`:

- Clusters will be `isolated` from list upon being scanned for
  allocation, so scanning of on-list cluster no longer need to hold
  the si->lock except the very moment, and hence removed the ci->lock
  usage inside si->lock.

  In the new allocator design, one cluster always get moved after scanning
  (free -> nonfull, nonfull -> frag, frag -> frag tail) so this
  introduces no extra overhead. This also greatly reduced the contention
  of both si->lock and ci->lock as other CPUs won't walk onto the same
  cluster by iterating the list.

  The off-list time window of a cluster is also minimal, one CPU can at
  most hold one cluster while scanning the 512 entries on it, which we
  used to busy wait with a spin lock.

  This is done with `cluster_isolate_lock` on scanning of a new cluster.

  Note: Scanning of per CPU cluster is a special case, it doesn't
  isolation the cluster. That's because it doesn't need to hold the
  si->lock at all, it simply acquire the ci->lock of previously used
  cluster and use it.

- Cluster will be `relocated` after allocation or freeing according to
  it's count and status.

  Allocations no longer holds si->lock now, and may drop ci->lock for
  reclaim, so the cluster could be moved to anywhere. Besides,
  `isolation` clears all flags when it takes the cluster off list
  (The flag must be in-sync with list status, so cluster users don't
  need to touch si->lock for checking its list status. This is important
  for reducing contention on si->lock). So the cluster have to be
  `relocated` according to its usage after being allocation to the
  right list.

  This is done with `relocate_cluster` after allocation, or
  `[partial_]free_cluster` after freeing.

Now except swapon / swapoff and discard, `isolation` and `relocation` are
the only two places that need to take si->lock. And as each CPU will keep
using its per-CPU cluster as much as possible and a cluster have 512
entries to be consumed, si->lock is rarely touched.

The lock contention of si->lock is now barely observable. Test with build
linux kernel with defconfig showed huge performance improvement:

tiem make -j96 / 768M memcg, 4K pages, 10G ZRAM, on Intel 8255C:
Before:
Sys time: 73578.30, Real time: 864.05
After: (-50.7% sys time, -44.8% real time)
Sys time: 36227.49, Real time: 476.66

time make -j96 / 1152M memcg, 64K mTHP, 10G ZRAM, on Intel 8255C:
(avg of 4 test run)
Before:
Sys time: 74044.85, Real time: 846.51
hugepages-64kB/stats/swpout: 1735216
hugepages-64kB/stats/swpout_fallback: 430333

After: (-40.4% sys time, -37.1% real time)
Sys time: 44160.56, Real time: 532.07
hugepages-64kB/stats/swpout: 1786288
hugepages-64kB/stats/swpout_fallback: 243384

time make -j32 / 512M memcg, 4K pages, 5G ZRAM, on AMD 7K62:
Before:
Sys time: 8098.21, Real time: 401.3
After: (-22.6% sys time, -12.8% real time )
Sys time: 6265.02, Real time: 349.83

The allocation success rate also slightly improved as we sanitized the
usage of clusters with new defined helpers and locks, so temporarily
dropped si->lock or ci->lock won't cause cluster order shuffle.

Suggested-by: Chris Li <chrisl@kernel.org>
Signed-off-by: Kairui Song <kasong@tencent.com>
---
 include/linux/swap.h |   5 +-
 mm/swapfile.c        | 418 ++++++++++++++++++++++++-------------------
 2 files changed, 239 insertions(+), 184 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 75fc2da1767d..a3b5d74b095a 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -265,6 +265,8 @@ enum swap_cluster_flags {
 	CLUSTER_FLAG_FREE,
 	CLUSTER_FLAG_NONFULL,
 	CLUSTER_FLAG_FRAG,
+	/* Clusters with flags above are allocatable */
+	CLUSTER_FLAG_USABLE = CLUSTER_FLAG_FRAG,
 	CLUSTER_FLAG_FULL,
 	CLUSTER_FLAG_DISCARD,
 	CLUSTER_FLAG_MAX,
@@ -290,6 +292,7 @@ enum swap_cluster_flags {
  * throughput.
  */
 struct percpu_cluster {
+	local_lock_t lock; /* Protect the percpu_cluster above */
 	unsigned int next[SWAP_NR_ORDERS]; /* Likely next allocation offset */
 };
 
@@ -312,7 +315,7 @@ struct swap_info_struct {
 					/* list of cluster that contains at least one free slot */
 	struct list_head frag_clusters[SWAP_NR_ORDERS];
 					/* list of cluster that are fragmented or contented */
-	unsigned int frag_cluster_nr[SWAP_NR_ORDERS];
+	atomic_long_t frag_cluster_nr[SWAP_NR_ORDERS];
 	unsigned int pages;		/* total of usable pages of swap */
 	atomic_long_t inuse_pages;	/* number of those currently in use */
 	struct percpu_cluster __percpu *percpu_cluster; /* per cpu's swap location */
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 96d8012b003c..a19ee8d5ffd0 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -260,12 +260,10 @@ static int __try_to_reclaim_swap(struct swap_info_struct *si,
 	folio_ref_sub(folio, nr_pages);
 	folio_set_dirty(folio);
 
-	spin_lock(&si->lock);
 	/* Only sinple page folio can be backed by zswap */
 	if (nr_pages == 1)
 		zswap_invalidate(entry);
 	swap_entry_range_free(si, entry, nr_pages);
-	spin_unlock(&si->lock);
 	ret = nr_pages;
 out_unlock:
 	folio_unlock(folio);
@@ -402,7 +400,21 @@ static void discard_swap_cluster(struct swap_info_struct *si,
 
 static inline bool cluster_is_free(struct swap_cluster_info *info)
 {
-	return info->flags == CLUSTER_FLAG_FREE;
+	return info->count == 0;
+}
+
+static inline bool cluster_is_discard(struct swap_cluster_info *info)
+{
+	return info->flags == CLUSTER_FLAG_DISCARD;
+}
+
+static inline bool cluster_is_usable(struct swap_cluster_info *ci, int order)
+{
+	if (unlikely(ci->flags > CLUSTER_FLAG_USABLE))
+		return false;
+	if (!order)
+		return true;
+	return cluster_is_free(ci) || order == ci->order;
 }
 
 static inline unsigned int cluster_index(struct swap_info_struct *si,
@@ -439,19 +451,20 @@ static void cluster_move(struct swap_info_struct *si,
 {
 	VM_WARN_ON(ci->flags == new_flags);
 	BUILD_BUG_ON(1 << sizeof(ci->flags) * BITS_PER_BYTE < CLUSTER_FLAG_MAX);
+	lockdep_assert_held(&ci->lock);
 
-	if (ci->flags == CLUSTER_FLAG_NONE) {
+	spin_lock(&si->lock);
+	if (ci->flags == CLUSTER_FLAG_NONE)
 		list_add_tail(&ci->list, list);
-	} else {
-		if (ci->flags == CLUSTER_FLAG_FRAG) {
-			VM_WARN_ON(!si->frag_cluster_nr[ci->order]);
-			si->frag_cluster_nr[ci->order]--;
-		}
+	else
 		list_move_tail(&ci->list, list);
-	}
+	spin_unlock(&si->lock);
+
+	if (ci->flags == CLUSTER_FLAG_FRAG)
+		atomic_long_dec(&si->frag_cluster_nr[ci->order]);
+	else if (new_flags == CLUSTER_FLAG_FRAG)
+		atomic_long_inc(&si->frag_cluster_nr[ci->order]);
 	ci->flags = new_flags;
-	if (new_flags == CLUSTER_FLAG_FRAG)
-		si->frag_cluster_nr[ci->order]++;
 }
 
 /* Add a cluster to discard list and schedule it to do discard */
@@ -474,39 +487,82 @@ static void swap_cluster_schedule_discard(struct swap_info_struct *si,
 
 static void __free_cluster(struct swap_info_struct *si, struct swap_cluster_info *ci)
 {
-	lockdep_assert_held(&si->lock);
 	lockdep_assert_held(&ci->lock);
 	cluster_move(si, ci, &si->free_clusters, CLUSTER_FLAG_FREE);
 	ci->order = 0;
 }
 
+/*
+ * Isolate and lock the first cluster that is not contented on a list,
+ * clean its flag before taken off-list. Cluster flag must be in sync
+ * with list status, so cluster updaters can always know the cluster
+ * list status without touching si lock.
+ *
+ * Note it's possible that all clusters on a list are contented so
+ * this returns NULL for an non-empty list.
+ */
+static struct swap_cluster_info *cluster_isolate_lock(
+		struct swap_info_struct *si, struct list_head *list)
+{
+	struct swap_cluster_info *ci, *ret = NULL;
+
+	spin_lock(&si->lock);
+	list_for_each_entry(ci, list, list) {
+		if (!spin_trylock(&ci->lock))
+			continue;
+
+		/* We may only isolate and clear flags of following lists */
+		VM_BUG_ON(!ci->flags);
+		VM_BUG_ON(ci->flags > CLUSTER_FLAG_USABLE &&
+			  ci->flags != CLUSTER_FLAG_FULL);
+
+		list_del(&ci->list);
+		ci->flags = CLUSTER_FLAG_NONE;
+		ret = ci;
+		break;
+	}
+	spin_unlock(&si->lock);
+
+	return ret;
+}
+
 /*
  * Doing discard actually. After a cluster discard is finished, the cluster
- * will be added to free cluster list. caller should hold si->lock.
-*/
-static void swap_do_scheduled_discard(struct swap_info_struct *si)
+ * will be added to free cluster list. Discard cluster is a bit special as
+ * they don't participate in allocation or reclaim, so clusters marked as
+ * CLUSTER_FLAG_DISCARD must remain off-list or on discard list.
+ */
+static bool swap_do_scheduled_discard(struct swap_info_struct *si)
 {
 	struct swap_cluster_info *ci;
+	bool ret = false;
 	unsigned int idx;
 
+	spin_lock(&si->lock);
 	while (!list_empty(&si->discard_clusters)) {
 		ci = list_first_entry(&si->discard_clusters, struct swap_cluster_info, list);
+		/*
+		 * Delete the cluster from list but don't clear the flag until
+		 * discard is done, so isolation and relocation will skip it.
+		 */
 		list_del(&ci->list);
-		/* Must clear flag when taking a cluster off-list */
-		ci->flags = CLUSTER_FLAG_NONE;
 		idx = cluster_index(si, ci);
 		spin_unlock(&si->lock);
-
 		discard_swap_cluster(si, idx * SWAPFILE_CLUSTER,
 				SWAPFILE_CLUSTER);
 
-		spin_lock(&si->lock);
 		spin_lock(&ci->lock);
-		__free_cluster(si, ci);
+		/* Discard is done, return to list and clear the flag */
+		ci->flags = CLUSTER_FLAG_NONE;
 		memset(si->swap_map + idx * SWAPFILE_CLUSTER,
 				0, SWAPFILE_CLUSTER);
+		__free_cluster(si, ci);
 		spin_unlock(&ci->lock);
+		ret = true;
+		spin_lock(&si->lock);
 	}
+	spin_unlock(&si->lock);
+	return ret;
 }
 
 static void swap_discard_work(struct work_struct *work)
@@ -515,9 +571,7 @@ static void swap_discard_work(struct work_struct *work)
 
 	si = container_of(work, struct swap_info_struct, discard_work);
 
-	spin_lock(&si->lock);
 	swap_do_scheduled_discard(si);
-	spin_unlock(&si->lock);
 }
 
 static void swap_users_ref_free(struct percpu_ref *ref)
@@ -528,10 +582,14 @@ static void swap_users_ref_free(struct percpu_ref *ref)
 	complete(&si->comp);
 }
 
+/*
+ * Must be called after freeing if ci->count == 0, puts the cluster to free
+ * or discard list.
+ */
 static void free_cluster(struct swap_info_struct *si, struct swap_cluster_info *ci)
 {
 	VM_BUG_ON(ci->count != 0);
-	lockdep_assert_held(&si->lock);
+	VM_BUG_ON(ci->flags == CLUSTER_FLAG_FREE);
 	lockdep_assert_held(&ci->lock);
 
 	/*
@@ -548,6 +606,48 @@ static void free_cluster(struct swap_info_struct *si, struct swap_cluster_info *
 	__free_cluster(si, ci);
 }
 
+/*
+ * Must be called after freeing if ci->count != 0, puts the cluster to free
+ * or nonfull list.
+ */
+static void partial_free_cluster(struct swap_info_struct *si,
+				 struct swap_cluster_info *ci)
+{
+	VM_BUG_ON(!ci->count || ci->count == SWAPFILE_CLUSTER);
+	lockdep_assert_held(&ci->lock);
+
+	if (ci->flags != CLUSTER_FLAG_NONFULL)
+		cluster_move(si, ci, &si->nonfull_clusters[ci->order],
+			     CLUSTER_FLAG_NONFULL);
+}
+
+/*
+ * Must be called after allocation, put the cluster to full or frag list.
+ * Note: allocation don't need si lock, and may drop the ci lock for reclaim,
+ * so the cluster could end up any where before re-acquiring ci lock.
+ */
+static void relocate_cluster(struct swap_info_struct *si,
+			     struct swap_cluster_info *ci)
+{
+	lockdep_assert_held(&ci->lock);
+
+	/* Discard cluster must remain off-list or on discard list */
+	if (cluster_is_discard(ci))
+		return;
+
+	if (!ci->count) {
+		free_cluster(si, ci);
+	} else if (ci->count != SWAPFILE_CLUSTER) {
+		if (ci->flags != CLUSTER_FLAG_FRAG)
+			cluster_move(si, ci, &si->frag_clusters[ci->order],
+				     CLUSTER_FLAG_FRAG);
+	} else {
+		if (ci->flags != CLUSTER_FLAG_FULL)
+			cluster_move(si, ci, &si->full_clusters,
+				     CLUSTER_FLAG_FULL);
+	}
+}
+
 /*
  * The cluster corresponding to page_nr will be used. The cluster will not be
  * added to free cluster list and its usage counter will be increased by 1.
@@ -566,30 +666,6 @@ static void inc_cluster_info_page(struct swap_info_struct *si,
 	VM_BUG_ON(ci->flags);
 }
 
-/*
- * The cluster ci decreases @nr_pages usage. If the usage counter becomes 0,
- * which means no page in the cluster is in use, we can optionally discard
- * the cluster and add it to free cluster list.
- */
-static void dec_cluster_info_page(struct swap_info_struct *si,
-				  struct swap_cluster_info *ci, int nr_pages)
-{
-	VM_BUG_ON(ci->count < nr_pages);
-	VM_BUG_ON(cluster_is_free(ci));
-	lockdep_assert_held(&si->lock);
-	lockdep_assert_held(&ci->lock);
-	ci->count -= nr_pages;
-
-	if (!ci->count) {
-		free_cluster(si, ci);
-		return;
-	}
-
-	if (ci->flags != CLUSTER_FLAG_NONFULL)
-		cluster_move(si, ci, &si->nonfull_clusters[ci->order],
-			     CLUSTER_FLAG_NONFULL);
-}
-
 static bool cluster_reclaim_range(struct swap_info_struct *si,
 				  struct swap_cluster_info *ci,
 				  unsigned long start, unsigned long end)
@@ -599,8 +675,6 @@ static bool cluster_reclaim_range(struct swap_info_struct *si,
 	int nr_reclaim;
 
 	spin_unlock(&ci->lock);
-	spin_unlock(&si->lock);
-
 	do {
 		switch (READ_ONCE(map[offset])) {
 		case 0:
@@ -618,9 +692,7 @@ static bool cluster_reclaim_range(struct swap_info_struct *si,
 		}
 	} while (offset < end);
 out:
-	spin_lock(&si->lock);
 	spin_lock(&ci->lock);
-
 	/*
 	 * Recheck the range no matter reclaim succeeded or not, the slot
 	 * could have been be freed while we are not holding the lock.
@@ -634,11 +706,11 @@ static bool cluster_reclaim_range(struct swap_info_struct *si,
 
 static bool cluster_scan_range(struct swap_info_struct *si,
 			       struct swap_cluster_info *ci,
-			       unsigned long start, unsigned int nr_pages)
+			       unsigned long start, unsigned int nr_pages,
+			       bool *need_reclaim)
 {
 	unsigned long offset, end = start + nr_pages;
 	unsigned char *map = si->swap_map;
-	bool need_reclaim = false;
 
 	for (offset = start; offset < end; offset++) {
 		switch (READ_ONCE(map[offset])) {
@@ -647,16 +719,13 @@ static bool cluster_scan_range(struct swap_info_struct *si,
 		case SWAP_HAS_CACHE:
 			if (!vm_swap_full())
 				return false;
-			need_reclaim = true;
+			*need_reclaim = true;
 			continue;
 		default:
 			return false;
 		}
 	}
 
-	if (need_reclaim)
-		return cluster_reclaim_range(si, ci, start, end);
-
 	return true;
 }
 
@@ -666,23 +735,12 @@ static void cluster_alloc_range(struct swap_info_struct *si, struct swap_cluster
 {
 	unsigned int nr_pages = 1 << order;
 
-	VM_BUG_ON(ci->flags != CLUSTER_FLAG_FREE &&
-		  ci->flags != CLUSTER_FLAG_NONFULL &&
-		  ci->flags != CLUSTER_FLAG_FRAG);
-
-	if (cluster_is_free(ci)) {
-		if (nr_pages < SWAPFILE_CLUSTER)
-			cluster_move(si, ci, &si->nonfull_clusters[order],
-				     CLUSTER_FLAG_NONFULL);
+	if (cluster_is_free(ci))
 		ci->order = order;
-	}
 
 	memset(si->swap_map + start, usage, nr_pages);
 	swap_range_alloc(si, nr_pages);
 	ci->count += nr_pages;
-
-	if (ci->count == SWAPFILE_CLUSTER)
-		cluster_move(si, ci, &si->full_clusters, CLUSTER_FLAG_FULL);
 }
 
 static unsigned int alloc_swap_scan_cluster(struct swap_info_struct *si, unsigned long offset,
@@ -692,34 +750,52 @@ static unsigned int alloc_swap_scan_cluster(struct swap_info_struct *si, unsigne
 	unsigned long start = offset & ~(SWAPFILE_CLUSTER - 1);
 	unsigned long end = min(start + SWAPFILE_CLUSTER, si->max);
 	unsigned int nr_pages = 1 << order;
+	bool need_reclaim, ret;
 	struct swap_cluster_info *ci;
 
-	if (end < nr_pages)
-		return SWAP_NEXT_INVALID;
-	end -= nr_pages;
+	ci = &si->cluster_info[offset / SWAPFILE_CLUSTER];
+	lockdep_assert_held(&ci->lock);
 
-	ci = lock_cluster(si, offset);
-	if (ci->count + nr_pages > SWAPFILE_CLUSTER) {
+	if (end < nr_pages || ci->count + nr_pages > SWAPFILE_CLUSTER) {
 		offset = SWAP_NEXT_INVALID;
-		goto done;
+		goto out;
 	}
 
-	while (offset <= end) {
-		if (cluster_scan_range(si, ci, offset, nr_pages)) {
-			cluster_alloc_range(si, ci, offset, usage, order);
-			*foundp = offset;
-			if (ci->count == SWAPFILE_CLUSTER) {
+	for (end -= nr_pages; offset <= end; offset += nr_pages) {
+		need_reclaim = false;
+		if (!cluster_scan_range(si, ci, offset, nr_pages, &need_reclaim))
+			continue;
+		if (need_reclaim) {
+			ret = cluster_reclaim_range(si, ci, start, end);
+			/*
+			 * Reclaim drops ci->lock and cluster could be used
+			 * by another order. Not checking flag as off-list
+			 * cluster has no flag set, and change of list
+			 * won't cause fragmentation.
+			 */
+			if (!cluster_is_usable(ci, order)) {
 				offset = SWAP_NEXT_INVALID;
-				goto done;
+				goto out;
 			}
-			offset += nr_pages;
-			break;
+			if (cluster_is_free(ci))
+				offset = start;
+			/* Reclaim failed but cluster is usable, try next */
+			if (!ret)
+				continue;
+		}
+		cluster_alloc_range(si, ci, offset, usage, order);
+		*foundp = offset;
+		if (ci->count == SWAPFILE_CLUSTER) {
+			offset = SWAP_NEXT_INVALID;
+			goto out;
 		}
 		offset += nr_pages;
+		break;
 	}
 	if (offset > end)
 		offset = SWAP_NEXT_INVALID;
-done:
+out:
+	relocate_cluster(si, ci);
 	unlock_cluster(ci);
 	return offset;
 }
@@ -736,18 +812,17 @@ static void swap_reclaim_full_clusters(struct swap_info_struct *si, bool force)
 	if (force)
 		to_scan = swap_usage_in_pages(si) / SWAPFILE_CLUSTER;
 
-	while (!list_empty(&si->full_clusters)) {
-		ci = list_first_entry(&si->full_clusters, struct swap_cluster_info, list);
-		list_move_tail(&ci->list, &si->full_clusters);
+	while ((ci = cluster_isolate_lock(si, &si->full_clusters))) {
 		offset = cluster_offset(si, ci);
 		end = min(si->max, offset + SWAPFILE_CLUSTER);
 		to_scan--;
 
-		spin_unlock(&si->lock);
 		while (offset < end) {
 			if (READ_ONCE(map[offset]) == SWAP_HAS_CACHE) {
+				spin_unlock(&ci->lock);
 				nr_reclaim = __try_to_reclaim_swap(si, offset,
 								   TTRS_ANYWAY | TTRS_DIRECT);
+				spin_lock(&ci->lock);
 				if (nr_reclaim) {
 					offset += abs(nr_reclaim);
 					continue;
@@ -755,8 +830,8 @@ static void swap_reclaim_full_clusters(struct swap_info_struct *si, bool force)
 			}
 			offset++;
 		}
-		spin_lock(&si->lock);
 
+		unlock_cluster(ci);
 		if (to_scan <= 0)
 			break;
 	}
@@ -768,9 +843,7 @@ static void swap_reclaim_work(struct work_struct *work)
 
 	si = container_of(work, struct swap_info_struct, reclaim_work);
 
-	spin_lock(&si->lock);
 	swap_reclaim_full_clusters(si, true);
-	spin_unlock(&si->lock);
 }
 
 /*
@@ -781,23 +854,36 @@ static void swap_reclaim_work(struct work_struct *work)
 static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int order,
 					      unsigned char usage)
 {
-	struct percpu_cluster *cluster;
 	struct swap_cluster_info *ci;
 	unsigned int offset, found = 0;
 
-new_cluster:
-	lockdep_assert_held(&si->lock);
-	cluster = this_cpu_ptr(si->percpu_cluster);
-	offset = cluster->next[order];
+	/* Fast path using per CPU cluster */
+	local_lock(&si->percpu_cluster->lock);
+	offset = __this_cpu_read(si->percpu_cluster->next[order]);
 	if (offset) {
-		offset = alloc_swap_scan_cluster(si, offset, &found, order, usage);
+		ci = lock_cluster(si, offset);
+		/* Cluster could have been used by another order */
+		if (cluster_is_usable(ci, order)) {
+			if (cluster_is_free(ci))
+				offset = cluster_offset(si, ci);
+			offset = alloc_swap_scan_cluster(si, offset, &found,
+							 order, usage);
+		} else {
+			unlock_cluster(ci);
+		}
 		if (found)
 			goto done;
 	}
 
-	if (!list_empty(&si->free_clusters)) {
-		ci = list_first_entry(&si->free_clusters, struct swap_cluster_info, list);
-		offset = alloc_swap_scan_cluster(si, cluster_offset(si, ci), &found, order, usage);
+new_cluster:
+	ci = cluster_isolate_lock(si, &si->free_clusters);
+	if (ci) {
+		offset = alloc_swap_scan_cluster(si, cluster_offset(si, ci),
+						 &found, order, usage);
+		/*
+		 * Allocation from free cluster must never fail and
+		 * cluster lock must remain untouched.
+		 */
 		VM_BUG_ON(!found);
 		goto done;
 	}
@@ -807,49 +893,45 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
 		swap_reclaim_full_clusters(si, false);
 
 	if (order < PMD_ORDER) {
-		unsigned int frags = 0;
+		unsigned int frags = 0, frags_existing;
 
-		while (!list_empty(&si->nonfull_clusters[order])) {
-			ci = list_first_entry(&si->nonfull_clusters[order],
-					      struct swap_cluster_info, list);
-			cluster_move(si, ci, &si->frag_clusters[order], CLUSTER_FLAG_FRAG);
+		while ((ci = cluster_isolate_lock(si, &si->nonfull_clusters[order]))) {
 			offset = alloc_swap_scan_cluster(si, cluster_offset(si, ci),
 							 &found, order, usage);
-			frags++;
+			/*
+			 * With `fragmenting` set to true, it will surely take
+			 * the cluster off nonfull list
+			 */
 			if (found)
 				goto done;
+			frags++;
 		}
 
-		/*
-		 * Nonfull clusters are moved to frag tail if we reached
-		 * here, count them too, don't over scan the frag list.
-		 */
-		while (frags < si->frag_cluster_nr[order]) {
-			ci = list_first_entry(&si->frag_clusters[order],
-					      struct swap_cluster_info, list);
+		frags_existing = atomic_long_read(&si->frag_cluster_nr[order]);
+		while (frags < frags_existing &&
+		       (ci = cluster_isolate_lock(si, &si->frag_clusters[order]))) {
+			atomic_long_dec(&si->frag_cluster_nr[order]);
 			/*
-			 * Rotate the frag list to iterate, they were all failing
-			 * high order allocation or moved here due to per-CPU usage,
-			 * this help keeping usable cluster ahead.
+			 * Rotate the frag list to iterate, they were all
+			 * failing high order allocation or moved here due to
+			 * per-CPU usage, but either way they could contain
+			 * usable (eg. lazy-freed swap cache) slots.
 			 */
-			list_move_tail(&ci->list, &si->frag_clusters[order]);
 			offset = alloc_swap_scan_cluster(si, cluster_offset(si, ci),
 							 &found, order, usage);
-			frags++;
 			if (found)
 				goto done;
+			frags++;
 		}
 	}
 
-	if (!list_empty(&si->discard_clusters)) {
-		/*
-		 * we don't have free cluster but have some clusters in
-		 * discarding, do discard now and reclaim them, then
-		 * reread cluster_next_cpu since we dropped si->lock
-		 */
-		swap_do_scheduled_discard(si);
+	/*
+	 * We don't have free cluster but have some clusters in
+	 * discarding, do discard now and reclaim them, then
+	 * reread cluster_next_cpu since we dropped si->lock
+	 */
+	if ((si->flags & SWP_PAGE_DISCARD) && swap_do_scheduled_discard(si))
 		goto new_cluster;
-	}
 
 	if (order)
 		goto done;
@@ -860,26 +942,25 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
 		 * Clusters here have at least one usable slots and can't fail order 0
 		 * allocation, but reclaim may drop si->lock and race with another user.
 		 */
-		while (!list_empty(&si->frag_clusters[o])) {
-			ci = list_first_entry(&si->frag_clusters[o],
-					      struct swap_cluster_info, list);
+		while ((ci = cluster_isolate_lock(si, &si->frag_clusters[o]))) {
+			atomic_long_dec(&si->frag_cluster_nr[o]);
 			offset = alloc_swap_scan_cluster(si, cluster_offset(si, ci),
-							 &found, 0, usage);
+							 &found, order, usage);
 			if (found)
 				goto done;
 		}
 
-		while (!list_empty(&si->nonfull_clusters[o])) {
-			ci = list_first_entry(&si->nonfull_clusters[o],
-					      struct swap_cluster_info, list);
+		while ((ci = cluster_isolate_lock(si, &si->nonfull_clusters[o]))) {
 			offset = alloc_swap_scan_cluster(si, cluster_offset(si, ci),
-							 &found, 0, usage);
+							 &found, order, usage);
 			if (found)
 				goto done;
 		}
 	}
 done:
-	cluster->next[order] = offset;
+	__this_cpu_write(si->percpu_cluster->next[order], offset);
+	local_unlock(&si->percpu_cluster->lock);
+
 	return found;
 }
 
@@ -1135,14 +1216,11 @@ int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_order)
 		plist_requeue(&si->avail_lists[node], &swap_avail_heads[node]);
 		spin_unlock(&swap_avail_lock);
 		if (get_swap_device_info(si)) {
-			spin_lock(&si->lock);
 			n_ret = scan_swap_map_slots(si, SWAP_HAS_CACHE,
 					n_goal, swp_entries, order);
-			spin_unlock(&si->lock);
 			put_swap_device(si);
 			if (n_ret || size > 1)
 				goto check_out;
-			cond_resched();
 		}
 
 		spin_lock(&swap_avail_lock);
@@ -1355,9 +1433,7 @@ static bool __swap_entries_free(struct swap_info_struct *si,
 	if (!has_cache) {
 		for (i = 0; i < nr; i++)
 			zswap_invalidate(swp_entry(si->type, offset + i));
-		spin_lock(&si->lock);
 		swap_entry_range_free(si, entry, nr);
-		spin_unlock(&si->lock);
 	}
 	return has_cache;
 
@@ -1386,16 +1462,27 @@ static void swap_entry_range_free(struct swap_info_struct *si, swp_entry_t entry
 	unsigned char *map_end = map + nr_pages;
 	struct swap_cluster_info *ci;
 
+	/* It should never free entries across different clusters */
+	VM_BUG_ON((offset / SWAPFILE_CLUSTER) != ((offset + nr_pages - 1) / SWAPFILE_CLUSTER));
+
 	ci = lock_cluster(si, offset);
+	VM_BUG_ON(cluster_is_free(ci));
+	VM_BUG_ON(ci->count < nr_pages);
+
+	ci->count -= nr_pages;
 	do {
 		VM_BUG_ON(*map != SWAP_HAS_CACHE);
 		*map = 0;
 	} while (++map < map_end);
-	dec_cluster_info_page(si, ci, nr_pages);
-	unlock_cluster(ci);
 
 	mem_cgroup_uncharge_swap(entry, nr_pages);
 	swap_range_free(si, offset, nr_pages);
+
+	if (!ci->count)
+		free_cluster(si, ci);
+	else
+		partial_free_cluster(si, ci);
+	unlock_cluster(ci);
 }
 
 static void cluster_swap_free_nr(struct swap_info_struct *si,
@@ -1467,9 +1554,7 @@ void put_swap_folio(struct folio *folio, swp_entry_t entry)
 	ci = lock_cluster(si, offset);
 	if (size > 1 && swap_is_has_cache(si, offset, size)) {
 		unlock_cluster(ci);
-		spin_lock(&si->lock);
 		swap_entry_range_free(si, entry, size);
-		spin_unlock(&si->lock);
 		return;
 	}
 	for (int i = 0; i < size; i++, entry.val++) {
@@ -1484,46 +1569,19 @@ void put_swap_folio(struct folio *folio, swp_entry_t entry)
 	unlock_cluster(ci);
 }
 
-static int swp_entry_cmp(const void *ent1, const void *ent2)
-{
-	const swp_entry_t *e1 = ent1, *e2 = ent2;
-
-	return (int)swp_type(*e1) - (int)swp_type(*e2);
-}
-
 void swapcache_free_entries(swp_entry_t *entries, int n)
 {
-	struct swap_info_struct *si, *prev;
 	int i;
+	struct swap_info_struct *si = NULL;
 
 	if (n <= 0)
 		return;
 
-	prev = NULL;
-	si = NULL;
-
-	/*
-	 * Sort swap entries by swap device, so each lock is only taken once.
-	 * nr_swapfiles isn't absolutely correct, but the overhead of sort() is
-	 * so low that it isn't necessary to optimize further.
-	 */
-	if (nr_swapfiles > 1)
-		sort(entries, n, sizeof(entries[0]), swp_entry_cmp, NULL);
 	for (i = 0; i < n; ++i) {
 		si = _swap_info_get(entries[i]);
-
-		if (si != prev) {
-			if (prev != NULL)
-				spin_unlock(&prev->lock);
-			if (si != NULL)
-				spin_lock(&si->lock);
-		}
 		if (si)
 			swap_entry_range_free(si, entries[i], 1);
-		prev = si;
 	}
-	if (si)
-		spin_unlock(&si->lock);
 }
 
 int __swap_count(swp_entry_t entry)
@@ -1775,13 +1833,8 @@ swp_entry_t get_swap_page_of_type(int type)
 		goto fail;
 
 	/* This is called for allocating swap entry, not cache */
-	if (get_swap_device_info(si)) {
-		spin_lock(&si->lock);
-		if ((si->flags & SWP_WRITEOK) && scan_swap_map_slots(si, 1, 1, &entry, 0))
-			atomic_long_dec(&nr_swap_pages);
-		spin_unlock(&si->lock);
-		put_swap_device(si);
-	}
+	if ((si->flags & SWP_WRITEOK) && scan_swap_map_slots(si, 1, 1, &entry, 0))
+		atomic_long_dec(&nr_swap_pages);
 fail:
 	return entry;
 }
@@ -3098,6 +3151,7 @@ static struct swap_cluster_info *setup_clusters(struct swap_info_struct *si,
 		cluster = per_cpu_ptr(si->percpu_cluster, cpu);
 		for (i = 0; i < SWAP_NR_ORDERS; i++)
 			cluster->next[i] = SWAP_NEXT_INVALID;
+		local_lock_init(&cluster->lock);
 	}
 
 	/*
@@ -3121,7 +3175,7 @@ static struct swap_cluster_info *setup_clusters(struct swap_info_struct *si,
 	for (i = 0; i < SWAP_NR_ORDERS; i++) {
 		INIT_LIST_HEAD(&si->nonfull_clusters[i]);
 		INIT_LIST_HEAD(&si->frag_clusters[i]);
-		si->frag_cluster_nr[i] = 0;
+		atomic_long_set(&si->frag_cluster_nr[i], 0);
 	}
 
 	/*
@@ -3603,7 +3657,6 @@ int add_swap_count_continuation(swp_entry_t entry, gfp_t gfp_mask)
 		 */
 		goto outer;
 	}
-	spin_lock(&si->lock);
 
 	offset = swp_offset(entry);
 
@@ -3668,7 +3721,6 @@ int add_swap_count_continuation(swp_entry_t entry, gfp_t gfp_mask)
 	spin_unlock(&si->cont_lock);
 out:
 	unlock_cluster(ci);
-	spin_unlock(&si->lock);
 	put_swap_device(si);
 outer:
 	if (page)
-- 
2.47.0



^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 10/13] mm, swap: simplify percpu cluster updating
  2024-10-22 19:24 [PATCH 00/13] mm, swap: rework of swap allocator locks Kairui Song
                   ` (8 preceding siblings ...)
  2024-10-22 19:24 ` [PATCH 09/13] mm, swap: reduce contention on device lock Kairui Song
@ 2024-10-22 19:24 ` Kairui Song
  2024-10-22 19:24 ` [PATCH 11/13] mm, swap: introduce a helper for retrieving cluster from offset Kairui Song
                   ` (5 subsequent siblings)
  15 siblings, 0 replies; 21+ messages in thread
From: Kairui Song @ 2024-10-22 19:24 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Chris Li, Barry Song, Ryan Roberts, Hugh Dickins,
	Yosry Ahmed, Huang, Ying, Tim Chen, Nhat Pham, linux-kernel,
	Kairui Song

From: Kairui Song <kasong@tencent.com>

Instead of using a returning argument, we can simply store the next
cluster offset to the fixed percpu location, which reduce the stack
usage and simplify the function:

Object size:
./scripts/bloat-o-meter mm/swapfile.o mm/swapfile.o.new
add/remove: 0/0 grow/shrink: 0/2 up/down: 0/-271 (-271)
Function                                     old     new   delta
get_swap_pages                              2847    2733    -114
alloc_swap_scan_cluster                      894     737    -157
Total: Before=30833, After=30562, chg -0.88%

Stack usage:
Before:
swapfile.c:1190:5:get_swap_pages       240    static

After:
swapfile.c:1185:5:get_swap_pages       216    static

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 include/linux/swap.h |  4 ++--
 mm/swapfile.c        | 57 ++++++++++++++++++++------------------------
 2 files changed, 28 insertions(+), 33 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index a3b5d74b095a..0e6c6bb385f0 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -276,9 +276,9 @@ enum swap_cluster_flags {
  * The first page in the swap file is the swap header, which is always marked
  * bad to prevent it from being allocated as an entry. This also prevents the
  * cluster to which it belongs being marked free. Therefore 0 is safe to use as
- * a sentinel to indicate next is not valid in percpu_cluster.
+ * a sentinel to indicate an entry is not valid.
  */
-#define SWAP_NEXT_INVALID	0
+#define SWAP_ENTRY_INVALID	0
 
 #ifdef CONFIG_THP_SWAP
 #define SWAP_NR_ORDERS		(PMD_ORDER + 1)
diff --git a/mm/swapfile.c b/mm/swapfile.c
index a19ee8d5ffd0..f529e2ce2019 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -743,11 +743,14 @@ static void cluster_alloc_range(struct swap_info_struct *si, struct swap_cluster
 	ci->count += nr_pages;
 }
 
-static unsigned int alloc_swap_scan_cluster(struct swap_info_struct *si, unsigned long offset,
-					    unsigned int *foundp, unsigned int order,
+/* Try use a new cluster for current CPU and allocate from it. */
+static unsigned int alloc_swap_scan_cluster(struct swap_info_struct *si,
+					    unsigned long offset,
+					    unsigned int order,
 					    unsigned char usage)
 {
-	unsigned long start = offset & ~(SWAPFILE_CLUSTER - 1);
+	unsigned int next = SWAP_ENTRY_INVALID, found = SWAP_ENTRY_INVALID;
+	unsigned long start = ALIGN_DOWN(offset, SWAPFILE_CLUSTER);
 	unsigned long end = min(start + SWAPFILE_CLUSTER, si->max);
 	unsigned int nr_pages = 1 << order;
 	bool need_reclaim, ret;
@@ -756,10 +759,8 @@ static unsigned int alloc_swap_scan_cluster(struct swap_info_struct *si, unsigne
 	ci = &si->cluster_info[offset / SWAPFILE_CLUSTER];
 	lockdep_assert_held(&ci->lock);
 
-	if (end < nr_pages || ci->count + nr_pages > SWAPFILE_CLUSTER) {
-		offset = SWAP_NEXT_INVALID;
+	if (end < nr_pages || ci->count + nr_pages > SWAPFILE_CLUSTER)
 		goto out;
-	}
 
 	for (end -= nr_pages; offset <= end; offset += nr_pages) {
 		need_reclaim = false;
@@ -773,10 +774,8 @@ static unsigned int alloc_swap_scan_cluster(struct swap_info_struct *si, unsigne
 			 * cluster has no flag set, and change of list
 			 * won't cause fragmentation.
 			 */
-			if (!cluster_is_usable(ci, order)) {
-				offset = SWAP_NEXT_INVALID;
+			if (!cluster_is_usable(ci, order))
 				goto out;
-			}
 			if (cluster_is_free(ci))
 				offset = start;
 			/* Reclaim failed but cluster is usable, try next */
@@ -784,20 +783,17 @@ static unsigned int alloc_swap_scan_cluster(struct swap_info_struct *si, unsigne
 				continue;
 		}
 		cluster_alloc_range(si, ci, offset, usage, order);
-		*foundp = offset;
-		if (ci->count == SWAPFILE_CLUSTER) {
-			offset = SWAP_NEXT_INVALID;
-			goto out;
-		}
+		found = offset;
 		offset += nr_pages;
+		if (ci->count < SWAPFILE_CLUSTER && offset <= end)
+			next = offset;
 		break;
 	}
-	if (offset > end)
-		offset = SWAP_NEXT_INVALID;
 out:
 	relocate_cluster(si, ci);
 	unlock_cluster(ci);
-	return offset;
+	__this_cpu_write(si->percpu_cluster->next[order], next);
+	return found;
 }
 
 /* Return true if reclaimed a whole cluster */
@@ -866,8 +862,8 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
 		if (cluster_is_usable(ci, order)) {
 			if (cluster_is_free(ci))
 				offset = cluster_offset(si, ci);
-			offset = alloc_swap_scan_cluster(si, offset, &found,
-							 order, usage);
+			found = alloc_swap_scan_cluster(si, offset,
+							order, usage);
 		} else {
 			unlock_cluster(ci);
 		}
@@ -878,8 +874,8 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
 new_cluster:
 	ci = cluster_isolate_lock(si, &si->free_clusters);
 	if (ci) {
-		offset = alloc_swap_scan_cluster(si, cluster_offset(si, ci),
-						 &found, order, usage);
+		found = alloc_swap_scan_cluster(si, cluster_offset(si, ci),
+						order, usage);
 		/*
 		 * Allocation from free cluster must never fail and
 		 * cluster lock must remain untouched.
@@ -896,8 +892,8 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
 		unsigned int frags = 0, frags_existing;
 
 		while ((ci = cluster_isolate_lock(si, &si->nonfull_clusters[order]))) {
-			offset = alloc_swap_scan_cluster(si, cluster_offset(si, ci),
-							 &found, order, usage);
+			found = alloc_swap_scan_cluster(si, cluster_offset(si, ci),
+							order, usage);
 			/*
 			 * With `fragmenting` set to true, it will surely take
 			 * the cluster off nonfull list
@@ -917,8 +913,8 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
 			 * per-CPU usage, but either way they could contain
 			 * usable (eg. lazy-freed swap cache) slots.
 			 */
-			offset = alloc_swap_scan_cluster(si, cluster_offset(si, ci),
-							 &found, order, usage);
+			found = alloc_swap_scan_cluster(si, cluster_offset(si, ci),
+							order, usage);
 			if (found)
 				goto done;
 			frags++;
@@ -944,21 +940,20 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
 		 */
 		while ((ci = cluster_isolate_lock(si, &si->frag_clusters[o]))) {
 			atomic_long_dec(&si->frag_cluster_nr[o]);
-			offset = alloc_swap_scan_cluster(si, cluster_offset(si, ci),
-							 &found, order, usage);
+			found = alloc_swap_scan_cluster(si, cluster_offset(si, ci),
+							0, usage);
 			if (found)
 				goto done;
 		}
 
 		while ((ci = cluster_isolate_lock(si, &si->nonfull_clusters[o]))) {
-			offset = alloc_swap_scan_cluster(si, cluster_offset(si, ci),
-							 &found, order, usage);
+			found = alloc_swap_scan_cluster(si, cluster_offset(si, ci),
+							0, usage);
 			if (found)
 				goto done;
 		}
 	}
 done:
-	__this_cpu_write(si->percpu_cluster->next[order], offset);
 	local_unlock(&si->percpu_cluster->lock);
 
 	return found;
@@ -3150,7 +3145,7 @@ static struct swap_cluster_info *setup_clusters(struct swap_info_struct *si,
 
 		cluster = per_cpu_ptr(si->percpu_cluster, cpu);
 		for (i = 0; i < SWAP_NR_ORDERS; i++)
-			cluster->next[i] = SWAP_NEXT_INVALID;
+			cluster->next[i] = SWAP_ENTRY_INVALID;
 		local_lock_init(&cluster->lock);
 	}
 
-- 
2.47.0



^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 11/13] mm, swap: introduce a helper for retrieving cluster from offset
  2024-10-22 19:24 [PATCH 00/13] mm, swap: rework of swap allocator locks Kairui Song
                   ` (9 preceding siblings ...)
  2024-10-22 19:24 ` [PATCH 10/13] mm, swap: simplify percpu cluster updating Kairui Song
@ 2024-10-22 19:24 ` Kairui Song
  2024-10-22 19:24 ` [PATCH 12/13] mm, swap: use a global swap cluster for non-rotation device Kairui Song
                   ` (4 subsequent siblings)
  15 siblings, 0 replies; 21+ messages in thread
From: Kairui Song @ 2024-10-22 19:24 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Chris Li, Barry Song, Ryan Roberts, Hugh Dickins,
	Yosry Ahmed, Huang, Ying, Tim Chen, Nhat Pham, linux-kernel,
	Kairui Song

From: Kairui Song <kasong@tencent.com>

It's a common operation to retrieve the cluster info from offset,
introduce a helper for this.

Suggested-by: Chris Li <chrisl@kernel.org>
Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/swapfile.c | 16 +++++++++++-----
 1 file changed, 11 insertions(+), 5 deletions(-)

diff --git a/mm/swapfile.c b/mm/swapfile.c
index f529e2ce2019..f25d697f6736 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -423,6 +423,12 @@ static inline unsigned int cluster_index(struct swap_info_struct *si,
 	return ci - si->cluster_info;
 }
 
+static inline struct swap_cluster_info *offset_to_cluster(struct swap_info_struct *si,
+							  unsigned long offset)
+{
+	return &si->cluster_info[offset / SWAPFILE_CLUSTER];
+}
+
 static inline unsigned int cluster_offset(struct swap_info_struct *si,
 					  struct swap_cluster_info *ci)
 {
@@ -434,7 +440,7 @@ static inline struct swap_cluster_info *lock_cluster(struct swap_info_struct *si
 {
 	struct swap_cluster_info *ci;
 
-	ci = &si->cluster_info[offset / SWAPFILE_CLUSTER];
+	ci = offset_to_cluster(si, offset);
 	spin_lock(&ci->lock);
 
 	return ci;
@@ -756,7 +762,7 @@ static unsigned int alloc_swap_scan_cluster(struct swap_info_struct *si,
 	bool need_reclaim, ret;
 	struct swap_cluster_info *ci;
 
-	ci = &si->cluster_info[offset / SWAPFILE_CLUSTER];
+	ci = offset_to_cluster(si, offset);
 	lockdep_assert_held(&ci->lock);
 
 	if (end < nr_pages || ci->count + nr_pages > SWAPFILE_CLUSTER)
@@ -1457,10 +1463,10 @@ static void swap_entry_range_free(struct swap_info_struct *si, swp_entry_t entry
 	unsigned char *map_end = map + nr_pages;
 	struct swap_cluster_info *ci;
 
-	/* It should never free entries across different clusters */
-	VM_BUG_ON((offset / SWAPFILE_CLUSTER) != ((offset + nr_pages - 1) / SWAPFILE_CLUSTER));
-
 	ci = lock_cluster(si, offset);
+
+	/* It should never free entries across different clusters */
+	VM_BUG_ON(ci != offset_to_cluster(si, offset + nr_pages - 1));
 	VM_BUG_ON(cluster_is_free(ci));
 	VM_BUG_ON(ci->count < nr_pages);
 
-- 
2.47.0



^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 12/13] mm, swap: use a global swap cluster for non-rotation device
  2024-10-22 19:24 [PATCH 00/13] mm, swap: rework of swap allocator locks Kairui Song
                   ` (10 preceding siblings ...)
  2024-10-22 19:24 ` [PATCH 11/13] mm, swap: introduce a helper for retrieving cluster from offset Kairui Song
@ 2024-10-22 19:24 ` Kairui Song
  2024-10-22 19:37 ` [PATCH 13/13] mm, swap_slots: remove slot cache for freeing path Kairui Song
                   ` (3 subsequent siblings)
  15 siblings, 0 replies; 21+ messages in thread
From: Kairui Song @ 2024-10-22 19:24 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Chris Li, Barry Song, Ryan Roberts, Hugh Dickins,
	Yosry Ahmed, Huang, Ying, Tim Chen, Nhat Pham, linux-kernel,
	Kairui Song

From: Kairui Song <kasong@tencent.com>

Non-rotation (SSD / ZRAM) device can tolerate fragmentations so the goal
of SWAP allocator is to avoid contention of clusters. So it used a
per-CPU cluster design, and each CPU will be using a different cluster
as much as possible.

But HDD is very sensitive to fragmentations, contention is trivial compared
to this. So just use one global cluster instead. This ensured each order
will be wring to a same cluster as much as possible, which helps to make
the IO more continuous.

This ensures the performance of cluster allocator is as good as the old
allocator. Test after this commit compared to before this series:

make -j32 with tinyconfig, using 1G memcg limit and HDD swap:

Before this series:
114.44user 29.11system 39:42.90elapsed 6%CPU (0avgtext+0avgdata 157284maxresident)k
2901232inputs+0outputs (238877major+4227640minor)pagefaults

After this commit:
113.90user 23.81system 38:11.77elapsed 6%CPU (0avgtext+0avgdata 157260maxresident)k
2548728inputs+0outputs (235471major+4238110minor)pagefaults

Suggested-by: Chris Li <chrisl@kernel.org>
Signed-off-by: Kairui Song <kasong@tencent.com>
---
 include/linux/swap.h |  2 ++
 mm/swapfile.c        | 48 ++++++++++++++++++++++++++++++++------------
 2 files changed, 37 insertions(+), 13 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 0e6c6bb385f0..9898b1881d4d 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -319,6 +319,8 @@ struct swap_info_struct {
 	unsigned int pages;		/* total of usable pages of swap */
 	atomic_long_t inuse_pages;	/* number of those currently in use */
 	struct percpu_cluster __percpu *percpu_cluster; /* per cpu's swap location */
+	struct percpu_cluster *global_cluster; /* Use one global cluster for rotating device */
+	spinlock_t global_cluster_lock;	/* Serialize usage of global cluster */
 	struct rb_root swap_extent_root;/* root of the swap extent rbtree */
 	struct block_device *bdev;	/* swap device or bdev of swap file */
 	struct file *swap_file;		/* seldom referenced */
diff --git a/mm/swapfile.c b/mm/swapfile.c
index f25d697f6736..6eb298a222c0 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -798,7 +798,10 @@ static unsigned int alloc_swap_scan_cluster(struct swap_info_struct *si,
 out:
 	relocate_cluster(si, ci);
 	unlock_cluster(ci);
-	__this_cpu_write(si->percpu_cluster->next[order], next);
+	if (si->flags & SWP_SOLIDSTATE)
+		__this_cpu_write(si->percpu_cluster->next[order], next);
+	else
+		si->global_cluster->next[order] = next;
 	return found;
 }
 
@@ -860,8 +863,14 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
 	unsigned int offset, found = 0;
 
 	/* Fast path using per CPU cluster */
-	local_lock(&si->percpu_cluster->lock);
-	offset = __this_cpu_read(si->percpu_cluster->next[order]);
+	if (si->flags & SWP_SOLIDSTATE) {
+		local_lock(&si->percpu_cluster->lock);
+		offset = __this_cpu_read(si->percpu_cluster->next[order]);
+	} else {
+		spin_lock(&si->global_cluster_lock);
+		offset = si->global_cluster->next[order];
+	}
+
 	if (offset) {
 		ci = lock_cluster(si, offset);
 		/* Cluster could have been used by another order */
@@ -960,8 +969,10 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
 		}
 	}
 done:
-	local_unlock(&si->percpu_cluster->lock);
-
+	if (si->flags & SWP_SOLIDSTATE)
+		local_unlock(&si->percpu_cluster->lock);
+	else
+		spin_unlock(&si->global_cluster_lock);
 	return found;
 }
 
@@ -2737,6 +2748,8 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
 	mutex_unlock(&swapon_mutex);
 	free_percpu(p->percpu_cluster);
 	p->percpu_cluster = NULL;
+	kfree(p->global_cluster);
+	p->global_cluster = NULL;
 	vfree(swap_map);
 	kvfree(zeromap);
 	kvfree(cluster_info);
@@ -3142,17 +3155,24 @@ static struct swap_cluster_info *setup_clusters(struct swap_info_struct *si,
 	for (i = 0; i < nr_clusters; i++)
 		spin_lock_init(&cluster_info[i].lock);
 
-	si->percpu_cluster = alloc_percpu(struct percpu_cluster);
-	if (!si->percpu_cluster)
-		goto err_free;
+	if (si->flags & SWP_SOLIDSTATE) {
+		si->percpu_cluster = alloc_percpu(struct percpu_cluster);
+		if (!si->percpu_cluster)
+			goto err_free;
 
-	for_each_possible_cpu(cpu) {
-		struct percpu_cluster *cluster;
+		for_each_possible_cpu(cpu) {
+			struct percpu_cluster *cluster;
 
-		cluster = per_cpu_ptr(si->percpu_cluster, cpu);
+			cluster = per_cpu_ptr(si->percpu_cluster, cpu);
+			for (i = 0; i < SWAP_NR_ORDERS; i++)
+				cluster->next[i] = SWAP_ENTRY_INVALID;
+			local_lock_init(&cluster->lock);
+		}
+	} else {
+		si->global_cluster = kmalloc(sizeof(*si->global_cluster), GFP_KERNEL);
 		for (i = 0; i < SWAP_NR_ORDERS; i++)
-			cluster->next[i] = SWAP_ENTRY_INVALID;
-		local_lock_init(&cluster->lock);
+			si->global_cluster->next[i] = SWAP_ENTRY_INVALID;
+		spin_lock_init(&si->global_cluster_lock);
 	}
 
 	/*
@@ -3426,6 +3446,8 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
 bad_swap:
 	free_percpu(si->percpu_cluster);
 	si->percpu_cluster = NULL;
+	kfree(si->global_cluster);
+	si->global_cluster = NULL;
 	inode = NULL;
 	destroy_swap_extents(si);
 	swap_cgroup_swapoff(si->type);
-- 
2.47.0



^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 13/13] mm, swap_slots: remove slot cache for freeing path
  2024-10-22 19:24 [PATCH 00/13] mm, swap: rework of swap allocator locks Kairui Song
                   ` (11 preceding siblings ...)
  2024-10-22 19:24 ` [PATCH 12/13] mm, swap: use a global swap cluster for non-rotation device Kairui Song
@ 2024-10-22 19:37 ` Kairui Song
  2024-10-23  2:24 ` [PATCH 00/13] mm, swap: rework of swap allocator locks Huang, Ying
                   ` (2 subsequent siblings)
  15 siblings, 0 replies; 21+ messages in thread
From: Kairui Song @ 2024-10-22 19:37 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Chris Li, Barry Song, Ryan Roberts, Hugh Dickins,
	Yosry Ahmed, Huang, Ying, Tim Chen, Nhat Pham, linux-kernel,
	Kairui Song

From: Kairui Song <kasong@tencent.com>

The slot cache for freeing path is mostly for reducing the overhead
of si->lock. As we have basically eliminated the si->lock usage
for freeing path, it can be just removed.

This helps simplify the code, and avoids swap entries from being hold
in cache upon freeing. The delayed freeing of entries have been
causing trouble for further optimizations for zswap [1] and in theory
will also cause more fragmentation, and extra overhead.

Test with build linux kernel showed both performance and fragmentation
is better without the cache:

tiem make -j96 / 768M memcg, 4K pages, 10G ZRAM, avg of 4 test run::
Before:
Sys time: 36047.78, Real time: 472.43
After: (-7.6% sys time, -7.3% real time)
Sys time: 33314.76, Real time: 437.67

time make -j96 / 1152M memcg, 64K mTHP, 10G ZRAM, avg of 4 test run:
Before:
Sys time: 46859.04, Real time: 562.63
hugepages-64kB/stats/swpout: 1783392
hugepages-64kB/stats/swpout_fallback: 240875
After: (-23.3% sys time, -21.3% real time)
Sys time: 35958.87, Real time: 442.69
hugepages-64kB/stats/swpout: 1866267
hugepages-64kB/stats/swpout_fallback: 158330

Sequential SWAP should be also slightly faster, tests didn't show a
measurable difference though, at least no regression:

Swapin 4G zero page on ZRAM (time in us):
Before (avg. 1923756)
1912391 1927023 1927957 1916527 1918263 1914284 1934753 1940813 1921791
After (avg. 1922290):
1919101 1925743 1916810 1917007 1923930 1935152 1917403 1923549 1921913

Link: https://lore.kernel.org/all/CAMgjq7ACohT_uerSz8E_994ZZCv709Zor+43hdmesW_59W1BWw@mail.gmail.com/[1c]
Suggested-by: Chris Li <chrisl@kernel.org>
Signed-off-by: Kairui Song <kasong@tencent.com>
---
 include/linux/swap_slots.h |  3 --
 mm/swap_slots.c            | 78 +++++----------------------------
 mm/swapfile.c              | 89 +++++++++++++++-----------------------
 3 files changed, 44 insertions(+), 126 deletions(-)

diff --git a/include/linux/swap_slots.h b/include/linux/swap_slots.h
index 15adfb8c813a..840aec3523b2 100644
--- a/include/linux/swap_slots.h
+++ b/include/linux/swap_slots.h
@@ -16,15 +16,12 @@ struct swap_slots_cache {
 	swp_entry_t	*slots;
 	int		nr;
 	int		cur;
-	spinlock_t	free_lock;  /* protects slots_ret, n_ret */
-	swp_entry_t	*slots_ret;
 	int		n_ret;
 };
 
 void disable_swap_slots_cache_lock(void);
 void reenable_swap_slots_cache_unlock(void);
 void enable_swap_slots_cache(void);
-void free_swap_slot(swp_entry_t entry);
 
 extern bool swap_slot_cache_enabled;
 
diff --git a/mm/swap_slots.c b/mm/swap_slots.c
index 13ab3b771409..9c7c171df7ba 100644
--- a/mm/swap_slots.c
+++ b/mm/swap_slots.c
@@ -43,17 +43,15 @@ static DEFINE_MUTEX(swap_slots_cache_mutex);
 /* Serialize swap slots cache enable/disable operations */
 static DEFINE_MUTEX(swap_slots_cache_enable_mutex);
 
-static void __drain_swap_slots_cache(unsigned int type);
+static void __drain_swap_slots_cache(void);
 
 #define use_swap_slot_cache (swap_slot_cache_active && swap_slot_cache_enabled)
-#define SLOTS_CACHE 0x1
-#define SLOTS_CACHE_RET 0x2
 
 static void deactivate_swap_slots_cache(void)
 {
 	mutex_lock(&swap_slots_cache_mutex);
 	swap_slot_cache_active = false;
-	__drain_swap_slots_cache(SLOTS_CACHE|SLOTS_CACHE_RET);
+	__drain_swap_slots_cache();
 	mutex_unlock(&swap_slots_cache_mutex);
 }
 
@@ -72,7 +70,7 @@ void disable_swap_slots_cache_lock(void)
 	if (swap_slot_cache_initialized) {
 		/* serialize with cpu hotplug operations */
 		cpus_read_lock();
-		__drain_swap_slots_cache(SLOTS_CACHE|SLOTS_CACHE_RET);
+		__drain_swap_slots_cache();
 		cpus_read_unlock();
 	}
 }
@@ -113,7 +111,7 @@ static bool check_cache_active(void)
 static int alloc_swap_slot_cache(unsigned int cpu)
 {
 	struct swap_slots_cache *cache;
-	swp_entry_t *slots, *slots_ret;
+	swp_entry_t *slots;
 
 	/*
 	 * Do allocation outside swap_slots_cache_mutex
@@ -125,28 +123,19 @@ static int alloc_swap_slot_cache(unsigned int cpu)
 	if (!slots)
 		return -ENOMEM;
 
-	slots_ret = kvcalloc(SWAP_SLOTS_CACHE_SIZE, sizeof(swp_entry_t),
-			     GFP_KERNEL);
-	if (!slots_ret) {
-		kvfree(slots);
-		return -ENOMEM;
-	}
-
 	mutex_lock(&swap_slots_cache_mutex);
 	cache = &per_cpu(swp_slots, cpu);
-	if (cache->slots || cache->slots_ret) {
+	if (cache->slots) {
 		/* cache already allocated */
 		mutex_unlock(&swap_slots_cache_mutex);
 
 		kvfree(slots);
-		kvfree(slots_ret);
 
 		return 0;
 	}
 
 	if (!cache->lock_initialized) {
 		mutex_init(&cache->alloc_lock);
-		spin_lock_init(&cache->free_lock);
 		cache->lock_initialized = true;
 	}
 	cache->nr = 0;
@@ -160,19 +149,16 @@ static int alloc_swap_slot_cache(unsigned int cpu)
 	 */
 	mb();
 	cache->slots = slots;
-	cache->slots_ret = slots_ret;
 	mutex_unlock(&swap_slots_cache_mutex);
 	return 0;
 }
 
-static void drain_slots_cache_cpu(unsigned int cpu, unsigned int type,
-				  bool free_slots)
+static void drain_slots_cache_cpu(unsigned int cpu, bool free_slots)
 {
 	struct swap_slots_cache *cache;
-	swp_entry_t *slots = NULL;
 
 	cache = &per_cpu(swp_slots, cpu);
-	if ((type & SLOTS_CACHE) && cache->slots) {
+	if (cache->slots) {
 		mutex_lock(&cache->alloc_lock);
 		swapcache_free_entries(cache->slots + cache->cur, cache->nr);
 		cache->cur = 0;
@@ -183,20 +169,9 @@ static void drain_slots_cache_cpu(unsigned int cpu, unsigned int type,
 		}
 		mutex_unlock(&cache->alloc_lock);
 	}
-	if ((type & SLOTS_CACHE_RET) && cache->slots_ret) {
-		spin_lock_irq(&cache->free_lock);
-		swapcache_free_entries(cache->slots_ret, cache->n_ret);
-		cache->n_ret = 0;
-		if (free_slots && cache->slots_ret) {
-			slots = cache->slots_ret;
-			cache->slots_ret = NULL;
-		}
-		spin_unlock_irq(&cache->free_lock);
-		kvfree(slots);
-	}
 }
 
-static void __drain_swap_slots_cache(unsigned int type)
+static void __drain_swap_slots_cache(void)
 {
 	unsigned int cpu;
 
@@ -224,13 +199,13 @@ static void __drain_swap_slots_cache(unsigned int type)
 	 * There are no slots on such cpu that need to be drained.
 	 */
 	for_each_online_cpu(cpu)
-		drain_slots_cache_cpu(cpu, type, false);
+		drain_slots_cache_cpu(cpu, false);
 }
 
 static int free_slot_cache(unsigned int cpu)
 {
 	mutex_lock(&swap_slots_cache_mutex);
-	drain_slots_cache_cpu(cpu, SLOTS_CACHE | SLOTS_CACHE_RET, true);
+	drain_slots_cache_cpu(cpu, true);
 	mutex_unlock(&swap_slots_cache_mutex);
 	return 0;
 }
@@ -269,39 +244,6 @@ static int refill_swap_slots_cache(struct swap_slots_cache *cache)
 	return cache->nr;
 }
 
-void free_swap_slot(swp_entry_t entry)
-{
-	struct swap_slots_cache *cache;
-
-	/* Large folio swap slot is not covered. */
-	zswap_invalidate(entry);
-
-	cache = raw_cpu_ptr(&swp_slots);
-	if (likely(use_swap_slot_cache && cache->slots_ret)) {
-		spin_lock_irq(&cache->free_lock);
-		/* Swap slots cache may be deactivated before acquiring lock */
-		if (!use_swap_slot_cache || !cache->slots_ret) {
-			spin_unlock_irq(&cache->free_lock);
-			goto direct_free;
-		}
-		if (cache->n_ret >= SWAP_SLOTS_CACHE_SIZE) {
-			/*
-			 * Return slots to global pool.
-			 * The current swap_map value is SWAP_HAS_CACHE.
-			 * Set it to 0 to indicate it is available for
-			 * allocation in global pool
-			 */
-			swapcache_free_entries(cache->slots_ret, cache->n_ret);
-			cache->n_ret = 0;
-		}
-		cache->slots_ret[cache->n_ret++] = entry;
-		spin_unlock_irq(&cache->free_lock);
-	} else {
-direct_free:
-		swapcache_free_entries(&entry, 1);
-	}
-}
-
 swp_entry_t folio_alloc_swap(struct folio *folio)
 {
 	swp_entry_t entry;
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 6eb298a222c0..c77b6ec3c83b 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -53,14 +53,15 @@
 static bool swap_count_continued(struct swap_info_struct *, pgoff_t,
 				 unsigned char);
 static void free_swap_count_continuations(struct swap_info_struct *);
-static void swap_entry_range_free(struct swap_info_struct *si, swp_entry_t entry,
-				  unsigned int nr_pages);
+static void swap_entry_range_free(struct swap_info_struct *si,
+				  struct swap_cluster_info *ci,
+				  swp_entry_t entry, unsigned int nr_pages);
 static void swap_range_alloc(struct swap_info_struct *si,
 			     unsigned int nr_entries);
 static bool folio_swapcache_freeable(struct folio *folio);
 static struct swap_cluster_info *lock_cluster(struct swap_info_struct *si,
 					      unsigned long offset);
-static void unlock_cluster(struct swap_cluster_info *ci);
+static inline void unlock_cluster(struct swap_cluster_info *ci);
 
 static DEFINE_SPINLOCK(swap_lock);
 static unsigned int nr_swapfiles;
@@ -260,10 +261,9 @@ static int __try_to_reclaim_swap(struct swap_info_struct *si,
 	folio_ref_sub(folio, nr_pages);
 	folio_set_dirty(folio);
 
-	/* Only sinple page folio can be backed by zswap */
-	if (nr_pages == 1)
-		zswap_invalidate(entry);
-	swap_entry_range_free(si, entry, nr_pages);
+	ci = lock_cluster(si, offset);
+	swap_entry_range_free(si, ci, entry, nr_pages);
+	unlock_cluster(ci);
 	ret = nr_pages;
 out_unlock:
 	folio_unlock(folio);
@@ -1105,8 +1105,10 @@ static void swap_range_free(struct swap_info_struct *si, unsigned long offset,
 	 * Use atomic clear_bit operations only on zeromap instead of non-atomic
 	 * bitmap_clear to prevent adjacent bits corruption due to simultaneous writes.
 	 */
-	for (i = 0; i < nr_entries; i++)
+	for (i = 0; i < nr_entries; i++) {
 		clear_bit(offset + i, si->zeromap);
+		zswap_invalidate(swp_entry(si->type, offset + i));
+	}
 
 	if (si->flags & SWP_BLKDEV)
 		swap_slot_free_notify =
@@ -1410,9 +1412,9 @@ static unsigned char __swap_entry_free(struct swap_info_struct *si,
 
 	ci = lock_cluster(si, offset);
 	usage = __swap_entry_free_locked(si, offset, 1);
-	unlock_cluster(ci);
 	if (!usage)
-		free_swap_slot(entry);
+		swap_entry_range_free(si, ci, swp_entry(si->type, offset), 1);
+	unlock_cluster(ci);
 
 	return usage;
 }
@@ -1440,13 +1442,10 @@ static bool __swap_entries_free(struct swap_info_struct *si,
 	}
 	for (i = 0; i < nr; i++)
 		WRITE_ONCE(si->swap_map[offset + i], SWAP_HAS_CACHE);
+	if (!has_cache)
+		swap_entry_range_free(si, ci, entry, nr);
 	unlock_cluster(ci);
 
-	if (!has_cache) {
-		for (i = 0; i < nr; i++)
-			zswap_invalidate(swp_entry(si->type, offset + i));
-		swap_entry_range_free(si, entry, nr);
-	}
 	return has_cache;
 
 fallback:
@@ -1466,15 +1465,13 @@ static bool __swap_entries_free(struct swap_info_struct *si,
  * Drop the last HAS_CACHE flag of swap entries, caller have to
  * ensure all entries belong to the same cgroup.
  */
-static void swap_entry_range_free(struct swap_info_struct *si, swp_entry_t entry,
-				  unsigned int nr_pages)
+static void swap_entry_range_free(struct swap_info_struct *si,
+				  struct swap_cluster_info *ci,
+				  swp_entry_t entry, unsigned int nr_pages)
 {
 	unsigned long offset = swp_offset(entry);
 	unsigned char *map = si->swap_map + offset;
 	unsigned char *map_end = map + nr_pages;
-	struct swap_cluster_info *ci;
-
-	ci = lock_cluster(si, offset);
 
 	/* It should never free entries across different clusters */
 	VM_BUG_ON(ci != offset_to_cluster(si, offset + nr_pages - 1));
@@ -1494,7 +1491,6 @@ static void swap_entry_range_free(struct swap_info_struct *si, swp_entry_t entry
 		free_cluster(si, ci);
 	else
 		partial_free_cluster(si, ci);
-	unlock_cluster(ci);
 }
 
 static void cluster_swap_free_nr(struct swap_info_struct *si,
@@ -1502,28 +1498,13 @@ static void cluster_swap_free_nr(struct swap_info_struct *si,
 		unsigned char usage)
 {
 	struct swap_cluster_info *ci;
-	DECLARE_BITMAP(to_free, BITS_PER_LONG) = { 0 };
-	int i, nr;
+	unsigned long end = offset + nr_pages;
 
 	ci = lock_cluster(si, offset);
-	while (nr_pages) {
-		nr = min(BITS_PER_LONG, nr_pages);
-		for (i = 0; i < nr; i++) {
-			if (!__swap_entry_free_locked(si, offset + i, usage))
-				bitmap_set(to_free, i, 1);
-		}
-		if (!bitmap_empty(to_free, BITS_PER_LONG)) {
-			unlock_cluster(ci);
-			for_each_set_bit(i, to_free, BITS_PER_LONG)
-				free_swap_slot(swp_entry(si->type, offset + i));
-			if (nr == nr_pages)
-				return;
-			bitmap_clear(to_free, 0, BITS_PER_LONG);
-			ci = lock_cluster(si, offset);
-		}
-		offset += nr;
-		nr_pages -= nr;
-	}
+	do {
+		if (!__swap_entry_free_locked(si, offset, usage))
+			swap_entry_range_free(si, ci, swp_entry(si->type, offset), 1);
+	} while (++offset < end);
 	unlock_cluster(ci);
 }
 
@@ -1564,18 +1545,12 @@ void put_swap_folio(struct folio *folio, swp_entry_t entry)
 		return;
 
 	ci = lock_cluster(si, offset);
-	if (size > 1 && swap_is_has_cache(si, offset, size)) {
-		unlock_cluster(ci);
-		swap_entry_range_free(si, entry, size);
-		return;
-	}
-	for (int i = 0; i < size; i++, entry.val++) {
-		if (!__swap_entry_free_locked(si, offset + i, SWAP_HAS_CACHE)) {
-			unlock_cluster(ci);
-			free_swap_slot(entry);
-			if (i == size - 1)
-				return;
-			lock_cluster(si, offset);
+	if (swap_is_has_cache(si, offset, size))
+		swap_entry_range_free(si, ci, entry, size);
+	else {
+		for (int i = 0; i < size; i++, entry.val++) {
+			if (!__swap_entry_free_locked(si, offset + i, SWAP_HAS_CACHE))
+				swap_entry_range_free(si, ci, entry, 1);
 		}
 	}
 	unlock_cluster(ci);
@@ -1584,6 +1559,7 @@ void put_swap_folio(struct folio *folio, swp_entry_t entry)
 void swapcache_free_entries(swp_entry_t *entries, int n)
 {
 	int i;
+	struct swap_cluster_info *ci;
 	struct swap_info_struct *si = NULL;
 
 	if (n <= 0)
@@ -1591,8 +1567,11 @@ void swapcache_free_entries(swp_entry_t *entries, int n)
 
 	for (i = 0; i < n; ++i) {
 		si = _swap_info_get(entries[i]);
-		if (si)
-			swap_entry_range_free(si, entries[i], 1);
+		if (si) {
+			ci = lock_cluster(si, swp_offset(entries[i]));
+			swap_entry_range_free(si, ci, entries[i], 1);
+			unlock_cluster(ci);
+		}
 	}
 }
 
-- 
2.47.0



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 00/13] mm, swap: rework of swap allocator locks
  2024-10-22 19:24 [PATCH 00/13] mm, swap: rework of swap allocator locks Kairui Song
                   ` (12 preceding siblings ...)
  2024-10-22 19:37 ` [PATCH 13/13] mm, swap_slots: remove slot cache for freeing path Kairui Song
@ 2024-10-23  2:24 ` Huang, Ying
  2024-10-23 18:01   ` Kairui Song
  2024-10-23 10:27 ` Andrew Morton
  2024-10-23 17:59 ` Yosry Ahmed
  15 siblings, 1 reply; 21+ messages in thread
From: Huang, Ying @ 2024-10-23  2:24 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Kairui Song, Andrew Morton, Chris Li, Barry Song,
	Ryan Roberts, Hugh Dickins, Yosry Ahmed, Tim Chen, Nhat Pham,
	linux-kernel

Hi, Kairui,

Kairui Song <ryncsn@gmail.com> writes:

> From: Kairui Song <kasong@tencent.com>
>
> This series improved the swap allocator performance greatly by reworking
> the locking design and simplify a lot of code path.
>
> This is follow up of previous swap cluster allocator series:
> https://lore.kernel.org/linux-mm/20240730-swap-allocator-v5-0-cb9c148b9297@kernel.org/
>
> And this series is based on an follow up fix of the swap cluster
> allocator:
> https://lore.kernel.org/linux-mm/20241022175512.10398-1-ryncsn@gmail.com/
>
> This is part of the new swap allocator work item discussed in
> Chris's "Swap Abstraction" discussion at LSF/MM 2024, and
> "mTHP and swap allocator" discussion at LPC 2024.
>
> Previous series introduced a fully cluster based allocation algorithm,
> this series completely get rid of the old allocation path and makes the
> allocator avoid grabbing the si->lock unless needed. This bring huge
> performance gain and get rid of slot cache on freeing path.

Great!

> Currently, swap locking is mainly composed of two locks, cluster lock
> (ci->lock) and device lock (si->lock). The device lock is widely used
> to protect many things, causing it to be the main bottleneck for SWAP.

Device lock can be confusing with another device lock for struct device.
Better to call it swap device lock?

> Cluster lock is much more fine-grained, so it will be best to use
> ci->lock instead of si->lock as much as possible.
>
> `perf lock` indicates this issue clearly. Doing linux kernel build
> using tmpfs and ZRAM with limited memory (make -j64 with 1G memcg and 4k
> pages), result of "perf lock contention -ab sleep 3":
>
>   contended   total wait     max wait     avg wait         type   caller
>
>      34948     53.63 s       7.11 ms      1.53 ms     spinlock   free_swap_and_cache_nr+0x350
>      16569     40.05 s       6.45 ms      2.42 ms     spinlock   get_swap_pages+0x231
>      11191     28.41 s       7.03 ms      2.54 ms     spinlock   swapcache_free_entries+0x59
>       4147     22.78 s     122.66 ms      5.49 ms     spinlock   page_vma_mapped_walk+0x6f3
>       4595      7.17 s       6.79 ms      1.56 ms     spinlock   swapcache_free_entries+0x59
>     406027      2.74 s       2.59 ms      6.74 us     spinlock   list_lru_add+0x39
>   ...snip...
>
> The top 5 caller are all users of si->lock, total wait time up sums to
> several minutes in the 3 seconds time window.

Can you show results of `perf record -g`, `perf report -g` too?  I have
interest to check hot spot shifting too.

> Following the new allocator design, many operation doesn't need to touch
> si->lock at all. We only need to take si->lock when doing operations
> across multiple clusters (eg. changing the cluster list), other
> operations only need to take ci->lock. So ideally allocator should
> always take ci->lock first, then, if needed, take si->lock. But due
> to historical reasons, ci->lock is used inside si->lock by design,
> causing lock inversion if we simply try to acquire si->lock after
> acquiring ci->lock.
>
> This series audited all si->lock usage, simplify legacy codes, eliminate
> usage of si->lock as much as possible by introducing new designs based
> on the new cluster allocator.
>
> Old HDD allocation codes are removed, cluster allocator is adapted
> with small changes for HDD usage, test is looking OK.

I think that it's a good idea to remove HDD allocation specific code.
Can you check the performance of swapping to HDD?  However, I understand
that many people have no HDD in hand.

> And this also removed slot cache for freeing path. The performance is
> better without it, and this enables other clean up and optimizations
> as discussed before:
> https://lore.kernel.org/all/CAMgjq7ACohT_uerSz8E_994ZZCv709Zor+43hdmesW_59W1BWw@mail.gmail.com/
>
> After this series, lock contention on si->lock is nearly unobservable
> with `perf lock` with the same test above :
>
>   contended   total wait     max wait     avg wait         type   caller
>   ... snip ...
>          91    204.62 us      4.51 us      2.25 us     spinlock   cluster_move+0x2e
>   ... snip ...
>          47    125.62 us      4.47 us      2.67 us     spinlock   cluster_move+0x2e
>   ... snip ...
>          23     63.15 us      3.95 us      2.74 us     spinlock   cluster_move+0x2e
>   ... snip ...
>          17     41.26 us      4.58 us      2.43 us     spinlock   cluster_isolate_lock+0x1d
>   ... snip ...
>
> cluster_move and cluster_isolate_lock are basically the only users
> of si->lock now, performance gain is huge with reduced LOC.
>
> Tests
> ===
>
> Build kernel with defconfig on tmpfs with ZRAM as swap:
> ---
>
> Running a test matrix which is scaled up progressive for a intuitive result.
> The test are ran on top of tmpfs, using memory cgroup for memory limitation,
> on a 48c96t system.
>
> 12 test run for each case, it can be seen clearly that as concurrent job
> number goes higher the performance gain is higher, the performance is
> higher even with low concurrency.
>
>    make -j<NR>     |   System Time (seconds)  |   Total Time (seconds)
>  (NR / Mem / ZRAM) | (Before / After / Delta) | (Before / After / Delta)
>  With 4k pages only:
>   6 / 192M / 3G    |    5258 /  5235 / -0.3%  |    1420 /  1414 / -0.3%
>  12 / 256M / 4G    |    5518 /  5337 / -3.3%  |     758 /   742 / -2.1%
>  24 / 384M / 5G    |    7091 /  5766 / -18.7% |     476 /   422 / -11.3%
>  48 / 768M / 7G    |   11139 /  5831 / -47.7% |     330 /   221 / -33.0%
>  96 / 1.5G / 10G   |   21303 / 11353 / -46.7% |     283 /   180 / -36.4%
>  With 64k mTHP:
>  24 / 512M / 5G    |    5104 /  4641 / -18.7% |     376 /   358 / -4.8%
>  48 /   1G / 7G    |    8693 /  4662 / -18.7% |     257 /   176 / -31.5%
>  96 /   2G / 10G   |   17056 / 10263 / -39.8% |     234 /   169 / -27.8%

How much is the swap in/out throughput before/after the change?

When I worked on swap in/out performance before, the hot spot shifts from
swap related code to LRU lock and zone lock.  Things may change a lot
now.

If zram is used as swap device, the hot spot may become
compression/decompression after solving the swap lock contention.  To
stress swap subsystem further, we may use a ram disk as swap.
Previously, we have used a simulated pmem device (backed by DRAM).  That
can be setup as in,

https://pmem.io/blog/2016/02/how-to-emulate-persistent-memory/

After creating the raw block device: /dev/pmem0, we can do

$ mkswap /dev/pmem0
$ swapon /dev/pmem0

Can you use something similar if necessary?

> With more aggressive setup, it shows clearly both the performance and
> fragmentation are better:
>
> tiem make -j96 / 768M memcg, 4K pages, 10G ZRAM, on Intel 8255C * 2:
> (avg of 4 test run)
> Before:
> Sys time: 73578.30, Real time: 864.05
> tiem make -j96 / 1G memcg, 4K pages, 10G ZRAM:
> After: (-54.7% sys time, -49.3% real time)
> Sys time: 33314.76, Real time: 437.67
>
> time make -j96 / 1152M memcg, 64K mTHP, 10G ZRAM, on Intel 8255C * 2:
> (avg of 4 test run)
> Before:
> Sys time: 74044.85, Real time: 846.51
> hugepages-64kB/stats/swpout: 1735216
> hugepages-64kB/stats/swpout_fallback: 430333
> After: (-51.4% sys time, -47.7% real time, -63.2% mTHP failure)
> Sys time: 35958.87, Real time: 442.69
> hugepages-64kB/stats/swpout: 1866267
> hugepages-64kB/stats/swpout_fallback: 158330
>
> There is a up to 54.7% improvement for build kernel test, and lower
> fragmentation rate. Performance improvement should be even larger for
> micro benchmarks

Very good result!

> Build kernel with tinyconfig on tmpfs with HDD as swap:
> ---
>
> This test is similar to above, but HDD test is very noisy and slow, the
> deviation is huge, so just use tinyconfig instead and take the median test
> result of 3 test run, which looks OK:
>
> Before this series:
> 114.44user 29.11system 39:42.90elapsed 6%CPU
> 2901232inputs+0outputs (238877major+4227640minor)pagefaults
>
> After this commit:
> 113.90user 23.81system 38:11.77elapsed 6%CPU
> 2548728inputs+0outputs (235471major+4238110minor)pagefaults
>
> Single thread SWAP:
> ---
>
> Sequential SWAP should also be slightly faster as we removed a lot of
> unnecessary parts. Test using micro benchmark for swapout/in 4G
> zero memory using ZRAM, 10 test runs:
>
> Swapout Before (avg. 3359304):
> 3353796 3358551 3371305 3356043 3367524 3355303 3355924 3354513 3360776
>
> Swapin Before (avg. 1928698):
> 1920283 1927183 1934105 1921373 1926562 1938261 1927726 1928636 1934155
>
> Swapout After (avg. 3347511, -0.4%):
> 3337863 3347948 3355235 3339081 3333134 3353006 3354917 3346055 3360359
>
> Swapin After (avg. 1922290, -0.3%):
> 1919101 1925743 1916810 1917007 1923930 1935152 1917403 1923549 1921913
>
> Worth noticing the patch "mm, swap: use a global swap cluster for
> non-rotation device" introduced minor overhead for certain tests (see
> the test results in commit message), but the gain from later commit
> covered that, it can be further improved later.
>
> Suggested-by: Chris Li <chrisl@kernel.org>
> Signed-off-by: Kairui Song <kasong@tencent.com>
>
> Kairui Song (13):
>   mm, swap: minor clean up for swap entry allocation
>   mm, swap: fold swap_info_get_cont in the only caller
>   mm, swap: remove old allocation path for HDD
>   mm, swap: use cluster lock for HDD
>   mm, swap: clean up device availability check
>   mm, swap: clean up plist removal and adding
>   mm, swap: hold a reference of si during scan and clean up flags
>   mm, swap: use an enum to define all cluster flags and wrap flags
>     changes
>   mm, swap: reduce contention on device lock
>   mm, swap: simplify percpu cluster updating
>   mm, swap: introduce a helper for retrieving cluster from offset
>   mm, swap: use a global swap cluster for non-rotation device
>   mm, swap_slots: remove slot cache for freeing path
>
>  fs/btrfs/inode.c           |    1 -
>  fs/iomap/swapfile.c        |    1 -
>  include/linux/swap.h       |   36 +-
>  include/linux/swap_slots.h |    3 -
>  mm/page_io.c               |    1 -
>  mm/swap_slots.c            |   78 +--
>  mm/swapfile.c              | 1198 ++++++++++++++++--------------------
>  7 files changed, 558 insertions(+), 760 deletions(-)

--
Best Regards,
Huang, Ying


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 00/13] mm, swap: rework of swap allocator locks
  2024-10-22 19:24 [PATCH 00/13] mm, swap: rework of swap allocator locks Kairui Song
                   ` (13 preceding siblings ...)
  2024-10-23  2:24 ` [PATCH 00/13] mm, swap: rework of swap allocator locks Huang, Ying
@ 2024-10-23 10:27 ` Andrew Morton
  2024-10-23 17:56   ` Kairui Song
  2024-10-23 17:59 ` Yosry Ahmed
  15 siblings, 1 reply; 21+ messages in thread
From: Andrew Morton @ 2024-10-23 10:27 UTC (permalink / raw)
  To: Kairui Song
  Cc: Kairui Song, linux-mm, Chris Li, Barry Song, Ryan Roberts,
	Hugh Dickins, Yosry Ahmed, Huang, Ying, Tim Chen, Nhat Pham,
	linux-kernel

On Wed, 23 Oct 2024 03:24:38 +0800 Kairui Song <ryncsn@gmail.com> wrote:

> After this series, lock contention on si->lock is nearly unobservable
> with `perf lock` with the same test above :
> 
>   contended   total wait     max wait     avg wait         type   caller
>   ... snip ...
>          91    204.62 us      4.51 us      2.25 us     spinlock   cluster_move+0x2e
>   ... snip ...
>          47    125.62 us      4.47 us      2.67 us     spinlock   cluster_move+0x2e
>   ... snip ...
>          23     63.15 us      3.95 us      2.74 us     spinlock   cluster_move+0x2e
>   ... snip ...
>          17     41.26 us      4.58 us      2.43 us     spinlock   cluster_isolate_lock+0x1d
>   ... snip ...

Were any overall runtime benefits observed?


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 00/13] mm, swap: rework of swap allocator locks
  2024-10-23 10:27 ` Andrew Morton
@ 2024-10-23 17:56   ` Kairui Song
  0 siblings, 0 replies; 21+ messages in thread
From: Kairui Song @ 2024-10-23 17:56 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, Chris Li, Barry Song, Ryan Roberts, Hugh Dickins,
	Yosry Ahmed, Huang, Ying, Tim Chen, Nhat Pham, linux-kernel

On Wed, Oct 23, 2024 at 6:27 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> On Wed, 23 Oct 2024 03:24:38 +0800 Kairui Song <ryncsn@gmail.com> wrote:
>
> > After this series, lock contention on si->lock is nearly unobservable
> > with `perf lock` with the same test above :
> >
> >   contended   total wait     max wait     avg wait         type   caller
> >   ... snip ...
> >          91    204.62 us      4.51 us      2.25 us     spinlock   cluster_move+0x2e
> >   ... snip ...
> >          47    125.62 us      4.47 us      2.67 us     spinlock   cluster_move+0x2e
> >   ... snip ...
> >          23     63.15 us      3.95 us      2.74 us     spinlock   cluster_move+0x2e
> >   ... snip ...
> >          17     41.26 us      4.58 us      2.43 us     spinlock   cluster_isolate_lock+0x1d
> >   ... snip ...
>
> Were any overall runtime benefits observed?

Yes, see the "Tests" results in the cover letter (summary: up to 50%
build time saved for build linux kernel test when under pressure, with
either mTHP or 4K pages):

time make -j96 / 768M memcg, 4K pages, 10G ZRAM, on Intel 8255C * 2 in VM:
(avg of 4 test run)
Before:
Sys time: 73578.30, Real time: 864.05
After: (-54.7% sys time, -49.3% real time)
Sys time: 33314.76, Real time: 437.67

time make -j96 / 1152M memcg, 64K mTHP, 10G ZRAM, on Intel 8255C * 2 in VM:
(avg of 4 test run)
Before:
Sys time: 74044.85, Real time: 846.51
After: (-51.4% sys time, -47.7% real time, -63.2% mTHP failure)
Sys time: 35958.87, Real time: 442.69

Tests on the host bare metal showed similar results.

There are some other test results I didn't include in the cover letter
for V1 yet and I'm still testing more scenarios, eg. mysql test in 1G
memcg and with 96 workers and ZRAM swap:
before:
    transactions:                        755630 (6292.11 per sec.)
    queries:                             12090080 (100673.69 per sec.)
after:
    transactions:                        1077156 (8972.73 per sec.)
    queries:                             17234496 (143563.65 per sec.)

~30% faster.

Also the mTHP swap allocation success rate is higher, I can highlight
these changes in V2.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 00/13] mm, swap: rework of swap allocator locks
  2024-10-22 19:24 [PATCH 00/13] mm, swap: rework of swap allocator locks Kairui Song
                   ` (14 preceding siblings ...)
  2024-10-23 10:27 ` Andrew Morton
@ 2024-10-23 17:59 ` Yosry Ahmed
  15 siblings, 0 replies; 21+ messages in thread
From: Yosry Ahmed @ 2024-10-23 17:59 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Chris Li, Barry Song, Ryan Roberts,
	Hugh Dickins, Huang, Ying, Tim Chen, Nhat Pham, linux-kernel

On Tue, Oct 22, 2024 at 12:29 PM Kairui Song <ryncsn@gmail.com> wrote:
>
> From: Kairui Song <kasong@tencent.com>
>
> This series improved the swap allocator performance greatly by reworking
> the locking design and simplify a lot of code path.
>
> This is follow up of previous swap cluster allocator series:
> https://lore.kernel.org/linux-mm/20240730-swap-allocator-v5-0-cb9c148b9297@kernel.org/
>
> And this series is based on an follow up fix of the swap cluster
> allocator:
> https://lore.kernel.org/linux-mm/20241022175512.10398-1-ryncsn@gmail.com/
>
> This is part of the new swap allocator work item discussed in
> Chris's "Swap Abstraction" discussion at LSF/MM 2024, and
> "mTHP and swap allocator" discussion at LPC 2024.
>
> Previous series introduced a fully cluster based allocation algorithm,
> this series completely get rid of the old allocation path and makes the
> allocator avoid grabbing the si->lock unless needed. This bring huge
> performance gain and get rid of slot cache on freeing path.
>
> Currently, swap locking is mainly composed of two locks, cluster lock
> (ci->lock) and device lock (si->lock). The device lock is widely used
> to protect many things, causing it to be the main bottleneck for SWAP.
>
> Cluster lock is much more fine-grained, so it will be best to use
> ci->lock instead of si->lock as much as possible.
>
> `perf lock` indicates this issue clearly. Doing linux kernel build
> using tmpfs and ZRAM with limited memory (make -j64 with 1G memcg and 4k
> pages), result of "perf lock contention -ab sleep 3":
>
>   contended   total wait     max wait     avg wait         type   caller
>
>      34948     53.63 s       7.11 ms      1.53 ms     spinlock   free_swap_and_cache_nr+0x350
>      16569     40.05 s       6.45 ms      2.42 ms     spinlock   get_swap_pages+0x231
>      11191     28.41 s       7.03 ms      2.54 ms     spinlock   swapcache_free_entries+0x59
>       4147     22.78 s     122.66 ms      5.49 ms     spinlock   page_vma_mapped_walk+0x6f3
>       4595      7.17 s       6.79 ms      1.56 ms     spinlock   swapcache_free_entries+0x59
>     406027      2.74 s       2.59 ms      6.74 us     spinlock   list_lru_add+0x39
>   ...snip...
>
> The top 5 caller are all users of si->lock, total wait time up sums to
> several minutes in the 3 seconds time window.
>
> Following the new allocator design, many operation doesn't need to touch
> si->lock at all. We only need to take si->lock when doing operations
> across multiple clusters (eg. changing the cluster list), other
> operations only need to take ci->lock. So ideally allocator should
> always take ci->lock first, then, if needed, take si->lock. But due
> to historical reasons, ci->lock is used inside si->lock by design,
> causing lock inversion if we simply try to acquire si->lock after
> acquiring ci->lock.
>
> This series audited all si->lock usage, simplify legacy codes, eliminate
> usage of si->lock as much as possible by introducing new designs based
> on the new cluster allocator.
>
> Old HDD allocation codes are removed, cluster allocator is adapted
> with small changes for HDD usage, test is looking OK.
>
> And this also removed slot cache for freeing path. The performance is
> better without it, and this enables other clean up and optimizations
> as discussed before:
> https://lore.kernel.org/all/CAMgjq7ACohT_uerSz8E_994ZZCv709Zor+43hdmesW_59W1BWw@mail.gmail.com/
>
> After this series, lock contention on si->lock is nearly unobservable
> with `perf lock` with the same test above :
>
>   contended   total wait     max wait     avg wait         type   caller
>   ... snip ...
>          91    204.62 us      4.51 us      2.25 us     spinlock   cluster_move+0x2e
>   ... snip ...
>          47    125.62 us      4.47 us      2.67 us     spinlock   cluster_move+0x2e
>   ... snip ...
>          23     63.15 us      3.95 us      2.74 us     spinlock   cluster_move+0x2e
>   ... snip ...
>          17     41.26 us      4.58 us      2.43 us     spinlock   cluster_isolate_lock+0x1d
>   ... snip ...
>
> cluster_move and cluster_isolate_lock are basically the only users
> of si->lock now, performance gain is huge with reduced LOC.
>
> Tests
> ===
>
> Build kernel with defconfig on tmpfs with ZRAM as swap:
> ---
>
> Running a test matrix which is scaled up progressive for a intuitive result.
> The test are ran on top of tmpfs, using memory cgroup for memory limitation,
> on a 48c96t system.
>
> 12 test run for each case, it can be seen clearly that as concurrent job
> number goes higher the performance gain is higher, the performance is
> higher even with low concurrency.
>
>    make -j<NR>     |   System Time (seconds)  |   Total Time (seconds)
>  (NR / Mem / ZRAM) | (Before / After / Delta) | (Before / After / Delta)
>  With 4k pages only:
>   6 / 192M / 3G    |    5258 /  5235 / -0.3%  |    1420 /  1414 / -0.3%
>  12 / 256M / 4G    |    5518 /  5337 / -3.3%  |     758 /   742 / -2.1%
>  24 / 384M / 5G    |    7091 /  5766 / -18.7% |     476 /   422 / -11.3%
>  48 / 768M / 7G    |   11139 /  5831 / -47.7% |     330 /   221 / -33.0%
>  96 / 1.5G / 10G   |   21303 / 11353 / -46.7% |     283 /   180 / -36.4%
>  With 64k mTHP:
>  24 / 512M / 5G    |    5104 /  4641 / -18.7% |     376 /   358 / -4.8%
>  48 /   1G / 7G    |    8693 /  4662 / -18.7% |     257 /   176 / -31.5%
>  96 /   2G / 10G   |   17056 / 10263 / -39.8% |     234 /   169 / -27.8%
>
> With more aggressive setup, it shows clearly both the performance and
> fragmentation are better:
>
> tiem make -j96 / 768M memcg, 4K pages, 10G ZRAM, on Intel 8255C * 2:
> (avg of 4 test run)
> Before:
> Sys time: 73578.30, Real time: 864.05
> tiem make -j96 / 1G memcg, 4K pages, 10G ZRAM:
> After: (-54.7% sys time, -49.3% real time)
> Sys time: 33314.76, Real time: 437.67
>
> time make -j96 / 1152M memcg, 64K mTHP, 10G ZRAM, on Intel 8255C * 2:
> (avg of 4 test run)
> Before:
> Sys time: 74044.85, Real time: 846.51
> hugepages-64kB/stats/swpout: 1735216
> hugepages-64kB/stats/swpout_fallback: 430333
> After: (-51.4% sys time, -47.7% real time, -63.2% mTHP failure)
> Sys time: 35958.87, Real time: 442.69
> hugepages-64kB/stats/swpout: 1866267
> hugepages-64kB/stats/swpout_fallback: 158330
>
> There is a up to 54.7% improvement for build kernel test, and lower
> fragmentation rate. Performance improvement should be even larger for
> micro benchmarks
>
> Build kernel with tinyconfig on tmpfs with HDD as swap:
> ---
>
> This test is similar to above, but HDD test is very noisy and slow, the
> deviation is huge, so just use tinyconfig instead and take the median test
> result of 3 test run, which looks OK:
>
> Before this series:
> 114.44user 29.11system 39:42.90elapsed 6%CPU
> 2901232inputs+0outputs (238877major+4227640minor)pagefaults
>
> After this commit:
> 113.90user 23.81system 38:11.77elapsed 6%CPU
> 2548728inputs+0outputs (235471major+4238110minor)pagefaults
>
> Single thread SWAP:
> ---
>
> Sequential SWAP should also be slightly faster as we removed a lot of
> unnecessary parts. Test using micro benchmark for swapout/in 4G
> zero memory using ZRAM, 10 test runs:
>
> Swapout Before (avg. 3359304):
> 3353796 3358551 3371305 3356043 3367524 3355303 3355924 3354513 3360776
>
> Swapin Before (avg. 1928698):
> 1920283 1927183 1934105 1921373 1926562 1938261 1927726 1928636 1934155
>
> Swapout After (avg. 3347511, -0.4%):
> 3337863 3347948 3355235 3339081 3333134 3353006 3354917 3346055 3360359
>
> Swapin After (avg. 1922290, -0.3%):
> 1919101 1925743 1916810 1917007 1923930 1935152 1917403 1923549 1921913

Unfortunately I don't have the time to go through this series, but I
just wanted to say that this awesome work, Kairui.

Selfishly, I especially like cleaning up the swap slot freeing path,
and having a centralized freeing path with a single call to
zswap_invalidate().

Thanks for doing this :)


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 00/13] mm, swap: rework of swap allocator locks
  2024-10-23  2:24 ` [PATCH 00/13] mm, swap: rework of swap allocator locks Huang, Ying
@ 2024-10-23 18:01   ` Kairui Song
  2024-10-24  3:04     ` Huang, Ying
  0 siblings, 1 reply; 21+ messages in thread
From: Kairui Song @ 2024-10-23 18:01 UTC (permalink / raw)
  To: Huang, Ying
  Cc: linux-mm, Andrew Morton, Chris Li, Barry Song, Ryan Roberts,
	Hugh Dickins, Yosry Ahmed, Tim Chen, Nhat Pham, linux-kernel

On Wed, Oct 23, 2024 at 10:27 AM Huang, Ying <ying.huang@intel.com> wrote:
>
> Hi, Kairui,

Hi Ying,

>
> Kairui Song <ryncsn@gmail.com> writes:
>
> > From: Kairui Song <kasong@tencent.com>
> >
> > This series improved the swap allocator performance greatly by reworking
> > the locking design and simplify a lot of code path.
> >
> > This is follow up of previous swap cluster allocator series:
> > https://lore.kernel.org/linux-mm/20240730-swap-allocator-v5-0-cb9c148b9297@kernel.org/
> >
> > And this series is based on an follow up fix of the swap cluster
> > allocator:
> > https://lore.kernel.org/linux-mm/20241022175512.10398-1-ryncsn@gmail.com/
> >
> > This is part of the new swap allocator work item discussed in
> > Chris's "Swap Abstraction" discussion at LSF/MM 2024, and
> > "mTHP and swap allocator" discussion at LPC 2024.
> >
> > Previous series introduced a fully cluster based allocation algorithm,
> > this series completely get rid of the old allocation path and makes the
> > allocator avoid grabbing the si->lock unless needed. This bring huge
> > performance gain and get rid of slot cache on freeing path.
>
> Great!
>
> > Currently, swap locking is mainly composed of two locks, cluster lock
> > (ci->lock) and device lock (si->lock). The device lock is widely used
> > to protect many things, causing it to be the main bottleneck for SWAP.
>
> Device lock can be confusing with another device lock for struct device.
> Better to call it swap device lock?

Good idea, I'll use the term swap device lock then.

>
> > Cluster lock is much more fine-grained, so it will be best to use
> > ci->lock instead of si->lock as much as possible.
> >
> > `perf lock` indicates this issue clearly. Doing linux kernel build
> > using tmpfs and ZRAM with limited memory (make -j64 with 1G memcg and 4k
> > pages), result of "perf lock contention -ab sleep 3":
> >
> >   contended   total wait     max wait     avg wait         type   caller
> >
> >      34948     53.63 s       7.11 ms      1.53 ms     spinlock   free_swap_and_cache_nr+0x350
> >      16569     40.05 s       6.45 ms      2.42 ms     spinlock   get_swap_pages+0x231
> >      11191     28.41 s       7.03 ms      2.54 ms     spinlock   swapcache_free_entries+0x59
> >       4147     22.78 s     122.66 ms      5.49 ms     spinlock   page_vma_mapped_walk+0x6f3
> >       4595      7.17 s       6.79 ms      1.56 ms     spinlock   swapcache_free_entries+0x59
> >     406027      2.74 s       2.59 ms      6.74 us     spinlock   list_lru_add+0x39
> >   ...snip...
> >
> > The top 5 caller are all users of si->lock, total wait time up sums to
> > several minutes in the 3 seconds time window.
>
> Can you show results of `perf record -g`, `perf report -g` too?  I have
> interest to check hot spot shifting too.

Sure. I think `perf lock` result is already good enough and cleaner.
My test environment are mostly VM based so spinlock slow path may get
offloaded to host, and can't be see by perf record, I collected
following data after disabled paravirt spinlock:

The time consumption and stack trace of a page fault before:
-   78.45%     0.17%  cc1              [kernel.kallsyms]
                [k] asm_exc_page_fault
   - 78.28% asm_exc_page_fault
      - 78.18% exc_page_fault
         - 78.17% do_user_addr_fault
            - 78.09% handle_mm_fault
               - 78.06% __handle_mm_fault
                  - 69.69% do_swap_page
                     - 55.87% alloc_swap_folio
                        - 55.60% mem_cgroup_swapin_charge_folio
                           - 55.48% charge_memcg
                              - 55.45% try_charge_memcg
                                 - 55.36% try_to_free_mem_cgroup_pages
                                    - do_try_to_free_pages
                                       - 55.35% shrink_node
                                          - 55.27% shrink_lruvec
                                             - 55.13% try_to_shrink_lruvec
                                                - 54.79% evict_folios
                                                   - 54.35% shrink_folio_list
                                                      - 30.01% add_to_swap
                                                         - 29.77%
folio_alloc_swap
                                                            - 29.50%
get_swap_pages

25.03% queued_spin_lock_slowpath
                                                               - 2.71%
alloc_swap_scan_cluster

1.80% queued_spin_lock_slowpath
                                                                  +
0.89% __try_to_reclaim_swap
                                                               - 1.74%
swap_reclaim_full_clusters

1.74% queued_spin_lock_slowpath
                                                      - 10.88%
try_to_unmap_flush_dirty
                                                         - 10.87%
arch_tlbbatch_flush
                                                            - 10.85%
on_each_cpu_cond_mask

smp_call_function_many_cond
                                                      + 7.45% pageout
                                                      + 2.71% try_to_unmap_flush
                                                      + 1.90% try_to_unmap
                                                      + 0.78% folio_referenced
                     - 9.41% cluster_swap_free_nr
                        - 9.39% free_swap_slot
                           - 9.35% swapcache_free_entries
                                8.40% queued_spin_lock_slowpath
                                0.93% swap_entry_range_free
                     - 3.61% swap_read_folio_bdev_sync
                        - 3.55% submit_bio_wait
                           - 3.51% submit_bio_noacct_nocheck
                              + 3.46% __submit_bio
                  + 7.71% do_pte_missing
                  + 0.61% wp_page_copy

The queued_spin_lock_slowpath above is the si->lock, and there are
multiple users of it so the total overhead is higher than shown.

After:
-   75.05%     0.43%  cc1              [kernel.kallsyms]
                [k] asm_exc_page_fault
   - 74.62% asm_exc_page_fault
      - 74.36% exc_page_fault
         - 74.34% do_user_addr_fault
            - 74.10% handle_mm_fault
               - 73.96% __handle_mm_fault
                  - 67.55% do_swap_page
                     - 45.92% alloc_swap_folio
                        - 45.03% mem_cgroup_swapin_charge_folio
                           - 44.58% charge_memcg
                              - 44.44% try_charge_memcg
                                 - 44.12% try_to_free_mem_cgroup_pages
                                    - do_try_to_free_pages
                                       - 44.10% shrink_node
                                          - 43.86% shrink_lruvec
                                             - 41.92% try_to_shrink_lruvec
                                                - 40.67% evict_folios
                                                   - 37.12% shrink_folio_list
                                                      - 20.88% pageout
                                                         + 20.02% swap_writepage
                                                         + 0.72% shmem_writepage
                                                      - 4.08% add_to_swap
                                                         - 2.48%
folio_alloc_swap
                                                            - 2.12%
__mem_cgroup_try_charge_swap
                                                               - 1.47%
swap_cgroup_record
                                                                  +
1.32% _raw_spin_lock_irqsave
                                                         - 1.56%
add_to_swap_cache
                                                            - 1.04% xas_store
                                                               + 1.01%
workingset_update_node
                                                      + 3.97%
try_to_unmap_flush_dirty
                                                      + 3.51% folio_referenced
                                                      + 2.24% __remove_mapping
                                                      + 1.16% try_to_unmap
                                                      + 0.52% try_to_unmap_flush
                                                     2.50%
queued_spin_lock_slowpath
                                                     0.79% scan_folios
                                                + 1.20% try_to_inc_max_seq
                                             + 1.92% lru_add_drain
                        + 0.73% vma_alloc_folio_noprof
                     - 9.81% swap_read_folio_bdev_sync
                        - 9.61% submit_bio_wait
                           + 9.49% submit_bio_noacct_nocheck
                     - 8.06% cluster_swap_free_nr
                        - 8.02% swap_entry_range_free
                           + 3.92% __mem_cgroup_uncharge_swap
                           + 2.90% zram_slot_free_notify
                             0.58% clear_shadow_from_swap_cache
                     - 1.32% __folio_batch_add_and_move
                        - 1.30% folio_batch_move_lru
                           + 1.10% folio_lruvec_lock_irqsave

spin_lock usage is much lower.

I prefer the perf lock output as it shows the exact time and user of locks.

>
> > Following the new allocator design, many operation doesn't need to touch
> > si->lock at all. We only need to take si->lock when doing operations
> > across multiple clusters (eg. changing the cluster list), other
> > operations only need to take ci->lock. So ideally allocator should
> > always take ci->lock first, then, if needed, take si->lock. But due
> > to historical reasons, ci->lock is used inside si->lock by design,
> > causing lock inversion if we simply try to acquire si->lock after
> > acquiring ci->lock.
> >
> > This series audited all si->lock usage, simplify legacy codes, eliminate
> > usage of si->lock as much as possible by introducing new designs based
> > on the new cluster allocator.
> >
> > Old HDD allocation codes are removed, cluster allocator is adapted
> > with small changes for HDD usage, test is looking OK.
>
> I think that it's a good idea to remove HDD allocation specific code.
> Can you check the performance of swapping to HDD?  However, I understand
> that many people have no HDD in hand.

It's not hard to make cluster allocator work well with HDD in theory,
see the commit "mm, swap: use a global swap cluster for non-rotation
device".
The testing is not very reliable though, I found HDD swap performance
is very unstable because of the IO pattern of HDD, so it's just a best
effort try.

> > And this also removed slot cache for freeing path. The performance is
> > better without it, and this enables other clean up and optimizations
> > as discussed before:
> > https://lore.kernel.org/all/CAMgjq7ACohT_uerSz8E_994ZZCv709Zor+43hdmesW_59W1BWw@mail.gmail.com/
> >
> > After this series, lock contention on si->lock is nearly unobservable
> > with `perf lock` with the same test above :
> >
> >   contended   total wait     max wait     avg wait         type   caller
> >   ... snip ...
> >          91    204.62 us      4.51 us      2.25 us     spinlock   cluster_move+0x2e
> >   ... snip ...
> >          47    125.62 us      4.47 us      2.67 us     spinlock   cluster_move+0x2e
> >   ... snip ...
> >          23     63.15 us      3.95 us      2.74 us     spinlock   cluster_move+0x2e
> >   ... snip ...
> >          17     41.26 us      4.58 us      2.43 us     spinlock   cluster_isolate_lock+0x1d
> >   ... snip ...
> >
> > cluster_move and cluster_isolate_lock are basically the only users
> > of si->lock now, performance gain is huge with reduced LOC.
> >
> > Tests
> > ===
> >
> > Build kernel with defconfig on tmpfs with ZRAM as swap:
> > ---
> >
> > Running a test matrix which is scaled up progressive for a intuitive result.
> > The test are ran on top of tmpfs, using memory cgroup for memory limitation,
> > on a 48c96t system.
> >
> > 12 test run for each case, it can be seen clearly that as concurrent job
> > number goes higher the performance gain is higher, the performance is
> > higher even with low concurrency.
> >
> >    make -j<NR>     |   System Time (seconds)  |   Total Time (seconds)
> >  (NR / Mem / ZRAM) | (Before / After / Delta) | (Before / After / Delta)
> >  With 4k pages only:
> >   6 / 192M / 3G    |    5258 /  5235 / -0.3%  |    1420 /  1414 / -0.3%
> >  12 / 256M / 4G    |    5518 /  5337 / -3.3%  |     758 /   742 / -2.1%
> >  24 / 384M / 5G    |    7091 /  5766 / -18.7% |     476 /   422 / -11.3%
> >  48 / 768M / 7G    |   11139 /  5831 / -47.7% |     330 /   221 / -33.0%
> >  96 / 1.5G / 10G   |   21303 / 11353 / -46.7% |     283 /   180 / -36.4%
> >  With 64k mTHP:
> >  24 / 512M / 5G    |    5104 /  4641 / -18.7% |     376 /   358 / -4.8%
> >  48 /   1G / 7G    |    8693 /  4662 / -18.7% |     257 /   176 / -31.5%
> >  96 /   2G / 10G   |   17056 / 10263 / -39.8% |     234 /   169 / -27.8%
>
> How much is the swap in/out throughput before/after the change?

This may not be too beneficial for typical throughput measurement:
- For example doing the same test with brd will only show a ~20%
performance improvement, still a big gain though. I think the si->lock
spinlock wasting CPU cycles may effect CPU sensitive things like ZRAM
even more.
- And simple benchmarks which just do multiple sequential swaps in/out
in multiple thread hardly stress the allocator. I haven't found a good
benchmark to simulate random parallel IOs on SWAP yet, I can write one
later.

A more close to real word benchmark like build kernel test, or
mysql/sysbench all showed great improment.

>
> When I worked on swap in/out performance before, the hot spot shifts from
> swap related code to LRU lock and zone lock.  Things may change a lot
> now.
>
> If zram is used as swap device, the hot spot may become
> compression/decompression after solving the swap lock contention.  To
> stress swap subsystem further, we may use a ram disk as swap.
> Previously, we have used a simulated pmem device (backed by DRAM).  That
> can be setup as in,
>
> https://pmem.io/blog/2016/02/how-to-emulate-persistent-memory/
>
> After creating the raw block device: /dev/pmem0, we can do
>
> $ mkswap /dev/pmem0
> $ swapon /dev/pmem0
>
> Can you use something similar if necessary?

I used to test with brd, as described above, I think using ZRAM with
test simulating real workload is more useful.
And I did include a Sequential SWAP test, the result is looking OK (no
regression, minor to none improvement).

I can  have a try with the pmem setup later, I guess the result will
be similar to brd test.


>
> > With more aggressive setup, it shows clearly both the performance and
> > fragmentation are better:
> >
> > tiem make -j96 / 768M memcg, 4K pages, 10G ZRAM, on Intel 8255C * 2:
> > (avg of 4 test run)
> > Before:
> > Sys time: 73578.30, Real time: 864.05
> > tiem make -j96 / 1G memcg, 4K pages, 10G ZRAM:
> > After: (-54.7% sys time, -49.3% real time)
> > Sys time: 33314.76, Real time: 437.67
> >
> > time make -j96 / 1152M memcg, 64K mTHP, 10G ZRAM, on Intel 8255C * 2:
> > (avg of 4 test run)
> > Before:
> > Sys time: 74044.85, Real time: 846.51
> > hugepages-64kB/stats/swpout: 1735216
> > hugepages-64kB/stats/swpout_fallback: 430333
> > After: (-51.4% sys time, -47.7% real time, -63.2% mTHP failure)
> > Sys time: 35958.87, Real time: 442.69
> > hugepages-64kB/stats/swpout: 1866267
> > hugepages-64kB/stats/swpout_fallback: 158330
> >
> > There is a up to 54.7% improvement for build kernel test, and lower
> > fragmentation rate. Performance improvement should be even larger for
> > micro benchmarks
>
> Very good result!
>
> > Build kernel with tinyconfig on tmpfs with HDD as swap:
> > ---
> >
> > This test is similar to above, but HDD test is very noisy and slow, the
> > deviation is huge, so just use tinyconfig instead and take the median test
> > result of 3 test run, which looks OK:
> >
> > Before this series:
> > 114.44user 29.11system 39:42.90elapsed 6%CPU
> > 2901232inputs+0outputs (238877major+4227640minor)pagefaults
> >
> > After this commit:
> > 113.90user 23.81system 38:11.77elapsed 6%CPU
> > 2548728inputs+0outputs (235471major+4238110minor)pagefaults
> >
> > Single thread SWAP:
> > ---
> >
> > Sequential SWAP should also be slightly faster as we removed a lot of
> > unnecessary parts. Test using micro benchmark for swapout/in 4G
> > zero memory using ZRAM, 10 test runs:
> >
> > Swapout Before (avg. 3359304):
> > 3353796 3358551 3371305 3356043 3367524 3355303 3355924 3354513 3360776
> >
> > Swapin Before (avg. 1928698):
> > 1920283 1927183 1934105 1921373 1926562 1938261 1927726 1928636 1934155
> >
> > Swapout After (avg. 3347511, -0.4%):
> > 3337863 3347948 3355235 3339081 3333134 3353006 3354917 3346055 3360359
> >
> > Swapin After (avg. 1922290, -0.3%):
> > 1919101 1925743 1916810 1917007 1923930 1935152 1917403 1923549 1921913
> >
> > Worth noticing the patch "mm, swap: use a global swap cluster for
> > non-rotation device" introduced minor overhead for certain tests (see
> > the test results in commit message), but the gain from later commit
> > covered that, it can be further improved later.
> >
> > Suggested-by: Chris Li <chrisl@kernel.org>
> > Signed-off-by: Kairui Song <kasong@tencent.com>
> >
> > Kairui Song (13):
> >   mm, swap: minor clean up for swap entry allocation
> >   mm, swap: fold swap_info_get_cont in the only caller
> >   mm, swap: remove old allocation path for HDD
> >   mm, swap: use cluster lock for HDD
> >   mm, swap: clean up device availability check
> >   mm, swap: clean up plist removal and adding
> >   mm, swap: hold a reference of si during scan and clean up flags
> >   mm, swap: use an enum to define all cluster flags and wrap flags
> >     changes
> >   mm, swap: reduce contention on device lock
> >   mm, swap: simplify percpu cluster updating
> >   mm, swap: introduce a helper for retrieving cluster from offset
> >   mm, swap: use a global swap cluster for non-rotation device
> >   mm, swap_slots: remove slot cache for freeing path
> >
> >  fs/btrfs/inode.c           |    1 -
> >  fs/iomap/swapfile.c        |    1 -
> >  include/linux/swap.h       |   36 +-
> >  include/linux/swap_slots.h |    3 -
> >  mm/page_io.c               |    1 -
> >  mm/swap_slots.c            |   78 +--
> >  mm/swapfile.c              | 1198 ++++++++++++++++--------------------
> >  7 files changed, 558 insertions(+), 760 deletions(-)
>
> --
> Best Regards,
> Huang, Ying


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 00/13] mm, swap: rework of swap allocator locks
  2024-10-23 18:01   ` Kairui Song
@ 2024-10-24  3:04     ` Huang, Ying
  2024-10-24  3:51       ` Kairui Song
  0 siblings, 1 reply; 21+ messages in thread
From: Huang, Ying @ 2024-10-24  3:04 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Chris Li, Barry Song, Ryan Roberts,
	Hugh Dickins, Yosry Ahmed, Tim Chen, Nhat Pham, linux-kernel

Kairui Song <ryncsn@gmail.com> writes:

> On Wed, Oct 23, 2024 at 10:27 AM Huang, Ying <ying.huang@intel.com> wrote:
>>
>> Hi, Kairui,
>
> Hi Ying,
>
>>
>> Kairui Song <ryncsn@gmail.com> writes:
>>
>> > From: Kairui Song <kasong@tencent.com>
>> >
>> > This series improved the swap allocator performance greatly by reworking
>> > the locking design and simplify a lot of code path.
>> >
>> > This is follow up of previous swap cluster allocator series:
>> > https://lore.kernel.org/linux-mm/20240730-swap-allocator-v5-0-cb9c148b9297@kernel.org/
>> >
>> > And this series is based on an follow up fix of the swap cluster
>> > allocator:
>> > https://lore.kernel.org/linux-mm/20241022175512.10398-1-ryncsn@gmail.com/
>> >
>> > This is part of the new swap allocator work item discussed in
>> > Chris's "Swap Abstraction" discussion at LSF/MM 2024, and
>> > "mTHP and swap allocator" discussion at LPC 2024.
>> >
>> > Previous series introduced a fully cluster based allocation algorithm,
>> > this series completely get rid of the old allocation path and makes the
>> > allocator avoid grabbing the si->lock unless needed. This bring huge
>> > performance gain and get rid of slot cache on freeing path.
>>
>> Great!
>>
>> > Currently, swap locking is mainly composed of two locks, cluster lock
>> > (ci->lock) and device lock (si->lock). The device lock is widely used
>> > to protect many things, causing it to be the main bottleneck for SWAP.
>>
>> Device lock can be confusing with another device lock for struct device.
>> Better to call it swap device lock?
>
> Good idea, I'll use the term swap device lock then.
>
>>
>> > Cluster lock is much more fine-grained, so it will be best to use
>> > ci->lock instead of si->lock as much as possible.
>> >
>> > `perf lock` indicates this issue clearly. Doing linux kernel build
>> > using tmpfs and ZRAM with limited memory (make -j64 with 1G memcg and 4k
>> > pages), result of "perf lock contention -ab sleep 3":
>> >
>> >   contended   total wait     max wait     avg wait         type   caller
>> >
>> >      34948     53.63 s       7.11 ms      1.53 ms     spinlock   free_swap_and_cache_nr+0x350
>> >      16569     40.05 s       6.45 ms      2.42 ms     spinlock   get_swap_pages+0x231
>> >      11191     28.41 s       7.03 ms      2.54 ms     spinlock   swapcache_free_entries+0x59
>> >       4147     22.78 s     122.66 ms      5.49 ms     spinlock   page_vma_mapped_walk+0x6f3
>> >       4595      7.17 s       6.79 ms      1.56 ms     spinlock   swapcache_free_entries+0x59
>> >     406027      2.74 s       2.59 ms      6.74 us     spinlock   list_lru_add+0x39
>> >   ...snip...
>> >
>> > The top 5 caller are all users of si->lock, total wait time up sums to
>> > several minutes in the 3 seconds time window.
>>
>> Can you show results of `perf record -g`, `perf report -g` too?  I have
>> interest to check hot spot shifting too.
>
> Sure. I think `perf lock` result is already good enough and cleaner.
> My test environment are mostly VM based so spinlock slow path may get
> offloaded to host, and can't be see by perf record, I collected
> following data after disabled paravirt spinlock:
>
> The time consumption and stack trace of a page fault before:
> -   78.45%     0.17%  cc1              [kernel.kallsyms]
>                 [k] asm_exc_page_fault
>    - 78.28% asm_exc_page_fault
>       - 78.18% exc_page_fault
>          - 78.17% do_user_addr_fault
>             - 78.09% handle_mm_fault
>                - 78.06% __handle_mm_fault
>                   - 69.69% do_swap_page
>                      - 55.87% alloc_swap_folio
>                         - 55.60% mem_cgroup_swapin_charge_folio
>                            - 55.48% charge_memcg
>                               - 55.45% try_charge_memcg
>                                  - 55.36% try_to_free_mem_cgroup_pages
>                                     - do_try_to_free_pages
>                                        - 55.35% shrink_node
>                                           - 55.27% shrink_lruvec
>                                              - 55.13% try_to_shrink_lruvec
>                                                 - 54.79% evict_folios
>                                                    - 54.35% shrink_folio_list
>                                                       - 30.01% add_to_swap
>                                                          - 29.77%
> folio_alloc_swap
>                                                             - 29.50%
> get_swap_pages
>
> 25.03% queued_spin_lock_slowpath
>                                                                - 2.71%
> alloc_swap_scan_cluster
>
> 1.80% queued_spin_lock_slowpath
>                                                                   +
> 0.89% __try_to_reclaim_swap
>                                                                - 1.74%
> swap_reclaim_full_clusters
>
> 1.74% queued_spin_lock_slowpath
>                                                       - 10.88%
> try_to_unmap_flush_dirty
>                                                          - 10.87%
> arch_tlbbatch_flush
>                                                             - 10.85%
> on_each_cpu_cond_mask
>
> smp_call_function_many_cond
>                                                       + 7.45% pageout
>                                                       + 2.71% try_to_unmap_flush
>                                                       + 1.90% try_to_unmap
>                                                       + 0.78% folio_referenced
>                      - 9.41% cluster_swap_free_nr
>                         - 9.39% free_swap_slot
>                            - 9.35% swapcache_free_entries
>                                 8.40% queued_spin_lock_slowpath
>                                 0.93% swap_entry_range_free
>                      - 3.61% swap_read_folio_bdev_sync
>                         - 3.55% submit_bio_wait
>                            - 3.51% submit_bio_noacct_nocheck
>                               + 3.46% __submit_bio
>                   + 7.71% do_pte_missing
>                   + 0.61% wp_page_copy
>
> The queued_spin_lock_slowpath above is the si->lock, and there are
> multiple users of it so the total overhead is higher than shown.
>
> After:
> -   75.05%     0.43%  cc1              [kernel.kallsyms]
>                 [k] asm_exc_page_fault
>    - 74.62% asm_exc_page_fault
>       - 74.36% exc_page_fault
>          - 74.34% do_user_addr_fault
>             - 74.10% handle_mm_fault
>                - 73.96% __handle_mm_fault
>                   - 67.55% do_swap_page
>                      - 45.92% alloc_swap_folio
>                         - 45.03% mem_cgroup_swapin_charge_folio
>                            - 44.58% charge_memcg
>                               - 44.44% try_charge_memcg
>                                  - 44.12% try_to_free_mem_cgroup_pages
>                                     - do_try_to_free_pages
>                                        - 44.10% shrink_node
>                                           - 43.86% shrink_lruvec
>                                              - 41.92% try_to_shrink_lruvec
>                                                 - 40.67% evict_folios
>                                                    - 37.12% shrink_folio_list
>                                                       - 20.88% pageout
>                                                          + 20.02% swap_writepage
>                                                          + 0.72% shmem_writepage
>                                                       - 4.08% add_to_swap
>                                                          - 2.48%
> folio_alloc_swap
>                                                             - 2.12%
> __mem_cgroup_try_charge_swap
>                                                                - 1.47%
> swap_cgroup_record
>                                                                   +
> 1.32% _raw_spin_lock_irqsave
>                                                          - 1.56%
> add_to_swap_cache
>                                                             - 1.04% xas_store
>                                                                + 1.01%
> workingset_update_node
>                                                       + 3.97%
> try_to_unmap_flush_dirty
>                                                       + 3.51% folio_referenced
>                                                       + 2.24% __remove_mapping
>                                                       + 1.16% try_to_unmap
>                                                       + 0.52% try_to_unmap_flush
>                                                      2.50%
> queued_spin_lock_slowpath
>                                                      0.79% scan_folios
>                                                 + 1.20% try_to_inc_max_seq
>                                              + 1.92% lru_add_drain
>                         + 0.73% vma_alloc_folio_noprof
>                      - 9.81% swap_read_folio_bdev_sync
>                         - 9.61% submit_bio_wait
>                            + 9.49% submit_bio_noacct_nocheck
>                      - 8.06% cluster_swap_free_nr
>                         - 8.02% swap_entry_range_free
>                            + 3.92% __mem_cgroup_uncharge_swap
>                            + 2.90% zram_slot_free_notify
>                              0.58% clear_shadow_from_swap_cache
>                      - 1.32% __folio_batch_add_and_move
>                         - 1.30% folio_batch_move_lru
>                            + 1.10% folio_lruvec_lock_irqsave

Thanks for data.

It seems that the cycles shifts from spinning to memory compression.
That is expected.

> spin_lock usage is much lower.
>
> I prefer the perf lock output as it shows the exact time and user of locks.

perf cycles data is more complete.  You can find which part becomes new
hot spot.

>>
>> > Following the new allocator design, many operation doesn't need to touch
>> > si->lock at all. We only need to take si->lock when doing operations
>> > across multiple clusters (eg. changing the cluster list), other
>> > operations only need to take ci->lock. So ideally allocator should
>> > always take ci->lock first, then, if needed, take si->lock. But due
>> > to historical reasons, ci->lock is used inside si->lock by design,
>> > causing lock inversion if we simply try to acquire si->lock after
>> > acquiring ci->lock.
>> >
>> > This series audited all si->lock usage, simplify legacy codes, eliminate
>> > usage of si->lock as much as possible by introducing new designs based
>> > on the new cluster allocator.
>> >
>> > Old HDD allocation codes are removed, cluster allocator is adapted
>> > with small changes for HDD usage, test is looking OK.
>>
>> I think that it's a good idea to remove HDD allocation specific code.
>> Can you check the performance of swapping to HDD?  However, I understand
>> that many people have no HDD in hand.
>
> It's not hard to make cluster allocator work well with HDD in theory,
> see the commit "mm, swap: use a global swap cluster for non-rotation
> device".
> The testing is not very reliable though, I found HDD swap performance
> is very unstable because of the IO pattern of HDD, so it's just a best
> effort try.

Just to check whether code change cause something too bad for HDD.  No
measurable difference is a good news.

>> > And this also removed slot cache for freeing path. The performance is
>> > better without it, and this enables other clean up and optimizations
>> > as discussed before:
>> > https://lore.kernel.org/all/CAMgjq7ACohT_uerSz8E_994ZZCv709Zor+43hdmesW_59W1BWw@mail.gmail.com/
>> >
>> > After this series, lock contention on si->lock is nearly unobservable
>> > with `perf lock` with the same test above :
>> >
>> >   contended   total wait     max wait     avg wait         type   caller
>> >   ... snip ...
>> >          91    204.62 us      4.51 us      2.25 us     spinlock   cluster_move+0x2e
>> >   ... snip ...
>> >          47    125.62 us      4.47 us      2.67 us     spinlock   cluster_move+0x2e
>> >   ... snip ...
>> >          23     63.15 us      3.95 us      2.74 us     spinlock   cluster_move+0x2e
>> >   ... snip ...
>> >          17     41.26 us      4.58 us      2.43 us     spinlock   cluster_isolate_lock+0x1d
>> >   ... snip ...
>> >
>> > cluster_move and cluster_isolate_lock are basically the only users
>> > of si->lock now, performance gain is huge with reduced LOC.
>> >
>> > Tests
>> > ===
>> >
>> > Build kernel with defconfig on tmpfs with ZRAM as swap:
>> > ---
>> >
>> > Running a test matrix which is scaled up progressive for a intuitive result.
>> > The test are ran on top of tmpfs, using memory cgroup for memory limitation,
>> > on a 48c96t system.
>> >
>> > 12 test run for each case, it can be seen clearly that as concurrent job
>> > number goes higher the performance gain is higher, the performance is
>> > higher even with low concurrency.
>> >
>> >    make -j<NR>     |   System Time (seconds)  |   Total Time (seconds)
>> >  (NR / Mem / ZRAM) | (Before / After / Delta) | (Before / After / Delta)
>> >  With 4k pages only:
>> >   6 / 192M / 3G    |    5258 /  5235 / -0.3%  |    1420 /  1414 / -0.3%
>> >  12 / 256M / 4G    |    5518 /  5337 / -3.3%  |     758 /   742 / -2.1%
>> >  24 / 384M / 5G    |    7091 /  5766 / -18.7% |     476 /   422 / -11.3%
>> >  48 / 768M / 7G    |   11139 /  5831 / -47.7% |     330 /   221 / -33.0%
>> >  96 / 1.5G / 10G   |   21303 / 11353 / -46.7% |     283 /   180 / -36.4%
>> >  With 64k mTHP:
>> >  24 / 512M / 5G    |    5104 /  4641 / -18.7% |     376 /   358 / -4.8%
>> >  48 /   1G / 7G    |    8693 /  4662 / -18.7% |     257 /   176 / -31.5%
>> >  96 /   2G / 10G   |   17056 / 10263 / -39.8% |     234 /   169 / -27.8%
>>
>> How much is the swap in/out throughput before/after the change?
>
> This may not be too beneficial for typical throughput measurement:
> - For example doing the same test with brd will only show a ~20%
> performance improvement, still a big gain though. I think the si->lock
> spinlock wasting CPU cycles may effect CPU sensitive things like ZRAM
> even more.

20% is a good data.  You don't need to guess.  perf cycles profiling can
show the hot spot.

> - And simple benchmarks which just do multiple sequential swaps in/out
> in multiple thread hardly stress the allocator.
>
> I haven't found a good
> benchmark to simulate random parallel IOs on SWAP yet, I can write one
> later.

I have used anon-w-rand test case of vm-scalability to simulate random
parallel swap out.

https://git.kernel.org/pub/scm/linux/kernel/git/wfg/vm-scalability.git/tree/case-anon-w-rand

> A more close to real word benchmark like build kernel test, or
> mysql/sysbench all showed great improment.

Yes.  Real work load is good.  We can use micro-benchmark to find out
some performance limit, for example, max possible throughput.

>>
>> When I worked on swap in/out performance before, the hot spot shifts from
>> swap related code to LRU lock and zone lock.  Things may change a lot
>> now.
>>
>> If zram is used as swap device, the hot spot may become
>> compression/decompression after solving the swap lock contention.  To
>> stress swap subsystem further, we may use a ram disk as swap.
>> Previously, we have used a simulated pmem device (backed by DRAM).  That
>> can be setup as in,
>>
>> https://pmem.io/blog/2016/02/how-to-emulate-persistent-memory/
>>
>> After creating the raw block device: /dev/pmem0, we can do
>>
>> $ mkswap /dev/pmem0
>> $ swapon /dev/pmem0
>>
>> Can you use something similar if necessary?
>
> I used to test with brd, as described above,

brd will allocate memory during running, pmem can avoid that.  perf
profile is your friends to root cause the possible issue.

> I think using ZRAM with
> test simulating real workload is more useful.

Yes.  And, as I said before.  Micro-benchmark has its own value.

> And I did include a Sequential SWAP test, the result is looking OK (no
> regression, minor to none improvement).

Good.  At least we have no regression here.

> I can  have a try with the pmem setup later, I guess the result will
> be similar to brd test.
>
>
>>
>> > With more aggressive setup, it shows clearly both the performance and
>> > fragmentation are better:
>> >
>> > tiem make -j96 / 768M memcg, 4K pages, 10G ZRAM, on Intel 8255C * 2:
>> > (avg of 4 test run)
>> > Before:
>> > Sys time: 73578.30, Real time: 864.05
>> > tiem make -j96 / 1G memcg, 4K pages, 10G ZRAM:
>> > After: (-54.7% sys time, -49.3% real time)
>> > Sys time: 33314.76, Real time: 437.67
>> >
>> > time make -j96 / 1152M memcg, 64K mTHP, 10G ZRAM, on Intel 8255C * 2:
>> > (avg of 4 test run)
>> > Before:
>> > Sys time: 74044.85, Real time: 846.51
>> > hugepages-64kB/stats/swpout: 1735216
>> > hugepages-64kB/stats/swpout_fallback: 430333
>> > After: (-51.4% sys time, -47.7% real time, -63.2% mTHP failure)
>> > Sys time: 35958.87, Real time: 442.69
>> > hugepages-64kB/stats/swpout: 1866267
>> > hugepages-64kB/stats/swpout_fallback: 158330
>> >
>> > There is a up to 54.7% improvement for build kernel test, and lower
>> > fragmentation rate. Performance improvement should be even larger for
>> > micro benchmarks
>>
>> Very good result!
>>
>> > Build kernel with tinyconfig on tmpfs with HDD as swap:
>> > ---
>> >
>> > This test is similar to above, but HDD test is very noisy and slow, the
>> > deviation is huge, so just use tinyconfig instead and take the median test
>> > result of 3 test run, which looks OK:
>> >
>> > Before this series:
>> > 114.44user 29.11system 39:42.90elapsed 6%CPU
>> > 2901232inputs+0outputs (238877major+4227640minor)pagefaults
>> >
>> > After this commit:
>> > 113.90user 23.81system 38:11.77elapsed 6%CPU
>> > 2548728inputs+0outputs (235471major+4238110minor)pagefaults
>> >
>> > Single thread SWAP:
>> > ---
>> >
>> > Sequential SWAP should also be slightly faster as we removed a lot of
>> > unnecessary parts. Test using micro benchmark for swapout/in 4G
>> > zero memory using ZRAM, 10 test runs:
>> >
>> > Swapout Before (avg. 3359304):
>> > 3353796 3358551 3371305 3356043 3367524 3355303 3355924 3354513 3360776
>> >
>> > Swapin Before (avg. 1928698):
>> > 1920283 1927183 1934105 1921373 1926562 1938261 1927726 1928636 1934155
>> >
>> > Swapout After (avg. 3347511, -0.4%):
>> > 3337863 3347948 3355235 3339081 3333134 3353006 3354917 3346055 3360359
>> >
>> > Swapin After (avg. 1922290, -0.3%):
>> > 1919101 1925743 1916810 1917007 1923930 1935152 1917403 1923549 1921913
>> >
>> > Worth noticing the patch "mm, swap: use a global swap cluster for
>> > non-rotation device" introduced minor overhead for certain tests (see
>> > the test results in commit message), but the gain from later commit
>> > covered that, it can be further improved later.
>> >
>> > Suggested-by: Chris Li <chrisl@kernel.org>
>> > Signed-off-by: Kairui Song <kasong@tencent.com>
>> >
>> > Kairui Song (13):
>> >   mm, swap: minor clean up for swap entry allocation
>> >   mm, swap: fold swap_info_get_cont in the only caller
>> >   mm, swap: remove old allocation path for HDD
>> >   mm, swap: use cluster lock for HDD
>> >   mm, swap: clean up device availability check
>> >   mm, swap: clean up plist removal and adding
>> >   mm, swap: hold a reference of si during scan and clean up flags
>> >   mm, swap: use an enum to define all cluster flags and wrap flags
>> >     changes
>> >   mm, swap: reduce contention on device lock
>> >   mm, swap: simplify percpu cluster updating
>> >   mm, swap: introduce a helper for retrieving cluster from offset
>> >   mm, swap: use a global swap cluster for non-rotation device
>> >   mm, swap_slots: remove slot cache for freeing path
>> >
>> >  fs/btrfs/inode.c           |    1 -
>> >  fs/iomap/swapfile.c        |    1 -
>> >  include/linux/swap.h       |   36 +-
>> >  include/linux/swap_slots.h |    3 -
>> >  mm/page_io.c               |    1 -
>> >  mm/swap_slots.c            |   78 +--
>> >  mm/swapfile.c              | 1198 ++++++++++++++++--------------------
>> >  7 files changed, 558 insertions(+), 760 deletions(-)

--
Best Regards,
Huang, Ying


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 00/13] mm, swap: rework of swap allocator locks
  2024-10-24  3:04     ` Huang, Ying
@ 2024-10-24  3:51       ` Kairui Song
  0 siblings, 0 replies; 21+ messages in thread
From: Kairui Song @ 2024-10-24  3:51 UTC (permalink / raw)
  To: Huang, Ying
  Cc: linux-mm, Andrew Morton, Chris Li, Barry Song, Ryan Roberts,
	Hugh Dickins, Yosry Ahmed, Tim Chen, Nhat Pham, linux-kernel

On Thu, Oct 24, 2024 at 11:08 AM Huang, Ying <ying.huang@intel.com> wrote:
>
> Kairui Song <ryncsn@gmail.com> writes:
>
> > On Wed, Oct 23, 2024 at 10:27 AM Huang, Ying <ying.huang@intel.com> wrote:
> >>
> >> Hi, Kairui,
> >
> > Hi Ying,
> >
> >>
> >> Kairui Song <ryncsn@gmail.com> writes:
> >>
> >> > From: Kairui Song <kasong@tencent.com>
> >> >
> >> > This series improved the swap allocator performance greatly by reworking
> >> > the locking design and simplify a lot of code path.
> >> >
> >> > This is follow up of previous swap cluster allocator series:
> >> > https://lore.kernel.org/linux-mm/20240730-swap-allocator-v5-0-cb9c148b9297@kernel.org/
> >> >
> >> > And this series is based on an follow up fix of the swap cluster
> >> > allocator:
> >> > https://lore.kernel.org/linux-mm/20241022175512.10398-1-ryncsn@gmail.com/
> >> >
> >> > This is part of the new swap allocator work item discussed in
> >> > Chris's "Swap Abstraction" discussion at LSF/MM 2024, and
> >> > "mTHP and swap allocator" discussion at LPC 2024.
> >> >
> >> > Previous series introduced a fully cluster based allocation algorithm,
> >> > this series completely get rid of the old allocation path and makes the
> >> > allocator avoid grabbing the si->lock unless needed. This bring huge
> >> > performance gain and get rid of slot cache on freeing path.
> >>
> >> Great!
> >>
> >> > Currently, swap locking is mainly composed of two locks, cluster lock
> >> > (ci->lock) and device lock (si->lock). The device lock is widely used
> >> > to protect many things, causing it to be the main bottleneck for SWAP.
> >>
> >> Device lock can be confusing with another device lock for struct device.
> >> Better to call it swap device lock?
> >
> > Good idea, I'll use the term swap device lock then.
> >
> >>
> >> > Cluster lock is much more fine-grained, so it will be best to use
> >> > ci->lock instead of si->lock as much as possible.
> >> >
> >> > `perf lock` indicates this issue clearly. Doing linux kernel build
> >> > using tmpfs and ZRAM with limited memory (make -j64 with 1G memcg and 4k
> >> > pages), result of "perf lock contention -ab sleep 3":
> >> >
> >> >   contended   total wait     max wait     avg wait         type   caller
> >> >
> >> >      34948     53.63 s       7.11 ms      1.53 ms     spinlock   free_swap_and_cache_nr+0x350
> >> >      16569     40.05 s       6.45 ms      2.42 ms     spinlock   get_swap_pages+0x231
> >> >      11191     28.41 s       7.03 ms      2.54 ms     spinlock   swapcache_free_entries+0x59
> >> >       4147     22.78 s     122.66 ms      5.49 ms     spinlock   page_vma_mapped_walk+0x6f3
> >> >       4595      7.17 s       6.79 ms      1.56 ms     spinlock   swapcache_free_entries+0x59
> >> >     406027      2.74 s       2.59 ms      6.74 us     spinlock   list_lru_add+0x39
> >> >   ...snip...
> >> >
> >> > The top 5 caller are all users of si->lock, total wait time up sums to
> >> > several minutes in the 3 seconds time window.
> >>
> >> Can you show results of `perf record -g`, `perf report -g` too?  I have
> >> interest to check hot spot shifting too.
> >
> > Sure. I think `perf lock` result is already good enough and cleaner.
> > My test environment are mostly VM based so spinlock slow path may get
> > offloaded to host, and can't be see by perf record, I collected
> > following data after disabled paravirt spinlock:
> >
> > The time consumption and stack trace of a page fault before:
> > -   78.45%     0.17%  cc1              [kernel.kallsyms]
> >                 [k] asm_exc_page_fault
> >    - 78.28% asm_exc_page_fault
> >       - 78.18% exc_page_fault
> >          - 78.17% do_user_addr_fault
> >             - 78.09% handle_mm_fault
> >                - 78.06% __handle_mm_fault
> >                   - 69.69% do_swap_page
> >                      - 55.87% alloc_swap_folio
> >                         - 55.60% mem_cgroup_swapin_charge_folio
> >                            - 55.48% charge_memcg
> >                               - 55.45% try_charge_memcg
> >                                  - 55.36% try_to_free_mem_cgroup_pages
> >                                     - do_try_to_free_pages
> >                                        - 55.35% shrink_node
> >                                           - 55.27% shrink_lruvec
> >                                              - 55.13% try_to_shrink_lruvec
> >                                                 - 54.79% evict_folios
> >                                                    - 54.35% shrink_folio_list
> >                                                       - 30.01% add_to_swap
> >                                                          - 29.77%
> > folio_alloc_swap
> >                                                             - 29.50%
> > get_swap_pages
> >
> > 25.03% queued_spin_lock_slowpath
> >                                                                - 2.71%
> > alloc_swap_scan_cluster
> >
> > 1.80% queued_spin_lock_slowpath
> >                                                                   +
> > 0.89% __try_to_reclaim_swap
> >                                                                - 1.74%
> > swap_reclaim_full_clusters
> >
> > 1.74% queued_spin_lock_slowpath
> >                                                       - 10.88%
> > try_to_unmap_flush_dirty
> >                                                          - 10.87%
> > arch_tlbbatch_flush
> >                                                             - 10.85%
> > on_each_cpu_cond_mask
> >
> > smp_call_function_many_cond
> >                                                       + 7.45% pageout
> >                                                       + 2.71% try_to_unmap_flush
> >                                                       + 1.90% try_to_unmap
> >                                                       + 0.78% folio_referenced
> >                      - 9.41% cluster_swap_free_nr
> >                         - 9.39% free_swap_slot
> >                            - 9.35% swapcache_free_entries
> >                                 8.40% queued_spin_lock_slowpath
> >                                 0.93% swap_entry_range_free
> >                      - 3.61% swap_read_folio_bdev_sync
> >                         - 3.55% submit_bio_wait
> >                            - 3.51% submit_bio_noacct_nocheck
> >                               + 3.46% __submit_bio
> >                   + 7.71% do_pte_missing
> >                   + 0.61% wp_page_copy
> >
> > The queued_spin_lock_slowpath above is the si->lock, and there are
> > multiple users of it so the total overhead is higher than shown.
> >
> > After:
> > -   75.05%     0.43%  cc1              [kernel.kallsyms]
> >                 [k] asm_exc_page_fault
> >    - 74.62% asm_exc_page_fault
> >       - 74.36% exc_page_fault
> >          - 74.34% do_user_addr_fault
> >             - 74.10% handle_mm_fault
> >                - 73.96% __handle_mm_fault
> >                   - 67.55% do_swap_page
> >                      - 45.92% alloc_swap_folio
> >                         - 45.03% mem_cgroup_swapin_charge_folio
> >                            - 44.58% charge_memcg
> >                               - 44.44% try_charge_memcg
> >                                  - 44.12% try_to_free_mem_cgroup_pages
> >                                     - do_try_to_free_pages
> >                                        - 44.10% shrink_node
> >                                           - 43.86% shrink_lruvec
> >                                              - 41.92% try_to_shrink_lruvec
> >                                                 - 40.67% evict_folios
> >                                                    - 37.12% shrink_folio_list
> >                                                       - 20.88% pageout
> >                                                          + 20.02% swap_writepage
> >                                                          + 0.72% shmem_writepage
> >                                                       - 4.08% add_to_swap
> >                                                          - 2.48%
> > folio_alloc_swap
> >                                                             - 2.12%
> > __mem_cgroup_try_charge_swap
> >                                                                - 1.47%
> > swap_cgroup_record
> >                                                                   +
> > 1.32% _raw_spin_lock_irqsave
> >                                                          - 1.56%
> > add_to_swap_cache
> >                                                             - 1.04% xas_store
> >                                                                + 1.01%
> > workingset_update_node
> >                                                       + 3.97%
> > try_to_unmap_flush_dirty
> >                                                       + 3.51% folio_referenced
> >                                                       + 2.24% __remove_mapping
> >                                                       + 1.16% try_to_unmap
> >                                                       + 0.52% try_to_unmap_flush
> >                                                      2.50%
> > queued_spin_lock_slowpath
> >                                                      0.79% scan_folios
> >                                                 + 1.20% try_to_inc_max_seq
> >                                              + 1.92% lru_add_drain
> >                         + 0.73% vma_alloc_folio_noprof
> >                      - 9.81% swap_read_folio_bdev_sync
> >                         - 9.61% submit_bio_wait
> >                            + 9.49% submit_bio_noacct_nocheck
> >                      - 8.06% cluster_swap_free_nr
> >                         - 8.02% swap_entry_range_free
> >                            + 3.92% __mem_cgroup_uncharge_swap
> >                            + 2.90% zram_slot_free_notify
> >                              0.58% clear_shadow_from_swap_cache
> >                      - 1.32% __folio_batch_add_and_move
> >                         - 1.30% folio_batch_move_lru
> >                            + 1.10% folio_lruvec_lock_irqsave
>
> Thanks for data.
>
> It seems that the cycles shifts from spinning to memory compression.
> That is expected.
>
> > spin_lock usage is much lower.
> >
> > I prefer the perf lock output as it shows the exact time and user of locks.
>
> perf cycles data is more complete.  You can find which part becomes new
> hot spot.
>
> >>
> >> > Following the new allocator design, many operation doesn't need to touch
> >> > si->lock at all. We only need to take si->lock when doing operations
> >> > across multiple clusters (eg. changing the cluster list), other
> >> > operations only need to take ci->lock. So ideally allocator should
> >> > always take ci->lock first, then, if needed, take si->lock. But due
> >> > to historical reasons, ci->lock is used inside si->lock by design,
> >> > causing lock inversion if we simply try to acquire si->lock after
> >> > acquiring ci->lock.
> >> >
> >> > This series audited all si->lock usage, simplify legacy codes, eliminate
> >> > usage of si->lock as much as possible by introducing new designs based
> >> > on the new cluster allocator.
> >> >
> >> > Old HDD allocation codes are removed, cluster allocator is adapted
> >> > with small changes for HDD usage, test is looking OK.
> >>
> >> I think that it's a good idea to remove HDD allocation specific code.
> >> Can you check the performance of swapping to HDD?  However, I understand
> >> that many people have no HDD in hand.
> >
> > It's not hard to make cluster allocator work well with HDD in theory,
> > see the commit "mm, swap: use a global swap cluster for non-rotation
> > device".
> > The testing is not very reliable though, I found HDD swap performance
> > is very unstable because of the IO pattern of HDD, so it's just a best
> > effort try.
>
> Just to check whether code change cause something too bad for HDD.  No
> measurable difference is a good news.
>
> >> > And this also removed slot cache for freeing path. The performance is
> >> > better without it, and this enables other clean up and optimizations
> >> > as discussed before:
> >> > https://lore.kernel.org/all/CAMgjq7ACohT_uerSz8E_994ZZCv709Zor+43hdmesW_59W1BWw@mail.gmail.com/
> >> >
> >> > After this series, lock contention on si->lock is nearly unobservable
> >> > with `perf lock` with the same test above :
> >> >
> >> >   contended   total wait     max wait     avg wait         type   caller
> >> >   ... snip ...
> >> >          91    204.62 us      4.51 us      2.25 us     spinlock   cluster_move+0x2e
> >> >   ... snip ...
> >> >          47    125.62 us      4.47 us      2.67 us     spinlock   cluster_move+0x2e
> >> >   ... snip ...
> >> >          23     63.15 us      3.95 us      2.74 us     spinlock   cluster_move+0x2e
> >> >   ... snip ...
> >> >          17     41.26 us      4.58 us      2.43 us     spinlock   cluster_isolate_lock+0x1d
> >> >   ... snip ...
> >> >
> >> > cluster_move and cluster_isolate_lock are basically the only users
> >> > of si->lock now, performance gain is huge with reduced LOC.
> >> >
> >> > Tests
> >> > ===
> >> >
> >> > Build kernel with defconfig on tmpfs with ZRAM as swap:
> >> > ---
> >> >
> >> > Running a test matrix which is scaled up progressive for a intuitive result.
> >> > The test are ran on top of tmpfs, using memory cgroup for memory limitation,
> >> > on a 48c96t system.
> >> >
> >> > 12 test run for each case, it can be seen clearly that as concurrent job
> >> > number goes higher the performance gain is higher, the performance is
> >> > higher even with low concurrency.
> >> >
> >> >    make -j<NR>     |   System Time (seconds)  |   Total Time (seconds)
> >> >  (NR / Mem / ZRAM) | (Before / After / Delta) | (Before / After / Delta)
> >> >  With 4k pages only:
> >> >   6 / 192M / 3G    |    5258 /  5235 / -0.3%  |    1420 /  1414 / -0.3%
> >> >  12 / 256M / 4G    |    5518 /  5337 / -3.3%  |     758 /   742 / -2.1%
> >> >  24 / 384M / 5G    |    7091 /  5766 / -18.7% |     476 /   422 / -11.3%
> >> >  48 / 768M / 7G    |   11139 /  5831 / -47.7% |     330 /   221 / -33.0%
> >> >  96 / 1.5G / 10G   |   21303 / 11353 / -46.7% |     283 /   180 / -36.4%
> >> >  With 64k mTHP:
> >> >  24 / 512M / 5G    |    5104 /  4641 / -18.7% |     376 /   358 / -4.8%
> >> >  48 /   1G / 7G    |    8693 /  4662 / -18.7% |     257 /   176 / -31.5%
> >> >  96 /   2G / 10G   |   17056 / 10263 / -39.8% |     234 /   169 / -27.8%
> >>
> >> How much is the swap in/out throughput before/after the change?
> >
> > This may not be too beneficial for typical throughput measurement:
> > - For example doing the same test with brd will only show a ~20%
> > performance improvement, still a big gain though. I think the si->lock
> > spinlock wasting CPU cycles may effect CPU sensitive things like ZRAM
> > even more.
>
> 20% is a good data.  You don't need to guess.  perf cycles profiling can
> show the hot spot.
>
> > - And simple benchmarks which just do multiple sequential swaps in/out
> > in multiple thread hardly stress the allocator.
> >
> > I haven't found a good
> > benchmark to simulate random parallel IOs on SWAP yet, I can write one
> > later.
>
> I have used anon-w-rand test case of vm-scalability to simulate random
> parallel swap out.
>
> https://git.kernel.org/pub/scm/linux/kernel/git/wfg/vm-scalability.git/tree/case-anon-w-rand
>
> > A more close to real word benchmark like build kernel test, or
> > mysql/sysbench all showed great improment.
>
> Yes.  Real work load is good.  We can use micro-benchmark to find out
> some performance limit, for example, max possible throughput.
>
> >>
> >> When I worked on swap in/out performance before, the hot spot shifts from
> >> swap related code to LRU lock and zone lock.  Things may change a lot
> >> now.
> >>
> >> If zram is used as swap device, the hot spot may become
> >> compression/decompression after solving the swap lock contention.  To
> >> stress swap subsystem further, we may use a ram disk as swap.
> >> Previously, we have used a simulated pmem device (backed by DRAM).  That
> >> can be setup as in,
> >>
> >> https://pmem.io/blog/2016/02/how-to-emulate-persistent-memory/
> >>
> >> After creating the raw block device: /dev/pmem0, we can do
> >>
> >> $ mkswap /dev/pmem0
> >> $ swapon /dev/pmem0
> >>
> >> Can you use something similar if necessary?
> >
> > I used to test with brd, as described above,
>
> brd will allocate memory during running, pmem can avoid that.  perf
> profile is your friends to root cause the possible issue.
>
> > I think using ZRAM with
> > test simulating real workload is more useful.
>
> Yes.  And, as I said before.  Micro-benchmark has its own value.

Hi Ying,

Thank you very much for the suggestion, I didn't mean I'm against
micro benchmarks in any way, just a lot of effort was spent on other
tests so I skipped that part for V1.

As you mentioned vm-scalability, I think this is definitely a good
idea to include that test when pmem simulation.

There are still some bottlenecks of SWAP, beside compression and page
fault / tlb, mostly cgroup lock and list lru locks. I have some ideas
to optimize these too, could be next steps.

> > And I did include a Sequential SWAP test, the result is looking OK (no
> > regression, minor to none improvement).
>
> Good.  At least we have no regression here.
>
> --
> Best Regards,
> Huang, Ying


^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2024-10-24  3:52 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-10-22 19:24 [PATCH 00/13] mm, swap: rework of swap allocator locks Kairui Song
2024-10-22 19:24 ` [PATCH 01/13] mm, swap: minor clean up for swap entry allocation Kairui Song
2024-10-22 19:24 ` [PATCH 02/13] mm, swap: fold swap_info_get_cont in the only caller Kairui Song
2024-10-22 19:24 ` [PATCH 03/13] mm, swap: remove old allocation path for HDD Kairui Song
2024-10-22 19:24 ` [PATCH 04/13] mm, swap: use cluster lock " Kairui Song
2024-10-22 19:24 ` [PATCH 05/13] mm, swap: clean up device availability check Kairui Song
2024-10-22 19:24 ` [PATCH 06/13] mm, swap: clean up plist removal and adding Kairui Song
2024-10-22 19:24 ` [PATCH 07/13] mm, swap: hold a reference of si during scan and clean up flags Kairui Song
2024-10-22 19:24 ` [PATCH 08/13] mm, swap: use an enum to define all cluster flags and wrap flags changes Kairui Song
2024-10-22 19:24 ` [PATCH 09/13] mm, swap: reduce contention on device lock Kairui Song
2024-10-22 19:24 ` [PATCH 10/13] mm, swap: simplify percpu cluster updating Kairui Song
2024-10-22 19:24 ` [PATCH 11/13] mm, swap: introduce a helper for retrieving cluster from offset Kairui Song
2024-10-22 19:24 ` [PATCH 12/13] mm, swap: use a global swap cluster for non-rotation device Kairui Song
2024-10-22 19:37 ` [PATCH 13/13] mm, swap_slots: remove slot cache for freeing path Kairui Song
2024-10-23  2:24 ` [PATCH 00/13] mm, swap: rework of swap allocator locks Huang, Ying
2024-10-23 18:01   ` Kairui Song
2024-10-24  3:04     ` Huang, Ying
2024-10-24  3:51       ` Kairui Song
2024-10-23 10:27 ` Andrew Morton
2024-10-23 17:56   ` Kairui Song
2024-10-23 17:59 ` Yosry Ahmed

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox