[PATCH 0/7] mm, swap: remove swap slot cache

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/7] mm, swap: remove swap slot cache
@ 2025-02-14 17:57 Kairui Song
  2025-02-14 17:57 ` [PATCH 1/7] mm, swap: avoid reclaiming irrelevant swap cache Kairui Song
                   ` (7 more replies)
  0 siblings, 8 replies; 30+ messages in thread
From: Kairui Song @ 2025-02-14 17:57 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Chris Li, Barry Song, Hugh Dickins, Yosry Ahmed,
	Huang, Ying, Baoquan He, Nhat Pham, Johannes Weiner,
	Kalesh Singh, linux-kernel, Kairui Song

From: Kairui Song <kasong@tencent.com>

Slot cache was initially introduced by commit 67afa38e012e ("mm/swap:
add cache for swap slots allocation") to reduce the lock contention
of si->lock.

Previous series "mm, swap: rework of swap allocator locks" [1] removed
swap slot cache for freeing path as freeing path no longer touches
si->lock in most cased. Allocation path also have slight to none
contention on si->lock since that series, but slot cache still helps
to reduce other overheads, like counters and the plist.

This series removes the slot cache from allocation path too, by using
the cluster as allocation fast path and also reduce other overheads.

Now slot cache is completely gone, the code is much simplified without
obvious feature or performance change, also clean up related workaround.
Also this should avoid other potential issues, e.g. the long pinning
of swap slots: swap slot cache pins swap slots with HAS_CACHE, causing
reclaim or allocation fail to use these slots on scanning.

The only behavior change is the swap device allocation rotation
mechanism, as explained in the patch "mm, swap: use percpu cluster
as allocation fast path".

Test results are looking good after deleting the swap slot cache:

- vm-scalability with: `usemem --init-time -O -y -x -R -31 1G`,
12G memory cgroup using simulated pmem as SWAP (32G pmem, 32 CPUs),
16 test runs for each case, measuring the total throughput:

                      Before (KB/s) (stdev)  After (KB/s) (stdev)
Random (4K):          424907.60 (24410.78)   414745.92  (34554.78)
Random (64K):         163308.82 (11635.72)   167314.50  (18434.99)
Sequential (4K, !-R): 6150056.79 (103205.90) 6321469.06 (115878.16)

- Build linux kernel with make -j96, using 4K folio with 1.5G memory
cgroup limit and 64K folio with 2G memory cgroup limit, on top of tmpfs,
12 test runs, measuring the system time:

                  Before (s) (stdev)  After (s) (stdev)
make -j96 (4K):   6445.69 (61.95)     6408.80 (69.46)
make -j96 (64K):  6841.71 (409.04)    6437.99 (435.55)

The performance is unchanged, slightly better in some cases.

[1] https://lore.kernel.org/linux-mm/20250113175732.48099-1-ryncsn@gmail.com/

Kairui Song (7):
  mm, swap: avoid reclaiming irrelevant swap cache
  mm, swap: drop the flag TTRS_DIRECT
  mm, swap: avoid redundant swap device pinning
  mm, swap: don't update the counter up-front
  mm, swap: use percpu cluster as allocation fast path
  mm, swap: remove swap slot cache
  mm, swap: simplify folio swap allocation

 include/linux/swap.h       |  21 +--
 include/linux/swap_slots.h |  28 ----
 mm/Makefile                |   2 +-
 mm/shmem.c                 |  21 +--
 mm/swap.h                  |   6 -
 mm/swap_slots.c            | 295 ----------------------------------
 mm/swap_state.c            |  79 ++--------
 mm/swapfile.c              | 315 ++++++++++++++++++-------------------
 mm/vmscan.c                |  16 +-
 mm/zswap.c                 |   6 +
 10 files changed, 196 insertions(+), 593 deletions(-)
 delete mode 100644 include/linux/swap_slots.h
 delete mode 100644 mm/swap_slots.c

-- 
2.48.1

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 1/7] mm, swap: avoid reclaiming irrelevant swap cache
  2025-02-14 17:57 [PATCH 0/7] mm, swap: remove swap slot cache Kairui Song
@ 2025-02-14 17:57 ` Kairui Song
  2025-02-19  2:11   ` Baoquan He
  2025-02-14 17:57 ` [PATCH 2/7] mm, swap: drop the flag TTRS_DIRECT Kairui Song
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 30+ messages in thread
From: Kairui Song @ 2025-02-14 17:57 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Chris Li, Barry Song, Hugh Dickins, Yosry Ahmed,
	Huang, Ying, Baoquan He, Nhat Pham, Johannes Weiner,
	Kalesh Singh, linux-kernel, Kairui Song

From: Kairui Song <kasong@tencent.com>

Swap allocator will do swap cache reclaim to recycle HAS_CACHE slots for
allocation. It initiates the reclaim from the offset to be reclaimed and
looks up the corresponding folio. The lookup process is lockless, so it's
possible the folio will be removed from the swap cache and given
a different swap entry before the reclaim locks the folio. If
it happens, the reclaim will end up reclaiming an irrelevant folio, and
return wrong return value.

This shouldn't cause any problem with correctness or stability, but
it is indeed confusing and unexpected, and will increase fragmentation,
decrease performance.

Fix this by checking whether the folio is still pointing to the offset
the allocator want to reclaim before reclaiming it.

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/swapfile.c | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/mm/swapfile.c b/mm/swapfile.c
index 34baefb000b5..c77ffee4af86 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -210,6 +210,7 @@ static int __try_to_reclaim_swap(struct swap_info_struct *si,
 	int ret, nr_pages;
 	bool need_reclaim;
 
+again:
 	folio = filemap_get_folio(address_space, swap_cache_index(entry));
 	if (IS_ERR(folio))
 		return 0;
@@ -227,8 +228,16 @@ static int __try_to_reclaim_swap(struct swap_info_struct *si,
 	if (!folio_trylock(folio))
 		goto out;
 
-	/* offset could point to the middle of a large folio */
+	/*
+	 * Offset could point to the middle of a large folio, or folio
+	 * may no longer point to the expected offset before it's locked.
+	 */
 	entry = folio->swap;
+	if (offset < swp_offset(entry) || offset >= swp_offset(entry) + nr_pages) {
+		folio_unlock(folio);
+		folio_put(folio);
+		goto again;
+	}
 	offset = swp_offset(entry);
 
 	need_reclaim = ((flags & TTRS_ANYWAY) ||
-- 
2.48.1



^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 2/7] mm, swap: drop the flag TTRS_DIRECT
  2025-02-14 17:57 [PATCH 0/7] mm, swap: remove swap slot cache Kairui Song
  2025-02-14 17:57 ` [PATCH 1/7] mm, swap: avoid reclaiming irrelevant swap cache Kairui Song
@ 2025-02-14 17:57 ` Kairui Song
  2025-02-19  2:42   ` Baoquan He
  2025-02-14 17:57 ` [PATCH 3/7] mm, swap: avoid redundant swap device pinning Kairui Song
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 30+ messages in thread
From: Kairui Song @ 2025-02-14 17:57 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Chris Li, Barry Song, Hugh Dickins, Yosry Ahmed,
	Huang, Ying, Baoquan He, Nhat Pham, Johannes Weiner,
	Kalesh Singh, linux-kernel, Kairui Song

From: Kairui Song <kasong@tencent.com>

This flag exists temporarily to allow the allocator to bypass the slot
cache during freeing, so reclaiming one slot will free the slot
immediately.

But now we have already removed slot cache usage on freeing, so this
flag has no effect now.

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/swapfile.c | 23 +++--------------------
 1 file changed, 3 insertions(+), 20 deletions(-)

diff --git a/mm/swapfile.c b/mm/swapfile.c
index c77ffee4af86..449e388a6fec 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -158,8 +158,6 @@ static long swap_usage_in_pages(struct swap_info_struct *si)
 #define TTRS_UNMAPPED		0x2
 /* Reclaim the swap entry if swap is getting full */
 #define TTRS_FULL		0x4
-/* Reclaim directly, bypass the slot cache and don't touch device lock */
-#define TTRS_DIRECT		0x8
 
 static bool swap_only_has_cache(struct swap_info_struct *si,
 			      unsigned long offset, int nr_pages)
@@ -257,23 +255,8 @@ static int __try_to_reclaim_swap(struct swap_info_struct *si,
 	if (!need_reclaim)
 		goto out_unlock;
 
-	if (!(flags & TTRS_DIRECT)) {
-		/* Free through slot cache */
-		delete_from_swap_cache(folio);
-		folio_set_dirty(folio);
-		ret = nr_pages;
-		goto out_unlock;
-	}
-
-	xa_lock_irq(&address_space->i_pages);
-	__delete_from_swap_cache(folio, entry, NULL);
-	xa_unlock_irq(&address_space->i_pages);
-	folio_ref_sub(folio, nr_pages);
+	delete_from_swap_cache(folio);
 	folio_set_dirty(folio);
-
-	ci = lock_cluster(si, offset);
-	swap_entry_range_free(si, ci, entry, nr_pages);
-	unlock_cluster(ci);
 	ret = nr_pages;
 out_unlock:
 	folio_unlock(folio);
@@ -707,7 +690,7 @@ static bool cluster_reclaim_range(struct swap_info_struct *si,
 			offset++;
 			break;
 		case SWAP_HAS_CACHE:
-			nr_reclaim = __try_to_reclaim_swap(si, offset, TTRS_ANYWAY | TTRS_DIRECT);
+			nr_reclaim = __try_to_reclaim_swap(si, offset, TTRS_ANYWAY);
 			if (nr_reclaim > 0)
 				offset += nr_reclaim;
 			else
@@ -860,7 +843,7 @@ static void swap_reclaim_full_clusters(struct swap_info_struct *si, bool force)
 			if (READ_ONCE(map[offset]) == SWAP_HAS_CACHE) {
 				spin_unlock(&ci->lock);
 				nr_reclaim = __try_to_reclaim_swap(si, offset,
-								   TTRS_ANYWAY | TTRS_DIRECT);
+								   TTRS_ANYWAY);
 				spin_lock(&ci->lock);
 				if (nr_reclaim) {
 					offset += abs(nr_reclaim);
-- 
2.48.1



^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 3/7] mm, swap: avoid redundant swap device pinning
  2025-02-14 17:57 [PATCH 0/7] mm, swap: remove swap slot cache Kairui Song
  2025-02-14 17:57 ` [PATCH 1/7] mm, swap: avoid reclaiming irrelevant swap cache Kairui Song
  2025-02-14 17:57 ` [PATCH 2/7] mm, swap: drop the flag TTRS_DIRECT Kairui Song
@ 2025-02-14 17:57 ` Kairui Song
  2025-02-19  3:35   ` Baoquan He
  2025-02-14 17:57 ` [PATCH 4/7] mm, swap: don't update the counter up-front Kairui Song
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 30+ messages in thread
From: Kairui Song @ 2025-02-14 17:57 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Chris Li, Barry Song, Hugh Dickins, Yosry Ahmed,
	Huang, Ying, Baoquan He, Nhat Pham, Johannes Weiner,
	Kalesh Singh, linux-kernel, Kairui Song

From: Kairui Song <kasong@tencent.com>

There are only two callers of __read_swap_cache_async not holding a swap
device reference, so make them hold a reference instead, and drop the
get/put_swap_device calls in __read_swap_cache_async. This should reduce
the overhead for swap in during page fault slightly.

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/swap_state.c | 14 ++++++++------
 mm/zswap.c      |  6 ++++++
 2 files changed, 14 insertions(+), 6 deletions(-)

diff --git a/mm/swap_state.c b/mm/swap_state.c
index a54b035d6a6c..50840a2887a5 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -426,17 +426,13 @@ struct folio *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 		struct mempolicy *mpol, pgoff_t ilx, bool *new_page_allocated,
 		bool skip_if_exists)
 {
-	struct swap_info_struct *si;
+	struct swap_info_struct *si = swp_swap_info(entry);
 	struct folio *folio;
 	struct folio *new_folio = NULL;
 	struct folio *result = NULL;
 	void *shadow = NULL;
 
 	*new_page_allocated = false;
-	si = get_swap_device(entry);
-	if (!si)
-		return NULL;
-
 	for (;;) {
 		int err;
 		/*
@@ -532,7 +528,6 @@ struct folio *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 	put_swap_folio(new_folio, entry);
 	folio_unlock(new_folio);
 put_and_return:
-	put_swap_device(si);
 	if (!(*new_page_allocated) && new_folio)
 		folio_put(new_folio);
 	return result;
@@ -552,11 +547,16 @@ struct folio *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 		struct vm_area_struct *vma, unsigned long addr,
 		struct swap_iocb **plug)
 {
+	struct swap_info_struct *si;
 	bool page_allocated;
 	struct mempolicy *mpol;
 	pgoff_t ilx;
 	struct folio *folio;
 
+	si = get_swap_device(entry);
+	if (!si)
+		return NULL;
+
 	mpol = get_vma_policy(vma, addr, 0, &ilx);
 	folio = __read_swap_cache_async(entry, gfp_mask, mpol, ilx,
 					&page_allocated, false);
@@ -564,6 +564,8 @@ struct folio *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 
 	if (page_allocated)
 		swap_read_folio(folio, plug);
+
+	put_swap_device(si);
 	return folio;
 }
 
diff --git a/mm/zswap.c b/mm/zswap.c
index ac9d299e7d0c..83dfa1f9e689 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -1051,14 +1051,20 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
 	struct folio *folio;
 	struct mempolicy *mpol;
 	bool folio_was_allocated;
+	struct swap_info_struct *si;
 	struct writeback_control wbc = {
 		.sync_mode = WB_SYNC_NONE,
 	};
 
 	/* try to allocate swap cache folio */
+	si = get_swap_device(swpentry);
+	if (!si)
+		return -EEXIST;
+
 	mpol = get_task_policy(current);
 	folio = __read_swap_cache_async(swpentry, GFP_KERNEL, mpol,
 				NO_INTERLEAVE_INDEX, &folio_was_allocated, true);
+	put_swap_device(si);
 	if (!folio)
 		return -ENOMEM;
 
-- 
2.48.1



^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 4/7] mm, swap: don't update the counter up-front
  2025-02-14 17:57 [PATCH 0/7] mm, swap: remove swap slot cache Kairui Song
                   ` (2 preceding siblings ...)
  2025-02-14 17:57 ` [PATCH 3/7] mm, swap: avoid redundant swap device pinning Kairui Song
@ 2025-02-14 17:57 ` Kairui Song
  2025-02-14 17:57 ` [PATCH 5/7] mm, swap: use percpu cluster as allocation fast path Kairui Song
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 30+ messages in thread
From: Kairui Song @ 2025-02-14 17:57 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Chris Li, Barry Song, Hugh Dickins, Yosry Ahmed,
	Huang, Ying, Baoquan He, Nhat Pham, Johannes Weiner,
	Kalesh Singh, linux-kernel, Kairui Song

From: Kairui Song <kasong@tencent.com>

The counter update before allocation was useful to avoid unnecessary
scan when device is full, so it will abort early if the counter
indicated the device is full. But that is an uncommon case, and now
scanning of a full device is very fast, so the up-front update is not
helpful any more.

Remove it and simplify the slot allocation logic.

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/swapfile.c | 18 ++----------------
 1 file changed, 2 insertions(+), 16 deletions(-)

diff --git a/mm/swapfile.c b/mm/swapfile.c
index 449e388a6fec..ae3bd0a862fc 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1208,22 +1208,10 @@ int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_order)
 	int order = swap_entry_order(entry_order);
 	unsigned long size = 1 << order;
 	struct swap_info_struct *si, *next;
-	long avail_pgs;
 	int n_ret = 0;
 	int node;
 
 	spin_lock(&swap_avail_lock);
-
-	avail_pgs = atomic_long_read(&nr_swap_pages) / size;
-	if (avail_pgs <= 0) {
-		spin_unlock(&swap_avail_lock);
-		goto noswap;
-	}
-
-	n_goal = min3((long)n_goal, (long)SWAP_BATCH, avail_pgs);
-
-	atomic_long_sub(n_goal * size, &nr_swap_pages);
-
 start_over:
 	node = numa_node_id();
 	plist_for_each_entry_safe(si, next, &swap_avail_heads[node], avail_lists[node]) {
@@ -1257,10 +1245,8 @@ int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_order)
 	spin_unlock(&swap_avail_lock);
 
 check_out:
-	if (n_ret < n_goal)
-		atomic_long_add((long)(n_goal - n_ret) * size,
-				&nr_swap_pages);
-noswap:
+	atomic_long_sub(n_ret * size, &nr_swap_pages);
+
 	return n_ret;
 }
 
-- 
2.48.1



^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 5/7] mm, swap: use percpu cluster as allocation fast path
  2025-02-14 17:57 [PATCH 0/7] mm, swap: remove swap slot cache Kairui Song
                   ` (3 preceding siblings ...)
  2025-02-14 17:57 ` [PATCH 4/7] mm, swap: don't update the counter up-front Kairui Song
@ 2025-02-14 17:57 ` Kairui Song
  2025-02-19  7:53   ` Baoquan He
  2025-02-14 17:57 ` [PATCH 6/7] mm, swap: remove swap slot cache Kairui Song
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 30+ messages in thread
From: Kairui Song @ 2025-02-14 17:57 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Chris Li, Barry Song, Hugh Dickins, Yosry Ahmed,
	Huang, Ying, Baoquan He, Nhat Pham, Johannes Weiner,
	Kalesh Singh, linux-kernel, Kairui Song

From: Kairui Song <kasong@tencent.com>

Current allocation workflow first traverses the plist with a global lock
held, after choosing a device, it uses the percpu cluster on that swap
device. This commit moves the percpu cluster variable out of being tied
to individual swap devices, making it a global percpu variable, and will
be used directly for allocation as a fast path.

The global percpu cluster variable will never point to a HDD device, and
allocation on HDD devices is still globally serialized.

This improves the allocator performance and prepares for removal of the
slot cache in later commits. There shouldn't be much observable behavior
change, except one thing: this changes how swap device allocation
rotation works.

Currently, each allocation will rotate the plist, and because of the
existence of slot cache (64 entries), swap devices of the same priority
are rotated for every 64 entries consumed. And, high order allocations
are different, they will bypass the slot cache, and so swap device is
rotated for every 16K, 32K, or up to 2M allocation.

The rotation rule was never clearly defined or documented, it was changed
several times without mentioning too.

After this commit, once slot cache is gone in later commits, swap device
rotation will happen for every consumed cluster. Ideally non-HDD devices
will be rotated if 2M space has been consumed for each order, this seems
reasonable. HDD devices is rotated for every allocation regardless of the
allocation order, which should be OK and trivial.

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 include/linux/swap.h |  11 ++--
 mm/swapfile.c        | 120 +++++++++++++++++++++++++++----------------
 2 files changed, 79 insertions(+), 52 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 2fe91c293636..a8d84f22357e 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -284,12 +284,10 @@ enum swap_cluster_flags {
 #endif
 
 /*
- * We assign a cluster to each CPU, so each CPU can allocate swap entry from
- * its own cluster and swapout sequentially. The purpose is to optimize swapout
- * throughput.
+ * We keep using same cluster for rotating device so swapout will be sequential.
+ * The purpose is to optimize swapout throughput on rotating device.
  */
-struct percpu_cluster {
-	local_lock_t lock; /* Protect the percpu_cluster above */
+struct swap_sequential_cluster {
 	unsigned int next[SWAP_NR_ORDERS]; /* Likely next allocation offset */
 };
 
@@ -315,8 +313,7 @@ struct swap_info_struct {
 	atomic_long_t frag_cluster_nr[SWAP_NR_ORDERS];
 	unsigned int pages;		/* total of usable pages of swap */
 	atomic_long_t inuse_pages;	/* number of those currently in use */
-	struct percpu_cluster __percpu *percpu_cluster; /* per cpu's swap location */
-	struct percpu_cluster *global_cluster; /* Use one global cluster for rotating device */
+	struct swap_sequential_cluster *global_cluster; /* Use one global cluster for rotating device */
 	spinlock_t global_cluster_lock;	/* Serialize usage of global cluster */
 	struct rb_root swap_extent_root;/* root of the swap extent rbtree */
 	struct block_device *bdev;	/* swap device or bdev of swap file */
diff --git a/mm/swapfile.c b/mm/swapfile.c
index ae3bd0a862fc..791cd7ed5bdf 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -116,6 +116,18 @@ static atomic_t proc_poll_event = ATOMIC_INIT(0);
 
 atomic_t nr_rotate_swap = ATOMIC_INIT(0);
 
+struct percpu_swap_cluster {
+	struct swap_info_struct *si;
+	unsigned long offset[SWAP_NR_ORDERS];
+	local_lock_t lock;
+};
+
+static DEFINE_PER_CPU(struct percpu_swap_cluster, percpu_swap_cluster) = {
+	.si = NULL,
+	.offset = { SWAP_ENTRY_INVALID },
+	.lock = INIT_LOCAL_LOCK(),
+};
+
 static struct swap_info_struct *swap_type_to_swap_info(int type)
 {
 	if (type >= MAX_SWAPFILES)
@@ -548,7 +560,7 @@ static bool swap_do_scheduled_discard(struct swap_info_struct *si)
 		ci = list_first_entry(&si->discard_clusters, struct swap_cluster_info, list);
 		/*
 		 * Delete the cluster from list to prepare for discard, but keep
-		 * the CLUSTER_FLAG_DISCARD flag, there could be percpu_cluster
+		 * the CLUSTER_FLAG_DISCARD flag, percpu_swap_cluster could be
 		 * pointing to it, or ran into by relocate_cluster.
 		 */
 		list_del(&ci->list);
@@ -815,10 +827,12 @@ static unsigned int alloc_swap_scan_cluster(struct swap_info_struct *si,
 out:
 	relocate_cluster(si, ci);
 	unlock_cluster(ci);
-	if (si->flags & SWP_SOLIDSTATE)
-		__this_cpu_write(si->percpu_cluster->next[order], next);
-	else
+	if (si->flags & SWP_SOLIDSTATE) {
+		__this_cpu_write(percpu_swap_cluster.si, si);
+		__this_cpu_write(percpu_swap_cluster.offset[order], next);
+	} else {
 		si->global_cluster->next[order] = next;
+	}
 	return found;
 }
 
@@ -869,9 +883,8 @@ static void swap_reclaim_work(struct work_struct *work)
 }
 
 /*
- * Try to get swap entries with specified order from current cpu's swap entry
- * pool (a cluster). This might involve allocating a new cluster for current CPU
- * too.
+ * Try to allocate swap entries with specified order and try set a new
+ * cluster for current CPU too.
  */
 static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int order,
 					      unsigned char usage)
@@ -879,18 +892,12 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
 	struct swap_cluster_info *ci;
 	unsigned int offset, found = 0;
 
-	if (si->flags & SWP_SOLIDSTATE) {
-		/* Fast path using per CPU cluster */
-		local_lock(&si->percpu_cluster->lock);
-		offset = __this_cpu_read(si->percpu_cluster->next[order]);
-	} else {
+	if (!(si->flags & SWP_SOLIDSTATE)) {
 		/* Serialize HDD SWAP allocation for each device. */
 		spin_lock(&si->global_cluster_lock);
 		offset = si->global_cluster->next[order];
-	}
-
-	if (offset) {
 		ci = lock_cluster(si, offset);
+
 		/* Cluster could have been used by another order */
 		if (cluster_is_usable(ci, order)) {
 			if (cluster_is_empty(ci))
@@ -980,9 +987,7 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
 		}
 	}
 done:
-	if (si->flags & SWP_SOLIDSTATE)
-		local_unlock(&si->percpu_cluster->lock);
-	else
+	if (!(si->flags & SWP_SOLIDSTATE))
 		spin_unlock(&si->global_cluster_lock);
 	return found;
 }
@@ -1203,6 +1208,41 @@ static bool get_swap_device_info(struct swap_info_struct *si)
 	return true;
 }
 
+/*
+ * Fast path try to get swap entries with specified order from current
+ * CPU's swap entry pool (a cluster).
+ */
+static int swap_alloc_fast(swp_entry_t entries[],
+			   unsigned char usage,
+			   int order, int n_goal)
+{
+	struct swap_cluster_info *ci;
+	struct swap_info_struct *si;
+	unsigned int offset, found;
+	int n_ret = 0;
+
+	n_goal = min(n_goal, SWAP_BATCH);
+
+	si = __this_cpu_read(percpu_swap_cluster.si);
+	offset = __this_cpu_read(percpu_swap_cluster.offset[order]);
+	if (!si || !offset || !get_swap_device_info(si))
+		return 0;
+
+	while (offset) {
+		ci = lock_cluster(si, offset);
+		found = alloc_swap_scan_cluster(si, ci, offset, order, usage);
+		if (!found)
+			break;
+		entries[n_ret++] = swp_entry(si->type, found);
+		if (n_ret == n_goal)
+			break;
+		offset = __this_cpu_read(percpu_swap_cluster.offset[order]);
+	}
+
+	put_swap_device(si);
+	return n_ret;
+}
+
 int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_order)
 {
 	int order = swap_entry_order(entry_order);
@@ -1211,19 +1251,28 @@ int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_order)
 	int n_ret = 0;
 	int node;
 
+	/* Fast path using percpu cluster */
+	local_lock(&percpu_swap_cluster.lock);
+	n_ret = swap_alloc_fast(swp_entries,
+				SWAP_HAS_CACHE,
+				order, n_goal);
+	if (n_ret == n_goal)
+		goto out;
+
+	n_goal = min_t(int, n_goal - n_ret, SWAP_BATCH);
+	/* Rotate the device and switch to a new cluster */
 	spin_lock(&swap_avail_lock);
 start_over:
 	node = numa_node_id();
 	plist_for_each_entry_safe(si, next, &swap_avail_heads[node], avail_lists[node]) {
-		/* requeue si to after same-priority siblings */
 		plist_requeue(&si->avail_lists[node], &swap_avail_heads[node]);
 		spin_unlock(&swap_avail_lock);
 		if (get_swap_device_info(si)) {
-			n_ret = scan_swap_map_slots(si, SWAP_HAS_CACHE,
-					n_goal, swp_entries, order);
+			n_ret += scan_swap_map_slots(si, SWAP_HAS_CACHE, n_goal,
+					swp_entries + n_ret, order);
 			put_swap_device(si);
 			if (n_ret || size > 1)
-				goto check_out;
+				goto out;
 		}
 
 		spin_lock(&swap_avail_lock);
@@ -1241,12 +1290,10 @@ int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_order)
 		if (plist_node_empty(&next->avail_lists[node]))
 			goto start_over;
 	}
-
 	spin_unlock(&swap_avail_lock);
-
-check_out:
+out:
+	local_unlock(&percpu_swap_cluster.lock);
 	atomic_long_sub(n_ret * size, &nr_swap_pages);
-
 	return n_ret;
 }
 
@@ -2733,8 +2780,6 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
 	arch_swap_invalidate_area(p->type);
 	zswap_swapoff(p->type);
 	mutex_unlock(&swapon_mutex);
-	free_percpu(p->percpu_cluster);
-	p->percpu_cluster = NULL;
 	kfree(p->global_cluster);
 	p->global_cluster = NULL;
 	vfree(swap_map);
@@ -3133,7 +3178,7 @@ static struct swap_cluster_info *setup_clusters(struct swap_info_struct *si,
 	unsigned long nr_clusters = DIV_ROUND_UP(maxpages, SWAPFILE_CLUSTER);
 	struct swap_cluster_info *cluster_info;
 	unsigned long i, j, idx;
-	int cpu, err = -ENOMEM;
+	int err = -ENOMEM;
 
 	cluster_info = kvcalloc(nr_clusters, sizeof(*cluster_info), GFP_KERNEL);
 	if (!cluster_info)
@@ -3142,20 +3187,7 @@ static struct swap_cluster_info *setup_clusters(struct swap_info_struct *si,
 	for (i = 0; i < nr_clusters; i++)
 		spin_lock_init(&cluster_info[i].lock);
 
-	if (si->flags & SWP_SOLIDSTATE) {
-		si->percpu_cluster = alloc_percpu(struct percpu_cluster);
-		if (!si->percpu_cluster)
-			goto err_free;
-
-		for_each_possible_cpu(cpu) {
-			struct percpu_cluster *cluster;
-
-			cluster = per_cpu_ptr(si->percpu_cluster, cpu);
-			for (i = 0; i < SWAP_NR_ORDERS; i++)
-				cluster->next[i] = SWAP_ENTRY_INVALID;
-			local_lock_init(&cluster->lock);
-		}
-	} else {
+	if (!(si->flags & SWP_SOLIDSTATE)) {
 		si->global_cluster = kmalloc(sizeof(*si->global_cluster),
 				     GFP_KERNEL);
 		if (!si->global_cluster)
@@ -3432,8 +3464,6 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
 bad_swap_unlock_inode:
 	inode_unlock(inode);
 bad_swap:
-	free_percpu(si->percpu_cluster);
-	si->percpu_cluster = NULL;
 	kfree(si->global_cluster);
 	si->global_cluster = NULL;
 	inode = NULL;
-- 
2.48.1



^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 6/7] mm, swap: remove swap slot cache
  2025-02-14 17:57 [PATCH 0/7] mm, swap: remove swap slot cache Kairui Song
                   ` (4 preceding siblings ...)
  2025-02-14 17:57 ` [PATCH 5/7] mm, swap: use percpu cluster as allocation fast path Kairui Song
@ 2025-02-14 17:57 ` Kairui Song
  2025-02-15 16:23   ` kernel test robot
  2025-02-20  7:55   ` Baoquan He
  2025-02-14 17:57 ` [PATCH 7/7] mm, swap: simplify folio swap allocation Kairui Song
  2025-02-15 10:27 ` [PATCH 0/7] mm, swap: remove swap slot cache Baoquan He
  7 siblings, 2 replies; 30+ messages in thread
From: Kairui Song @ 2025-02-14 17:57 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Chris Li, Barry Song, Hugh Dickins, Yosry Ahmed,
	Huang, Ying, Baoquan He, Nhat Pham, Johannes Weiner,
	Kalesh Singh, linux-kernel, Kairui Song

From: Kairui Song <kasong@tencent.com>

Slot cache is no longer needed now, removing it and all related code.

- vm-scalability with: `usemem --init-time -O -y -x -R -31 1G`,
12G memory cgroup using simulated pmem as SWAP (32G pmem, 32 CPUs),
16 test runs for each case, measuring the total throughput:

                      Before (KB/s) (stdev)  After (KB/s) (stdev)
Random (4K):          424907.60 (24410.78)   414745.92  (34554.78)
Random (64K):         163308.82 (11635.72)   167314.50  (18434.99)
Sequential (4K, !-R): 6150056.79 (103205.90) 6321469.06 (115878.16)

The performance changes are below noise level.

- Build linux kernel with make -j96, using 4K folio with 1.5G memory
cgroup limit and 64K folio with 2G memory cgroup limit, on top of tmpfs,
12 test runs, measuring the system time:

                  Before (s) (stdev)  After (s) (stdev)
make -j96 (4K):   6445.69 (61.95)     6408.80 (69.46)
make -j96 (64K):  6841.71 (409.04)    6437.99 (435.55)

Similar to above, 64k mTHP case showed a slight improvement.

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 include/linux/swap.h       |   2 -
 include/linux/swap_slots.h |  28 ----
 mm/Makefile                |   2 +-
 mm/swap_slots.c            | 295 -------------------------------------
 mm/swap_state.c            |   8 +-
 mm/swapfile.c              | 173 ++++++++--------------
 6 files changed, 64 insertions(+), 444 deletions(-)
 delete mode 100644 include/linux/swap_slots.h
 delete mode 100644 mm/swap_slots.c

diff --git a/include/linux/swap.h b/include/linux/swap.h
index a8d84f22357e..456833705ea0 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -465,7 +465,6 @@ void free_pages_and_swap_cache(struct encoded_page **, int);
 extern atomic_long_t nr_swap_pages;
 extern long total_swap_pages;
 extern atomic_t nr_rotate_swap;
-extern bool has_usable_swap(void);
 
 /* Swap 50% full? Release swapcache more aggressively.. */
 static inline bool vm_swap_full(void)
@@ -489,7 +488,6 @@ extern void swap_shmem_alloc(swp_entry_t, int);
 extern int swap_duplicate(swp_entry_t);
 extern int swapcache_prepare(swp_entry_t entry, int nr);
 extern void swap_free_nr(swp_entry_t entry, int nr_pages);
-extern void swapcache_free_entries(swp_entry_t *entries, int n);
 extern void free_swap_and_cache_nr(swp_entry_t entry, int nr);
 int swap_type_of(dev_t device, sector_t offset);
 int find_first_swap(dev_t *device);
diff --git a/include/linux/swap_slots.h b/include/linux/swap_slots.h
deleted file mode 100644
index 840aec3523b2..000000000000
--- a/include/linux/swap_slots.h
+++ /dev/null
@@ -1,28 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0 */
-#ifndef _LINUX_SWAP_SLOTS_H
-#define _LINUX_SWAP_SLOTS_H
-
-#include <linux/swap.h>
-#include <linux/spinlock.h>
-#include <linux/mutex.h>
-
-#define SWAP_SLOTS_CACHE_SIZE			SWAP_BATCH
-#define THRESHOLD_ACTIVATE_SWAP_SLOTS_CACHE	(5*SWAP_SLOTS_CACHE_SIZE)
-#define THRESHOLD_DEACTIVATE_SWAP_SLOTS_CACHE	(2*SWAP_SLOTS_CACHE_SIZE)
-
-struct swap_slots_cache {
-	bool		lock_initialized;
-	struct mutex	alloc_lock; /* protects slots, nr, cur */
-	swp_entry_t	*slots;
-	int		nr;
-	int		cur;
-	int		n_ret;
-};
-
-void disable_swap_slots_cache_lock(void);
-void reenable_swap_slots_cache_unlock(void);
-void enable_swap_slots_cache(void);
-
-extern bool swap_slot_cache_enabled;
-
-#endif /* _LINUX_SWAP_SLOTS_H */
diff --git a/mm/Makefile b/mm/Makefile
index 53392d2af3a5..ea16e472b294 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -75,7 +75,7 @@ ifdef CONFIG_MMU
 	obj-$(CONFIG_ADVISE_SYSCALLS)	+= madvise.o
 endif
 
-obj-$(CONFIG_SWAP)	+= page_io.o swap_state.o swapfile.o swap_slots.o
+obj-$(CONFIG_SWAP)	+= page_io.o swap_state.o swapfile.o
 obj-$(CONFIG_ZSWAP)	+= zswap.o
 obj-$(CONFIG_HAS_DMA)	+= dmapool.o
 obj-$(CONFIG_HUGETLBFS)	+= hugetlb.o
diff --git a/mm/swap_slots.c b/mm/swap_slots.c
deleted file mode 100644
index 9c7c171df7ba..000000000000
--- a/mm/swap_slots.c
+++ /dev/null
@@ -1,295 +0,0 @@
-// SPDX-License-Identifier: GPL-2.0
-/*
- * Manage cache of swap slots to be used for and returned from
- * swap.
- *
- * Copyright(c) 2016 Intel Corporation.
- *
- * Author: Tim Chen <tim.c.chen@linux.intel.com>
- *
- * We allocate the swap slots from the global pool and put
- * it into local per cpu caches.  This has the advantage
- * of no needing to acquire the swap_info lock every time
- * we need a new slot.
- *
- * There is also opportunity to simply return the slot
- * to local caches without needing to acquire swap_info
- * lock.  We do not reuse the returned slots directly but
- * move them back to the global pool in a batch.  This
- * allows the slots to coalesce and reduce fragmentation.
- *
- * The swap entry allocated is marked with SWAP_HAS_CACHE
- * flag in map_count that prevents it from being allocated
- * again from the global pool.
- *
- * The swap slots cache is protected by a mutex instead of
- * a spin lock as when we search for slots with scan_swap_map,
- * we can possibly sleep.
- */
-
-#include <linux/swap_slots.h>
-#include <linux/cpu.h>
-#include <linux/cpumask.h>
-#include <linux/slab.h>
-#include <linux/vmalloc.h>
-#include <linux/mutex.h>
-#include <linux/mm.h>
-
-static DEFINE_PER_CPU(struct swap_slots_cache, swp_slots);
-static bool	swap_slot_cache_active;
-bool	swap_slot_cache_enabled;
-static bool	swap_slot_cache_initialized;
-static DEFINE_MUTEX(swap_slots_cache_mutex);
-/* Serialize swap slots cache enable/disable operations */
-static DEFINE_MUTEX(swap_slots_cache_enable_mutex);
-
-static void __drain_swap_slots_cache(void);
-
-#define use_swap_slot_cache (swap_slot_cache_active && swap_slot_cache_enabled)
-
-static void deactivate_swap_slots_cache(void)
-{
-	mutex_lock(&swap_slots_cache_mutex);
-	swap_slot_cache_active = false;
-	__drain_swap_slots_cache();
-	mutex_unlock(&swap_slots_cache_mutex);
-}
-
-static void reactivate_swap_slots_cache(void)
-{
-	mutex_lock(&swap_slots_cache_mutex);
-	swap_slot_cache_active = true;
-	mutex_unlock(&swap_slots_cache_mutex);
-}
-
-/* Must not be called with cpu hot plug lock */
-void disable_swap_slots_cache_lock(void)
-{
-	mutex_lock(&swap_slots_cache_enable_mutex);
-	swap_slot_cache_enabled = false;
-	if (swap_slot_cache_initialized) {
-		/* serialize with cpu hotplug operations */
-		cpus_read_lock();
-		__drain_swap_slots_cache();
-		cpus_read_unlock();
-	}
-}
-
-static void __reenable_swap_slots_cache(void)
-{
-	swap_slot_cache_enabled = has_usable_swap();
-}
-
-void reenable_swap_slots_cache_unlock(void)
-{
-	__reenable_swap_slots_cache();
-	mutex_unlock(&swap_slots_cache_enable_mutex);
-}
-
-static bool check_cache_active(void)
-{
-	long pages;
-
-	if (!swap_slot_cache_enabled)
-		return false;
-
-	pages = get_nr_swap_pages();
-	if (!swap_slot_cache_active) {
-		if (pages > num_online_cpus() *
-		    THRESHOLD_ACTIVATE_SWAP_SLOTS_CACHE)
-			reactivate_swap_slots_cache();
-		goto out;
-	}
-
-	/* if global pool of slot caches too low, deactivate cache */
-	if (pages < num_online_cpus() * THRESHOLD_DEACTIVATE_SWAP_SLOTS_CACHE)
-		deactivate_swap_slots_cache();
-out:
-	return swap_slot_cache_active;
-}
-
-static int alloc_swap_slot_cache(unsigned int cpu)
-{
-	struct swap_slots_cache *cache;
-	swp_entry_t *slots;
-
-	/*
-	 * Do allocation outside swap_slots_cache_mutex
-	 * as kvzalloc could trigger reclaim and folio_alloc_swap,
-	 * which can lock swap_slots_cache_mutex.
-	 */
-	slots = kvcalloc(SWAP_SLOTS_CACHE_SIZE, sizeof(swp_entry_t),
-			 GFP_KERNEL);
-	if (!slots)
-		return -ENOMEM;
-
-	mutex_lock(&swap_slots_cache_mutex);
-	cache = &per_cpu(swp_slots, cpu);
-	if (cache->slots) {
-		/* cache already allocated */
-		mutex_unlock(&swap_slots_cache_mutex);
-
-		kvfree(slots);
-
-		return 0;
-	}
-
-	if (!cache->lock_initialized) {
-		mutex_init(&cache->alloc_lock);
-		cache->lock_initialized = true;
-	}
-	cache->nr = 0;
-	cache->cur = 0;
-	cache->n_ret = 0;
-	/*
-	 * We initialized alloc_lock and free_lock earlier.  We use
-	 * !cache->slots or !cache->slots_ret to know if it is safe to acquire
-	 * the corresponding lock and use the cache.  Memory barrier below
-	 * ensures the assumption.
-	 */
-	mb();
-	cache->slots = slots;
-	mutex_unlock(&swap_slots_cache_mutex);
-	return 0;
-}
-
-static void drain_slots_cache_cpu(unsigned int cpu, bool free_slots)
-{
-	struct swap_slots_cache *cache;
-
-	cache = &per_cpu(swp_slots, cpu);
-	if (cache->slots) {
-		mutex_lock(&cache->alloc_lock);
-		swapcache_free_entries(cache->slots + cache->cur, cache->nr);
-		cache->cur = 0;
-		cache->nr = 0;
-		if (free_slots && cache->slots) {
-			kvfree(cache->slots);
-			cache->slots = NULL;
-		}
-		mutex_unlock(&cache->alloc_lock);
-	}
-}
-
-static void __drain_swap_slots_cache(void)
-{
-	unsigned int cpu;
-
-	/*
-	 * This function is called during
-	 *	1) swapoff, when we have to make sure no
-	 *	   left over slots are in cache when we remove
-	 *	   a swap device;
-	 *      2) disabling of swap slot cache, when we run low
-	 *	   on swap slots when allocating memory and need
-	 *	   to return swap slots to global pool.
-	 *
-	 * We cannot acquire cpu hot plug lock here as
-	 * this function can be invoked in the cpu
-	 * hot plug path:
-	 * cpu_up -> lock cpu_hotplug -> cpu hotplug state callback
-	 *   -> memory allocation -> direct reclaim -> folio_alloc_swap
-	 *   -> drain_swap_slots_cache
-	 *
-	 * Hence the loop over current online cpu below could miss cpu that
-	 * is being brought online but not yet marked as online.
-	 * That is okay as we do not schedule and run anything on a
-	 * cpu before it has been marked online. Hence, we will not
-	 * fill any swap slots in slots cache of such cpu.
-	 * There are no slots on such cpu that need to be drained.
-	 */
-	for_each_online_cpu(cpu)
-		drain_slots_cache_cpu(cpu, false);
-}
-
-static int free_slot_cache(unsigned int cpu)
-{
-	mutex_lock(&swap_slots_cache_mutex);
-	drain_slots_cache_cpu(cpu, true);
-	mutex_unlock(&swap_slots_cache_mutex);
-	return 0;
-}
-
-void enable_swap_slots_cache(void)
-{
-	mutex_lock(&swap_slots_cache_enable_mutex);
-	if (!swap_slot_cache_initialized) {
-		int ret;
-
-		ret = cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "swap_slots_cache",
-					alloc_swap_slot_cache, free_slot_cache);
-		if (WARN_ONCE(ret < 0, "Cache allocation failed (%s), operating "
-				       "without swap slots cache.\n", __func__))
-			goto out_unlock;
-
-		swap_slot_cache_initialized = true;
-	}
-
-	__reenable_swap_slots_cache();
-out_unlock:
-	mutex_unlock(&swap_slots_cache_enable_mutex);
-}
-
-/* called with swap slot cache's alloc lock held */
-static int refill_swap_slots_cache(struct swap_slots_cache *cache)
-{
-	if (!use_swap_slot_cache)
-		return 0;
-
-	cache->cur = 0;
-	if (swap_slot_cache_active)
-		cache->nr = get_swap_pages(SWAP_SLOTS_CACHE_SIZE,
-					   cache->slots, 0);
-
-	return cache->nr;
-}
-
-swp_entry_t folio_alloc_swap(struct folio *folio)
-{
-	swp_entry_t entry;
-	struct swap_slots_cache *cache;
-
-	entry.val = 0;
-
-	if (folio_test_large(folio)) {
-		if (IS_ENABLED(CONFIG_THP_SWAP))
-			get_swap_pages(1, &entry, folio_order(folio));
-		goto out;
-	}
-
-	/*
-	 * Preemption is allowed here, because we may sleep
-	 * in refill_swap_slots_cache().  But it is safe, because
-	 * accesses to the per-CPU data structure are protected by the
-	 * mutex cache->alloc_lock.
-	 *
-	 * The alloc path here does not touch cache->slots_ret
-	 * so cache->free_lock is not taken.
-	 */
-	cache = raw_cpu_ptr(&swp_slots);
-
-	if (likely(check_cache_active() && cache->slots)) {
-		mutex_lock(&cache->alloc_lock);
-		if (cache->slots) {
-repeat:
-			if (cache->nr) {
-				entry = cache->slots[cache->cur];
-				cache->slots[cache->cur++].val = 0;
-				cache->nr--;
-			} else if (refill_swap_slots_cache(cache)) {
-				goto repeat;
-			}
-		}
-		mutex_unlock(&cache->alloc_lock);
-		if (entry.val)
-			goto out;
-	}
-
-	get_swap_pages(1, &entry, 0);
-out:
-	if (mem_cgroup_try_charge_swap(folio, entry)) {
-		put_swap_folio(folio, entry);
-		entry.val = 0;
-	}
-	return entry;
-}
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 50840a2887a5..2b5744e211cd 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -20,7 +20,6 @@
 #include <linux/blkdev.h>
 #include <linux/migrate.h>
 #include <linux/vmalloc.h>
-#include <linux/swap_slots.h>
 #include <linux/huge_mm.h>
 #include <linux/shmem_fs.h>
 #include "internal.h"
@@ -447,13 +446,8 @@ struct folio *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 
 		/*
 		 * Just skip read ahead for unused swap slot.
-		 * During swap_off when swap_slot_cache is disabled,
-		 * we have to handle the race between putting
-		 * swap entry in swap cache and marking swap slot
-		 * as SWAP_HAS_CACHE.  That's done in later part of code or
-		 * else swap_off will be aborted if we return NULL.
 		 */
-		if (!swap_entry_swapped(si, entry) && swap_slot_cache_enabled)
+		if (!swap_entry_swapped(si, entry))
 			goto put_and_return;
 
 		/*
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 791cd7ed5bdf..66c8869ef346 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -37,7 +37,6 @@
 #include <linux/oom.h>
 #include <linux/swapfile.h>
 #include <linux/export.h>
-#include <linux/swap_slots.h>
 #include <linux/sort.h>
 #include <linux/completion.h>
 #include <linux/suspend.h>
@@ -892,6 +891,13 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
 	struct swap_cluster_info *ci;
 	unsigned int offset, found = 0;
 
+	/*
+	 * Swapfile is not block device so unable
+	 * to allocate large entries.
+	 */
+	if (order && !(si->flags & SWP_BLKDEV))
+		return 0;
+
 	if (!(si->flags & SWP_SOLIDSTATE)) {
 		/* Serialize HDD SWAP allocation for each device. */
 		spin_lock(&si->global_cluster_lock);
@@ -1155,43 +1161,6 @@ static void swap_range_free(struct swap_info_struct *si, unsigned long offset,
 	swap_usage_sub(si, nr_entries);
 }
 
-static int scan_swap_map_slots(struct swap_info_struct *si,
-			       unsigned char usage, int nr,
-			       swp_entry_t slots[], int order)
-{
-	unsigned int nr_pages = 1 << order;
-	int n_ret = 0;
-
-	if (order > 0) {
-		/*
-		 * Should not even be attempting large allocations when huge
-		 * page swap is disabled.  Warn and fail the allocation.
-		 */
-		if (!IS_ENABLED(CONFIG_THP_SWAP) ||
-		    nr_pages > SWAPFILE_CLUSTER) {
-			VM_WARN_ON_ONCE(1);
-			return 0;
-		}
-
-		/*
-		 * Swapfile is not block device so unable
-		 * to allocate large entries.
-		 */
-		if (!(si->flags & SWP_BLKDEV))
-			return 0;
-	}
-
-	while (n_ret < nr) {
-		unsigned long offset = cluster_alloc_swap_entry(si, order, usage);
-
-		if (!offset)
-			break;
-		slots[n_ret++] = swp_entry(si->type, offset);
-	}
-
-	return n_ret;
-}
-
 static bool get_swap_device_info(struct swap_info_struct *si)
 {
 	if (!percpu_ref_tryget_live(&si->users))
@@ -1212,54 +1181,53 @@ static bool get_swap_device_info(struct swap_info_struct *si)
  * Fast path try to get swap entries with specified order from current
  * CPU's swap entry pool (a cluster).
  */
-static int swap_alloc_fast(swp_entry_t entries[],
+static int swap_alloc_fast(swp_entry_t *entry,
 			   unsigned char usage,
-			   int order, int n_goal)
+			   int order)
 {
 	struct swap_cluster_info *ci;
 	struct swap_info_struct *si;
-	unsigned int offset, found;
-	int n_ret = 0;
-
-	n_goal = min(n_goal, SWAP_BATCH);
+	unsigned int offset, found = SWAP_ENTRY_INVALID;
 
 	si = __this_cpu_read(percpu_swap_cluster.si);
 	offset = __this_cpu_read(percpu_swap_cluster.offset[order]);
 	if (!si || !offset || !get_swap_device_info(si))
-		return 0;
+		return false;
 
-	while (offset) {
-		ci = lock_cluster(si, offset);
-		found = alloc_swap_scan_cluster(si, ci, offset, order, usage);
-		if (!found)
-			break;
-		entries[n_ret++] = swp_entry(si->type, found);
-		if (n_ret == n_goal)
-			break;
-		offset = __this_cpu_read(percpu_swap_cluster.offset[order]);
-	}
+	ci = lock_cluster(si, offset);
+	found = alloc_swap_scan_cluster(si, ci, offset, order, usage);
+	if (found)
+		*entry = swp_entry(si->type, found);
 
 	put_swap_device(si);
-	return n_ret;
+	return !!found;
 }
 
-int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_order)
+swp_entry_t folio_alloc_swap(struct folio *folio)
 {
-	int order = swap_entry_order(entry_order);
-	unsigned long size = 1 << order;
+	unsigned int order = folio_order(folio);
+	unsigned int size = 1 << order;
 	struct swap_info_struct *si, *next;
-	int n_ret = 0;
+	swp_entry_t entry = {};
+	unsigned long offset;
 	int node;
 
+	if (order) {
+		/*
+		 * Should not even be attempting large allocations when huge
+		 * page swap is disabled. Warn and fail the allocation.
+		 */
+		if (!IS_ENABLED(CONFIG_THP_SWAP) || size > SWAPFILE_CLUSTER) {
+			VM_WARN_ON_ONCE(1);
+			return entry;
+		}
+	}
+
 	/* Fast path using percpu cluster */
 	local_lock(&percpu_swap_cluster.lock);
-	n_ret = swap_alloc_fast(swp_entries,
-				SWAP_HAS_CACHE,
-				order, n_goal);
-	if (n_ret == n_goal)
-		goto out;
+	if (swap_alloc_fast(&entry, SWAP_HAS_CACHE, order))
+		goto out_alloced;
 
-	n_goal = min_t(int, n_goal - n_ret, SWAP_BATCH);
 	/* Rotate the device and switch to a new cluster */
 	spin_lock(&swap_avail_lock);
 start_over:
@@ -1268,11 +1236,14 @@ int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_order)
 		plist_requeue(&si->avail_lists[node], &swap_avail_heads[node]);
 		spin_unlock(&swap_avail_lock);
 		if (get_swap_device_info(si)) {
-			n_ret += scan_swap_map_slots(si, SWAP_HAS_CACHE, n_goal,
-					swp_entries + n_ret, order);
+			offset = cluster_alloc_swap_entry(si, order, SWAP_HAS_CACHE);
 			put_swap_device(si);
-			if (n_ret || size > 1)
-				goto out;
+			if (offset) {
+				entry = swp_entry(si->type, offset);
+				goto out_alloced;
+			}
+			if (order)
+				goto out_failed;
 		}
 
 		spin_lock(&swap_avail_lock);
@@ -1291,10 +1262,20 @@ int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_order)
 			goto start_over;
 	}
 	spin_unlock(&swap_avail_lock);
-out:
+out_failed:
+	local_unlock(&percpu_swap_cluster.lock);
+	return entry;
+
+out_alloced:
 	local_unlock(&percpu_swap_cluster.lock);
-	atomic_long_sub(n_ret * size, &nr_swap_pages);
-	return n_ret;
+	if (mem_cgroup_try_charge_swap(folio, entry)) {
+		put_swap_folio(folio, entry);
+		entry.val = 0;
+	} else {
+		atomic_long_sub(size, &nr_swap_pages);
+	}
+
+	return entry;
 }
 
 static struct swap_info_struct *_swap_info_get(swp_entry_t entry)
@@ -1590,25 +1571,6 @@ void put_swap_folio(struct folio *folio, swp_entry_t entry)
 	unlock_cluster(ci);
 }
 
-void swapcache_free_entries(swp_entry_t *entries, int n)
-{
-	int i;
-	struct swap_cluster_info *ci;
-	struct swap_info_struct *si = NULL;
-
-	if (n <= 0)
-		return;
-
-	for (i = 0; i < n; ++i) {
-		si = _swap_info_get(entries[i]);
-		if (si) {
-			ci = lock_cluster(si, swp_offset(entries[i]));
-			swap_entry_range_free(si, ci, entries[i], 1);
-			unlock_cluster(ci);
-		}
-	}
-}
-
 int __swap_count(swp_entry_t entry)
 {
 	struct swap_info_struct *si = swp_swap_info(entry);
@@ -1849,6 +1811,7 @@ void free_swap_and_cache_nr(swp_entry_t entry, int nr)
 swp_entry_t get_swap_page_of_type(int type)
 {
 	struct swap_info_struct *si = swap_type_to_swap_info(type);
+	unsigned long offset;
 	swp_entry_t entry = {0};
 
 	if (!si)
@@ -1856,8 +1819,13 @@ swp_entry_t get_swap_page_of_type(int type)
 
 	/* This is called for allocating swap entry, not cache */
 	if (get_swap_device_info(si)) {
-		if ((si->flags & SWP_WRITEOK) && scan_swap_map_slots(si, 1, 1, &entry, 0))
-			atomic_long_dec(&nr_swap_pages);
+		if (si->flags & SWP_WRITEOK) {
+			offset = cluster_alloc_swap_entry(si, 0, 1);
+			if (offset) {
+				entry = swp_entry(si->type, offset);
+				atomic_long_dec(&nr_swap_pages);
+			}
+		}
 		put_swap_device(si);
 	}
 fail:
@@ -2623,16 +2591,6 @@ static bool __has_usable_swap(void)
 	return !plist_head_empty(&swap_active_head);
 }
 
-bool has_usable_swap(void)
-{
-	bool ret;
-
-	spin_lock(&swap_lock);
-	ret = __has_usable_swap();
-	spin_unlock(&swap_lock);
-	return ret;
-}
-
 /*
  * Called after clearing SWP_WRITEOK, ensures cluster_alloc_range
  * see the updated flags, so there will be no more allocations.
@@ -2724,8 +2682,6 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
 
 	wait_for_allocation(p);
 
-	disable_swap_slots_cache_lock();
-
 	set_current_oom_origin();
 	err = try_to_unuse(p->type);
 	clear_current_oom_origin();
@@ -2733,12 +2689,9 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
 	if (err) {
 		/* re-insert swap space back into swap_list */
 		reinsert_swap_info(p);
-		reenable_swap_slots_cache_unlock();
 		goto out_dput;
 	}
 
-	reenable_swap_slots_cache_unlock();
-
 	/*
 	 * Wait for swap operations protected by get/put_swap_device()
 	 * to complete.  Because of synchronize_rcu() here, all swap
@@ -3487,8 +3440,6 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
 		putname(name);
 	if (inode)
 		inode_unlock(inode);
-	if (!error)
-		enable_swap_slots_cache();
 	return error;
 }
 
-- 
2.48.1



^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 7/7] mm, swap: simplify folio swap allocation
  2025-02-14 17:57 [PATCH 0/7] mm, swap: remove swap slot cache Kairui Song
                   ` (5 preceding siblings ...)
  2025-02-14 17:57 ` [PATCH 6/7] mm, swap: remove swap slot cache Kairui Song
@ 2025-02-14 17:57 ` Kairui Song
  2025-02-14 20:13   ` Matthew Wilcox
                     ` (3 more replies)
  2025-02-15 10:27 ` [PATCH 0/7] mm, swap: remove swap slot cache Baoquan He
  7 siblings, 4 replies; 30+ messages in thread
From: Kairui Song @ 2025-02-14 17:57 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Chris Li, Barry Song, Hugh Dickins, Yosry Ahmed,
	Huang, Ying, Baoquan He, Nhat Pham, Johannes Weiner,
	Kalesh Singh, linux-kernel, Kairui Song

From: Kairui Song <kasong@tencent.com>

With slot cache gone, clean up the allocation helpers even more.
folio_alloc_swap will be the only entry for allocation and adding
the folio to swap cache (except suspend), making it opposite of
folio_free_swap.

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 include/linux/swap.h |   8 ++--
 mm/shmem.c           |  21 +++------
 mm/swap.h            |   6 ---
 mm/swap_state.c      |  57 ----------------------
 mm/swapfile.c        | 110 ++++++++++++++++++++++++++++---------------
 mm/vmscan.c          |  16 ++++++-
 6 files changed, 94 insertions(+), 124 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 456833705ea0..e799e965dac8 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -478,7 +478,7 @@ static inline long get_nr_swap_pages(void)
 }
 
 extern void si_swapinfo(struct sysinfo *);
-swp_entry_t folio_alloc_swap(struct folio *folio);
+bool folio_alloc_swap(struct folio *folio, gfp_t gfp_mask);
 bool folio_free_swap(struct folio *folio);
 void put_swap_folio(struct folio *folio, swp_entry_t entry);
 extern swp_entry_t get_swap_page_of_type(int);
@@ -587,11 +587,9 @@ static inline int swp_swapcount(swp_entry_t entry)
 	return 0;
 }
 
-static inline swp_entry_t folio_alloc_swap(struct folio *folio)
+static bool folio_alloc_swap(struct folio *folio, gfp_t gfp_mask);
 {
-	swp_entry_t entry;
-	entry.val = 0;
-	return entry;
+	return false;
 }
 
 static inline bool folio_free_swap(struct folio *folio)
diff --git a/mm/shmem.c b/mm/shmem.c
index b35ba250c53d..2aa206b52ff2 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1546,7 +1546,6 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc)
 	struct inode *inode = mapping->host;
 	struct shmem_inode_info *info = SHMEM_I(inode);
 	struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb);
-	swp_entry_t swap;
 	pgoff_t index;
 	int nr_pages;
 	bool split = false;
@@ -1628,14 +1627,6 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc)
 		folio_mark_uptodate(folio);
 	}
 
-	swap = folio_alloc_swap(folio);
-	if (!swap.val) {
-		if (nr_pages > 1)
-			goto try_split;
-
-		goto redirty;
-	}
-
 	/*
 	 * Add inode to shmem_unuse()'s list of swapped-out inodes,
 	 * if it's not already there.  Do it now before the folio is
@@ -1648,20 +1639,20 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc)
 	if (list_empty(&info->swaplist))
 		list_add(&info->swaplist, &shmem_swaplist);
 
-	if (add_to_swap_cache(folio, swap,
-			__GFP_HIGH | __GFP_NOMEMALLOC | __GFP_NOWARN,
-			NULL) == 0) {
+	if (folio_alloc_swap(folio, __GFP_HIGH | __GFP_NOMEMALLOC | __GFP_NOWARN)) {
 		shmem_recalc_inode(inode, 0, nr_pages);
-		swap_shmem_alloc(swap, nr_pages);
-		shmem_delete_from_page_cache(folio, swp_to_radix_entry(swap));
+		swap_shmem_alloc(folio->swap, nr_pages);
+		shmem_delete_from_page_cache(folio, swp_to_radix_entry(folio->swap));
 
 		mutex_unlock(&shmem_swaplist_mutex);
 		BUG_ON(folio_mapped(folio));
 		return swap_writepage(&folio->page, wbc);
 	}
 
+	list_del_init(&info->swaplist);
 	mutex_unlock(&shmem_swaplist_mutex);
-	put_swap_folio(folio, swap);
+	if (nr_pages > 1)
+		goto try_split;
 redirty:
 	folio_mark_dirty(folio);
 	if (wbc->for_reclaim)
diff --git a/mm/swap.h b/mm/swap.h
index ad2f121de970..0abb68091b4f 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -50,7 +50,6 @@ static inline pgoff_t swap_cache_index(swp_entry_t entry)
 }
 
 void show_swap_cache_info(void);
-bool add_to_swap(struct folio *folio);
 void *get_shadow_from_swap_cache(swp_entry_t entry);
 int add_to_swap_cache(struct folio *folio, swp_entry_t entry,
 		      gfp_t gfp, void **shadowp);
@@ -163,11 +162,6 @@ struct folio *filemap_get_incore_folio(struct address_space *mapping,
 	return filemap_get_folio(mapping, index);
 }
 
-static inline bool add_to_swap(struct folio *folio)
-{
-	return false;
-}
-
 static inline void *get_shadow_from_swap_cache(swp_entry_t entry)
 {
 	return NULL;
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 2b5744e211cd..68fd981b514f 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -166,63 +166,6 @@ void __delete_from_swap_cache(struct folio *folio,
 	__lruvec_stat_mod_folio(folio, NR_SWAPCACHE, -nr);
 }
 
-/**
- * add_to_swap - allocate swap space for a folio
- * @folio: folio we want to move to swap
- *
- * Allocate swap space for the folio and add the folio to the
- * swap cache.
- *
- * Context: Caller needs to hold the folio lock.
- * Return: Whether the folio was added to the swap cache.
- */
-bool add_to_swap(struct folio *folio)
-{
-	swp_entry_t entry;
-	int err;
-
-	VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
-	VM_BUG_ON_FOLIO(!folio_test_uptodate(folio), folio);
-
-	entry = folio_alloc_swap(folio);
-	if (!entry.val)
-		return false;
-
-	/*
-	 * XArray node allocations from PF_MEMALLOC contexts could
-	 * completely exhaust the page allocator. __GFP_NOMEMALLOC
-	 * stops emergency reserves from being allocated.
-	 *
-	 * TODO: this could cause a theoretical memory reclaim
-	 * deadlock in the swap out path.
-	 */
-	/*
-	 * Add it to the swap cache.
-	 */
-	err = add_to_swap_cache(folio, entry,
-			__GFP_HIGH|__GFP_NOMEMALLOC|__GFP_NOWARN, NULL);
-	if (err)
-		goto fail;
-	/*
-	 * Normally the folio will be dirtied in unmap because its
-	 * pte should be dirty. A special case is MADV_FREE page. The
-	 * page's pte could have dirty bit cleared but the folio's
-	 * SwapBacked flag is still set because clearing the dirty bit
-	 * and SwapBacked flag has no lock protected. For such folio,
-	 * unmap will not set dirty bit for it, so folio reclaim will
-	 * not write the folio out. This can cause data corruption when
-	 * the folio is swapped in later. Always setting the dirty flag
-	 * for the folio solves the problem.
-	 */
-	folio_mark_dirty(folio);
-
-	return true;
-
-fail:
-	put_swap_folio(folio, entry);
-	return false;
-}
-
 /*
  * This must be called only on folios that have
  * been verified to be in the swap cache and locked.
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 66c8869ef346..8449bd703bd8 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1181,9 +1181,9 @@ static bool get_swap_device_info(struct swap_info_struct *si)
  * Fast path try to get swap entries with specified order from current
  * CPU's swap entry pool (a cluster).
  */
-static int swap_alloc_fast(swp_entry_t *entry,
-			   unsigned char usage,
-			   int order)
+static bool swap_alloc_fast(swp_entry_t *entry,
+			    unsigned char usage,
+			    int order)
 {
 	struct swap_cluster_info *ci;
 	struct swap_info_struct *si;
@@ -1203,47 +1203,31 @@ static int swap_alloc_fast(swp_entry_t *entry,
 	return !!found;
 }
 
-swp_entry_t folio_alloc_swap(struct folio *folio)
+/* Rotate the device and switch to a new cluster */
+static bool swap_alloc_rotate(swp_entry_t *entry,
+			      unsigned char usage,
+			      int order)
 {
-	unsigned int order = folio_order(folio);
-	unsigned int size = 1 << order;
-	struct swap_info_struct *si, *next;
-	swp_entry_t entry = {};
-	unsigned long offset;
 	int node;
+	unsigned long offset;
+	struct swap_info_struct *si, *next;
 
-	if (order) {
-		/*
-		 * Should not even be attempting large allocations when huge
-		 * page swap is disabled. Warn and fail the allocation.
-		 */
-		if (!IS_ENABLED(CONFIG_THP_SWAP) || size > SWAPFILE_CLUSTER) {
-			VM_WARN_ON_ONCE(1);
-			return entry;
-		}
-	}
-
-	/* Fast path using percpu cluster */
-	local_lock(&percpu_swap_cluster.lock);
-	if (swap_alloc_fast(&entry, SWAP_HAS_CACHE, order))
-		goto out_alloced;
-
-	/* Rotate the device and switch to a new cluster */
+	node = numa_node_id();
 	spin_lock(&swap_avail_lock);
 start_over:
-	node = numa_node_id();
 	plist_for_each_entry_safe(si, next, &swap_avail_heads[node], avail_lists[node]) {
+		/* Rotate the device and switch to a new cluster */
 		plist_requeue(&si->avail_lists[node], &swap_avail_heads[node]);
 		spin_unlock(&swap_avail_lock);
 		if (get_swap_device_info(si)) {
 			offset = cluster_alloc_swap_entry(si, order, SWAP_HAS_CACHE);
 			put_swap_device(si);
 			if (offset) {
-				entry = swp_entry(si->type, offset);
-				goto out_alloced;
+				*entry = swp_entry(si->type, offset);
+				return true;
 			}
 			if (order)
-				goto out_failed;
+				return false;
 		}
 
 		spin_lock(&swap_avail_lock);
@@ -1262,20 +1246,68 @@ swp_entry_t folio_alloc_swap(struct folio *folio)
 			goto start_over;
 	}
 	spin_unlock(&swap_avail_lock);
-out_failed:
+	return false;
+}
+
+/**
+ * folio_alloc_swap - allocate swap space for a folio
+ * @folio: folio we want to move to swap
+ * @gfp: gfp mask for shadow nodes
+ *
+ * Allocate swap space for the folio and add the folio to the
+ * swap cache.
+ *
+ * Context: Caller needs to hold the folio lock.
+ * Return: Whether the folio was added to the swap cache.
+ */
+bool folio_alloc_swap(struct folio *folio, gfp_t gfp)
+{
+	unsigned int order = folio_order(folio);
+	unsigned int size = 1 << order;
+	swp_entry_t entry = {};
+
+	VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
+	VM_BUG_ON_FOLIO(!folio_test_uptodate(folio), folio);
+
+	/*
+	 * Should not even be attempting large allocations when huge
+	 * page swap is disabled. Warn and fail the allocation.
+	 */
+	if (order && (!IS_ENABLED(CONFIG_THP_SWAP) || size > SWAPFILE_CLUSTER)) {
+		VM_WARN_ON_ONCE(1);
+		return false;
+	}
+
+	local_lock(&percpu_swap_cluster.lock);
+	if (swap_alloc_fast(&entry, SWAP_HAS_CACHE, order))
+		goto out_alloced;
+	if (swap_alloc_rotate(&entry, SWAP_HAS_CACHE, order))
+		goto out_alloced;
 	local_unlock(&percpu_swap_cluster.lock);
-	return entry;
+	return false;
 
 out_alloced:
 	local_unlock(&percpu_swap_cluster.lock);
-	if (mem_cgroup_try_charge_swap(folio, entry)) {
-		put_swap_folio(folio, entry);
-		entry.val = 0;
-	} else {
-		atomic_long_sub(size, &nr_swap_pages);
-	}
+	if (mem_cgroup_try_charge_swap(folio, entry))
+		goto out_free;
 
-	return entry;
+	/*
+	 * XArray node allocations from PF_MEMALLOC contexts could
+	 * completely exhaust the page allocator. __GFP_NOMEMALLOC
+	 * stops emergency reserves from being allocated.
+	 *
+	 * TODO: this could cause a theoretical memory reclaim
+	 * deadlock in the swap out path.
+	 */
+	if (add_to_swap_cache(folio, entry, gfp | __GFP_NOMEMALLOC, NULL))
+		goto out_free;
+
+	atomic_long_sub(size, &nr_swap_pages);
+	return true;
+
+out_free:
+	put_swap_folio(folio, entry);
+	return false;
 }
 
 static struct swap_info_struct *_swap_info_get(swp_entry_t entry)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index fcca38bc640f..71a6b597e469 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1289,7 +1289,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
 					    split_folio_to_list(folio, folio_list))
 						goto activate_locked;
 				}
-				if (!add_to_swap(folio)) {
+				if (!folio_alloc_swap(folio, __GFP_HIGH | __GFP_NOWARN)) {
 					int __maybe_unused order = folio_order(folio);
 
 					if (!folio_test_large(folio))
@@ -1305,9 +1305,21 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
 					}
 #endif
 					count_mthp_stat(order, MTHP_STAT_SWPOUT_FALLBACK);
-					if (!add_to_swap(folio))
+					if (!folio_alloc_swap(folio, __GFP_HIGH | __GFP_NOWARN))
 						goto activate_locked_split;
 				}
+				/*
+				 * Normally the folio will be dirtied in unmap because its
+				 * pte should be dirty. A special case is MADV_FREE page. The
+				 * page's pte could have dirty bit cleared but the folio's
+				 * SwapBacked flag is still set because clearing the dirty bit
+				 * and SwapBacked flag has no lock protected. For such folio,
+				 * unmap will not set dirty bit for it, so folio reclaim will
+				 * not write the folio out. This can cause data corruption when
+				 * the folio is swapped in later. Always setting the dirty flag
+				 * for the folio solves the problem.
+				 */
+				folio_mark_dirty(folio);
 			}
 		}
 
-- 
2.48.1



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 7/7] mm, swap: simplify folio swap allocation
  2025-02-14 17:57 ` [PATCH 7/7] mm, swap: simplify folio swap allocation Kairui Song
@ 2025-02-14 20:13   ` Matthew Wilcox
  2025-02-15  6:40     ` Kairui Song
  2025-02-15 16:43   ` kernel test robot
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 30+ messages in thread
From: Matthew Wilcox @ 2025-02-14 20:13 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Chris Li, Barry Song, Hugh Dickins,
	Yosry Ahmed, Huang, Ying, Baoquan He, Nhat Pham, Johannes Weiner,
	Kalesh Singh, linux-kernel

On Sat, Feb 15, 2025 at 01:57:09AM +0800, Kairui Song wrote:
> @@ -1648,20 +1639,20 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc)
>  	if (list_empty(&info->swaplist))
>  		list_add(&info->swaplist, &shmem_swaplist);
>  
> -	if (add_to_swap_cache(folio, swap,
> -			__GFP_HIGH | __GFP_NOMEMALLOC | __GFP_NOWARN,
> -			NULL) == 0) {
> +	if (folio_alloc_swap(folio, __GFP_HIGH | __GFP_NOMEMALLOC | __GFP_NOWARN)) {

add_to_swap_cache() returns 0 on success or -errno.

folio_alloc_swap returns true on success.

That would seem to indicate you should change the polarity of this test?

Or should folio_alloc_swap() return an errno?  Is there value in
distinguishing why we couldn't alloc swap (ENOMEM vs ENOSPC, perhaps?)



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 7/7] mm, swap: simplify folio swap allocation
  2025-02-14 20:13   ` Matthew Wilcox
@ 2025-02-15  6:40     ` Kairui Song
  0 siblings, 0 replies; 30+ messages in thread
From: Kairui Song @ 2025-02-15  6:40 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-mm, Andrew Morton, Chris Li, Barry Song, Hugh Dickins,
	Yosry Ahmed, Huang, Ying, Baoquan He, Nhat Pham, Johannes Weiner,
	Kalesh Singh, linux-kernel

On Sat, Feb 15, 2025 at 4:13 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Sat, Feb 15, 2025 at 01:57:09AM +0800, Kairui Song wrote:
> > @@ -1648,20 +1639,20 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc)
> >       if (list_empty(&info->swaplist))
> >               list_add(&info->swaplist, &shmem_swaplist);
> >
> > -     if (add_to_swap_cache(folio, swap,
> > -                     __GFP_HIGH | __GFP_NOMEMALLOC | __GFP_NOWARN,
> > -                     NULL) == 0) {
> > +     if (folio_alloc_swap(folio, __GFP_HIGH | __GFP_NOMEMALLOC | __GFP_NOWARN)) {
>
> add_to_swap_cache() returns 0 on success or -errno.
>
> folio_alloc_swap returns true on success.
>
> That would seem to indicate you should change the polarity of this test?

I think I already did? It was (add_to_swap_cache(...) == 0), now it's
(folio_alloc_swap(...))

>
> Or should folio_alloc_swap() return an errno?  Is there value in
> distinguishing why we couldn't alloc swap (ENOMEM vs ENOSPC, perhaps?)
>

Good idea, return an error value might be more helpful in the future,
will update this part.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/7] mm, swap: remove swap slot cache
  2025-02-14 17:57 [PATCH 0/7] mm, swap: remove swap slot cache Kairui Song
                   ` (6 preceding siblings ...)
  2025-02-14 17:57 ` [PATCH 7/7] mm, swap: simplify folio swap allocation Kairui Song
@ 2025-02-15 10:27 ` Baoquan He
  2025-02-15 13:34   ` Kairui Song
  7 siblings, 1 reply; 30+ messages in thread
From: Baoquan He @ 2025-02-15 10:27 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Chris Li, Barry Song, Hugh Dickins,
	Yosry Ahmed, Huang, Ying, Nhat Pham, Johannes Weiner,
	Kalesh Singh, linux-kernel

Hi Kairui,

On 02/15/25 at 01:57am, Kairui Song wrote:
> From: Kairui Song <kasong@tencent.com>
> 
> Slot cache was initially introduced by commit 67afa38e012e ("mm/swap:
> add cache for swap slots allocation") to reduce the lock contention
> of si->lock.

Thanks for adding me in CC. While I got failure to apply this series
to the latest mainline kernel, could you tell what is the base commit
of this pathcset?

Thanks
Baoquan



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/7] mm, swap: remove swap slot cache
  2025-02-15 10:27 ` [PATCH 0/7] mm, swap: remove swap slot cache Baoquan He
@ 2025-02-15 13:34   ` Kairui Song
  2025-02-15 15:07     ` Baoquan He
  0 siblings, 1 reply; 30+ messages in thread
From: Kairui Song @ 2025-02-15 13:34 UTC (permalink / raw)
  To: Baoquan He
  Cc: linux-mm, Andrew Morton, Chris Li, Barry Song, Hugh Dickins,
	Yosry Ahmed, Huang, Ying, Nhat Pham, Johannes Weiner,
	Kalesh Singh, linux-kernel

On Sat, Feb 15, 2025 at 6:28 PM Baoquan He <bhe@redhat.com> wrote:
>
> Hi Kairui,
>
> On 02/15/25 at 01:57am, Kairui Song wrote:
> > From: Kairui Song <kasong@tencent.com>
> >
> > Slot cache was initially introduced by commit 67afa38e012e ("mm/swap:
> > add cache for swap slots allocation") to reduce the lock contention
> > of si->lock.
>
> Thanks for adding me in CC. While I got failure to apply this series
> to the latest mainline kernel, could you tell what is the base commit
> of this pathcset?
>
> Thanks
> Baoquan
>

Hi Baoquan,

It's based on Andrews's mm-unstable here:
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git

I've just re-checked, there should be no conflict. Sorry I didn't
include this info in the cover letter, mm development is rapid so
usually I send patch based on mm-unstable.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/7] mm, swap: remove swap slot cache
  2025-02-15 13:34   ` Kairui Song
@ 2025-02-15 15:07     ` Baoquan He
  0 siblings, 0 replies; 30+ messages in thread
From: Baoquan He @ 2025-02-15 15:07 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Chris Li, Barry Song, Hugh Dickins,
	Yosry Ahmed, Huang, Ying, Nhat Pham, Johannes Weiner,
	Kalesh Singh, linux-kernel

On 02/15/25 at 09:34pm, Kairui Song wrote:
> On Sat, Feb 15, 2025 at 6:28 PM Baoquan He <bhe@redhat.com> wrote:
> >
> > Hi Kairui,
> >
> > On 02/15/25 at 01:57am, Kairui Song wrote:
> > > From: Kairui Song <kasong@tencent.com>
> > >
> > > Slot cache was initially introduced by commit 67afa38e012e ("mm/swap:
> > > add cache for swap slots allocation") to reduce the lock contention
> > > of si->lock.
> >
> > Thanks for adding me in CC. While I got failure to apply this series
> > to the latest mainline kernel, could you tell what is the base commit
> > of this pathcset?
> >
> > Thanks
> > Baoquan
> >
> 
> Hi Baoquan,
> 
> It's based on Andrews's mm-unstable here:
> git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git
> 
> I've just re-checked, there should be no conflict. Sorry I didn't
> include this info in the cover letter, mm development is rapid so
> usually I send patch based on mm-unstable.

Thanks, applied to akpm-mm/mm-unstable branch cleanly.



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 6/7] mm, swap: remove swap slot cache
  2025-02-14 17:57 ` [PATCH 6/7] mm, swap: remove swap slot cache Kairui Song
@ 2025-02-15 16:23   ` kernel test robot
  2025-02-20  7:55   ` Baoquan He
  1 sibling, 0 replies; 30+ messages in thread
From: kernel test robot @ 2025-02-15 16:23 UTC (permalink / raw)
  To: Kairui Song, linux-mm
  Cc: oe-kbuild-all, Andrew Morton, Linux Memory Management List,
	Chris Li, Barry Song, Hugh Dickins, Yosry Ahmed, Huang, Ying,
	Baoquan He, Nhat Pham, Johannes Weiner, Kalesh Singh,
	linux-kernel, Kairui Song

Hi Kairui,

kernel test robot noticed the following build warnings:

[auto build test WARNING on akpm-mm/mm-everything]

url:    https://github.com/intel-lab-lkp/linux/commits/Kairui-Song/mm-swap-avoid-reclaiming-irrelevant-swap-cache/20250215-020239
base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link:    https://lore.kernel.org/r/20250214175709.76029-7-ryncsn%40gmail.com
patch subject: [PATCH 6/7] mm, swap: remove swap slot cache
config: x86_64-buildonly-randconfig-003-20250215 (https://download.01.org/0day-ci/archive/20250216/202502160045.CBhqtRFf-lkp@intel.com/config)
compiler: gcc-12 (Debian 12.2.0-14) 12.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250216/202502160045.CBhqtRFf-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202502160045.CBhqtRFf-lkp@intel.com/

All warnings (new ones prefixed by >>):

>> mm/swapfile.c:2589:13: warning: '__has_usable_swap' defined but not used [-Wunused-function]
    2589 | static bool __has_usable_swap(void)
         |             ^~~~~~~~~~~~~~~~~


vim +/__has_usable_swap +2589 mm/swapfile.c

40531542e28324 Cesar Eduardo Barros 2011-03-22  2588  
80e75021486bd2 Kefeng Wang          2024-04-18 @2589  static bool __has_usable_swap(void)
80e75021486bd2 Kefeng Wang          2024-04-18  2590  {
80e75021486bd2 Kefeng Wang          2024-04-18  2591  	return !plist_head_empty(&swap_active_head);
80e75021486bd2 Kefeng Wang          2024-04-18  2592  }
80e75021486bd2 Kefeng Wang          2024-04-18  2593  

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 7/7] mm, swap: simplify folio swap allocation
  2025-02-14 17:57 ` [PATCH 7/7] mm, swap: simplify folio swap allocation Kairui Song
  2025-02-14 20:13   ` Matthew Wilcox
@ 2025-02-15 16:43   ` kernel test robot
  2025-02-15 16:54   ` kernel test robot
  2025-02-20 10:41   ` Baoquan He
  3 siblings, 0 replies; 30+ messages in thread
From: kernel test robot @ 2025-02-15 16:43 UTC (permalink / raw)
  To: Kairui Song, linux-mm
  Cc: llvm, oe-kbuild-all, Andrew Morton, Linux Memory Management List,
	Chris Li, Barry Song, Hugh Dickins, Yosry Ahmed, Huang, Ying,
	Baoquan He, Nhat Pham, Johannes Weiner, Kalesh Singh,
	linux-kernel, Kairui Song

Hi Kairui,

kernel test robot noticed the following build errors:

[auto build test ERROR on akpm-mm/mm-everything]

url:    https://github.com/intel-lab-lkp/linux/commits/Kairui-Song/mm-swap-avoid-reclaiming-irrelevant-swap-cache/20250215-020239
base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link:    https://lore.kernel.org/r/20250214175709.76029-8-ryncsn%40gmail.com
patch subject: [PATCH 7/7] mm, swap: simplify folio swap allocation
config: x86_64-buildonly-randconfig-002-20250215 (https://download.01.org/0day-ci/archive/20250216/202502160013.l8ZYewQK-lkp@intel.com/config)
compiler: clang version 19.1.3 (https://github.com/llvm/llvm-project ab51eccf88f5321e7c60591c5546b254b6afab99)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250216/202502160013.l8ZYewQK-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202502160013.l8ZYewQK-lkp@intel.com/

All errors (new ones prefixed by >>):

   In file included from arch/x86/kernel/asm-offsets.c:14:
   In file included from include/linux/suspend.h:5:
   In file included from include/linux/swap.h:9:
   In file included from include/linux/memcontrol.h:21:
   In file included from include/linux/mm.h:2208:
   include/linux/vmstat.h:504:43: warning: arithmetic between different enumeration types ('enum zone_stat_item' and 'enum numa_stat_item') [-Wenum-enum-conversion]
     504 |         return vmstat_text[NR_VM_ZONE_STAT_ITEMS +
         |                            ~~~~~~~~~~~~~~~~~~~~~ ^
     505 |                            item];
         |                            ~~~~
   include/linux/vmstat.h:511:43: warning: arithmetic between different enumeration types ('enum zone_stat_item' and 'enum numa_stat_item') [-Wenum-enum-conversion]
     511 |         return vmstat_text[NR_VM_ZONE_STAT_ITEMS +
         |                            ~~~~~~~~~~~~~~~~~~~~~ ^
     512 |                            NR_VM_NUMA_EVENT_ITEMS +
         |                            ~~~~~~~~~~~~~~~~~~~~~~
   include/linux/vmstat.h:518:36: warning: arithmetic between different enumeration types ('enum node_stat_item' and 'enum lru_list') [-Wenum-enum-conversion]
     518 |         return node_stat_name(NR_LRU_BASE + lru) + 3; // skip "nr_"
         |                               ~~~~~~~~~~~ ^ ~~~
   include/linux/vmstat.h:524:43: warning: arithmetic between different enumeration types ('enum zone_stat_item' and 'enum numa_stat_item') [-Wenum-enum-conversion]
     524 |         return vmstat_text[NR_VM_ZONE_STAT_ITEMS +
         |                            ~~~~~~~~~~~~~~~~~~~~~ ^
     525 |                            NR_VM_NUMA_EVENT_ITEMS +
         |                            ~~~~~~~~~~~~~~~~~~~~~~
   In file included from arch/x86/kernel/asm-offsets.c:14:
   In file included from include/linux/suspend.h:5:
>> include/linux/swap.h:591:1: error: expected identifier or '('
     591 | {
         | ^
   4 warnings and 1 error generated.
   make[3]: *** [scripts/Makefile.build:102: arch/x86/kernel/asm-offsets.s] Error 1 shuffle=2496405435
   make[3]: Target 'prepare' not remade because of errors.
   make[2]: *** [Makefile:1264: prepare0] Error 2 shuffle=2496405435
   make[2]: Target 'prepare' not remade because of errors.
   make[1]: *** [Makefile:251: __sub-make] Error 2 shuffle=2496405435
   make[1]: Target 'prepare' not remade because of errors.
   make: *** [Makefile:251: __sub-make] Error 2 shuffle=2496405435
   make: Target 'prepare' not remade because of errors.


vim +591 include/linux/swap.h

8334b96221ff0d Minchan Kim    2015-09-08  589  
f8d9ff0d052908 Kairui Song    2025-02-15  590  static bool folio_alloc_swap(struct folio *folio, gfp_t gfp_mask);
^1da177e4c3f41 Linus Torvalds 2005-04-16 @591  {
f8d9ff0d052908 Kairui Song    2025-02-15  592  	return false;
^1da177e4c3f41 Linus Torvalds 2005-04-16  593  }
^1da177e4c3f41 Linus Torvalds 2005-04-16  594  

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 7/7] mm, swap: simplify folio swap allocation
  2025-02-14 17:57 ` [PATCH 7/7] mm, swap: simplify folio swap allocation Kairui Song
  2025-02-14 20:13   ` Matthew Wilcox
  2025-02-15 16:43   ` kernel test robot
@ 2025-02-15 16:54   ` kernel test robot
  2025-02-20 10:41   ` Baoquan He
  3 siblings, 0 replies; 30+ messages in thread
From: kernel test robot @ 2025-02-15 16:54 UTC (permalink / raw)
  To: Kairui Song, linux-mm
  Cc: oe-kbuild-all, Andrew Morton, Linux Memory Management List,
	Chris Li, Barry Song, Hugh Dickins, Yosry Ahmed, Huang, Ying,
	Baoquan He, Nhat Pham, Johannes Weiner, Kalesh Singh,
	linux-kernel, Kairui Song

Hi Kairui,

kernel test robot noticed the following build errors:

[auto build test ERROR on akpm-mm/mm-everything]

url:    https://github.com/intel-lab-lkp/linux/commits/Kairui-Song/mm-swap-avoid-reclaiming-irrelevant-swap-cache/20250215-020239
base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link:    https://lore.kernel.org/r/20250214175709.76029-8-ryncsn%40gmail.com
patch subject: [PATCH 7/7] mm, swap: simplify folio swap allocation
config: x86_64-buildonly-randconfig-001-20250215 (https://download.01.org/0day-ci/archive/20250216/202502160040.5ULBvBsP-lkp@intel.com/config)
compiler: gcc-12 (Debian 12.2.0-14) 12.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250216/202502160040.5ULBvBsP-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202502160040.5ULBvBsP-lkp@intel.com/

All error/warnings (new ones prefixed by >>):

   In file included from include/linux/suspend.h:5,
                    from arch/x86/kernel/asm-offsets.c:14:
>> include/linux/swap.h:591:1: error: expected identifier or '(' before '{' token
     591 | {
         | ^
>> include/linux/swap.h:590:13: warning: 'folio_alloc_swap' declared 'static' but never defined [-Wunused-function]
     590 | static bool folio_alloc_swap(struct folio *folio, gfp_t gfp_mask);
         |             ^~~~~~~~~~~~~~~~
   make[3]: *** [scripts/Makefile.build:102: arch/x86/kernel/asm-offsets.s] Error 1 shuffle=2631350961
   make[3]: Target 'prepare' not remade because of errors.
   make[2]: *** [Makefile:1264: prepare0] Error 2 shuffle=2631350961
   make[2]: Target 'prepare' not remade because of errors.
   make[1]: *** [Makefile:251: __sub-make] Error 2 shuffle=2631350961
   make[1]: Target 'prepare' not remade because of errors.
   make: *** [Makefile:251: __sub-make] Error 2 shuffle=2631350961
   make: Target 'prepare' not remade because of errors.


vim +591 include/linux/swap.h

8334b96221ff0d Minchan Kim    2015-09-08  589  
f8d9ff0d052908 Kairui Song    2025-02-15 @590  static bool folio_alloc_swap(struct folio *folio, gfp_t gfp_mask);
^1da177e4c3f41 Linus Torvalds 2005-04-16 @591  {
f8d9ff0d052908 Kairui Song    2025-02-15  592  	return false;
^1da177e4c3f41 Linus Torvalds 2005-04-16  593  }
^1da177e4c3f41 Linus Torvalds 2005-04-16  594  

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 1/7] mm, swap: avoid reclaiming irrelevant swap cache
  2025-02-14 17:57 ` [PATCH 1/7] mm, swap: avoid reclaiming irrelevant swap cache Kairui Song
@ 2025-02-19  2:11   ` Baoquan He
  0 siblings, 0 replies; 30+ messages in thread
From: Baoquan He @ 2025-02-19  2:11 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Chris Li, Barry Song, Hugh Dickins,
	Yosry Ahmed, Huang, Ying, Nhat Pham, Johannes Weiner,
	Kalesh Singh, linux-kernel

On 02/15/25 at 01:57am, Kairui Song wrote:
> From: Kairui Song <kasong@tencent.com>
> 
> Swap allocator will do swap cache reclaim to recycle HAS_CACHE slots for
> allocation. It initiates the reclaim from the offset to be reclaimed and
> looks up the corresponding folio. The lookup process is lockless, so it's
> possible the folio will be removed from the swap cache and given
> a different swap entry before the reclaim locks the folio. If
> it happens, the reclaim will end up reclaiming an irrelevant folio, and
> return wrong return value.
> 
> This shouldn't cause any problem with correctness or stability, but
> it is indeed confusing and unexpected, and will increase fragmentation,
> decrease performance.
> 
> Fix this by checking whether the folio is still pointing to the offset
> the allocator want to reclaim before reclaiming it.
> 
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---
>  mm/swapfile.c | 11 ++++++++++-
>  1 file changed, 10 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 34baefb000b5..c77ffee4af86 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -210,6 +210,7 @@ static int __try_to_reclaim_swap(struct swap_info_struct *si,
>  	int ret, nr_pages;
>  	bool need_reclaim;
>  
> +again:
>  	folio = filemap_get_folio(address_space, swap_cache_index(entry));
>  	if (IS_ERR(folio))
>  		return 0;
> @@ -227,8 +228,16 @@ static int __try_to_reclaim_swap(struct swap_info_struct *si,
>  	if (!folio_trylock(folio))
>  		goto out;
>  
> -	/* offset could point to the middle of a large folio */
> +	/*
> +	 * Offset could point to the middle of a large folio, or folio
> +	 * may no longer point to the expected offset before it's locked.
> +	 */
>  	entry = folio->swap;
> +	if (offset < swp_offset(entry) || offset >= swp_offset(entry) + nr_pages) {
> +		folio_unlock(folio);
> +		folio_put(folio);
> +		goto again;
> +	}
>  	offset = swp_offset(entry);

LGTM,

Reviewed-by: Baoquan He <bhe@redhat.com>

While reading the code in __try_to_reclaim_swap(), I am always worried
that entry indexed by offset could be accessed by other users so tht
it doesn't only has cache, because we released the ci->lock and don't
hold any lock during period. It could be me who think too much.



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 2/7] mm, swap: drop the flag TTRS_DIRECT
  2025-02-14 17:57 ` [PATCH 2/7] mm, swap: drop the flag TTRS_DIRECT Kairui Song
@ 2025-02-19  2:42   ` Baoquan He
  0 siblings, 0 replies; 30+ messages in thread
From: Baoquan He @ 2025-02-19  2:42 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Chris Li, Barry Song, Hugh Dickins,
	Yosry Ahmed, Huang, Ying, Nhat Pham, Johannes Weiner,
	Kalesh Singh, linux-kernel

On 02/15/25 at 01:57am, Kairui Song wrote:
> From: Kairui Song <kasong@tencent.com>
> 
> This flag exists temporarily to allow the allocator to bypass the slot
> cache during freeing, so reclaiming one slot will free the slot
> immediately.
> 
> But now we have already removed slot cache usage on freeing, so this
> flag has no effect now.
> 
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---
>  mm/swapfile.c | 23 +++--------------------
>  1 file changed, 3 insertions(+), 20 deletions(-)
> 
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index c77ffee4af86..449e388a6fec 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c

Reviewed-by: Baoquan He <bhe@redhaat.com>

> @@ -158,8 +158,6 @@ static long swap_usage_in_pages(struct swap_info_struct *si)
>  #define TTRS_UNMAPPED		0x2
>  /* Reclaim the swap entry if swap is getting full */
>  #define TTRS_FULL		0x4
> -/* Reclaim directly, bypass the slot cache and don't touch device lock */
> -#define TTRS_DIRECT		0x8
>  
>  static bool swap_only_has_cache(struct swap_info_struct *si,
>  			      unsigned long offset, int nr_pages)
> @@ -257,23 +255,8 @@ static int __try_to_reclaim_swap(struct swap_info_struct *si,
>  	if (!need_reclaim)
>  		goto out_unlock;
>  
> -	if (!(flags & TTRS_DIRECT)) {
> -		/* Free through slot cache */
> -		delete_from_swap_cache(folio);
> -		folio_set_dirty(folio);
> -		ret = nr_pages;
> -		goto out_unlock;
> -	}
> -
> -	xa_lock_irq(&address_space->i_pages);
> -	__delete_from_swap_cache(folio, entry, NULL);
> -	xa_unlock_irq(&address_space->i_pages);
> -	folio_ref_sub(folio, nr_pages);
> +	delete_from_swap_cache(folio);
>  	folio_set_dirty(folio);
> -
> -	ci = lock_cluster(si, offset);
> -	swap_entry_range_free(si, ci, entry, nr_pages);
> -	unlock_cluster(ci);
>  	ret = nr_pages;
>  out_unlock:
>  	folio_unlock(folio);
> @@ -707,7 +690,7 @@ static bool cluster_reclaim_range(struct swap_info_struct *si,
>  			offset++;
>  			break;
>  		case SWAP_HAS_CACHE:
> -			nr_reclaim = __try_to_reclaim_swap(si, offset, TTRS_ANYWAY | TTRS_DIRECT);
> +			nr_reclaim = __try_to_reclaim_swap(si, offset, TTRS_ANYWAY);
>  			if (nr_reclaim > 0)
>  				offset += nr_reclaim;
>  			else
> @@ -860,7 +843,7 @@ static void swap_reclaim_full_clusters(struct swap_info_struct *si, bool force)
>  			if (READ_ONCE(map[offset]) == SWAP_HAS_CACHE) {
>  				spin_unlock(&ci->lock);
>  				nr_reclaim = __try_to_reclaim_swap(si, offset,
> -								   TTRS_ANYWAY | TTRS_DIRECT);
> +								   TTRS_ANYWAY);
>  				spin_lock(&ci->lock);
>  				if (nr_reclaim) {
>  					offset += abs(nr_reclaim);
> -- 
> 2.48.1
> 



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 3/7] mm, swap: avoid redundant swap device pinning
  2025-02-14 17:57 ` [PATCH 3/7] mm, swap: avoid redundant swap device pinning Kairui Song
@ 2025-02-19  3:35   ` Baoquan He
  0 siblings, 0 replies; 30+ messages in thread
From: Baoquan He @ 2025-02-19  3:35 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Chris Li, Barry Song, Hugh Dickins,
	Yosry Ahmed, Huang, Ying, Nhat Pham, Johannes Weiner,
	Kalesh Singh, linux-kernel

On 02/15/25 at 01:57am, Kairui Song wrote:
> From: Kairui Song <kasong@tencent.com>
> 
> There are only two callers of __read_swap_cache_async not holding a swap
> device reference, so make them hold a reference instead, and drop the
> get/put_swap_device calls in __read_swap_cache_async. This should reduce
> the overhead for swap in during page fault slightly.

This looks good to me, while the log makes me take a little longer time
to understand. Maybe rephrasing them a little bit can facilitate the log
reading. Not sure if my understanding is correct.

=======
Currently, __read_swap_cache_async() has get/put_swap_device() calls to
increase/decrease a swap device reference. While some of its callers
have held the swap  device reference, e.g in do_swap_page() and
shmem_swapin_folio() where __read_swap_cache_async() will finally
called. Now there are only two excpetional callers not holding a swap
device reference, so make them hold a reference instead. And drop the
get/put_swap_device calls in __read_swap_cache_async. This should reduce
the overhead for swap in during page fault slightly.
========

Other than the nit in log, this looks good to me.

Reviewed-by: Baoquan He <bhe@redhat.com>

> 
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---
>  mm/swap_state.c | 14 ++++++++------
>  mm/zswap.c      |  6 ++++++
>  2 files changed, 14 insertions(+), 6 deletions(-)
> 
> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index a54b035d6a6c..50840a2887a5 100644
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -426,17 +426,13 @@ struct folio *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
>  		struct mempolicy *mpol, pgoff_t ilx, bool *new_page_allocated,
>  		bool skip_if_exists)
>  {
> -	struct swap_info_struct *si;
> +	struct swap_info_struct *si = swp_swap_info(entry);
>  	struct folio *folio;
>  	struct folio *new_folio = NULL;
>  	struct folio *result = NULL;
>  	void *shadow = NULL;
>  
>  	*new_page_allocated = false;
> -	si = get_swap_device(entry);
> -	if (!si)
> -		return NULL;
> -
>  	for (;;) {
>  		int err;
>  		/*
> @@ -532,7 +528,6 @@ struct folio *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
>  	put_swap_folio(new_folio, entry);
>  	folio_unlock(new_folio);
>  put_and_return:
> -	put_swap_device(si);
>  	if (!(*new_page_allocated) && new_folio)
>  		folio_put(new_folio);
>  	return result;
> @@ -552,11 +547,16 @@ struct folio *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
>  		struct vm_area_struct *vma, unsigned long addr,
>  		struct swap_iocb **plug)
>  {
> +	struct swap_info_struct *si;
>  	bool page_allocated;
>  	struct mempolicy *mpol;
>  	pgoff_t ilx;
>  	struct folio *folio;
>  
> +	si = get_swap_device(entry);
> +	if (!si)
> +		return NULL;
> +
>  	mpol = get_vma_policy(vma, addr, 0, &ilx);
>  	folio = __read_swap_cache_async(entry, gfp_mask, mpol, ilx,
>  					&page_allocated, false);
> @@ -564,6 +564,8 @@ struct folio *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
>  
>  	if (page_allocated)
>  		swap_read_folio(folio, plug);
> +
> +	put_swap_device(si);
>  	return folio;
>  }
>  
> diff --git a/mm/zswap.c b/mm/zswap.c
> index ac9d299e7d0c..83dfa1f9e689 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -1051,14 +1051,20 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
>  	struct folio *folio;
>  	struct mempolicy *mpol;
>  	bool folio_was_allocated;
> +	struct swap_info_struct *si;
>  	struct writeback_control wbc = {
>  		.sync_mode = WB_SYNC_NONE,
>  	};
>  
>  	/* try to allocate swap cache folio */
> +	si = get_swap_device(swpentry);
> +	if (!si)
> +		return -EEXIST;
> +
>  	mpol = get_task_policy(current);
>  	folio = __read_swap_cache_async(swpentry, GFP_KERNEL, mpol,
>  				NO_INTERLEAVE_INDEX, &folio_was_allocated, true);
> +	put_swap_device(si);
>  	if (!folio)
>  		return -ENOMEM;
>  
> -- 
> 2.48.1
> 



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 5/7] mm, swap: use percpu cluster as allocation fast path
  2025-02-14 17:57 ` [PATCH 5/7] mm, swap: use percpu cluster as allocation fast path Kairui Song
@ 2025-02-19  7:53   ` Baoquan He
  2025-02-19  8:34     ` Kairui Song
  0 siblings, 1 reply; 30+ messages in thread
From: Baoquan He @ 2025-02-19  7:53 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Chris Li, Barry Song, Hugh Dickins,
	Yosry Ahmed, Huang, Ying, Nhat Pham, Johannes Weiner,
	Kalesh Singh, linux-kernel

On 02/15/25 at 01:57am, Kairui Song wrote:
> From: Kairui Song <kasong@tencent.com>
> 
> Current allocation workflow first traverses the plist with a global lock
> held, after choosing a device, it uses the percpu cluster on that swap
> device. This commit moves the percpu cluster variable out of being tied
> to individual swap devices, making it a global percpu variable, and will
> be used directly for allocation as a fast path.
> 
> The global percpu cluster variable will never point to a HDD device, and
> allocation on HDD devices is still globally serialized.
> 
> This improves the allocator performance and prepares for removal of the
> slot cache in later commits. There shouldn't be much observable behavior
> change, except one thing: this changes how swap device allocation
> rotation works.
> 
> Currently, each allocation will rotate the plist, and because of the
> existence of slot cache (64 entries), swap devices of the same priority
> are rotated for every 64 entries consumed. And, high order allocations
> are different, they will bypass the slot cache, and so swap device is
> rotated for every 16K, 32K, or up to 2M allocation.
> 
> The rotation rule was never clearly defined or documented, it was changed
> several times without mentioning too.
> 
> After this commit, once slot cache is gone in later commits, swap device
> rotation will happen for every consumed cluster. Ideally non-HDD devices
> will be rotated if 2M space has been consumed for each order, this seems

This breaks the rule where the high priority swap device is always taken
to allocate as long as there's free space in the device. After this patch,
it will try the percpu cluster firstly which is lower priority even though
the higher priority device has free space. However, this only happens when
the higher priority device is exhausted, not a generic case. If this is
expected, it may need be mentioned in log or doc somewhere at least.

> reasonable. HDD devices is rotated for every allocation regardless of the
> allocation order, which should be OK and trivial.
> 
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---
>  include/linux/swap.h |  11 ++--
>  mm/swapfile.c        | 120 +++++++++++++++++++++++++++----------------
>  2 files changed, 79 insertions(+), 52 deletions(-)
......
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index ae3bd0a862fc..791cd7ed5bdf 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -116,6 +116,18 @@ static atomic_t proc_poll_event = ATOMIC_INIT(0);
>  
......snip....
>  int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_order)
>  {
>  	int order = swap_entry_order(entry_order);
> @@ -1211,19 +1251,28 @@ int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_order)
>  	int n_ret = 0;
>  	int node;
>  
> +	/* Fast path using percpu cluster */
> +	local_lock(&percpu_swap_cluster.lock);
> +	n_ret = swap_alloc_fast(swp_entries,
> +				SWAP_HAS_CACHE,
> +				order, n_goal);
> +	if (n_ret == n_goal)
> +		goto out;
> +
> +	n_goal = min_t(int, n_goal - n_ret, SWAP_BATCH);

Here, the behaviour is changed too. In old allocation, partial
allocation will jump out to return. In this patch, you try the percpu
cluster firstly, then call scan_swap_map_slots() to try best and will
jump out even though partial allocation succeed. But the allocation from
scan_swap_map_slots() could happen on different si device, this looks
bizarre. Do you think we need reconsider the design?

> +	/* Rotate the device and switch to a new cluster */
>  	spin_lock(&swap_avail_lock);
>  start_over:
>  	node = numa_node_id();
>  	plist_for_each_entry_safe(si, next, &swap_avail_heads[node], avail_lists[node]) {
> -		/* requeue si to after same-priority siblings */
>  		plist_requeue(&si->avail_lists[node], &swap_avail_heads[node]);
>  		spin_unlock(&swap_avail_lock);
>  		if (get_swap_device_info(si)) {
> -			n_ret = scan_swap_map_slots(si, SWAP_HAS_CACHE,
> -					n_goal, swp_entries, order);
> +			n_ret += scan_swap_map_slots(si, SWAP_HAS_CACHE, n_goal,
> +					swp_entries + n_ret, order);
>  			put_swap_device(si);
>  			if (n_ret || size > 1)
> -				goto check_out;
> +				goto out;
>  		}
>  
>  		spin_lock(&swap_avail_lock);
> @@ -1241,12 +1290,10 @@ int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_order)
>  		if (plist_node_empty(&next->avail_lists[node]))
>  			goto start_over;
>  	}
> -
>  	spin_unlock(&swap_avail_lock);
> -
> -check_out:
> +out:
> +	local_unlock(&percpu_swap_cluster.lock);
>  	atomic_long_sub(n_ret * size, &nr_swap_pages);
> -
>  	return n_ret;
>  }
>  
......snip...



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 5/7] mm, swap: use percpu cluster as allocation fast path
  2025-02-19  7:53   ` Baoquan He
@ 2025-02-19  8:34     ` Kairui Song
  2025-02-19  9:26       ` Baoquan He
  2025-02-19 10:55       ` Baoquan He
  0 siblings, 2 replies; 30+ messages in thread
From: Kairui Song @ 2025-02-19  8:34 UTC (permalink / raw)
  To: Baoquan He
  Cc: linux-mm, Andrew Morton, Chris Li, Barry Song, Hugh Dickins,
	Yosry Ahmed, Huang, Ying, Nhat Pham, Johannes Weiner,
	Kalesh Singh, linux-kernel

On Wed, Feb 19, 2025 at 3:54 PM Baoquan He <bhe@redhat.com> wrote:

Hi Baoquan,

Thanks for the review!

>
> On 02/15/25 at 01:57am, Kairui Song wrote:
> > From: Kairui Song <kasong@tencent.com>
> >
> > Current allocation workflow first traverses the plist with a global lock
> > held, after choosing a device, it uses the percpu cluster on that swap
> > device. This commit moves the percpu cluster variable out of being tied
> > to individual swap devices, making it a global percpu variable, and will
> > be used directly for allocation as a fast path.
> >
> > The global percpu cluster variable will never point to a HDD device, and
> > allocation on HDD devices is still globally serialized.
> >
> > This improves the allocator performance and prepares for removal of the
> > slot cache in later commits. There shouldn't be much observable behavior
> > change, except one thing: this changes how swap device allocation
> > rotation works.
> >
> > Currently, each allocation will rotate the plist, and because of the
> > existence of slot cache (64 entries), swap devices of the same priority
> > are rotated for every 64 entries consumed. And, high order allocations
> > are different, they will bypass the slot cache, and so swap device is
> > rotated for every 16K, 32K, or up to 2M allocation.
> >
> > The rotation rule was never clearly defined or documented, it was changed
> > several times without mentioning too.
> >
> > After this commit, once slot cache is gone in later commits, swap device
> > rotation will happen for every consumed cluster. Ideally non-HDD devices
> > will be rotated if 2M space has been consumed for each order, this seems
>
> This breaks the rule where the high priority swap device is always taken
> to allocate as long as there's free space in the device. After this patch,
> it will try the percpu cluster firstly which is lower priority even though
> the higher priority device has free space. However, this only happens when
> the higher priority device is exhausted, not a generic case. If this is
> expected, it may need be mentioned in log or doc somewhere at least.

Hmm, actually this rule was already broken if you are very strict
about it. The current percpu slot cache does a pre-allocation, so the
high priority device will be removed from the plist while some CPU's
slot cache holding usable entries.

If the high priority device is exhausted, some CPU's percpu cluster
will point to a low priority device indeed, and keep using it until
the percpu cluster is drained. I think this should be
OK. The high priority device is already full, so the amount of
swapouts falls back to low priority device is only a performance
issue, I think it's a tiny change for a rare case.

>
> > reasonable. HDD devices is rotated for every allocation regardless of the
> > allocation order, which should be OK and trivial.
> >
> > Signed-off-by: Kairui Song <kasong@tencent.com>
> > ---
> >  include/linux/swap.h |  11 ++--
> >  mm/swapfile.c        | 120 +++++++++++++++++++++++++++----------------
> >  2 files changed, 79 insertions(+), 52 deletions(-)
> ......
> > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > index ae3bd0a862fc..791cd7ed5bdf 100644
> > --- a/mm/swapfile.c
> > +++ b/mm/swapfile.c
> > @@ -116,6 +116,18 @@ static atomic_t proc_poll_event = ATOMIC_INIT(0);
> >
> ......snip....
> >  int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_order)
> >  {
> >       int order = swap_entry_order(entry_order);
> > @@ -1211,19 +1251,28 @@ int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_order)
> >       int n_ret = 0;
> >       int node;
> >
> > +     /* Fast path using percpu cluster */
> > +     local_lock(&percpu_swap_cluster.lock);
> > +     n_ret = swap_alloc_fast(swp_entries,
> > +                             SWAP_HAS_CACHE,
> > +                             order, n_goal);
> > +     if (n_ret == n_goal)
> > +             goto out;
> > +
> > +     n_goal = min_t(int, n_goal - n_ret, SWAP_BATCH);
>
> Here, the behaviour is changed too. In old allocation, partial
> allocation will jump out to return. In this patch, you try the percpu
> cluster firstly, then call scan_swap_map_slots() to try best and will
> jump out even though partial allocation succeed. But the allocation from
> scan_swap_map_slots() could happen on different si device, this looks
> bizarre. Do you think we need reconsider the design?

Right, that's a behavior change, but only temporarily affects slot cache.
get_swap_pages will only be called with size > 1 when order == 0, and
only by slot cache. (Large order allocation always use size == 1,
other users only uses order == 0 && size == 1). So I didn't' notice
it, as this series is removing slot cache.

The partial side effect would be "returned slots will be from
different devices" and "slot_cache may get drained faster as
get_swap_pages may return less slots when percpu cluster is drained".
Might be a performance issue but seems slight and trivial, slot cache
can still work. And the next commit will just remove the slot cache,
and the problem will be gone. I think I can add a comment about it
here?


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 5/7] mm, swap: use percpu cluster as allocation fast path
  2025-02-19  8:34     ` Kairui Song
@ 2025-02-19  9:26       ` Baoquan He
  2025-02-19 10:55       ` Baoquan He
  1 sibling, 0 replies; 30+ messages in thread
From: Baoquan He @ 2025-02-19  9:26 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Chris Li, Barry Song, Hugh Dickins,
	Yosry Ahmed, Huang, Ying, Nhat Pham, Johannes Weiner,
	Kalesh Singh, linux-kernel

On 02/19/25 at 04:34pm, Kairui Song wrote:
> On Wed, Feb 19, 2025 at 3:54 PM Baoquan He <bhe@redhat.com> wrote:
> 
> Hi Baoquan,
> 
> Thanks for the review!
> 
> >
> > On 02/15/25 at 01:57am, Kairui Song wrote:
> > > From: Kairui Song <kasong@tencent.com>
> > >
> > > Current allocation workflow first traverses the plist with a global lock
> > > held, after choosing a device, it uses the percpu cluster on that swap
> > > device. This commit moves the percpu cluster variable out of being tied
> > > to individual swap devices, making it a global percpu variable, and will
> > > be used directly for allocation as a fast path.
> > >
> > > The global percpu cluster variable will never point to a HDD device, and
> > > allocation on HDD devices is still globally serialized.
> > >
> > > This improves the allocator performance and prepares for removal of the
> > > slot cache in later commits. There shouldn't be much observable behavior
> > > change, except one thing: this changes how swap device allocation
> > > rotation works.
> > >
> > > Currently, each allocation will rotate the plist, and because of the
> > > existence of slot cache (64 entries), swap devices of the same priority
> > > are rotated for every 64 entries consumed. And, high order allocations
> > > are different, they will bypass the slot cache, and so swap device is
> > > rotated for every 16K, 32K, or up to 2M allocation.
> > >
> > > The rotation rule was never clearly defined or documented, it was changed
> > > several times without mentioning too.
> > >
> > > After this commit, once slot cache is gone in later commits, swap device
> > > rotation will happen for every consumed cluster. Ideally non-HDD devices
> > > will be rotated if 2M space has been consumed for each order, this seems
> >
> > This breaks the rule where the high priority swap device is always taken
> > to allocate as long as there's free space in the device. After this patch,
> > it will try the percpu cluster firstly which is lower priority even though
> > the higher priority device has free space. However, this only happens when
> > the higher priority device is exhausted, not a generic case. If this is
> > expected, it may need be mentioned in log or doc somewhere at least.
> 
> Hmm, actually this rule was already broken if you are very strict
> about it. The current percpu slot cache does a pre-allocation, so the
> high priority device will be removed from the plist while some CPU's
> slot cache holding usable entries.

Ah, right, I didn't think about this point.

> 
> If the high priority device is exhausted, some CPU's percpu cluster
> will point to a low priority device indeed, and keep using it until
> the percpu cluster is drained. I think this should be
> OK. The high priority device is already full, so the amount of
> swapouts falls back to low priority device is only a performance
> issue, I think it's a tiny change for a rare case.

Agree, thanks for explanation.

> 
> >
> > > reasonable. HDD devices is rotated for every allocation regardless of the
> > > allocation order, which should be OK and trivial.
> > >
> > > Signed-off-by: Kairui Song <kasong@tencent.com>
> > > ---
> > >  include/linux/swap.h |  11 ++--
> > >  mm/swapfile.c        | 120 +++++++++++++++++++++++++++----------------
> > >  2 files changed, 79 insertions(+), 52 deletions(-)
> > ......
> > > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > > index ae3bd0a862fc..791cd7ed5bdf 100644
> > > --- a/mm/swapfile.c
> > > +++ b/mm/swapfile.c
> > > @@ -116,6 +116,18 @@ static atomic_t proc_poll_event = ATOMIC_INIT(0);
> > >
> > ......snip....
> > >  int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_order)
> > >  {
> > >       int order = swap_entry_order(entry_order);
> > > @@ -1211,19 +1251,28 @@ int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_order)
> > >       int n_ret = 0;
> > >       int node;
> > >
> > > +     /* Fast path using percpu cluster */
> > > +     local_lock(&percpu_swap_cluster.lock);
> > > +     n_ret = swap_alloc_fast(swp_entries,
> > > +                             SWAP_HAS_CACHE,
> > > +                             order, n_goal);
> > > +     if (n_ret == n_goal)
> > > +             goto out;
> > > +
> > > +     n_goal = min_t(int, n_goal - n_ret, SWAP_BATCH);
> >
> > Here, the behaviour is changed too. In old allocation, partial
> > allocation will jump out to return. In this patch, you try the percpu
> > cluster firstly, then call scan_swap_map_slots() to try best and will
> > jump out even though partial allocation succeed. But the allocation from
> > scan_swap_map_slots() could happen on different si device, this looks
> > bizarre. Do you think we need reconsider the design?
> 
> Right, that's a behavior change, but only temporarily affects slot cache.
> get_swap_pages will only be called with size > 1 when order == 0, and
> only by slot cache. (Large order allocation always use size == 1,
> other users only uses order == 0 && size == 1). So I didn't' notice
> it, as this series is removing slot cache.

Right, I am reviewing patch 6, noticed this is temporary.

> 
> The partial side effect would be "returned slots will be from
> different devices" and "slot_cache may get drained faster as
> get_swap_pages may return less slots when percpu cluster is drained".
> Might be a performance issue but seems slight and trivial, slot cache
> can still work. And the next commit will just remove the slot cache,
> and the problem will be gone. I think I can add a comment about it
> here?

Sounds good. Adding comment can avoid other people being confused when
checking patch 5. Thanks.



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 5/7] mm, swap: use percpu cluster as allocation fast path
  2025-02-19  8:34     ` Kairui Song
  2025-02-19  9:26       ` Baoquan He
@ 2025-02-19 10:55       ` Baoquan He
  2025-02-19 11:12         ` Kairui Song
  1 sibling, 1 reply; 30+ messages in thread
From: Baoquan He @ 2025-02-19 10:55 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Chris Li, Barry Song, Hugh Dickins,
	Yosry Ahmed, Huang, Ying, Nhat Pham, Johannes Weiner,
	Kalesh Singh, linux-kernel

On 02/19/25 at 04:34pm, Kairui Song wrote:
> On Wed, Feb 19, 2025 at 3:54 PM Baoquan He <bhe@redhat.com> wrote:
> 
> Hi Baoquan,
> 
> Thanks for the review!
> 
> >
> > On 02/15/25 at 01:57am, Kairui Song wrote:
> > > From: Kairui Song <kasong@tencent.com>
> > >
> > > Current allocation workflow first traverses the plist with a global lock
> > > held, after choosing a device, it uses the percpu cluster on that swap
> > > device. This commit moves the percpu cluster variable out of being tied
> > > to individual swap devices, making it a global percpu variable, and will
> > > be used directly for allocation as a fast path.
> > >
> > > The global percpu cluster variable will never point to a HDD device, and
> > > allocation on HDD devices is still globally serialized.
> > >
> > > This improves the allocator performance and prepares for removal of the
> > > slot cache in later commits. There shouldn't be much observable behavior
> > > change, except one thing: this changes how swap device allocation
> > > rotation works.
> > >
> > > Currently, each allocation will rotate the plist, and because of the
> > > existence of slot cache (64 entries), swap devices of the same priority
> > > are rotated for every 64 entries consumed. And, high order allocations
> > > are different, they will bypass the slot cache, and so swap device is
> > > rotated for every 16K, 32K, or up to 2M allocation.
> > >
> > > The rotation rule was never clearly defined or documented, it was changed
> > > several times without mentioning too.
> > >
> > > After this commit, once slot cache is gone in later commits, swap device
> > > rotation will happen for every consumed cluster. Ideally non-HDD devices
> > > will be rotated if 2M space has been consumed for each order, this seems
> >
> > This breaks the rule where the high priority swap device is always taken
> > to allocate as long as there's free space in the device. After this patch,
> > it will try the percpu cluster firstly which is lower priority even though
> > the higher priority device has free space. However, this only happens when
> > the higher priority device is exhausted, not a generic case. If this is
> > expected, it may need be mentioned in log or doc somewhere at least.
> 
> Hmm, actually this rule was already broken if you are very strict
> about it. The current percpu slot cache does a pre-allocation, so the
> high priority device will be removed from the plist while some CPU's
> slot cache holding usable entries.
> 
> If the high priority device is exhausted, some CPU's percpu cluster
> will point to a low priority device indeed, and keep using it until
> the percpu cluster is drained. I think this should be
> OK. The high priority device is already full, so the amount of
> swapouts falls back to low priority device is only a performance
> issue, I think it's a tiny change for a rare case.
> 
> >
> > > reasonable. HDD devices is rotated for every allocation regardless of the
> > > allocation order, which should be OK and trivial.
> > >
> > > Signed-off-by: Kairui Song <kasong@tencent.com>
> > > ---
> > >  include/linux/swap.h |  11 ++--
> > >  mm/swapfile.c        | 120 +++++++++++++++++++++++++++----------------
> > >  2 files changed, 79 insertions(+), 52 deletions(-)
> > ......
> > > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > > index ae3bd0a862fc..791cd7ed5bdf 100644
> > > --- a/mm/swapfile.c
> > > +++ b/mm/swapfile.c
> > > @@ -116,6 +116,18 @@ static atomic_t proc_poll_event = ATOMIC_INIT(0);
> > >
> > ......snip....
> > >  int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_order)
> > >  {
> > >       int order = swap_entry_order(entry_order);
> > > @@ -1211,19 +1251,28 @@ int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_order)
> > >       int n_ret = 0;
> > >       int node;
> > >
> > > +     /* Fast path using percpu cluster */
> > > +     local_lock(&percpu_swap_cluster.lock);
> > > +     n_ret = swap_alloc_fast(swp_entries,
> > > +                             SWAP_HAS_CACHE,
> > > +                             order, n_goal);
> > > +     if (n_ret == n_goal)
> > > +             goto out;
> > > +
> > > +     n_goal = min_t(int, n_goal - n_ret, SWAP_BATCH);
> >
> > Here, the behaviour is changed too. In old allocation, partial
> > allocation will jump out to return. In this patch, you try the percpu
> > cluster firstly, then call scan_swap_map_slots() to try best and will
> > jump out even though partial allocation succeed. But the allocation from
> > scan_swap_map_slots() could happen on different si device, this looks
> > bizarre. Do you think we need reconsider the design?
> 
> Right, that's a behavior change, but only temporarily affects slot cache.
> get_swap_pages will only be called with size > 1 when order == 0, and
> only by slot cache. (Large order allocation always use size == 1,
> other users only uses order == 0 && size == 1). So I didn't' notice
> it, as this series is removing slot cache.
> 
> The partial side effect would be "returned slots will be from
> different devices" and "slot_cache may get drained faster as
> get_swap_pages may return less slots when percpu cluster is drained".
> Might be a performance issue but seems slight and trivial, slot cache
> can still work. And the next commit will just remove the slot cache,
> and the problem will be gone. I think I can add a comment about it
> here?

By the way, another thing I suddenly think of is the percpu cluster
becoming glober over all devices. If one order on the stored si of
percpu cluster is used up, then the whole percpu cluster is drained and
rewritten. Won't this impact performance compared with the old embedding
percpu cluster in each si? In log you said "Ideally non-HDD devices
will be rotated if 2M space has been consumed for each order, this seems
reasonable.", while in reality it may be very difficult to achieve the
'each 2M space has been consumed for each order', but more often happen
when very few of order's space has been consumed, then rewrite percpu.
Wonder what I have missed about this point.

> 



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 5/7] mm, swap: use percpu cluster as allocation fast path
  2025-02-19 10:55       ` Baoquan He
@ 2025-02-19 11:12         ` Kairui Song
  2025-02-20  2:35           ` Baoquan He
  0 siblings, 1 reply; 30+ messages in thread
From: Kairui Song @ 2025-02-19 11:12 UTC (permalink / raw)
  To: Baoquan He
  Cc: linux-mm, Andrew Morton, Chris Li, Barry Song, Hugh Dickins,
	Yosry Ahmed, Huang, Ying, Nhat Pham, Johannes Weiner,
	Kalesh Singh, linux-kernel

On Wed, Feb 19, 2025 at 6:55 PM Baoquan He <bhe@redhat.com> wrote:
>
> On 02/19/25 at 04:34pm, Kairui Song wrote:
> > On Wed, Feb 19, 2025 at 3:54 PM Baoquan He <bhe@redhat.com> wrote:
> >
> > Hi Baoquan,
> >
> > Thanks for the review!
> >
> > >
> > > On 02/15/25 at 01:57am, Kairui Song wrote:
> > > > From: Kairui Song <kasong@tencent.com>
> > > >
> > > > Current allocation workflow first traverses the plist with a global lock
> > > > held, after choosing a device, it uses the percpu cluster on that swap
> > > > device. This commit moves the percpu cluster variable out of being tied
> > > > to individual swap devices, making it a global percpu variable, and will
> > > > be used directly for allocation as a fast path.
> > > >
> > > > The global percpu cluster variable will never point to a HDD device, and
> > > > allocation on HDD devices is still globally serialized.
> > > >
> > > > This improves the allocator performance and prepares for removal of the
> > > > slot cache in later commits. There shouldn't be much observable behavior
> > > > change, except one thing: this changes how swap device allocation
> > > > rotation works.
> > > >
> > > > Currently, each allocation will rotate the plist, and because of the
> > > > existence of slot cache (64 entries), swap devices of the same priority
> > > > are rotated for every 64 entries consumed. And, high order allocations
> > > > are different, they will bypass the slot cache, and so swap device is
> > > > rotated for every 16K, 32K, or up to 2M allocation.
> > > >
> > > > The rotation rule was never clearly defined or documented, it was changed
> > > > several times without mentioning too.
> > > >
> > > > After this commit, once slot cache is gone in later commits, swap device
> > > > rotation will happen for every consumed cluster. Ideally non-HDD devices
> > > > will be rotated if 2M space has been consumed for each order, this seems
> > >
> > > This breaks the rule where the high priority swap device is always taken
> > > to allocate as long as there's free space in the device. After this patch,
> > > it will try the percpu cluster firstly which is lower priority even though
> > > the higher priority device has free space. However, this only happens when
> > > the higher priority device is exhausted, not a generic case. If this is
> > > expected, it may need be mentioned in log or doc somewhere at least.
> >
> > Hmm, actually this rule was already broken if you are very strict
> > about it. The current percpu slot cache does a pre-allocation, so the
> > high priority device will be removed from the plist while some CPU's
> > slot cache holding usable entries.
> >
> > If the high priority device is exhausted, some CPU's percpu cluster
> > will point to a low priority device indeed, and keep using it until
> > the percpu cluster is drained. I think this should be
> > OK. The high priority device is already full, so the amount of
> > swapouts falls back to low priority device is only a performance
> > issue, I think it's a tiny change for a rare case.
> >
> > >
> > > > reasonable. HDD devices is rotated for every allocation regardless of the
> > > > allocation order, which should be OK and trivial.
> > > >
> > > > Signed-off-by: Kairui Song <kasong@tencent.com>
> > > > ---
> > > >  include/linux/swap.h |  11 ++--
> > > >  mm/swapfile.c        | 120 +++++++++++++++++++++++++++----------------
> > > >  2 files changed, 79 insertions(+), 52 deletions(-)
> > > ......
> > > > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > > > index ae3bd0a862fc..791cd7ed5bdf 100644
> > > > --- a/mm/swapfile.c
> > > > +++ b/mm/swapfile.c
> > > > @@ -116,6 +116,18 @@ static atomic_t proc_poll_event = ATOMIC_INIT(0);
> > > >
> > > ......snip....
> > > >  int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_order)
> > > >  {
> > > >       int order = swap_entry_order(entry_order);
> > > > @@ -1211,19 +1251,28 @@ int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_order)
> > > >       int n_ret = 0;
> > > >       int node;
> > > >
> > > > +     /* Fast path using percpu cluster */
> > > > +     local_lock(&percpu_swap_cluster.lock);
> > > > +     n_ret = swap_alloc_fast(swp_entries,
> > > > +                             SWAP_HAS_CACHE,
> > > > +                             order, n_goal);
> > > > +     if (n_ret == n_goal)
> > > > +             goto out;
> > > > +
> > > > +     n_goal = min_t(int, n_goal - n_ret, SWAP_BATCH);
> > >
> > > Here, the behaviour is changed too. In old allocation, partial
> > > allocation will jump out to return. In this patch, you try the percpu
> > > cluster firstly, then call scan_swap_map_slots() to try best and will
> > > jump out even though partial allocation succeed. But the allocation from
> > > scan_swap_map_slots() could happen on different si device, this looks
> > > bizarre. Do you think we need reconsider the design?
> >
> > Right, that's a behavior change, but only temporarily affects slot cache.
> > get_swap_pages will only be called with size > 1 when order == 0, and
> > only by slot cache. (Large order allocation always use size == 1,
> > other users only uses order == 0 && size == 1). So I didn't' notice
> > it, as this series is removing slot cache.
> >
> > The partial side effect would be "returned slots will be from
> > different devices" and "slot_cache may get drained faster as
> > get_swap_pages may return less slots when percpu cluster is drained".
> > Might be a performance issue but seems slight and trivial, slot cache
> > can still work. And the next commit will just remove the slot cache,
> > and the problem will be gone. I think I can add a comment about it
> > here?
>
> By the way, another thing I suddenly think of is the percpu cluster
> becoming glober over all devices. If one order on the stored si of
> percpu cluster is used up, then the whole percpu cluster is drained and
> rewritten. Won't this impact performance compared with the old embedding
> percpu cluster in each si? In log you said "Ideally non-HDD devices
> will be rotated if 2M space has been consumed for each order, this seems
> reasonable.", while in reality it may be very difficult to achieve the
> 'each 2M space has been consumed for each order', but more often happen
> when very few of order's space has been consumed, then rewrite percpu.
> Wonder what I have missed about this point.

Hi Baoquan,

> then the whole percpu cluster is drained and rewritten

Not sure what you mean, SWAP IO doesn't happen in units of clusters,
cluster is only a management unit for slots, so only allocated / freed
slot will be written. Discard is a different thing, and this should
have very little effect on that.

Or you mean the percpu struct array getting updated? It should be even
faster than before, updating a global percpu variable is easier to
calculate the offset at runtime, using GS.

> n reality it may be very difficult to achieve the 'each 2M space has been consumed for each order',

Very true, but notice for order >= 1, slot cache never worked before.
And for order == 0, it's very likely that a cluster will have more
than 64 slots usable. The test result I posted should be a good
example, and device is very full during the test, and performance is
basically identical to before. My only concern was about the device
rotating, as slot cache never worked for order >= 1, so the device
rotates was very frequently. But still seems no one really cared about
it, mthp swapout is a new thing and the previous rotation rule seems
even more confusing than this new idea.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 5/7] mm, swap: use percpu cluster as allocation fast path
  2025-02-19 11:12         ` Kairui Song
@ 2025-02-20  2:35           ` Baoquan He
  2025-02-20  2:48             ` Kairui Song
  0 siblings, 1 reply; 30+ messages in thread
From: Baoquan He @ 2025-02-20  2:35 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Chris Li, Barry Song, Hugh Dickins,
	Yosry Ahmed, Huang, Ying, Nhat Pham, Johannes Weiner,
	Kalesh Singh, linux-kernel

On 02/19/25 at 07:12pm, Kairui Song wrote:
> On Wed, Feb 19, 2025 at 6:55 PM Baoquan He <bhe@redhat.com> wrote:
> >
> > On 02/19/25 at 04:34pm, Kairui Song wrote:
> > > On Wed, Feb 19, 2025 at 3:54 PM Baoquan He <bhe@redhat.com> wrote:
> > >
> > > Hi Baoquan,
> > >
> > > Thanks for the review!
> > >
> > > >
> > > > On 02/15/25 at 01:57am, Kairui Song wrote:
> > > > > From: Kairui Song <kasong@tencent.com>
> > > > >
> > > > > Current allocation workflow first traverses the plist with a global lock
> > > > > held, after choosing a device, it uses the percpu cluster on that swap
> > > > > device. This commit moves the percpu cluster variable out of being tied
> > > > > to individual swap devices, making it a global percpu variable, and will
> > > > > be used directly for allocation as a fast path.
> > > > >
> > > > > The global percpu cluster variable will never point to a HDD device, and
> > > > > allocation on HDD devices is still globally serialized.
> > > > >
> > > > > This improves the allocator performance and prepares for removal of the
> > > > > slot cache in later commits. There shouldn't be much observable behavior
> > > > > change, except one thing: this changes how swap device allocation
> > > > > rotation works.
> > > > >
> > > > > Currently, each allocation will rotate the plist, and because of the
> > > > > existence of slot cache (64 entries), swap devices of the same priority
> > > > > are rotated for every 64 entries consumed. And, high order allocations
> > > > > are different, they will bypass the slot cache, and so swap device is
> > > > > rotated for every 16K, 32K, or up to 2M allocation.
> > > > >
> > > > > The rotation rule was never clearly defined or documented, it was changed
> > > > > several times without mentioning too.
> > > > >
> > > > > After this commit, once slot cache is gone in later commits, swap device
> > > > > rotation will happen for every consumed cluster. Ideally non-HDD devices
> > > > > will be rotated if 2M space has been consumed for each order, this seems
> > > >
> > > > This breaks the rule where the high priority swap device is always taken
> > > > to allocate as long as there's free space in the device. After this patch,
> > > > it will try the percpu cluster firstly which is lower priority even though
> > > > the higher priority device has free space. However, this only happens when
> > > > the higher priority device is exhausted, not a generic case. If this is
> > > > expected, it may need be mentioned in log or doc somewhere at least.
> > >
> > > Hmm, actually this rule was already broken if you are very strict
> > > about it. The current percpu slot cache does a pre-allocation, so the
> > > high priority device will be removed from the plist while some CPU's
> > > slot cache holding usable entries.
> > >
> > > If the high priority device is exhausted, some CPU's percpu cluster
> > > will point to a low priority device indeed, and keep using it until
> > > the percpu cluster is drained. I think this should be
> > > OK. The high priority device is already full, so the amount of
> > > swapouts falls back to low priority device is only a performance
> > > issue, I think it's a tiny change for a rare case.
> > >
> > > >
> > > > > reasonable. HDD devices is rotated for every allocation regardless of the
> > > > > allocation order, which should be OK and trivial.
> > > > >
> > > > > Signed-off-by: Kairui Song <kasong@tencent.com>
> > > > > ---
> > > > >  include/linux/swap.h |  11 ++--
> > > > >  mm/swapfile.c        | 120 +++++++++++++++++++++++++++----------------
> > > > >  2 files changed, 79 insertions(+), 52 deletions(-)
> > > > ......
> > > > > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > > > > index ae3bd0a862fc..791cd7ed5bdf 100644
> > > > > --- a/mm/swapfile.c
> > > > > +++ b/mm/swapfile.c
> > > > > @@ -116,6 +116,18 @@ static atomic_t proc_poll_event = ATOMIC_INIT(0);
> > > > >
> > > > ......snip....
> > > > >  int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_order)
> > > > >  {
> > > > >       int order = swap_entry_order(entry_order);
> > > > > @@ -1211,19 +1251,28 @@ int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_order)
> > > > >       int n_ret = 0;
> > > > >       int node;
> > > > >
> > > > > +     /* Fast path using percpu cluster */
> > > > > +     local_lock(&percpu_swap_cluster.lock);
> > > > > +     n_ret = swap_alloc_fast(swp_entries,
> > > > > +                             SWAP_HAS_CACHE,
> > > > > +                             order, n_goal);
> > > > > +     if (n_ret == n_goal)
> > > > > +             goto out;
> > > > > +
> > > > > +     n_goal = min_t(int, n_goal - n_ret, SWAP_BATCH);
> > > >
> > > > Here, the behaviour is changed too. In old allocation, partial
> > > > allocation will jump out to return. In this patch, you try the percpu
> > > > cluster firstly, then call scan_swap_map_slots() to try best and will
> > > > jump out even though partial allocation succeed. But the allocation from
> > > > scan_swap_map_slots() could happen on different si device, this looks
> > > > bizarre. Do you think we need reconsider the design?
> > >
> > > Right, that's a behavior change, but only temporarily affects slot cache.
> > > get_swap_pages will only be called with size > 1 when order == 0, and
> > > only by slot cache. (Large order allocation always use size == 1,
> > > other users only uses order == 0 && size == 1). So I didn't' notice
> > > it, as this series is removing slot cache.
> > >
> > > The partial side effect would be "returned slots will be from
> > > different devices" and "slot_cache may get drained faster as
> > > get_swap_pages may return less slots when percpu cluster is drained".
> > > Might be a performance issue but seems slight and trivial, slot cache
> > > can still work. And the next commit will just remove the slot cache,
> > > and the problem will be gone. I think I can add a comment about it
> > > here?
> >
> > By the way, another thing I suddenly think of is the percpu cluster
> > becoming glober over all devices. If one order on the stored si of
> > percpu cluster is used up, then the whole percpu cluster is drained and
> > rewritten. Won't this impact performance compared with the old embedding
> > percpu cluster in each si? In log you said "Ideally non-HDD devices
> > will be rotated if 2M space has been consumed for each order, this seems
> > reasonable.", while in reality it may be very difficult to achieve the
> > 'each 2M space has been consumed for each order', but more often happen
> > when very few of order's space has been consumed, then rewrite percpu.
> > Wonder what I have missed about this point.
> 
> Hi Baoquan,
> 
> > then the whole percpu cluster is drained and rewritten
> 
> Not sure what you mean, SWAP IO doesn't happen in units of clusters,
> cluster is only a management unit for slots, so only allocated / freed
> slot will be written. Discard is a different thing, and this should
> have very little effect on that.
> 
> Or you mean the percpu struct array getting updated? It should be even
> faster than before, updating a global percpu variable is easier to
> calculate the offset at runtime, using GS.

Yes, I mean the global percpu array updating.

> 
> > n reality it may be very difficult to achieve the 'each 2M space has been consumed for each order',
> 
> Very true, but notice for order >= 1, slot cache never worked before.
> And for order == 0, it's very likely that a cluster will have more
> than 64 slots usable. The test result I posted should be a good
> example, and device is very full during the test, and performance is
> basically identical to before. My only concern was about the device

My worry is the global percpu cluster may impact performance among
multiple swap devices. Before, per si percpu cluster will cache the
valid offset in one cluster for each order. For multiple swap devices,
this consumes a little bit more percpu memory. While the new global
percpu cluster could be updated to a different swap device easily only
of one order is available, then the whole array is invalid. That looks a
little drastic cmpared with before.

Yeah, the example you shown looks good. Wonder how many swap devices are
simulated in your example.

> rotating, as slot cache never worked for order >= 1, so the device
> rotates was very frequently. But still seems no one really cared about
> it, mthp swapout is a new thing and the previous rotation rule seems
> even more confusing than this new idea.

I never contact a real product environment with multiple tier and
many swap devices. In reality, with my shallow knowledge, usually only
one swap device is deployed. If that's true in most of time, the old
code or new code is fine, otherwise, seems we may need consider the
impact.



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 5/7] mm, swap: use percpu cluster as allocation fast path
  2025-02-20  2:35           ` Baoquan He
@ 2025-02-20  2:48             ` Kairui Song
  2025-02-20  3:24               ` Baoquan He
  0 siblings, 1 reply; 30+ messages in thread
From: Kairui Song @ 2025-02-20  2:48 UTC (permalink / raw)
  To: Baoquan He
  Cc: linux-mm, Andrew Morton, Chris Li, Barry Song, Hugh Dickins,
	Yosry Ahmed, Huang, Ying, Nhat Pham, Johannes Weiner,
	Kalesh Singh, linux-kernel

On Thu, Feb 20, 2025 at 10:35 AM Baoquan He <bhe@redhat.com> wrote:
>
> On 02/19/25 at 07:12pm, Kairui Song wrote:
> >
> > > n reality it may be very difficult to achieve the 'each 2M space has been consumed for each order',
> >
> > Very true, but notice for order >= 1, slot cache never worked before.
> > And for order == 0, it's very likely that a cluster will have more
> > than 64 slots usable. The test result I posted should be a good
> > example, and device is very full during the test, and performance is
> > basically identical to before. My only concern was about the device
>
> My worry is the global percpu cluster may impact performance among
> multiple swap devices. Before, per si percpu cluster will cache the
> valid offset in one cluster for each order. For multiple swap devices,
> this consumes a little bit more percpu memory. While the new global
> percpu cluster could be updated to a different swap device easily only
> of one order is available, then the whole array is invalid. That looks a
> little drastic cmpared with before.

Ah, now I got what you mean. That's seems could be a problem indeed.

I think I can change the

+struct percpu_swap_cluster {
+       struct swap_info_struct *si;

to

+struct percpu_swap_cluster {
+       struct swap_info_struct *si[SWAP_NR_ORDERS];

Or embed the swp type in the offset, this way each order won't affect
each other.  How do you think?

Previously high order allocation will bypass slot cache so allocation
could happen on different same priority devices too. So the behaviour
that each order using different device should be acceptable.

>
> Yeah, the example you shown looks good. Wonder how many swap devices are
> simulated in your example.
>
> > rotating, as slot cache never worked for order >= 1, so the device
> > rotates was very frequently. But still seems no one really cared about
> > it, mthp swapout is a new thing and the previous rotation rule seems
> > even more confusing than this new idea.
>
> I never contact a real product environment with multiple tier and
> many swap devices. In reality, with my shallow knowledge, usually only
> one swap device is deployed. If that's true in most of time, the old
> code or new code is fine, otherwise, seems we may need consider the
> impact.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 5/7] mm, swap: use percpu cluster as allocation fast path
  2025-02-20  2:48             ` Kairui Song
@ 2025-02-20  3:24               ` Baoquan He
  0 siblings, 0 replies; 30+ messages in thread
From: Baoquan He @ 2025-02-20  3:24 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Chris Li, Barry Song, Hugh Dickins,
	Yosry Ahmed, Huang, Ying, Nhat Pham, Johannes Weiner,
	Kalesh Singh, linux-kernel

On 02/20/25 at 10:48am, Kairui Song wrote:
> On Thu, Feb 20, 2025 at 10:35 AM Baoquan He <bhe@redhat.com> wrote:
> >
> > On 02/19/25 at 07:12pm, Kairui Song wrote:
> > >
> > > > n reality it may be very difficult to achieve the 'each 2M space has been consumed for each order',
> > >
> > > Very true, but notice for order >= 1, slot cache never worked before.
> > > And for order == 0, it's very likely that a cluster will have more
> > > than 64 slots usable. The test result I posted should be a good
> > > example, and device is very full during the test, and performance is
> > > basically identical to before. My only concern was about the device
> >
> > My worry is the global percpu cluster may impact performance among
> > multiple swap devices. Before, per si percpu cluster will cache the
> > valid offset in one cluster for each order. For multiple swap devices,
> > this consumes a little bit more percpu memory. While the new global
> > percpu cluster could be updated to a different swap device easily only
> > of one order is available, then the whole array is invalid. That looks a
> > little drastic cmpared with before.
> 
> Ah, now I got what you mean. That's seems could be a problem indeed.
> 
> I think I can change the
> 
> +struct percpu_swap_cluster {
> +       struct swap_info_struct *si;
> 
> to
> 
> +struct percpu_swap_cluster {
> +       struct swap_info_struct *si[SWAP_NR_ORDERS];
> 
> Or embed the swp type in the offset, this way each order won't affect
> each other.  How do you think?

Yes, this looks much better. You may need store both si and offset, the
above demonstrated struct percpu_swap_cluster lacks offset which seems
not good.

> 
> Previously high order allocation will bypass slot cache so allocation
> could happen on different same priority devices too. So the behaviour
> that each order using different device should be acceptable.
> 
> >
> > Yeah, the example you shown looks good. Wonder how many swap devices are
> > simulated in your example.
> >
> > > rotating, as slot cache never worked for order >= 1, so the device
> > > rotates was very frequently. But still seems no one really cared about
> > > it, mthp swapout is a new thing and the previous rotation rule seems
> > > even more confusing than this new idea.
> >
> > I never contact a real product environment with multiple tier and
> > many swap devices. In reality, with my shallow knowledge, usually only
> > one swap device is deployed. If that's true in most of time, the old
> > code or new code is fine, otherwise, seems we may need consider the
> > impact.
> 



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 6/7] mm, swap: remove swap slot cache
  2025-02-14 17:57 ` [PATCH 6/7] mm, swap: remove swap slot cache Kairui Song
  2025-02-15 16:23   ` kernel test robot
@ 2025-02-20  7:55   ` Baoquan He
  2025-02-24  3:16     ` Kairui Song
  1 sibling, 1 reply; 30+ messages in thread
From: Baoquan He @ 2025-02-20  7:55 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Chris Li, Barry Song, Hugh Dickins,
	Yosry Ahmed, Huang, Ying, Nhat Pham, Johannes Weiner,
	Kalesh Singh, linux-kernel

Hi Kairui,

On 02/15/25 at 01:57am, Kairui Song wrote:
......snip....  
> -int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_order)
> +swp_entry_t folio_alloc_swap(struct folio *folio)
>  {
> -	int order = swap_entry_order(entry_order);
> -	unsigned long size = 1 << order;
> +	unsigned int order = folio_order(folio);
> +	unsigned int size = 1 << order;
>  	struct swap_info_struct *si, *next;
> -	int n_ret = 0;
> +	swp_entry_t entry = {};
> +	unsigned long offset;
>  	int node;
>  
> +	if (order) {
> +		/*
> +		 * Should not even be attempting large allocations when huge
> +		 * page swap is disabled. Warn and fail the allocation.
> +		 */
> +		if (!IS_ENABLED(CONFIG_THP_SWAP) || size > SWAPFILE_CLUSTER) {
> +			VM_WARN_ON_ONCE(1);
> +			return entry;
> +		}
> +	}
> +
>  	/* Fast path using percpu cluster */
>  	local_lock(&percpu_swap_cluster.lock);
> -	n_ret = swap_alloc_fast(swp_entries,
> -				SWAP_HAS_CACHE,
> -				order, n_goal);
> -	if (n_ret == n_goal)
> -		goto out;
> +	if (swap_alloc_fast(&entry, SWAP_HAS_CACHE, order))
> +		goto out_alloced;
>  
> -	n_goal = min_t(int, n_goal - n_ret, SWAP_BATCH);
>  	/* Rotate the device and switch to a new cluster */
>  	spin_lock(&swap_avail_lock);
>  start_over:
> @@ -1268,11 +1236,14 @@ int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_order)
>  		plist_requeue(&si->avail_lists[node], &swap_avail_heads[node]);
>  		spin_unlock(&swap_avail_lock);
>  		if (get_swap_device_info(si)) {
> -			n_ret += scan_swap_map_slots(si, SWAP_HAS_CACHE, n_goal,
> -					swp_entries + n_ret, order);
> +			offset = cluster_alloc_swap_entry(si, order, SWAP_HAS_CACHE);
>  			put_swap_device(si);
> -			if (n_ret || size > 1)
> -				goto out;
> +			if (offset) {
> +				entry = swp_entry(si->type, offset);
> +				goto out_alloced;
> +			}
> +			if (order)
> +				goto out_failed;

This is not related to this patch, do you know why non order-0 case
can't start over on different devices?

>  		}
>  
>  		spin_lock(&swap_avail_lock);
> @@ -1291,10 +1262,20 @@ int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_order)
>  			goto start_over;
>  	}
>  	spin_unlock(&swap_avail_lock);
> -out:
> +out_failed:
> +	local_unlock(&percpu_swap_cluster.lock);
> +	return entry;
> +
> +out_alloced:
>  	local_unlock(&percpu_swap_cluster.lock);
> -	atomic_long_sub(n_ret * size, &nr_swap_pages);
> -	return n_ret;
> +	if (mem_cgroup_try_charge_swap(folio, entry)) {
> +		put_swap_folio(folio, entry);
> +		entry.val = 0;
> +	} else {
> +		atomic_long_sub(size, &nr_swap_pages);
> +	}
> +
> +	return entry;
>  }
>  
>  static struct swap_info_struct *_swap_info_get(swp_entry_t entry)
......snip....
> @@ -2623,16 +2591,6 @@ static bool __has_usable_swap(void)
>  	return !plist_head_empty(&swap_active_head);
>  }

seems the __has_usable_swap() function need be moved into the ifdeffery
scope where __folio_throttle_swaprate() is located to fix the lkp
warning.

>  
> -bool has_usable_swap(void)
> -{
> -	bool ret;
> -
> -	spin_lock(&swap_lock);
> -	ret = __has_usable_swap();
> -	spin_unlock(&swap_lock);
> -	return ret;
> -}
> -
>  /*
>   * Called after clearing SWP_WRITEOK, ensures cluster_alloc_range
>   * see the updated flags, so there will be no more allocations.

Other than the test robot reported warning, this patch looks good to me.
Thanks.



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 7/7] mm, swap: simplify folio swap allocation
  2025-02-14 17:57 ` [PATCH 7/7] mm, swap: simplify folio swap allocation Kairui Song
                     ` (2 preceding siblings ...)
  2025-02-15 16:54   ` kernel test robot
@ 2025-02-20 10:41   ` Baoquan He
  3 siblings, 0 replies; 30+ messages in thread
From: Baoquan He @ 2025-02-20 10:41 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Chris Li, Barry Song, Hugh Dickins,
	Yosry Ahmed, Huang, Ying, Nhat Pham, Johannes Weiner,
	Kalesh Singh, linux-kernel

On 02/15/25 at 01:57am, Kairui Song wrote:
......snip..  
> -swp_entry_t folio_alloc_swap(struct folio *folio)
> +/* Rotate the device and switch to a new cluster */
> +static bool swap_alloc_rotate(swp_entry_t *entry,
> +			      unsigned char usage,
> +			      int order)

The function name is misleading which may make people thing it's a HDD
swap allocation. I would call it swap_alloc_slow() relative to the
swap_alloc_fast().

>  {
> -	unsigned int order = folio_order(folio);
> -	unsigned int size = 1 << order;
> -	struct swap_info_struct *si, *next;
> -	swp_entry_t entry = {};
> -	unsigned long offset;
>  	int node;
> +	unsigned long offset;
> +	struct swap_info_struct *si, *next;
>  
> -	if (order) {
> -		/*
> -		 * Should not even be attempting large allocations when huge
> -		 * page swap is disabled. Warn and fail the allocation.
> -		 */
> -		if (!IS_ENABLED(CONFIG_THP_SWAP) || size > SWAPFILE_CLUSTER) {
> -			VM_WARN_ON_ONCE(1);
> -			return entry;
> -		}
> -	}
> -
> -	/* Fast path using percpu cluster */
> -	local_lock(&percpu_swap_cluster.lock);
> -	if (swap_alloc_fast(&entry, SWAP_HAS_CACHE, order))
> -		goto out_alloced;
> -
> -	/* Rotate the device and switch to a new cluster */
> +	node = numa_node_id();
>  	spin_lock(&swap_avail_lock);
>  start_over:
> -	node = numa_node_id();
>  	plist_for_each_entry_safe(si, next, &swap_avail_heads[node], avail_lists[node]) {
> +		/* Rotate the device and switch to a new cluster */
>  		plist_requeue(&si->avail_lists[node], &swap_avail_heads[node]);
>  		spin_unlock(&swap_avail_lock);
>  		if (get_swap_device_info(si)) {
>  			offset = cluster_alloc_swap_entry(si, order, SWAP_HAS_CACHE);
>  			put_swap_device(si);
>  			if (offset) {
> -				entry = swp_entry(si->type, offset);
> -				goto out_alloced;
> +				*entry = swp_entry(si->type, offset);
> +				return true;
>  			}
>  			if (order)
> -				goto out_failed;
> +				return false;
>  		}
>  
>  		spin_lock(&swap_avail_lock);



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 6/7] mm, swap: remove swap slot cache
  2025-02-20  7:55   ` Baoquan He
@ 2025-02-24  3:16     ` Kairui Song
  0 siblings, 0 replies; 30+ messages in thread
From: Kairui Song @ 2025-02-24  3:16 UTC (permalink / raw)
  To: Baoquan He
  Cc: linux-mm, Andrew Morton, Chris Li, Barry Song, Hugh Dickins,
	Yosry Ahmed, Huang, Ying, Nhat Pham, Johannes Weiner,
	Kalesh Singh, linux-kernel

On Thu, Feb 20, 2025 at 3:56 PM Baoquan He <bhe@redhat.com> wrote:
>
> Hi Kairui,
>
> On 02/15/25 at 01:57am, Kairui Song wrote:
> ......snip....
> > -int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_order)
> > +swp_entry_t folio_alloc_swap(struct folio *folio)
> >  {
> > -     int order = swap_entry_order(entry_order);
> > -     unsigned long size = 1 << order;
> > +     unsigned int order = folio_order(folio);
> > +     unsigned int size = 1 << order;
> >       struct swap_info_struct *si, *next;
> > -     int n_ret = 0;
> > +     swp_entry_t entry = {};
> > +     unsigned long offset;
> >       int node;
> >
> > +     if (order) {
> > +             /*
> > +              * Should not even be attempting large allocations when huge
> > +              * page swap is disabled. Warn and fail the allocation.
> > +              */
> > +             if (!IS_ENABLED(CONFIG_THP_SWAP) || size > SWAPFILE_CLUSTER) {
> > +                     VM_WARN_ON_ONCE(1);
> > +                     return entry;
> > +             }
> > +     }
> > +
> >       /* Fast path using percpu cluster */
> >       local_lock(&percpu_swap_cluster.lock);
> > -     n_ret = swap_alloc_fast(swp_entries,
> > -                             SWAP_HAS_CACHE,
> > -                             order, n_goal);
> > -     if (n_ret == n_goal)
> > -             goto out;
> > +     if (swap_alloc_fast(&entry, SWAP_HAS_CACHE, order))
> > +             goto out_alloced;
> >
> > -     n_goal = min_t(int, n_goal - n_ret, SWAP_BATCH);
> >       /* Rotate the device and switch to a new cluster */
> >       spin_lock(&swap_avail_lock);
> >  start_over:
> > @@ -1268,11 +1236,14 @@ int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_order)
> >               plist_requeue(&si->avail_lists[node], &swap_avail_heads[node]);
> >               spin_unlock(&swap_avail_lock);
> >               if (get_swap_device_info(si)) {
> > -                     n_ret += scan_swap_map_slots(si, SWAP_HAS_CACHE, n_goal,
> > -                                     swp_entries + n_ret, order);
> > +                     offset = cluster_alloc_swap_entry(si, order, SWAP_HAS_CACHE);
> >                       put_swap_device(si);
> > -                     if (n_ret || size > 1)
> > -                             goto out;
> > +                     if (offset) {
> > +                             entry = swp_entry(si->type, offset);
> > +                             goto out_alloced;
> > +                     }
> > +                     if (order)
> > +                             goto out_failed;
>
> This is not related to this patch, do you know why non order-0 case
> can't start over on different devices?

I think that might be an existing bug... I just didn change it as it's
kind of trivial, and also the comment "Swapfile is not block device so
unable to allocate large entries." which I didn't change either, is
also looking strange, but I prefer to fix them later as the background
seems a bit complex to explain.

>
> >               }
> >
> >               spin_lock(&swap_avail_lock);
> > @@ -1291,10 +1262,20 @@ int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_order)
> >                       goto start_over;
> >       }
> >       spin_unlock(&swap_avail_lock);
> > -out:
> > +out_failed:
> > +     local_unlock(&percpu_swap_cluster.lock);
> > +     return entry;
> > +
> > +out_alloced:
> >       local_unlock(&percpu_swap_cluster.lock);
> > -     atomic_long_sub(n_ret * size, &nr_swap_pages);
> > -     return n_ret;
> > +     if (mem_cgroup_try_charge_swap(folio, entry)) {
> > +             put_swap_folio(folio, entry);
> > +             entry.val = 0;
> > +     } else {
> > +             atomic_long_sub(size, &nr_swap_pages);
> > +     }
> > +
> > +     return entry;
> >  }
> >
> >  static struct swap_info_struct *_swap_info_get(swp_entry_t entry)
> ......snip....
> > @@ -2623,16 +2591,6 @@ static bool __has_usable_swap(void)
> >       return !plist_head_empty(&swap_active_head);
> >  }
>
> seems the __has_usable_swap() function need be moved into the ifdeffery
> scope where __folio_throttle_swaprate() is located to fix the lkp
> warning.

Yes, will fix the bot warning.


>
> >
> > -bool has_usable_swap(void)
> > -{
> > -     bool ret;
> > -
> > -     spin_lock(&swap_lock);
> > -     ret = __has_usable_swap();
> > -     spin_unlock(&swap_lock);
> > -     return ret;
> > -}
> > -
> >  /*
> >   * Called after clearing SWP_WRITEOK, ensures cluster_alloc_range
> >   * see the updated flags, so there will be no more allocations.
>
> Other than the test robot reported warning, this patch looks good to me.
> Thanks.
>
>


^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2025-02-24  3:17 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-02-14 17:57 [PATCH 0/7] mm, swap: remove swap slot cache Kairui Song
2025-02-14 17:57 ` [PATCH 1/7] mm, swap: avoid reclaiming irrelevant swap cache Kairui Song
2025-02-19  2:11   ` Baoquan He
2025-02-14 17:57 ` [PATCH 2/7] mm, swap: drop the flag TTRS_DIRECT Kairui Song
2025-02-19  2:42   ` Baoquan He
2025-02-14 17:57 ` [PATCH 3/7] mm, swap: avoid redundant swap device pinning Kairui Song
2025-02-19  3:35   ` Baoquan He
2025-02-14 17:57 ` [PATCH 4/7] mm, swap: don't update the counter up-front Kairui Song
2025-02-14 17:57 ` [PATCH 5/7] mm, swap: use percpu cluster as allocation fast path Kairui Song
2025-02-19  7:53   ` Baoquan He
2025-02-19  8:34     ` Kairui Song
2025-02-19  9:26       ` Baoquan He
2025-02-19 10:55       ` Baoquan He
2025-02-19 11:12         ` Kairui Song
2025-02-20  2:35           ` Baoquan He
2025-02-20  2:48             ` Kairui Song
2025-02-20  3:24               ` Baoquan He
2025-02-14 17:57 ` [PATCH 6/7] mm, swap: remove swap slot cache Kairui Song
2025-02-15 16:23   ` kernel test robot
2025-02-20  7:55   ` Baoquan He
2025-02-24  3:16     ` Kairui Song
2025-02-14 17:57 ` [PATCH 7/7] mm, swap: simplify folio swap allocation Kairui Song
2025-02-14 20:13   ` Matthew Wilcox
2025-02-15  6:40     ` Kairui Song
2025-02-15 16:43   ` kernel test robot
2025-02-15 16:54   ` kernel test robot
2025-02-20 10:41   ` Baoquan He
2025-02-15 10:27 ` [PATCH 0/7] mm, swap: remove swap slot cache Baoquan He
2025-02-15 13:34   ` Kairui Song
2025-02-15 15:07     ` Baoquan He

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox