[RFC PATCH v1 0/5] Alternative mTHP swap allocator improvements

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH v1 0/5] Alternative mTHP swap allocator improvements
@ 2024-06-18 23:26 Ryan Roberts
  2024-06-18 23:26 ` [RFC PATCH v1 1/5] mm: swap: Simplify end-of-cluster calculation Ryan Roberts
                   ` (7 more replies)
  0 siblings, 8 replies; 15+ messages in thread
From: Ryan Roberts @ 2024-06-18 23:26 UTC (permalink / raw)
  To: Andrew Morton, Chris Li, Kairui Song, Huang, Ying, Kalesh Singh,
	Barry Song, Hugh Dickins, David Hildenbrand
  Cc: Ryan Roberts, linux-kernel, linux-mm

Hi All,

Chris has been doing great work at [1] to clean up my mess in the mTHP swap
entry allocator. But Barry posted a test program and results at [2] showing that
even with Chris's changes, there are still some fallbacks (around 5% - 25% in
some cases). I was interested in why that might be and ended up putting this PoC
patch set together to try to get a better understanding. This series ends up
achieving 0% fallback, even with small folios ("-s") enabled. I haven't done
much testing beyond that (yet) but thought it was worth posting on the strength
of that result alone.

At a high level this works in a similar way to Chris's series; it marks a
cluster as being for a particular order and if a new cluster cannot be allocated
then it scans through the existing non-full clusters. But it does it by scanning
through the clusters rather than assembling them into a list. Cluster flags are
used to mark clusters that have been scanned and are known not to have enough
contiguous space, so the efficiency should be similar in practice.

Because its not based around a linked list, there is less churn and I'm
wondering if this is perhaps easier to review and potentially even get into
v6.10-rcX to fix up what's already there, rather than having to wait until v6.11
for Chris's series? I know Chris has a larger roadmap of improvements, so at
best I see this as a tactical fix that will ultimately be superseeded by Chris's
work.

There are a few differences to note vs Chris's series:

- order-0 fallback scanning is still allowed in any cluster; the argument in the
  past was that swap should always use all the swap space, so I've left this
  mechanism in. It is only a fallback though; first the the new per-order
  scanner is invoked, even for order-0, so if there are free slots in clusters
  already assigned for order-0, then the allocation will go there.

- CPUs can steal slots from other CPU's current clusters; those clusters remain
  scannable while they are current for a CPU and are only made unscannable when
  no more CPUs are scanning that particular cluster.

- I'm preferring to allocate a free cluster ahead of per-order scanning, since,
  as I understand it, the original intent of a per-cpu current cluster was to
  get pages for an application adjacent in the swap to speed up IO.

I'd be keen to hear if you think we could get something like this into v6.10 to
fix the mess - I'm willing to work quickly to address comments and do more
testing. If not, then this is probably just a distraction and we should
concentrate on Chris's series.

This applies on top of v6.10-rc4.

[1] https://lore.kernel.org/linux-mm/20240614-swap-allocator-v2-0-2a513b4a7f2f@kernel.org/
[2] https://lore.kernel.org/linux-mm/20240615084714.37499-1-21cnbao@gmail.com/

Thanks,
Ryan

Ryan Roberts (5):
  mm: swap: Simplify end-of-cluster calculation
  mm: swap: Change SWAP_NEXT_INVALID to highest value
  mm: swap: Track allocation order for clusters
  mm: swap: Scan for free swap entries in allocated clusters
  mm: swap: Optimize per-order cluster scanning

 include/linux/swap.h |  18 +++--
 mm/swapfile.c        | 164 ++++++++++++++++++++++++++++++++++++++-----
 2 files changed, 157 insertions(+), 25 deletions(-)

--
2.43.0

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [RFC PATCH v1 1/5] mm: swap: Simplify end-of-cluster calculation
  2024-06-18 23:26 [RFC PATCH v1 0/5] Alternative mTHP swap allocator improvements Ryan Roberts
@ 2024-06-18 23:26 ` Ryan Roberts
  2024-06-18 23:26 ` [RFC PATCH v1 2/5] mm: swap: Change SWAP_NEXT_INVALID to highest value Ryan Roberts
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 15+ messages in thread
From: Ryan Roberts @ 2024-06-18 23:26 UTC (permalink / raw)
  To: Andrew Morton, Chris Li, Kairui Song, Huang, Ying, Kalesh Singh,
	Barry Song, Hugh Dickins, David Hildenbrand
  Cc: Ryan Roberts, linux-kernel, linux-mm

Its possible that a swap file will have a partial cluster at the end, if
the swap size is not a multiple of the cluster size. But this partial
cluster will never be marked free and so scan_swap_map_try_ssd_cluster()
will never see it. Therefore it can always consider that a cluster ends
at the next cluster boundary.

This leads to a simplification of the endpoint calculation and removal
of an unnecessary conditional.

This change has the useful side effect of making lock_cluster()
unconditional, which will be used in a later commit.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 mm/swapfile.c | 16 +++++++---------
 1 file changed, 7 insertions(+), 9 deletions(-)

diff --git a/mm/swapfile.c b/mm/swapfile.c
index b3e5e384e330..30e79739dfdc 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -677,16 +677,14 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
 	 * check if there is still free entry in the cluster, maintaining
 	 * natural alignment.
 	 */
-	max = min_t(unsigned long, si->max, ALIGN(tmp + 1, SWAPFILE_CLUSTER));
-	if (tmp < max) {
-		ci = lock_cluster(si, tmp);
-		while (tmp < max) {
-			if (swap_range_empty(si->swap_map, tmp, nr_pages))
-				break;
-			tmp += nr_pages;
-		}
-		unlock_cluster(ci);
+	max = ALIGN(tmp + 1, SWAPFILE_CLUSTER);
+	ci = lock_cluster(si, tmp);
+	while (tmp < max) {
+		if (swap_range_empty(si->swap_map, tmp, nr_pages))
+			break;
+		tmp += nr_pages;
 	}
+	unlock_cluster(ci);
 	if (tmp >= max) {
 		cluster->next[order] = SWAP_NEXT_INVALID;
 		goto new_cluster;
--
2.43.0



^ permalink raw reply	[flat|nested] 15+ messages in thread

* [RFC PATCH v1 2/5] mm: swap: Change SWAP_NEXT_INVALID to highest value
  2024-06-18 23:26 [RFC PATCH v1 0/5] Alternative mTHP swap allocator improvements Ryan Roberts
  2024-06-18 23:26 ` [RFC PATCH v1 1/5] mm: swap: Simplify end-of-cluster calculation Ryan Roberts
@ 2024-06-18 23:26 ` Ryan Roberts
  2024-06-18 23:26 ` [RFC PATCH v1 3/5] mm: swap: Track allocation order for clusters Ryan Roberts
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 15+ messages in thread
From: Ryan Roberts @ 2024-06-18 23:26 UTC (permalink / raw)
  To: Andrew Morton, Chris Li, Kairui Song, Huang, Ying, Kalesh Singh,
	Barry Song, Hugh Dickins, David Hildenbrand
  Cc: Ryan Roberts, linux-kernel, linux-mm

We are about to introduce a scanning mechanism that can present 0 as a
valid cluster offset to scan_swap_map_try_ssd_cluster(), so let's change
SWAP_NEXT_INVALID to UINT_MAX, which is always invalid as an offset in
practice.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 include/linux/swap.h | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index bd450023b9a4..66566251ba31 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -261,12 +261,12 @@ struct swap_cluster_info {
 #define CLUSTER_FLAG_NEXT_NULL 2 /* This cluster has no next cluster */

 /*
- * The first page in the swap file is the swap header, which is always marked
- * bad to prevent it from being allocated as an entry. This also prevents the
- * cluster to which it belongs being marked free. Therefore 0 is safe to use as
- * a sentinel to indicate next is not valid in percpu_cluster.
+ * swap_info_struct::max is an unsigned int, so the maximum number of pages in
+ * the swap file is UINT_MAX. Therefore the highest legitimate index is
+ * UINT_MAX-1. Therefore UINT_MAX is safe to use as a sentinel to indicate next
+ * is not valid in percpu_cluster.
  */
-#define SWAP_NEXT_INVALID	0
+#define SWAP_NEXT_INVALID	UINT_MAX

 #ifdef CONFIG_THP_SWAP
 #define SWAP_NR_ORDERS		(PMD_ORDER + 1)
--
2.43.0



^ permalink raw reply	[flat|nested] 15+ messages in thread

* [RFC PATCH v1 3/5] mm: swap: Track allocation order for clusters
  2024-06-18 23:26 [RFC PATCH v1 0/5] Alternative mTHP swap allocator improvements Ryan Roberts
  2024-06-18 23:26 ` [RFC PATCH v1 1/5] mm: swap: Simplify end-of-cluster calculation Ryan Roberts
  2024-06-18 23:26 ` [RFC PATCH v1 2/5] mm: swap: Change SWAP_NEXT_INVALID to highest value Ryan Roberts
@ 2024-06-18 23:26 ` Ryan Roberts
  2024-06-18 23:26 ` [RFC PATCH v1 4/5] mm: swap: Scan for free swap entries in allocated clusters Ryan Roberts
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 15+ messages in thread
From: Ryan Roberts @ 2024-06-18 23:26 UTC (permalink / raw)
  To: Andrew Morton, Chris Li, Kairui Song, Huang, Ying, Kalesh Singh,
	Barry Song, Hugh Dickins, David Hildenbrand
  Cc: Ryan Roberts, linux-kernel, linux-mm

Add an `order` field to `struct swap_cluster_info`, which applies to
allocated clusters (i.e. those not on the free list) and tracks the swap
entry order that the cluster should be used to allocate. A future commit
will use this information to scan partially filled clusters to find
appropriate free swap entries for allocation. Note that it is still
possible that order-0 swap entries will be allocated in clusters that
indicate a higher order due to the order-0 scanning mechanism.

The maximum order we ever expect to see is 13 - PMD-size on arm64 with
64K base pages. 13 fits into 4 bits, so let's steal 4 unused flags bits
for this purpose to avoid making `struct swap_cluster_info` any bigger.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 include/linux/swap.h |  3 ++-
 mm/swapfile.c        | 24 +++++++++++++++---------
 2 files changed, 17 insertions(+), 10 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 66566251ba31..2a40fe02d281 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -255,7 +255,8 @@ struct swap_cluster_info {
 				 * cluster
 				 */
 	unsigned int data:24;
-	unsigned int flags:8;
+	unsigned int flags:4;
+	unsigned int order:4;
 };
 #define CLUSTER_FLAG_FREE 1 /* This cluster is free */
 #define CLUSTER_FLAG_NEXT_NULL 2 /* This cluster has no next cluster */
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 30e79739dfdc..7b13f02a7ac2 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -307,11 +307,13 @@ static inline void cluster_set_count(struct swap_cluster_info *info,
 	info->data = c;
 }

-static inline void cluster_set_count_flag(struct swap_cluster_info *info,
-					 unsigned int c, unsigned int f)
+static inline void cluster_set_count_flag_order(struct swap_cluster_info *info,
+						unsigned int c, unsigned int f,
+						unsigned int o)
 {
 	info->flags = f;
 	info->data = c;
+	info->order = o;
 }

 static inline unsigned int cluster_next(struct swap_cluster_info *info)
@@ -330,6 +332,7 @@ static inline void cluster_set_next_flag(struct swap_cluster_info *info,
 {
 	info->flags = f;
 	info->data = n;
+	info->order = 0;
 }

 static inline bool cluster_is_free(struct swap_cluster_info *info)
@@ -346,6 +349,7 @@ static inline void cluster_set_null(struct swap_cluster_info *info)
 {
 	info->flags = CLUSTER_FLAG_NEXT_NULL;
 	info->data = 0;
+	info->order = 0;
 }

 static inline struct swap_cluster_info *lock_cluster(struct swap_info_struct *si,
@@ -521,13 +525,14 @@ static void swap_users_ref_free(struct percpu_ref *ref)
 	complete(&si->comp);
 }

-static void alloc_cluster(struct swap_info_struct *si, unsigned long idx)
+static void alloc_cluster(struct swap_info_struct *si, unsigned long idx,
+			  int order)
 {
 	struct swap_cluster_info *ci = si->cluster_info;

 	VM_BUG_ON(cluster_list_first(&si->free_clusters) != idx);
 	cluster_list_del_first(&si->free_clusters, ci);
-	cluster_set_count_flag(ci + idx, 0, 0);
+	cluster_set_count_flag_order(ci + idx, 0, 0, order);
 }

 static void free_cluster(struct swap_info_struct *si, unsigned long idx)
@@ -556,14 +561,15 @@ static void free_cluster(struct swap_info_struct *si, unsigned long idx)
  */
 static void add_cluster_info_page(struct swap_info_struct *p,
 	struct swap_cluster_info *cluster_info, unsigned long page_nr,
-	unsigned long count)
+	int order)
 {
 	unsigned long idx = page_nr / SWAPFILE_CLUSTER;
+	unsigned long count = 1 << order;

 	if (!cluster_info)
 		return;
 	if (cluster_is_free(&cluster_info[idx]))
-		alloc_cluster(p, idx);
+		alloc_cluster(p, idx, order);

 	VM_BUG_ON(cluster_count(&cluster_info[idx]) + count > SWAPFILE_CLUSTER);
 	cluster_set_count(&cluster_info[idx],
@@ -577,7 +583,7 @@ static void add_cluster_info_page(struct swap_info_struct *p,
 static void inc_cluster_info_page(struct swap_info_struct *p,
 	struct swap_cluster_info *cluster_info, unsigned long page_nr)
 {
-	add_cluster_info_page(p, cluster_info, page_nr, 1);
+	add_cluster_info_page(p, cluster_info, page_nr, 0);
 }

 /*
@@ -964,7 +970,7 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
 			goto done;
 	}
 	memset(si->swap_map + offset, usage, nr_pages);
-	add_cluster_info_page(si, si->cluster_info, offset, nr_pages);
+	add_cluster_info_page(si, si->cluster_info, offset, order);
 	unlock_cluster(ci);

 	swap_range_alloc(si, offset, nr_pages);
@@ -1060,7 +1066,7 @@ static void swap_free_cluster(struct swap_info_struct *si, unsigned long idx)

 	ci = lock_cluster(si, offset);
 	memset(si->swap_map + offset, 0, SWAPFILE_CLUSTER);
-	cluster_set_count_flag(ci, 0, 0);
+	cluster_set_count_flag_order(ci, 0, 0, 0);
 	free_cluster(si, idx);
 	unlock_cluster(ci);
 	swap_range_free(si, offset, SWAPFILE_CLUSTER);
--
2.43.0



^ permalink raw reply	[flat|nested] 15+ messages in thread

* [RFC PATCH v1 4/5] mm: swap: Scan for free swap entries in allocated clusters
  2024-06-18 23:26 [RFC PATCH v1 0/5] Alternative mTHP swap allocator improvements Ryan Roberts
                   ` (2 preceding siblings ...)
  2024-06-18 23:26 ` [RFC PATCH v1 3/5] mm: swap: Track allocation order for clusters Ryan Roberts
@ 2024-06-18 23:26 ` Ryan Roberts
  2024-06-18 23:26 ` [RFC PATCH v1 5/5] mm: swap: Optimize per-order cluster scanning Ryan Roberts
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 15+ messages in thread
From: Ryan Roberts @ 2024-06-18 23:26 UTC (permalink / raw)
  To: Andrew Morton, Chris Li, Kairui Song, Huang, Ying, Kalesh Singh,
	Barry Song, Hugh Dickins, David Hildenbrand
  Cc: Ryan Roberts, linux-kernel, linux-mm

Previously mTHP would only be swapped out if a CPU could allocate itself
a free cluster from which to allocate mTHP-sized contiguous swap entry
blocks. But for a system making heavy use of swap, after a while
fragmentation ensures there are no available free clusters and therefore
the swap entry allocation fails and forces the mTHP to be split to base
pages which then get swap entries allocated by scanning the swap file
for free individual pages.

But when swap entries are freed, this makes holes in the clusters, and
often it would be possible to allocate new mTHP swap entries in those
holes.

So if we fail to allocate a free cluster, scan through the clusters
until we find one that is in use and contains swap entries of the order
we require. Then scan it until we find a suitably sized and aligned
hole. We keep a per-order "next cluster to scan" pointer so that future
scanning can be picked up from where we last left off. And if we scan
through all clusters without finding a suitable hole, we give up to
prevent live lock.

Running the test case provided by Barry Song at the below link, I can
see swpout fallback rate, which was previously 100% after a few
iterations, falls to 0% and stays there for all 100 iterations. This is
also the case when sprinkling in some non-mTHP allocations ("-s") too.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Link: https://lore.kernel.org/linux-mm/20240615084714.37499-1-21cnbao@gmail.com/
---
 include/linux/swap.h |  2 +
 mm/swapfile.c        | 90 ++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 92 insertions(+)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 2a40fe02d281..34ec4668a5c9 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -310,6 +310,8 @@ struct swap_info_struct {
 	unsigned int cluster_nr;	/* countdown to next cluster search */
 	unsigned int __percpu *cluster_next_cpu; /*percpu index for next allocation */
 	struct percpu_cluster __percpu *percpu_cluster; /* per cpu's swap location */
+	struct swap_cluster_info *next_order_scan[SWAP_NR_ORDERS];
+					/* Start cluster for next order-based scan */
 	struct rb_root swap_extent_root;/* root of the swap extent rbtree */
 	struct block_device *bdev;	/* swap device or bdev of swap file */
 	struct file *swap_file;		/* seldom referenced */
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 7b13f02a7ac2..24db03db8830 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -644,6 +644,84 @@ static inline bool swap_range_empty(char *swap_map, unsigned int start,
 	return true;
 }

+static inline
+struct swap_cluster_info *offset_to_cluster(struct swap_info_struct *si,
+					    unsigned int offset)
+{
+	VM_WARN_ON(!si->cluster_info);
+	return si->cluster_info + (offset / SWAPFILE_CLUSTER);
+}
+
+static inline
+unsigned int cluster_to_offset(struct swap_info_struct *si,
+			       struct swap_cluster_info *ci)
+{
+	VM_WARN_ON(!si->cluster_info);
+	return (ci - si->cluster_info) * SWAPFILE_CLUSTER;
+}
+
+static inline
+struct swap_cluster_info *next_cluster_circular(struct swap_info_struct *si,
+						struct swap_cluster_info *ci)
+{
+	struct swap_cluster_info *last;
+
+	/*
+	 * Wrap after the last whole cluster; never return the final partial
+	 * cluster because users assume an entire cluster is accessible.
+	 */
+	last = offset_to_cluster(si, si->max) - 1;
+	return ci == last ? si->cluster_info : ++ci;
+}
+
+static inline
+struct swap_cluster_info *prev_cluster_circular(struct swap_info_struct *si,
+						struct swap_cluster_info *ci)
+{
+	struct swap_cluster_info *last;
+
+	/*
+	 * Wrap to the last whole cluster; never return the final partial
+	 * cluster because users assume an entire cluster is accessible.
+	 */
+	last = offset_to_cluster(si, si->max) - 1;
+	return ci == si->cluster_info ? last : --ci;
+}
+
+/*
+ * Returns the offset of the next cluster, allocated to contain swap entries of
+ * `order`, that is eligible to scan for free space. On first call, *stop should
+ * be set to SWAP_NEXT_INVALID to indicate the clusters should be scanned all
+ * the way back around to the returned cluster. The function updates *stop upon
+ * first call and consumes it in subsequent calls. Returns SWAP_NEXT_INVALID if
+ * no such clusters are available. Must be called with si lock held.
+ */
+static unsigned int next_cluster_for_scan(struct swap_info_struct *si,
+					  int order, unsigned int *stop)
+{
+	struct swap_cluster_info *ci;
+	struct swap_cluster_info *end;
+
+	ci = si->next_order_scan[order];
+	if (*stop == SWAP_NEXT_INVALID)
+		*stop = cluster_to_offset(si, prev_cluster_circular(si, ci));
+	end = offset_to_cluster(si, *stop);
+
+	while (ci != end) {
+		if ((ci->flags & CLUSTER_FLAG_FREE) == 0 && ci->order == order)
+			break;
+		ci = next_cluster_circular(si, ci);
+	}
+
+	if (ci == end) {
+		si->next_order_scan[order] = ci;
+		return SWAP_NEXT_INVALID;
+	}
+
+	si->next_order_scan[order] = next_cluster_circular(si, ci);
+	return cluster_to_offset(si, ci);
+}
+
 /*
  * Try to get swap entries with specified order from current cpu's swap entry
  * pool (a cluster). This might involve allocating a new cluster for current CPU
@@ -656,6 +734,7 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
 	struct percpu_cluster *cluster;
 	struct swap_cluster_info *ci;
 	unsigned int tmp, max;
+	unsigned int stop = SWAP_NEXT_INVALID;

 new_cluster:
 	cluster = this_cpu_ptr(si->percpu_cluster);
@@ -674,6 +753,15 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
 			*scan_base = this_cpu_read(*si->cluster_next_cpu);
 			*offset = *scan_base;
 			goto new_cluster;
+		} else if (nr_pages < SWAPFILE_CLUSTER) {
+			/*
+			 * There is no point in scanning for free areas the same
+			 * size as the cluster, since the cluster would have
+			 * already been freed in that case.
+			 */
+			tmp = next_cluster_for_scan(si, order, &stop);
+			if (tmp == SWAP_NEXT_INVALID)
+				return false;
 		} else
 			return false;
 	}
@@ -2392,6 +2480,8 @@ static void setup_swap_info(struct swap_info_struct *p, int prio,
 	}
 	p->swap_map = swap_map;
 	p->cluster_info = cluster_info;
+	for (i = 0; i < SWAP_NR_ORDERS; i++)
+		p->next_order_scan[i] = cluster_info;
 }

 static void _enable_swap_info(struct swap_info_struct *p)
--
2.43.0



^ permalink raw reply	[flat|nested] 15+ messages in thread

* [RFC PATCH v1 5/5] mm: swap: Optimize per-order cluster scanning
  2024-06-18 23:26 [RFC PATCH v1 0/5] Alternative mTHP swap allocator improvements Ryan Roberts
                   ` (3 preceding siblings ...)
  2024-06-18 23:26 ` [RFC PATCH v1 4/5] mm: swap: Scan for free swap entries in allocated clusters Ryan Roberts
@ 2024-06-18 23:26 ` Ryan Roberts
  2024-06-19  7:19 ` [RFC PATCH v1 0/5] Alternative mTHP swap allocator improvements Huang, Ying
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 15+ messages in thread
From: Ryan Roberts @ 2024-06-18 23:26 UTC (permalink / raw)
  To: Andrew Morton, Chris Li, Kairui Song, Huang, Ying, Kalesh Singh,
	Barry Song, Hugh Dickins, David Hildenbrand
  Cc: Ryan Roberts, linux-kernel, linux-mm

Add CLUSTER_FLAG_SKIP_SCAN cluster flag, which is applied to a cluster
under 1 of 2 conditions. When present, the cluster will be skipped
during a scan.

- When the number of free entries is less than the number of entries
  that would be required for a new allocation of the order that the
  cluster serves.

- When scanning completes for the cluster, and no further scanners are
  active for the cluster and no swap entries were freed for the cluster
  since the last scan began. In this case, it has been proven that there
  are no contiguous free entries of sufficient size to allcoate the
  order that the cluster serves. In this case the cluster is made
  eligible for scanning again when the next entry is freed.

The latter is implemented to permit multiple CPUs to scan the same
cluster, which in turn garrantees that if there is a free block
available in a cluster allocated for the desired order then it will be
allocated on a first come, first served basis.

As a result, the number of active scanners for a cluster must be
tracked, costing 4 bytes per cluster.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 include/linux/swap.h |  3 +++
 mm/swapfile.c        | 36 ++++++++++++++++++++++++++++++++++--
 2 files changed, 37 insertions(+), 2 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 34ec4668a5c9..40c308749e79 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -257,9 +257,12 @@ struct swap_cluster_info {
 	unsigned int data:24;
 	unsigned int flags:4;
 	unsigned int order:4;
+	unsigned int nr_scanners;
 };
 #define CLUSTER_FLAG_FREE 1 /* This cluster is free */
 #define CLUSTER_FLAG_NEXT_NULL 2 /* This cluster has no next cluster */
+#define CLUSTER_FLAG_SKIP_SCAN 4 /* Skip cluster for per-order scan */
+#define CLUSTER_FLAG_DECREMENT 8 /* A swap entry was freed from cluster */

 /*
  * swap_info_struct::max is an unsigned int, so the maximum number of pages in
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 24db03db8830..caf382b4ecd3 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -574,6 +574,9 @@ static void add_cluster_info_page(struct swap_info_struct *p,
 	VM_BUG_ON(cluster_count(&cluster_info[idx]) + count > SWAPFILE_CLUSTER);
 	cluster_set_count(&cluster_info[idx],
 		cluster_count(&cluster_info[idx]) + count);
+
+	if (SWAPFILE_CLUSTER - cluster_count(&cluster_info[idx]) < count)
+		cluster_info[idx].flags |= CLUSTER_FLAG_SKIP_SCAN;
 }

 /*
@@ -595,6 +598,7 @@ static void dec_cluster_info_page(struct swap_info_struct *p,
 	struct swap_cluster_info *cluster_info, unsigned long page_nr)
 {
 	unsigned long idx = page_nr / SWAPFILE_CLUSTER;
+	unsigned long count = 1 << cluster_info[idx].order;

 	if (!cluster_info)
 		return;
@@ -603,6 +607,10 @@ static void dec_cluster_info_page(struct swap_info_struct *p,
 	cluster_set_count(&cluster_info[idx],
 		cluster_count(&cluster_info[idx]) - 1);

+	cluster_info[idx].flags |= CLUSTER_FLAG_DECREMENT;
+	if (SWAPFILE_CLUSTER - cluster_count(&cluster_info[idx]) >= count)
+		cluster_info[idx].flags &= ~CLUSTER_FLAG_SKIP_SCAN;
+
 	if (cluster_count(&cluster_info[idx]) == 0)
 		free_cluster(p, idx);
 }
@@ -708,7 +716,8 @@ static unsigned int next_cluster_for_scan(struct swap_info_struct *si,
 	end = offset_to_cluster(si, *stop);

 	while (ci != end) {
-		if ((ci->flags & CLUSTER_FLAG_FREE) == 0 && ci->order == order)
+		if ((ci->flags & (CLUSTER_FLAG_SKIP_SCAN | CLUSTER_FLAG_FREE)) == 0
+		    && ci->order == order)
 			break;
 		ci = next_cluster_circular(si, ci);
 	}
@@ -722,6 +731,21 @@ static unsigned int next_cluster_for_scan(struct swap_info_struct *si,
 	return cluster_to_offset(si, ci);
 }

+static inline void cluster_inc_scanners(struct swap_cluster_info *ci)
+{
+	/* Protected by si lock. */
+	ci->nr_scanners++;
+	ci->flags &= ~CLUSTER_FLAG_DECREMENT;
+}
+
+static inline void cluster_dec_scanners(struct swap_cluster_info *ci)
+{
+	/* Protected by si lock. */
+	ci->nr_scanners--;
+	if (ci->nr_scanners == 0 && (ci->flags & CLUSTER_FLAG_DECREMENT) == 0)
+		ci->flags |= CLUSTER_FLAG_SKIP_SCAN;
+}
+
 /*
  * Try to get swap entries with specified order from current cpu's swap entry
  * pool (a cluster). This might involve allocating a new cluster for current CPU
@@ -764,6 +788,8 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
 				return false;
 		} else
 			return false;
+
+		cluster_inc_scanners(offset_to_cluster(si, tmp));
 	}

 	/*
@@ -780,13 +806,19 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
 	}
 	unlock_cluster(ci);
 	if (tmp >= max) {
+		cluster_dec_scanners(ci);
 		cluster->next[order] = SWAP_NEXT_INVALID;
 		goto new_cluster;
 	}
 	*offset = tmp;
 	*scan_base = tmp;
 	tmp += nr_pages;
-	cluster->next[order] = tmp < max ? tmp : SWAP_NEXT_INVALID;
+	if (tmp >= max) {
+		cluster_dec_scanners(ci);
+		cluster->next[order] = SWAP_NEXT_INVALID;
+	} else {
+		cluster->next[order] = tmp;
+	}
 	return true;
 }

--
2.43.0



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH v1 0/5] Alternative mTHP swap allocator improvements
  2024-06-18 23:26 [RFC PATCH v1 0/5] Alternative mTHP swap allocator improvements Ryan Roberts
                   ` (4 preceding siblings ...)
  2024-06-18 23:26 ` [RFC PATCH v1 5/5] mm: swap: Optimize per-order cluster scanning Ryan Roberts
@ 2024-06-19  7:19 ` Huang, Ying
  2024-06-19  9:17   ` Ryan Roberts
  2024-06-19  9:11 ` Barry Song
  2024-06-25  8:02 ` Ryan Roberts
  7 siblings, 1 reply; 15+ messages in thread
From: Huang, Ying @ 2024-06-19  7:19 UTC (permalink / raw)
  To: Ryan Roberts, Andrew Morton
  Cc: Chris Li, Kairui Song, Kalesh Singh, Barry Song, Hugh Dickins,
	David Hildenbrand, linux-kernel, linux-mm

Hi, Ryan,

Ryan Roberts <ryan.roberts@arm.com> writes:

> Hi All,
>
> Chris has been doing great work at [1] to clean up my mess in the mTHP swap
> entry allocator.

I don't think the original behavior is something like mess.  It's just
the first step in the correct direction.  It's straightforward and
obviously correctly.  Then, we can optimize it step by step with data to
justify the increased complexity.

> But Barry posted a test program and results at [2] showing that
> even with Chris's changes, there are still some fallbacks (around 5% - 25% in
> some cases). I was interested in why that might be and ended up putting this PoC
> patch set together to try to get a better understanding. This series ends up
> achieving 0% fallback, even with small folios ("-s") enabled. I haven't done
> much testing beyond that (yet) but thought it was worth posting on the strength
> of that result alone.
>
> At a high level this works in a similar way to Chris's series; it marks a
> cluster as being for a particular order and if a new cluster cannot be allocated
> then it scans through the existing non-full clusters. But it does it by scanning
> through the clusters rather than assembling them into a list. Cluster flags are
> used to mark clusters that have been scanned and are known not to have enough
> contiguous space, so the efficiency should be similar in practice.
>
> Because its not based around a linked list, there is less churn and I'm
> wondering if this is perhaps easier to review and potentially even get into
> v6.10-rcX to fix up what's already there, rather than having to wait until v6.11
> for Chris's series? I know Chris has a larger roadmap of improvements, so at
> best I see this as a tactical fix that will ultimately be superseeded by Chris's
> work.

I don't think we need any mTHP swap entry allocation optimization to go
into v6.10-rcX.  There's no functionality or performance regression.
Per my understanding, we merge optimization when it's ready.

Hi, Andrew,

Please correct me if you don't agree.

[snip]

--
Best Regards,
Huang, Ying


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH v1 0/5] Alternative mTHP swap allocator improvements
  2024-06-19  7:19 ` [RFC PATCH v1 0/5] Alternative mTHP swap allocator improvements Huang, Ying
@ 2024-06-19  9:17   ` Ryan Roberts
  2024-06-20  1:34     ` Huang, Ying
  0 siblings, 1 reply; 15+ messages in thread
From: Ryan Roberts @ 2024-06-19  9:17 UTC (permalink / raw)
  To: Huang, Ying, Andrew Morton
  Cc: Chris Li, Kairui Song, Kalesh Singh, Barry Song, Hugh Dickins,
	David Hildenbrand, linux-kernel, linux-mm

On 19/06/2024 08:19, Huang, Ying wrote:
> Hi, Ryan,
> 
> Ryan Roberts <ryan.roberts@arm.com> writes:
> 
>> Hi All,
>>
>> Chris has been doing great work at [1] to clean up my mess in the mTHP swap
>> entry allocator.
> 
> I don't think the original behavior is something like mess.  It's just
> the first step in the correct direction.  It's straightforward and
> obviously correctly.  Then, we can optimize it step by step with data to
> justify the increased complexity.

OK, perhaps I was over-egging it by calling it a "mess". What you're describing
was my initial opinion too, but I saw Andrew complaining that we shouldn't be
merging a feature if it doesn't work. This series fixes the problem in a minimal
way - if you ignore the last patch, which is really is just a performance
optimization and could be dropped.

If we can ultimately get Chris's series to 0% fallback like this one, and
everyone is happy with the current state for v6.10, then agreed - let's
concentrate on Chris's series for v6.11.

Thanks,
Ryan

> 
>> But Barry posted a test program and results at [2] showing that
>> even with Chris's changes, there are still some fallbacks (around 5% - 25% in
>> some cases). I was interested in why that might be and ended up putting this PoC
>> patch set together to try to get a better understanding. This series ends up
>> achieving 0% fallback, even with small folios ("-s") enabled. I haven't done
>> much testing beyond that (yet) but thought it was worth posting on the strength
>> of that result alone.
>>
>> At a high level this works in a similar way to Chris's series; it marks a
>> cluster as being for a particular order and if a new cluster cannot be allocated
>> then it scans through the existing non-full clusters. But it does it by scanning
>> through the clusters rather than assembling them into a list. Cluster flags are
>> used to mark clusters that have been scanned and are known not to have enough
>> contiguous space, so the efficiency should be similar in practice.
>>
>> Because its not based around a linked list, there is less churn and I'm
>> wondering if this is perhaps easier to review and potentially even get into
>> v6.10-rcX to fix up what's already there, rather than having to wait until v6.11
>> for Chris's series? I know Chris has a larger roadmap of improvements, so at
>> best I see this as a tactical fix that will ultimately be superseeded by Chris's
>> work.
> 
> I don't think we need any mTHP swap entry allocation optimization to go
> into v6.10-rcX.  There's no functionality or performance regression.
> Per my understanding, we merge optimization when it's ready.
> 
> Hi, Andrew,
> 
> Please correct me if you don't agree.
> 
> [snip]
> 
> --
> Best Regards,
> Huang, Ying



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH v1 0/5] Alternative mTHP swap allocator improvements
  2024-06-19  9:17   ` Ryan Roberts
@ 2024-06-20  1:34     ` Huang, Ying
  0 siblings, 0 replies; 15+ messages in thread
From: Huang, Ying @ 2024-06-20  1:34 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Andrew Morton, Chris Li, Kairui Song, Kalesh Singh, Barry Song,
	Hugh Dickins, David Hildenbrand, linux-kernel, linux-mm

Ryan Roberts <ryan.roberts@arm.com> writes:

> On 19/06/2024 08:19, Huang, Ying wrote:
>> Hi, Ryan,
>> 
>> Ryan Roberts <ryan.roberts@arm.com> writes:
>> 
>>> Hi All,
>>>
>>> Chris has been doing great work at [1] to clean up my mess in the mTHP swap
>>> entry allocator.
>> 
>> I don't think the original behavior is something like mess.  It's just
>> the first step in the correct direction.  It's straightforward and
>> obviously correctly.  Then, we can optimize it step by step with data to
>> justify the increased complexity.
>
> OK, perhaps I was over-egging it by calling it a "mess". What you're describing
> was my initial opinion too, but I saw Andrew complaining that we shouldn't be
> merging a feature if it doesn't work.

I don't think it doesn't work.  It just works well for some workloads,
but doesn't work well for some other workloads.  We can always improve
the implementation to make it works better for more workloads.

From another point of view, even with your and Chris's improvement, mTHP
swap entry allocation will fallback for quite some workloads.

Use page allocator as an example.  Even in current kernel, THP allocation
may fallback for quite some workloads because of page fragmentation.
But I don't think THP doesn't work now.

> This series fixes the problem in a minimal
> way - if you ignore the last patch, which is really is just a performance
> optimization and could be dropped.
>
> If we can ultimately get Chris's series to 0% fallback like this one, and
> everyone is happy with the current state for v6.10, then agreed - let's
> concentrate on Chris's series for v6.11.

[snip]

--
Best Regards,
Huang, Ying


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH v1 0/5] Alternative mTHP swap allocator improvements
  2024-06-18 23:26 [RFC PATCH v1 0/5] Alternative mTHP swap allocator improvements Ryan Roberts
                   ` (5 preceding siblings ...)
  2024-06-19  7:19 ` [RFC PATCH v1 0/5] Alternative mTHP swap allocator improvements Huang, Ying
@ 2024-06-19  9:11 ` Barry Song
  2024-06-19  9:17   ` Ryan Roberts
  2024-06-25  8:02 ` Ryan Roberts
  7 siblings, 1 reply; 15+ messages in thread
From: Barry Song @ 2024-06-19  9:11 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Andrew Morton, Chris Li, Kairui Song, Huang, Ying, Kalesh Singh,
	Hugh Dickins, David Hildenbrand, linux-kernel, linux-mm,
	Shuai Yuan

On Wed, Jun 19, 2024 at 11:27 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> Hi All,
>
> Chris has been doing great work at [1] to clean up my mess in the mTHP swap
> entry allocator. But Barry posted a test program and results at [2] showing that
> even with Chris's changes, there are still some fallbacks (around 5% - 25% in
> some cases). I was interested in why that might be and ended up putting this PoC
> patch set together to try to get a better understanding. This series ends up
> achieving 0% fallback, even with small folios ("-s") enabled. I haven't done
> much testing beyond that (yet) but thought it was worth posting on the strength
> of that result alone.
>
> At a high level this works in a similar way to Chris's series; it marks a
> cluster as being for a particular order and if a new cluster cannot be allocated
> then it scans through the existing non-full clusters. But it does it by scanning
> through the clusters rather than assembling them into a list. Cluster flags are
> used to mark clusters that have been scanned and are known not to have enough
> contiguous space, so the efficiency should be similar in practice.
>
> Because its not based around a linked list, there is less churn and I'm
> wondering if this is perhaps easier to review and potentially even get into
> v6.10-rcX to fix up what's already there, rather than having to wait until v6.11
> for Chris's series? I know Chris has a larger roadmap of improvements, so at
> best I see this as a tactical fix that will ultimately be superseeded by Chris's
> work.
>
> There are a few differences to note vs Chris's series:
>
> - order-0 fallback scanning is still allowed in any cluster; the argument in the
>   past was that swap should always use all the swap space, so I've left this
>   mechanism in. It is only a fallback though; first the the new per-order
>   scanner is invoked, even for order-0, so if there are free slots in clusters
>   already assigned for order-0, then the allocation will go there.
>
> - CPUs can steal slots from other CPU's current clusters; those clusters remain
>   scannable while they are current for a CPU and are only made unscannable when
>   no more CPUs are scanning that particular cluster.
>
> - I'm preferring to allocate a free cluster ahead of per-order scanning, since,
>   as I understand it, the original intent of a per-cpu current cluster was to
>   get pages for an application adjacent in the swap to speed up IO.
>
> I'd be keen to hear if you think we could get something like this into v6.10 to
> fix the mess - I'm willing to work quickly to address comments and do more
> testing. If not, then this is probably just a distraction and we should
> concentrate on Chris's series.

Ryan, thank you very much for accomplishing this.

I am getting Shuai Yuan's (CC'd) help to collect the latency histogram of
add_to_swap() for both your approach and Chris's. I will update you with
the results ASAP.

I am also anticipating Chris's V3, as V1 seems quite stable, but V2 has
caused a couple of crashes.

>
> This applies on top of v6.10-rc4.
>
> [1] https://lore.kernel.org/linux-mm/20240614-swap-allocator-v2-0-2a513b4a7f2f@kernel.org/
> [2] https://lore.kernel.org/linux-mm/20240615084714.37499-1-21cnbao@gmail.com/
>
> Thanks,
> Ryan
>
> Ryan Roberts (5):
>   mm: swap: Simplify end-of-cluster calculation
>   mm: swap: Change SWAP_NEXT_INVALID to highest value
>   mm: swap: Track allocation order for clusters
>   mm: swap: Scan for free swap entries in allocated clusters
>   mm: swap: Optimize per-order cluster scanning
>
>  include/linux/swap.h |  18 +++--
>  mm/swapfile.c        | 164 ++++++++++++++++++++++++++++++++++++++-----
>  2 files changed, 157 insertions(+), 25 deletions(-)
>
> --
> 2.43.0
>

Thanks
Barry


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH v1 0/5] Alternative mTHP swap allocator improvements
  2024-06-19  9:11 ` Barry Song
@ 2024-06-19  9:17   ` Ryan Roberts
  2024-06-21  8:48     ` Barry Song
  0 siblings, 1 reply; 15+ messages in thread
From: Ryan Roberts @ 2024-06-19  9:17 UTC (permalink / raw)
  To: Barry Song
  Cc: Andrew Morton, Chris Li, Kairui Song, Huang, Ying, Kalesh Singh,
	Hugh Dickins, David Hildenbrand, linux-kernel, linux-mm,
	Shuai Yuan

On 19/06/2024 10:11, Barry Song wrote:
> On Wed, Jun 19, 2024 at 11:27 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> Hi All,
>>
>> Chris has been doing great work at [1] to clean up my mess in the mTHP swap
>> entry allocator. But Barry posted a test program and results at [2] showing that
>> even with Chris's changes, there are still some fallbacks (around 5% - 25% in
>> some cases). I was interested in why that might be and ended up putting this PoC
>> patch set together to try to get a better understanding. This series ends up
>> achieving 0% fallback, even with small folios ("-s") enabled. I haven't done
>> much testing beyond that (yet) but thought it was worth posting on the strength
>> of that result alone.
>>
>> At a high level this works in a similar way to Chris's series; it marks a
>> cluster as being for a particular order and if a new cluster cannot be allocated
>> then it scans through the existing non-full clusters. But it does it by scanning
>> through the clusters rather than assembling them into a list. Cluster flags are
>> used to mark clusters that have been scanned and are known not to have enough
>> contiguous space, so the efficiency should be similar in practice.
>>
>> Because its not based around a linked list, there is less churn and I'm
>> wondering if this is perhaps easier to review and potentially even get into
>> v6.10-rcX to fix up what's already there, rather than having to wait until v6.11
>> for Chris's series? I know Chris has a larger roadmap of improvements, so at
>> best I see this as a tactical fix that will ultimately be superseeded by Chris's
>> work.
>>
>> There are a few differences to note vs Chris's series:
>>
>> - order-0 fallback scanning is still allowed in any cluster; the argument in the
>>   past was that swap should always use all the swap space, so I've left this
>>   mechanism in. It is only a fallback though; first the the new per-order
>>   scanner is invoked, even for order-0, so if there are free slots in clusters
>>   already assigned for order-0, then the allocation will go there.
>>
>> - CPUs can steal slots from other CPU's current clusters; those clusters remain
>>   scannable while they are current for a CPU and are only made unscannable when
>>   no more CPUs are scanning that particular cluster.
>>
>> - I'm preferring to allocate a free cluster ahead of per-order scanning, since,
>>   as I understand it, the original intent of a per-cpu current cluster was to
>>   get pages for an application adjacent in the swap to speed up IO.
>>
>> I'd be keen to hear if you think we could get something like this into v6.10 to
>> fix the mess - I'm willing to work quickly to address comments and do more
>> testing. If not, then this is probably just a distraction and we should
>> concentrate on Chris's series.
> 
> Ryan, thank you very much for accomplishing this.
> 
> I am getting Shuai Yuan's (CC'd) help to collect the latency histogram of
> add_to_swap() for both your approach and Chris's. I will update you with
> the results ASAP.

Ahh great - look forward to the results!

> 
> I am also anticipating Chris's V3, as V1 seems quite stable, but V2 has
> caused a couple of crashes.
> 
>>
>> This applies on top of v6.10-rc4.
>>
>> [1] https://lore.kernel.org/linux-mm/20240614-swap-allocator-v2-0-2a513b4a7f2f@kernel.org/
>> [2] https://lore.kernel.org/linux-mm/20240615084714.37499-1-21cnbao@gmail.com/
>>
>> Thanks,
>> Ryan
>>
>> Ryan Roberts (5):
>>   mm: swap: Simplify end-of-cluster calculation
>>   mm: swap: Change SWAP_NEXT_INVALID to highest value
>>   mm: swap: Track allocation order for clusters
>>   mm: swap: Scan for free swap entries in allocated clusters
>>   mm: swap: Optimize per-order cluster scanning
>>
>>  include/linux/swap.h |  18 +++--
>>  mm/swapfile.c        | 164 ++++++++++++++++++++++++++++++++++++++-----
>>  2 files changed, 157 insertions(+), 25 deletions(-)
>>
>> --
>> 2.43.0
>>
> 
> Thanks
> Barry



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH v1 0/5] Alternative mTHP swap allocator improvements
  2024-06-19  9:17   ` Ryan Roberts
@ 2024-06-21  8:48     ` Barry Song
  0 siblings, 0 replies; 15+ messages in thread
From: Barry Song @ 2024-06-21  8:48 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Andrew Morton, Chris Li, Kairui Song, Huang, Ying, Kalesh Singh,
	Hugh Dickins, David Hildenbrand, linux-kernel, linux-mm,
	Shuai Yuan

On Wed, Jun 19, 2024 at 9:18 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 19/06/2024 10:11, Barry Song wrote:
> > On Wed, Jun 19, 2024 at 11:27 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>
> >> Hi All,
> >>
> >> Chris has been doing great work at [1] to clean up my mess in the mTHP swap
> >> entry allocator. But Barry posted a test program and results at [2] showing that
> >> even with Chris's changes, there are still some fallbacks (around 5% - 25% in
> >> some cases). I was interested in why that might be and ended up putting this PoC
> >> patch set together to try to get a better understanding. This series ends up
> >> achieving 0% fallback, even with small folios ("-s") enabled. I haven't done
> >> much testing beyond that (yet) but thought it was worth posting on the strength
> >> of that result alone.
> >>
> >> At a high level this works in a similar way to Chris's series; it marks a
> >> cluster as being for a particular order and if a new cluster cannot be allocated
> >> then it scans through the existing non-full clusters. But it does it by scanning
> >> through the clusters rather than assembling them into a list. Cluster flags are
> >> used to mark clusters that have been scanned and are known not to have enough
> >> contiguous space, so the efficiency should be similar in practice.
> >>
> >> Because its not based around a linked list, there is less churn and I'm
> >> wondering if this is perhaps easier to review and potentially even get into
> >> v6.10-rcX to fix up what's already there, rather than having to wait until v6.11
> >> for Chris's series? I know Chris has a larger roadmap of improvements, so at
> >> best I see this as a tactical fix that will ultimately be superseeded by Chris's
> >> work.
> >>
> >> There are a few differences to note vs Chris's series:
> >>
> >> - order-0 fallback scanning is still allowed in any cluster; the argument in the
> >>   past was that swap should always use all the swap space, so I've left this
> >>   mechanism in. It is only a fallback though; first the the new per-order
> >>   scanner is invoked, even for order-0, so if there are free slots in clusters
> >>   already assigned for order-0, then the allocation will go there.
> >>
> >> - CPUs can steal slots from other CPU's current clusters; those clusters remain
> >>   scannable while they are current for a CPU and are only made unscannable when
> >>   no more CPUs are scanning that particular cluster.
> >>
> >> - I'm preferring to allocate a free cluster ahead of per-order scanning, since,
> >>   as I understand it, the original intent of a per-cpu current cluster was to
> >>   get pages for an application adjacent in the swap to speed up IO.
> >>
> >> I'd be keen to hear if you think we could get something like this into v6.10 to
> >> fix the mess - I'm willing to work quickly to address comments and do more
> >> testing. If not, then this is probably just a distraction and we should
> >> concentrate on Chris's series.
> >
> > Ryan, thank you very much for accomplishing this.
> >
> > I am getting Shuai Yuan's (CC'd) help to collect the latency histogram of
> > add_to_swap() for both your approach and Chris's. I will update you with
> > the results ASAP.
>
> Ahh great - look forward to the results!

Essentially, we are measuring two types of latency:
* Small folio swap allocation
 * Large folio swap allocation

The concept code is like

diff --git a/mm/swap_state.c b/mm/swap_state.c
index 994723cef821..a608b916ed2f 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -185,10 +185,18 @@ bool add_to_swap(struct folio *folio)
        VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
        VM_BUG_ON_FOLIO(!folio_test_uptodate(folio), folio);

+       start_time = ktime_get();
+
        entry = folio_alloc_swap(folio);
        if (!entry.val)
                return false;

+       end_time = ktime_get();
+       if (folio_test_large(folio))
+               trace_large_swap_allocation_latency(ktime_sub(end_time
- start_time));
+       else
+               trace_small_swap_allocation_latency(ktime_sub(end_time
- start_time));
+
        /*
         * XArray node allocations from PF_MEMALLOC contexts could
         * completely exhaust the page allocator. __GFP_NOMEMALLOC


Then, we'll generate histograms for both large and small allocation
latency. We're currently
encountering some setup issues. Once we have the data, I'll provide
updates to you and Chris.
Additionally, I noticed some comments suggesting that Chris's patch
might negatively impact
the swap allocation latency of small folios. Perhaps the data can help
clarify this.

>
> >
> > I am also anticipating Chris's V3, as V1 seems quite stable, but V2 has
> > caused a couple of crashes.
> >
> >>
> >> This applies on top of v6.10-rc4.
> >>
> >> [1] https://lore.kernel.org/linux-mm/20240614-swap-allocator-v2-0-2a513b4a7f2f@kernel.org/
> >> [2] https://lore.kernel.org/linux-mm/20240615084714.37499-1-21cnbao@gmail.com/
> >>
> >> Thanks,
> >> Ryan
> >>
> >> Ryan Roberts (5):
> >>   mm: swap: Simplify end-of-cluster calculation
> >>   mm: swap: Change SWAP_NEXT_INVALID to highest value
> >>   mm: swap: Track allocation order for clusters
> >>   mm: swap: Scan for free swap entries in allocated clusters
> >>   mm: swap: Optimize per-order cluster scanning
> >>
> >>  include/linux/swap.h |  18 +++--
> >>  mm/swapfile.c        | 164 ++++++++++++++++++++++++++++++++++++++-----
> >>  2 files changed, 157 insertions(+), 25 deletions(-)
> >>
> >> --
> >> 2.43.0
> >>
>

Thanks
Barry


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH v1 0/5] Alternative mTHP swap allocator improvements
  2024-06-18 23:26 [RFC PATCH v1 0/5] Alternative mTHP swap allocator improvements Ryan Roberts
                   ` (6 preceding siblings ...)
  2024-06-19  9:11 ` Barry Song
@ 2024-06-25  8:02 ` Ryan Roberts
  2024-06-25 16:06   ` Chris Li
  7 siblings, 1 reply; 15+ messages in thread
From: Ryan Roberts @ 2024-06-25  8:02 UTC (permalink / raw)
  To: Andrew Morton, Chris Li, Kairui Song, Huang, Ying, Kalesh Singh,
	Barry Song, Hugh Dickins, David Hildenbrand
  Cc: linux-kernel, linux-mm

On 19/06/2024 00:26, Ryan Roberts wrote:
> Hi All,
> 
> Chris has been doing great work at [1] to clean up my mess in the mTHP swap
> entry allocator. But Barry posted a test program and results at [2] showing that
> even with Chris's changes, there are still some fallbacks (around 5% - 25% in
> some cases). I was interested in why that might be and ended up putting this PoC
> patch set together to try to get a better understanding. This series ends up
> achieving 0% fallback, even with small folios ("-s") enabled. I haven't done
> much testing beyond that (yet) but thought it was worth posting on the strength
> of that result alone.
> 
> At a high level this works in a similar way to Chris's series; it marks a
> cluster as being for a particular order and if a new cluster cannot be allocated
> then it scans through the existing non-full clusters. But it does it by scanning
> through the clusters rather than assembling them into a list. Cluster flags are
> used to mark clusters that have been scanned and are known not to have enough
> contiguous space, so the efficiency should be similar in practice.
> 
> Because its not based around a linked list, there is less churn and I'm
> wondering if this is perhaps easier to review and potentially even get into
> v6.10-rcX to fix up what's already there, rather than having to wait until v6.11
> for Chris's series? I know Chris has a larger roadmap of improvements, so at
> best I see this as a tactical fix that will ultimately be superseeded by Chris's
> work.
> 
> There are a few differences to note vs Chris's series:
> 
> - order-0 fallback scanning is still allowed in any cluster; the argument in the
>   past was that swap should always use all the swap space, so I've left this
>   mechanism in. It is only a fallback though; first the the new per-order
>   scanner is invoked, even for order-0, so if there are free slots in clusters
>   already assigned for order-0, then the allocation will go there.
> 
> - CPUs can steal slots from other CPU's current clusters; those clusters remain
>   scannable while they are current for a CPU and are only made unscannable when
>   no more CPUs are scanning that particular cluster.
> 
> - I'm preferring to allocate a free cluster ahead of per-order scanning, since,
>   as I understand it, the original intent of a per-cpu current cluster was to
>   get pages for an application adjacent in the swap to speed up IO.

[...]

I've been having a play, trying to extend this to actively free swap entries to make sufficient space for a new allcoation if the entries are already in swap cache. (To be clear, I'm not pursuing this series for inclusion, but was just trying to put some numbers to the idea, which could later be added to Chris's series if it makes sense).

However, I'm running into an unexpected issue; It was my previous understanding that if the swap map for an entry is SWAP_HAS_CACHE, then there is a folio in the swap cache and nothing is referencing the swap entry, so the entry can be freed with __try_to_reclaim_swap() - that's the pattern that the existing oder-0 scanner uses.

But I'm finding that __try_to_reclaim_swap() always returns 0, indicating that no folio was associated with the swap entry. How can that be, if swap_map[offset] == SWAP_HAS_CACHE ?

Here is my code, for reference. Note that "evict" is always set true to be very agressive for now, but the eventual idea is that it would only be set to true when certain criteria are met (e.g. last attempt to allocate for order either failed or had to evict to succeed). 

With logging added, every single call to __try_to_reclaim_swap() returns 0. (well it also live locks due to that with code as shown, but I hacked around that in experiments). Any ideas, before I dig deeper?


diff --git a/mm/swapfile.c b/mm/swapfile.c
index f6d78723f0fd..928d61e1d9d5 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -641,13 +641,25 @@ scan_swap_map_ssd_cluster_conflict(struct swap_info_struct *si,
 }
 
 static inline bool swap_range_empty(char *swap_map, unsigned int start,
-                                   unsigned int nr_pages)
+                                   unsigned int nr_pages, bool *inc_cached)
 {
        unsigned int i;
-
-       for (i = 0; i < nr_pages; i++) {
-               if (swap_map[start + i])
-                       return false;
+       bool hit_cache = false;
+
+       if (*inc_cached) {
+               for (i = 0; i < nr_pages; i++) {
+                       if (swap_map[start + i]) {
+                               if (swap_map[start + i] != SWAP_HAS_CACHE)
+                                       return false;
+                               hit_cache = true;
+                       }
+               }
+               *inc_cached = hit_cache;
+       } else {
+               for (i = 0; i < nr_pages; i++) {
+                       if (swap_map[start + i])
+                               return false;
+               }
        }
 
        return true;
@@ -760,6 +772,7 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
        struct swap_cluster_info *ci;
        unsigned int tmp, max;
        unsigned int stop = SWAP_NEXT_INVALID;
+       bool evict = true;
 
 new_cluster:
        cluster = this_cpu_ptr(si->percpu_cluster);
@@ -799,18 +812,48 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
         * natural alignment.
         */
        max = ALIGN(tmp + 1, SWAPFILE_CLUSTER);
-       ci = lock_cluster(si, tmp);
-       while (tmp < max) {
-               if (swap_range_empty(si->swap_map, tmp, nr_pages))
-                       break;
-               tmp += nr_pages;
+scan_cluster:
+       if (tmp < max) {
+               ci = lock_cluster(si, tmp);
+               while (tmp < max) {
+                       if (swap_range_empty(si->swap_map, tmp, nr_pages, &evict))
+                               break;
+                       tmp += nr_pages;
+               }
+               unlock_cluster(ci);
        }
-       unlock_cluster(ci);
        if (tmp >= max) {
                cluster_dec_scanners(ci);
                cluster->next[order] = SWAP_NEXT_INVALID;
                goto new_cluster;
        }
+       if (evict) {
+               unsigned int off;
+               int nr;
+
+               spin_unlock(&si->lock);
+
+               for (off = tmp; off < tmp + nr_pages; off += nr) {
+                       nr = 1;
+                       if (READ_ONCE(si->swap_map[off]) == SWAP_HAS_CACHE) {
+                               nr = __try_to_reclaim_swap(si, off, TTRS_ANYWAY);
+                               if (nr < 0)
+                                       break;
+                               else if (nr == 0)
+                                       nr = 1;
+                               nr = ALIGN(off + 1, nr) - off;
+                       }
+               }
+
+               spin_lock(&si->lock);
+               *scan_base = this_cpu_read(*si->cluster_next_cpu);
+               *offset = *scan_base;
+
+               if (nr < 0)
+                       tmp += nr_pages;
+
+               goto scan_cluster;
+       }
        *offset = tmp;
        *scan_base = tmp;
        tmp += nr_pages;




^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH v1 0/5] Alternative mTHP swap allocator improvements
  2024-06-25  8:02 ` Ryan Roberts
@ 2024-06-25 16:06   ` Chris Li
  2024-06-25 16:36     ` Ryan Roberts
  0 siblings, 1 reply; 15+ messages in thread
From: Chris Li @ 2024-06-25 16:06 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Andrew Morton, Kairui Song, Huang, Ying, Kalesh Singh,
	Barry Song, Hugh Dickins, David Hildenbrand, linux-kernel,
	linux-mm

On Tue, Jun 25, 2024 at 1:02 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 19/06/2024 00:26, Ryan Roberts wrote:
> > Hi All,
> >
> > Chris has been doing great work at [1] to clean up my mess in the mTHP swap
> > entry allocator. But Barry posted a test program and results at [2] showing that
> > even with Chris's changes, there are still some fallbacks (around 5% - 25% in
> > some cases). I was interested in why that might be and ended up putting this PoC
> > patch set together to try to get a better understanding. This series ends up
> > achieving 0% fallback, even with small folios ("-s") enabled. I haven't done
> > much testing beyond that (yet) but thought it was worth posting on the strength
> > of that result alone.
> >
> > At a high level this works in a similar way to Chris's series; it marks a
> > cluster as being for a particular order and if a new cluster cannot be allocated
> > then it scans through the existing non-full clusters. But it does it by scanning
> > through the clusters rather than assembling them into a list. Cluster flags are
> > used to mark clusters that have been scanned and are known not to have enough
> > contiguous space, so the efficiency should be similar in practice.
> >
> > Because its not based around a linked list, there is less churn and I'm
> > wondering if this is perhaps easier to review and potentially even get into
> > v6.10-rcX to fix up what's already there, rather than having to wait until v6.11
> > for Chris's series? I know Chris has a larger roadmap of improvements, so at
> > best I see this as a tactical fix that will ultimately be superseeded by Chris's
> > work.
> >
> > There are a few differences to note vs Chris's series:
> >
> > - order-0 fallback scanning is still allowed in any cluster; the argument in the
> >   past was that swap should always use all the swap space, so I've left this
> >   mechanism in. It is only a fallback though; first the the new per-order
> >   scanner is invoked, even for order-0, so if there are free slots in clusters
> >   already assigned for order-0, then the allocation will go there.
> >
> > - CPUs can steal slots from other CPU's current clusters; those clusters remain
> >   scannable while they are current for a CPU and are only made unscannable when
> >   no more CPUs are scanning that particular cluster.
> >
> > - I'm preferring to allocate a free cluster ahead of per-order scanning, since,
> >   as I understand it, the original intent of a per-cpu current cluster was to
> >   get pages for an application adjacent in the swap to speed up IO.
>
> [...]
>
> I've been having a play, trying to extend this to actively free swap entries to make sufficient space for a new allcoation if the entries are already in swap cache. (To be clear, I'm not pursuing this series for inclusion, but was just trying to put some numbers to the idea, which could later be added to Chris's series if it makes sense).
>
> However, I'm running into an unexpected issue; It was my previous understanding that if the swap map for an entry is SWAP_HAS_CACHE, then there is a folio in the swap cache and nothing is referencing the swap entry, so the entry can be freed with __try_to_reclaim_swap() - that's the pattern that the existing oder-0 scanner uses.
>
> But I'm finding that __try_to_reclaim_swap() always returns 0, indicating that no folio was associated with the swap entry. How can that be, if swap_map[offset] == SWAP_HAS_CACHE ?

I saw that in my test too. Some swap entries remain as SWAP_HAS_CACHE
which contribute to the swap allocation failure rate.

One possibility is that the zram is hitting the synchronous IO case
for swap in, it does not put the folio in swap cache when serving the
swap in. In commit 13ddaf26be324a7f951891ecd9ccd04466d27458 from
Kairui, the SWAP_HAS_CACHE flag was added for synchronous IO, in order
to address a race in the swap in. But the SWAP_HAS_CACHE should be
cleaned up after swap in fault though. I did not have a chance to dig
deeper into the root cause of those SWAP_HAS_CACHE entries yet.

Chris

>
> Here is my code, for reference. Note that "evict" is always set true to be very agressive for now, but the eventual idea is that it would only be set to true when certain criteria are met (e.g. last attempt to allocate for order either failed or had to evict to succeed).
>
> With logging added, every single call to __try_to_reclaim_swap() returns 0. (well it also live locks due to that with code as shown, but I hacked around that in experiments). Any ideas, before I dig deeper?
>
>
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index f6d78723f0fd..928d61e1d9d5 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -641,13 +641,25 @@ scan_swap_map_ssd_cluster_conflict(struct swap_info_struct *si,
>  }
>
>  static inline bool swap_range_empty(char *swap_map, unsigned int start,
> -                                   unsigned int nr_pages)
> +                                   unsigned int nr_pages, bool *inc_cached)
>  {
>         unsigned int i;
> -
> -       for (i = 0; i < nr_pages; i++) {
> -               if (swap_map[start + i])
> -                       return false;
> +       bool hit_cache = false;
> +
> +       if (*inc_cached) {
> +               for (i = 0; i < nr_pages; i++) {
> +                       if (swap_map[start + i]) {
> +                               if (swap_map[start + i] != SWAP_HAS_CACHE)
> +                                       return false;
> +                               hit_cache = true;
> +                       }
> +               }
> +               *inc_cached = hit_cache;
> +       } else {
> +               for (i = 0; i < nr_pages; i++) {
> +                       if (swap_map[start + i])
> +                               return false;
> +               }
>         }
>
>         return true;
> @@ -760,6 +772,7 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
>         struct swap_cluster_info *ci;
>         unsigned int tmp, max;
>         unsigned int stop = SWAP_NEXT_INVALID;
> +       bool evict = true;
>
>  new_cluster:
>         cluster = this_cpu_ptr(si->percpu_cluster);
> @@ -799,18 +812,48 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
>          * natural alignment.
>          */
>         max = ALIGN(tmp + 1, SWAPFILE_CLUSTER);
> -       ci = lock_cluster(si, tmp);
> -       while (tmp < max) {
> -               if (swap_range_empty(si->swap_map, tmp, nr_pages))
> -                       break;
> -               tmp += nr_pages;
> +scan_cluster:
> +       if (tmp < max) {
> +               ci = lock_cluster(si, tmp);
> +               while (tmp < max) {
> +                       if (swap_range_empty(si->swap_map, tmp, nr_pages, &evict))
> +                               break;
> +                       tmp += nr_pages;
> +               }
> +               unlock_cluster(ci);
>         }
> -       unlock_cluster(ci);
>         if (tmp >= max) {
>                 cluster_dec_scanners(ci);
>                 cluster->next[order] = SWAP_NEXT_INVALID;
>                 goto new_cluster;
>         }
> +       if (evict) {
> +               unsigned int off;
> +               int nr;
> +
> +               spin_unlock(&si->lock);
> +
> +               for (off = tmp; off < tmp + nr_pages; off += nr) {
> +                       nr = 1;
> +                       if (READ_ONCE(si->swap_map[off]) == SWAP_HAS_CACHE) {
> +                               nr = __try_to_reclaim_swap(si, off, TTRS_ANYWAY);
> +                               if (nr < 0)
> +                                       break;
> +                               else if (nr == 0)
> +                                       nr = 1;
> +                               nr = ALIGN(off + 1, nr) - off;
> +                       }
> +               }
> +
> +               spin_lock(&si->lock);
> +               *scan_base = this_cpu_read(*si->cluster_next_cpu);
> +               *offset = *scan_base;
> +
> +               if (nr < 0)
> +                       tmp += nr_pages;
> +
> +               goto scan_cluster;
> +       }
>         *offset = tmp;
>         *scan_base = tmp;
>         tmp += nr_pages;
>
>


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH v1 0/5] Alternative mTHP swap allocator improvements
  2024-06-25 16:06   ` Chris Li
@ 2024-06-25 16:36     ` Ryan Roberts
  0 siblings, 0 replies; 15+ messages in thread
From: Ryan Roberts @ 2024-06-25 16:36 UTC (permalink / raw)
  To: Chris Li
  Cc: Andrew Morton, Kairui Song, Huang, Ying, Kalesh Singh,
	Barry Song, Hugh Dickins, David Hildenbrand, linux-kernel,
	linux-mm

On 25/06/2024 17:06, Chris Li wrote:
> On Tue, Jun 25, 2024 at 1:02 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> On 19/06/2024 00:26, Ryan Roberts wrote:
>>> Hi All,
>>>
>>> Chris has been doing great work at [1] to clean up my mess in the mTHP swap
>>> entry allocator. But Barry posted a test program and results at [2] showing that
>>> even with Chris's changes, there are still some fallbacks (around 5% - 25% in
>>> some cases). I was interested in why that might be and ended up putting this PoC
>>> patch set together to try to get a better understanding. This series ends up
>>> achieving 0% fallback, even with small folios ("-s") enabled. I haven't done
>>> much testing beyond that (yet) but thought it was worth posting on the strength
>>> of that result alone.
>>>
>>> At a high level this works in a similar way to Chris's series; it marks a
>>> cluster as being for a particular order and if a new cluster cannot be allocated
>>> then it scans through the existing non-full clusters. But it does it by scanning
>>> through the clusters rather than assembling them into a list. Cluster flags are
>>> used to mark clusters that have been scanned and are known not to have enough
>>> contiguous space, so the efficiency should be similar in practice.
>>>
>>> Because its not based around a linked list, there is less churn and I'm
>>> wondering if this is perhaps easier to review and potentially even get into
>>> v6.10-rcX to fix up what's already there, rather than having to wait until v6.11
>>> for Chris's series? I know Chris has a larger roadmap of improvements, so at
>>> best I see this as a tactical fix that will ultimately be superseeded by Chris's
>>> work.
>>>
>>> There are a few differences to note vs Chris's series:
>>>
>>> - order-0 fallback scanning is still allowed in any cluster; the argument in the
>>>   past was that swap should always use all the swap space, so I've left this
>>>   mechanism in. It is only a fallback though; first the the new per-order
>>>   scanner is invoked, even for order-0, so if there are free slots in clusters
>>>   already assigned for order-0, then the allocation will go there.
>>>
>>> - CPUs can steal slots from other CPU's current clusters; those clusters remain
>>>   scannable while they are current for a CPU and are only made unscannable when
>>>   no more CPUs are scanning that particular cluster.
>>>
>>> - I'm preferring to allocate a free cluster ahead of per-order scanning, since,
>>>   as I understand it, the original intent of a per-cpu current cluster was to
>>>   get pages for an application adjacent in the swap to speed up IO.
>>
>> [...]
>>
>> I've been having a play, trying to extend this to actively free swap entries to make sufficient space for a new allcoation if the entries are already in swap cache. (To be clear, I'm not pursuing this series for inclusion, but was just trying to put some numbers to the idea, which could later be added to Chris's series if it makes sense).
>>
>> However, I'm running into an unexpected issue; It was my previous understanding that if the swap map for an entry is SWAP_HAS_CACHE, then there is a folio in the swap cache and nothing is referencing the swap entry, so the entry can be freed with __try_to_reclaim_swap() - that's the pattern that the existing oder-0 scanner uses.
>>
>> But I'm finding that __try_to_reclaim_swap() always returns 0, indicating that no folio was associated with the swap entry. How can that be, if swap_map[offset] == SWAP_HAS_CACHE ?
> 
> I saw that in my test too. Some swap entries remain as SWAP_HAS_CACHE
> which contribute to the swap allocation failure rate.
> 
> One possibility is that the zram is hitting the synchronous IO case
> for swap in, it does not put the folio in swap cache when serving the
> swap in. In commit 13ddaf26be324a7f951891ecd9ccd04466d27458 from
> Kairui, the SWAP_HAS_CACHE flag was added for synchronous IO, in order
> to address a race in the swap in. But the SWAP_HAS_CACHE should be
> cleaned up after swap in fault though. I did not have a chance to dig
> deeper into the root cause of those SWAP_HAS_CACHE entries yet.

Oh wait, I think they are probably in the swap slot cache waiting to be freed.
Once the count goes to 0 (inc SWAP_HAS_CACHE), __swap_entry_free_locked
temporarily adds SWAP_HAS_CACHE back in, and free_swap_slot() is called. The
entry is buffered in that layer until the buffer is full and they are all
flushed together. So should be able to trigger that flush to free some slots.

> 
> Chris
> 
>>
>> Here is my code, for reference. Note that "evict" is always set true to be very agressive for now, but the eventual idea is that it would only be set to true when certain criteria are met (e.g. last attempt to allocate for order either failed or had to evict to succeed).
>>
>> With logging added, every single call to __try_to_reclaim_swap() returns 0. (well it also live locks due to that with code as shown, but I hacked around that in experiments). Any ideas, before I dig deeper?
>>
>>
>> diff --git a/mm/swapfile.c b/mm/swapfile.c
>> index f6d78723f0fd..928d61e1d9d5 100644
>> --- a/mm/swapfile.c
>> +++ b/mm/swapfile.c
>> @@ -641,13 +641,25 @@ scan_swap_map_ssd_cluster_conflict(struct swap_info_struct *si,
>>  }
>>
>>  static inline bool swap_range_empty(char *swap_map, unsigned int start,
>> -                                   unsigned int nr_pages)
>> +                                   unsigned int nr_pages, bool *inc_cached)
>>  {
>>         unsigned int i;
>> -
>> -       for (i = 0; i < nr_pages; i++) {
>> -               if (swap_map[start + i])
>> -                       return false;
>> +       bool hit_cache = false;
>> +
>> +       if (*inc_cached) {
>> +               for (i = 0; i < nr_pages; i++) {
>> +                       if (swap_map[start + i]) {
>> +                               if (swap_map[start + i] != SWAP_HAS_CACHE)
>> +                                       return false;
>> +                               hit_cache = true;
>> +                       }
>> +               }
>> +               *inc_cached = hit_cache;
>> +       } else {
>> +               for (i = 0; i < nr_pages; i++) {
>> +                       if (swap_map[start + i])
>> +                               return false;
>> +               }
>>         }
>>
>>         return true;
>> @@ -760,6 +772,7 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
>>         struct swap_cluster_info *ci;
>>         unsigned int tmp, max;
>>         unsigned int stop = SWAP_NEXT_INVALID;
>> +       bool evict = true;
>>
>>  new_cluster:
>>         cluster = this_cpu_ptr(si->percpu_cluster);
>> @@ -799,18 +812,48 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
>>          * natural alignment.
>>          */
>>         max = ALIGN(tmp + 1, SWAPFILE_CLUSTER);
>> -       ci = lock_cluster(si, tmp);
>> -       while (tmp < max) {
>> -               if (swap_range_empty(si->swap_map, tmp, nr_pages))
>> -                       break;
>> -               tmp += nr_pages;
>> +scan_cluster:
>> +       if (tmp < max) {
>> +               ci = lock_cluster(si, tmp);
>> +               while (tmp < max) {
>> +                       if (swap_range_empty(si->swap_map, tmp, nr_pages, &evict))
>> +                               break;
>> +                       tmp += nr_pages;
>> +               }
>> +               unlock_cluster(ci);
>>         }
>> -       unlock_cluster(ci);
>>         if (tmp >= max) {
>>                 cluster_dec_scanners(ci);
>>                 cluster->next[order] = SWAP_NEXT_INVALID;
>>                 goto new_cluster;
>>         }
>> +       if (evict) {
>> +               unsigned int off;
>> +               int nr;
>> +
>> +               spin_unlock(&si->lock);
>> +
>> +               for (off = tmp; off < tmp + nr_pages; off += nr) {
>> +                       nr = 1;
>> +                       if (READ_ONCE(si->swap_map[off]) == SWAP_HAS_CACHE) {
>> +                               nr = __try_to_reclaim_swap(si, off, TTRS_ANYWAY);
>> +                               if (nr < 0)
>> +                                       break;
>> +                               else if (nr == 0)
>> +                                       nr = 1;
>> +                               nr = ALIGN(off + 1, nr) - off;
>> +                       }
>> +               }
>> +
>> +               spin_lock(&si->lock);
>> +               *scan_base = this_cpu_read(*si->cluster_next_cpu);
>> +               *offset = *scan_base;
>> +
>> +               if (nr < 0)
>> +                       tmp += nr_pages;
>> +
>> +               goto scan_cluster;
>> +       }
>>         *offset = tmp;
>>         *scan_base = tmp;
>>         tmp += nr_pages;
>>
>>



^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2024-06-25 16:37 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-06-18 23:26 [RFC PATCH v1 0/5] Alternative mTHP swap allocator improvements Ryan Roberts
2024-06-18 23:26 ` [RFC PATCH v1 1/5] mm: swap: Simplify end-of-cluster calculation Ryan Roberts
2024-06-18 23:26 ` [RFC PATCH v1 2/5] mm: swap: Change SWAP_NEXT_INVALID to highest value Ryan Roberts
2024-06-18 23:26 ` [RFC PATCH v1 3/5] mm: swap: Track allocation order for clusters Ryan Roberts
2024-06-18 23:26 ` [RFC PATCH v1 4/5] mm: swap: Scan for free swap entries in allocated clusters Ryan Roberts
2024-06-18 23:26 ` [RFC PATCH v1 5/5] mm: swap: Optimize per-order cluster scanning Ryan Roberts
2024-06-19  7:19 ` [RFC PATCH v1 0/5] Alternative mTHP swap allocator improvements Huang, Ying
2024-06-19  9:17   ` Ryan Roberts
2024-06-20  1:34     ` Huang, Ying
2024-06-19  9:11 ` Barry Song
2024-06-19  9:17   ` Ryan Roberts
2024-06-21  8:48     ` Barry Song
2024-06-25  8:02 ` Ryan Roberts
2024-06-25 16:06   ` Chris Li
2024-06-25 16:36     ` Ryan Roberts

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox