[PATCH v2 0/2] mm: swap: mTHP swap allocator base on swap cluster order

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v2 0/2] mm: swap: mTHP swap allocator base on swap cluster order
@ 2024-06-14 23:48 Chris Li
  2024-06-14 23:48 ` [PATCH v2 1/2] mm: swap: swap cluster switch to double link list Chris Li
                   ` (3 more replies)
  0 siblings, 4 replies; 22+ messages in thread
From: Chris Li @ 2024-06-14 23:48 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Kairui Song, Ryan Roberts, Huang, Ying, Kairui Song,
	Kalesh Singh, linux-kernel, linux-mm, Chris Li, Barry Song

This is the short term solutiolns "swap cluster order" listed
in my "Swap Abstraction" discussion slice 8 in the recent
LSF/MM conference.

When commit 845982eb264bc "mm: swap: allow storage of all mTHP
orders" is introduced, it only allocates the mTHP swap entries
from new empty cluster list.  It has a fragmentation issue
reported by Barry.

https://lore.kernel.org/all/CAGsJ_4zAcJkuW016Cfi6wicRr8N9X+GJJhgMQdSMp+Ah+NSgNQ@mail.gmail.com/

The mTHP allocation failure rate raises to almost 100% after a few
hours in Barry's test run.

The reason is that all the empty cluster has been exhausted while
there are planty of free swap entries to in the cluster that is
not 100% free.

Remember the swap allocation order in the cluster.
Keep track of the per order non full cluster list for later allocation.

This greatly improve the sucess rate of the mTHP swap allocation.

There is some test number in the V1 thread of this series:
https://lore.kernel.org/r/20240524-swap-allocator-v1-0-47861b423b26@kernel.org

Reported-by: Barry Song <21cnbao@gmail.com>
Signed-off-by: Chris Li <chrisl@kernel.org>
---
Changes in v2:
- Add the cluster state field to track the different phases of
  cluster allocations.
- Rename "next" to "list" for the list field, suggested by Ying.
- Update comment for the locking rules for cluster fields and listi,
  suggested by Ying.
- Nonfull list avoid cluster on the per cpu active cluster.
- Allocate from the nonfull list before attempting free list, suggested
  by Kairui.
- Link to v1: https://lore.kernel.org/r/20240524-swap-allocator-v1-0-47861b423b26@kernel.org

---
Chris Li (2):
      mm: swap: swap cluster switch to double link list
      mm: swap: mTHP allocate swap entries from nonfull list

 include/linux/swap.h |  31 +++---
 mm/swapfile.c        | 270 ++++++++++++++++++---------------------------------
 2 files changed, 107 insertions(+), 194 deletions(-)
---
base-commit: 19b8422c5bd56fb5e7085995801c6543a98bda1f
change-id: 20240523-swap-allocator-1534c480ece4

Best regards,
-- 
Chris Li <chrisl@kernel.org>



^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v2 1/2] mm: swap: swap cluster switch to double link list
  2024-06-14 23:48 [PATCH v2 0/2] mm: swap: mTHP swap allocator base on swap cluster order Chris Li
@ 2024-06-14 23:48 ` Chris Li
  2024-06-17  6:19   ` Huang, Ying
  2024-06-14 23:48 ` [PATCH v2 2/2] mm: swap: mTHP allocate swap entries from nonfull list Chris Li
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 22+ messages in thread
From: Chris Li @ 2024-06-14 23:48 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Kairui Song, Ryan Roberts, Huang, Ying, Kairui Song,
	Kalesh Singh, linux-kernel, linux-mm, Chris Li, Barry Song

Previously, the swap cluster used a cluster index as a pointer
to construct a custom single link list type "swap_cluster_list".
The next cluster pointer is shared with the cluster->count.
It prevents puting the non free cluster into a list.
Change the cluster to use the standard double link list instead.
This allows tracing the nonfull cluster in the follow up patch.

Remove the cluster getter/setter for accessing the cluster
struct member.

The list operation is protected by the swap_info_struct->lock.

Change cluster code to use "struct swap_cluster_info *" to
reference the cluster rather than by using index. That is more
consistent with the list manipulation. It avoids the repeat
adding index to the cluser_info. The code is easier to understand.

Remove the cluster next pointer is NULL flag, the double link
list can handle the empty list pretty well.

The "swap_cluster_info" struct is two pointer bigger, because
512 swap entries share one swap struct, it has very little impact
on the average memory usage per swap entry. For 1TB swapfile, the
swap cluster data structure increases from 8MB to 24MB.

Other than the list conversion, there is no real function change
in this patch.

Signed-off-by: Chris Li <chrisl@kernel.org>
---
 include/linux/swap.h |  28 +++----
 mm/swapfile.c        | 227 +++++++++++++--------------------------------------
 2 files changed, 70 insertions(+), 185 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 3df75d62a835..cd9154a3e934 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -242,23 +242,22 @@ enum {
  * space with SWAPFILE_CLUSTER pages long and naturally aligns in disk. All
  * free clusters are organized into a list. We fetch an entry from the list to
  * get a free cluster.
- *
- * The data field stores next cluster if the cluster is free or cluster usage
- * counter otherwise. The flags field determines if a cluster is free. This is
- * protected by swap_info_struct.lock.
  */
 struct swap_cluster_info {
 	spinlock_t lock;	/*
-				 * Protect swap_cluster_info fields
-				 * and swap_info_struct->swap_map
+				 * Protect swap_cluster_info count and state
+				 * field and swap_info_struct->swap_map
 				 * elements correspond to the swap
 				 * cluster
 				 */
-	unsigned int data:24;
-	unsigned int flags:8;
+	unsigned int count:12;
+	unsigned int state:3;
+	struct list_head list;	/* Protected by swap_info_struct->lock */
 };
-#define CLUSTER_FLAG_FREE 1 /* This cluster is free */
-#define CLUSTER_FLAG_NEXT_NULL 2 /* This cluster has no next cluster */
+
+#define CLUSTER_STATE_FREE	1 /* This cluster is free */
+#define CLUSTER_STATE_PER_CPU	2 /* This cluster on per_cpu_cluster  */
+
 
 /*
  * The first page in the swap file is the swap header, which is always marked
@@ -283,11 +282,6 @@ struct percpu_cluster {
 	unsigned int next[SWAP_NR_ORDERS]; /* Likely next allocation offset */
 };
 
-struct swap_cluster_list {
-	struct swap_cluster_info head;
-	struct swap_cluster_info tail;
-};
-
 /*
  * The in-memory structure used to track swap areas.
  */
@@ -300,7 +294,7 @@ struct swap_info_struct {
 	unsigned int	max;		/* extent of the swap_map */
 	unsigned char *swap_map;	/* vmalloc'ed array of usage counts */
 	struct swap_cluster_info *cluster_info; /* cluster info. Only for SSD */
-	struct swap_cluster_list free_clusters; /* free clusters list */
+	struct list_head free_clusters; /* free clusters list */
 	unsigned int lowest_bit;	/* index of first free in swap_map */
 	unsigned int highest_bit;	/* index of last free in swap_map */
 	unsigned int pages;		/* total of usable pages of swap */
@@ -331,7 +325,7 @@ struct swap_info_struct {
 					 * list.
 					 */
 	struct work_struct discard_work; /* discard worker */
-	struct swap_cluster_list discard_clusters; /* discard clusters list */
+	struct list_head discard_clusters; /* discard clusters list */
 	struct plist_node avail_lists[]; /*
 					   * entries in swap_avail_heads, one
 					   * entry per node.
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 9c6d8e557c0f..2f878b374349 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -290,62 +290,9 @@ static void discard_swap_cluster(struct swap_info_struct *si,
 #endif
 #define LATENCY_LIMIT		256
 
-static inline void cluster_set_flag(struct swap_cluster_info *info,
-	unsigned int flag)
-{
-	info->flags = flag;
-}
-
-static inline unsigned int cluster_count(struct swap_cluster_info *info)
-{
-	return info->data;
-}
-
-static inline void cluster_set_count(struct swap_cluster_info *info,
-				     unsigned int c)
-{
-	info->data = c;
-}
-
-static inline void cluster_set_count_flag(struct swap_cluster_info *info,
-					 unsigned int c, unsigned int f)
-{
-	info->flags = f;
-	info->data = c;
-}
-
-static inline unsigned int cluster_next(struct swap_cluster_info *info)
-{
-	return info->data;
-}
-
-static inline void cluster_set_next(struct swap_cluster_info *info,
-				    unsigned int n)
-{
-	info->data = n;
-}
-
-static inline void cluster_set_next_flag(struct swap_cluster_info *info,
-					 unsigned int n, unsigned int f)
-{
-	info->flags = f;
-	info->data = n;
-}
-
 static inline bool cluster_is_free(struct swap_cluster_info *info)
 {
-	return info->flags & CLUSTER_FLAG_FREE;
-}
-
-static inline bool cluster_is_null(struct swap_cluster_info *info)
-{
-	return info->flags & CLUSTER_FLAG_NEXT_NULL;
-}
-
-static inline void cluster_set_null(struct swap_cluster_info *info)
-{
-	info->flags = CLUSTER_FLAG_NEXT_NULL;
-	info->data = 0;
+	return info->state == CLUSTER_STATE_FREE;
 }
 
 static inline struct swap_cluster_info *lock_cluster(struct swap_info_struct *si,
@@ -394,65 +341,11 @@ static inline void unlock_cluster_or_swap_info(struct swap_info_struct *si,
 		spin_unlock(&si->lock);
 }
 
-static inline bool cluster_list_empty(struct swap_cluster_list *list)
-{
-	return cluster_is_null(&list->head);
-}
-
-static inline unsigned int cluster_list_first(struct swap_cluster_list *list)
-{
-	return cluster_next(&list->head);
-}
-
-static void cluster_list_init(struct swap_cluster_list *list)
-{
-	cluster_set_null(&list->head);
-	cluster_set_null(&list->tail);
-}
-
-static void cluster_list_add_tail(struct swap_cluster_list *list,
-				  struct swap_cluster_info *ci,
-				  unsigned int idx)
-{
-	if (cluster_list_empty(list)) {
-		cluster_set_next_flag(&list->head, idx, 0);
-		cluster_set_next_flag(&list->tail, idx, 0);
-	} else {
-		struct swap_cluster_info *ci_tail;
-		unsigned int tail = cluster_next(&list->tail);
-
-		/*
-		 * Nested cluster lock, but both cluster locks are
-		 * only acquired when we held swap_info_struct->lock
-		 */
-		ci_tail = ci + tail;
-		spin_lock_nested(&ci_tail->lock, SINGLE_DEPTH_NESTING);
-		cluster_set_next(ci_tail, idx);
-		spin_unlock(&ci_tail->lock);
-		cluster_set_next_flag(&list->tail, idx, 0);
-	}
-}
-
-static unsigned int cluster_list_del_first(struct swap_cluster_list *list,
-					   struct swap_cluster_info *ci)
-{
-	unsigned int idx;
-
-	idx = cluster_next(&list->head);
-	if (cluster_next(&list->tail) == idx) {
-		cluster_set_null(&list->head);
-		cluster_set_null(&list->tail);
-	} else
-		cluster_set_next_flag(&list->head,
-				      cluster_next(&ci[idx]), 0);
-
-	return idx;
-}
-
 /* Add a cluster to discard list and schedule it to do discard */
 static void swap_cluster_schedule_discard(struct swap_info_struct *si,
-		unsigned int idx)
+		struct swap_cluster_info *ci)
 {
+	unsigned int idx = ci - si->cluster_info;
 	/*
 	 * If scan_swap_map_slots() can't find a free cluster, it will check
 	 * si->swap_map directly. To make sure the discarding cluster isn't
@@ -462,17 +355,14 @@ static void swap_cluster_schedule_discard(struct swap_info_struct *si,
 	memset(si->swap_map + idx * SWAPFILE_CLUSTER,
 			SWAP_MAP_BAD, SWAPFILE_CLUSTER);
 
-	cluster_list_add_tail(&si->discard_clusters, si->cluster_info, idx);
-
+	list_add_tail(&ci->list, &si->discard_clusters);
 	schedule_work(&si->discard_work);
 }
 
-static void __free_cluster(struct swap_info_struct *si, unsigned long idx)
+static void __free_cluster(struct swap_info_struct *si, struct swap_cluster_info *ci)
 {
-	struct swap_cluster_info *ci = si->cluster_info;
-
-	cluster_set_flag(ci + idx, CLUSTER_FLAG_FREE);
-	cluster_list_add_tail(&si->free_clusters, ci, idx);
+	ci->state = CLUSTER_STATE_FREE;
+	list_add_tail(&ci->list, &si->free_clusters);
 }
 
 /*
@@ -481,21 +371,22 @@ static void __free_cluster(struct swap_info_struct *si, unsigned long idx)
 */
 static void swap_do_scheduled_discard(struct swap_info_struct *si)
 {
-	struct swap_cluster_info *info, *ci;
+	struct swap_cluster_info *ci;
 	unsigned int idx;
 
-	info = si->cluster_info;
-
-	while (!cluster_list_empty(&si->discard_clusters)) {
-		idx = cluster_list_del_first(&si->discard_clusters, info);
+	while (!list_empty(&si->discard_clusters)) {
+		ci = list_first_entry(&si->discard_clusters, struct swap_cluster_info, list);
+		list_del(&ci->list);
+		idx = ci - si->cluster_info;
 		spin_unlock(&si->lock);
 
 		discard_swap_cluster(si, idx * SWAPFILE_CLUSTER,
 				SWAPFILE_CLUSTER);
 
 		spin_lock(&si->lock);
-		ci = lock_cluster(si, idx * SWAPFILE_CLUSTER);
-		__free_cluster(si, idx);
+
+		spin_lock(&ci->lock);
+		__free_cluster(si, ci);
 		memset(si->swap_map + idx * SWAPFILE_CLUSTER,
 				0, SWAPFILE_CLUSTER);
 		unlock_cluster(ci);
@@ -521,20 +412,19 @@ static void swap_users_ref_free(struct percpu_ref *ref)
 	complete(&si->comp);
 }
 
-static void alloc_cluster(struct swap_info_struct *si, unsigned long idx)
+static struct swap_cluster_info *alloc_cluster(struct swap_info_struct *si, unsigned long idx)
 {
-	struct swap_cluster_info *ci = si->cluster_info;
+	struct swap_cluster_info *ci = list_first_entry(&si->free_clusters, struct swap_cluster_info, list);
 
-	VM_BUG_ON(cluster_list_first(&si->free_clusters) != idx);
-	cluster_list_del_first(&si->free_clusters, ci);
-	cluster_set_count_flag(ci + idx, 0, 0);
+	VM_BUG_ON(ci - si->cluster_info != idx);
+	list_del(&ci->list);
+	ci->count = 0;
+	return ci;
 }
 
-static void free_cluster(struct swap_info_struct *si, unsigned long idx)
+static void free_cluster(struct swap_info_struct *si, struct swap_cluster_info *ci)
 {
-	struct swap_cluster_info *ci = si->cluster_info + idx;
-
-	VM_BUG_ON(cluster_count(ci) != 0);
+	VM_BUG_ON(ci->count != 0);
 	/*
 	 * If the swap is discardable, prepare discard the cluster
 	 * instead of free it immediately. The cluster will be freed
@@ -542,11 +432,11 @@ static void free_cluster(struct swap_info_struct *si, unsigned long idx)
 	 */
 	if ((si->flags & (SWP_WRITEOK | SWP_PAGE_DISCARD)) ==
 	    (SWP_WRITEOK | SWP_PAGE_DISCARD)) {
-		swap_cluster_schedule_discard(si, idx);
+		swap_cluster_schedule_discard(si, ci);
 		return;
 	}
 
-	__free_cluster(si, idx);
+	__free_cluster(si, ci);
 }
 
 /*
@@ -559,15 +449,15 @@ static void add_cluster_info_page(struct swap_info_struct *p,
 	unsigned long count)
 {
 	unsigned long idx = page_nr / SWAPFILE_CLUSTER;
+	struct swap_cluster_info *ci = cluster_info + idx;
 
 	if (!cluster_info)
 		return;
-	if (cluster_is_free(&cluster_info[idx]))
+	if (cluster_is_free(ci))
 		alloc_cluster(p, idx);
 
-	VM_BUG_ON(cluster_count(&cluster_info[idx]) + count > SWAPFILE_CLUSTER);
-	cluster_set_count(&cluster_info[idx],
-		cluster_count(&cluster_info[idx]) + count);
+	VM_BUG_ON(ci->count + count > SWAPFILE_CLUSTER);
+	ci->count += count;
 }
 
 /*
@@ -581,24 +471,20 @@ static void inc_cluster_info_page(struct swap_info_struct *p,
 }
 
 /*
- * The cluster corresponding to page_nr decreases one usage. If the usage
- * counter becomes 0, which means no page in the cluster is in using, we can
- * optionally discard the cluster and add it to free cluster list.
+ * The cluster ci decreases one usage. If the usage counter becomes 0,
+ * which means no page in the cluster is in using, we can optionally discard
+ * the cluster and add it to free cluster list.
  */
-static void dec_cluster_info_page(struct swap_info_struct *p,
-	struct swap_cluster_info *cluster_info, unsigned long page_nr)
+static void dec_cluster_info_page(struct swap_info_struct *p, struct swap_cluster_info *ci)
 {
-	unsigned long idx = page_nr / SWAPFILE_CLUSTER;
-
-	if (!cluster_info)
+	if (!p->cluster_info)
 		return;
 
-	VM_BUG_ON(cluster_count(&cluster_info[idx]) == 0);
-	cluster_set_count(&cluster_info[idx],
-		cluster_count(&cluster_info[idx]) - 1);
+	VM_BUG_ON(ci->count == 0);
+	ci->count--;
 
-	if (cluster_count(&cluster_info[idx]) == 0)
-		free_cluster(p, idx);
+	if (!ci->count)
+		free_cluster(p, ci);
 }
 
 /*
@@ -611,10 +497,10 @@ scan_swap_map_ssd_cluster_conflict(struct swap_info_struct *si,
 {
 	struct percpu_cluster *percpu_cluster;
 	bool conflict;
-
+	struct swap_cluster_info *first = list_first_entry(&si->free_clusters, struct swap_cluster_info, list);
 	offset /= SWAPFILE_CLUSTER;
-	conflict = !cluster_list_empty(&si->free_clusters) &&
-		offset != cluster_list_first(&si->free_clusters) &&
+	conflict = !list_empty(&si->free_clusters) &&
+		offset !=  first - si->cluster_info &&
 		cluster_is_free(&si->cluster_info[offset]);
 
 	if (!conflict)
@@ -655,10 +541,14 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
 	cluster = this_cpu_ptr(si->percpu_cluster);
 	tmp = cluster->next[order];
 	if (tmp == SWAP_NEXT_INVALID) {
-		if (!cluster_list_empty(&si->free_clusters)) {
-			tmp = cluster_next(&si->free_clusters.head) *
-					SWAPFILE_CLUSTER;
-		} else if (!cluster_list_empty(&si->discard_clusters)) {
+		if (!list_empty(&si->free_clusters)) {
+			ci = list_first_entry(&si->free_clusters, struct swap_cluster_info, list);
+			list_del(&ci->list);
+			spin_lock(&ci->lock);
+			ci->state = CLUSTER_STATE_PER_CPU;
+			spin_unlock(&ci->lock);
+			tmp = (ci - si->cluster_info) * SWAPFILE_CLUSTER;
+		} else if (!list_empty(&si->discard_clusters)) {
 			/*
 			 * we don't have free cluster but have some clusters in
 			 * discarding, do discard now and reclaim them, then
@@ -1062,8 +952,8 @@ static void swap_free_cluster(struct swap_info_struct *si, unsigned long idx)
 
 	ci = lock_cluster(si, offset);
 	memset(si->swap_map + offset, 0, SWAPFILE_CLUSTER);
-	cluster_set_count_flag(ci, 0, 0);
-	free_cluster(si, idx);
+	ci->count = 0;
+	free_cluster(si, ci);
 	unlock_cluster(ci);
 	swap_range_free(si, offset, SWAPFILE_CLUSTER);
 }
@@ -1336,7 +1226,7 @@ static void swap_entry_free(struct swap_info_struct *p, swp_entry_t entry)
 	count = p->swap_map[offset];
 	VM_BUG_ON(count != SWAP_HAS_CACHE);
 	p->swap_map[offset] = 0;
-	dec_cluster_info_page(p, p->cluster_info, offset);
+	dec_cluster_info_page(p, ci);
 	unlock_cluster(ci);
 
 	mem_cgroup_uncharge_swap(entry, 1);
@@ -3003,8 +2893,8 @@ static int setup_swap_map_and_extents(struct swap_info_struct *p,
 
 	nr_good_pages = maxpages - 1;	/* omit header page */
 
-	cluster_list_init(&p->free_clusters);
-	cluster_list_init(&p->discard_clusters);
+	INIT_LIST_HEAD(&p->free_clusters);
+	INIT_LIST_HEAD(&p->discard_clusters);
 
 	for (i = 0; i < swap_header->info.nr_badpages; i++) {
 		unsigned int page_nr = swap_header->info.badpages[i];
@@ -3055,14 +2945,15 @@ static int setup_swap_map_and_extents(struct swap_info_struct *p,
 	for (k = 0; k < SWAP_CLUSTER_COLS; k++) {
 		j = (k + col) % SWAP_CLUSTER_COLS;
 		for (i = 0; i < DIV_ROUND_UP(nr_clusters, SWAP_CLUSTER_COLS); i++) {
+			struct swap_cluster_info *ci;
 			idx = i * SWAP_CLUSTER_COLS + j;
+			ci = cluster_info + idx;
 			if (idx >= nr_clusters)
 				continue;
-			if (cluster_count(&cluster_info[idx]))
+			if (ci->count)
 				continue;
-			cluster_set_flag(&cluster_info[idx], CLUSTER_FLAG_FREE);
-			cluster_list_add_tail(&p->free_clusters, cluster_info,
-					      idx);
+			ci->state = CLUSTER_STATE_FREE;
+			list_add_tail(&ci->list, &p->free_clusters);
 		}
 	}
 	return nr_extents;

-- 
2.45.2.627.g7a2c4fd464-goog



^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v2 2/2] mm: swap: mTHP allocate swap entries from nonfull list
  2024-06-14 23:48 [PATCH v2 0/2] mm: swap: mTHP swap allocator base on swap cluster order Chris Li
  2024-06-14 23:48 ` [PATCH v2 1/2] mm: swap: swap cluster switch to double link list Chris Li
@ 2024-06-14 23:48 ` Chris Li
  2024-06-15  1:06 ` [PATCH v2 0/2] mm: swap: mTHP swap allocator base on swap cluster order Andrew Morton
  2024-06-18 13:08 ` David Hildenbrand
  3 siblings, 0 replies; 22+ messages in thread
From: Chris Li @ 2024-06-14 23:48 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Kairui Song, Ryan Roberts, Huang, Ying, Kairui Song,
	Kalesh Singh, linux-kernel, linux-mm, Chris Li, Barry Song

Track the nonfull cluster as well as the empty cluster
on lists. Each order has one nonfull cluster list.

The cluster will remember which order it was used during
new cluster allocation.

When the cluster has free entry, add to the nonfull[order]
list. Allocation use the nonfull cluster list before using
the the free cluster list.

This improves the mTHP swap allocation success rate.

There are limitations if the distribution of numbers of
different orders of mTHP changes a lot. e.g. there are a lot
of nonfull cluster assign to order A while later time there
are a lot of order B allocation while very little allocation
in order A. Currently the cluster used by order A will not
reused by order B unless the cluster is 100% empty.

Signed-off-by: Chris Li <chrisl@kernel.org>
---
 include/linux/swap.h |  9 +++++++--
 mm/swapfile.c        | 49 ++++++++++++++++++++++++++++++++++++-------------
 2 files changed, 43 insertions(+), 15 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index cd9154a3e934..fcb21f9883a5 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -245,18 +245,21 @@ enum {
  */
 struct swap_cluster_info {
 	spinlock_t lock;	/*
-				 * Protect swap_cluster_info count and state
-				 * field and swap_info_struct->swap_map
+				 * Protect swap_cluster_info bitfields
+				 * and swap_info_struct->swap_map
 				 * elements correspond to the swap
 				 * cluster
 				 */
 	unsigned int count:12;
 	unsigned int state:3;
+	unsigned int order:4;
 	struct list_head list;	/* Protected by swap_info_struct->lock */
 };
 
 #define CLUSTER_STATE_FREE	1 /* This cluster is free */
 #define CLUSTER_STATE_PER_CPU	2 /* This cluster on per_cpu_cluster  */
+#define CLUSTER_STATE_SCANNED	3 /* This cluster off per_cpu_cluster */
+#define CLUSTER_STATE_NONFULL	4 /* This cluster is on nonfull list */
 
 
 /*
@@ -295,6 +298,8 @@ struct swap_info_struct {
 	unsigned char *swap_map;	/* vmalloc'ed array of usage counts */
 	struct swap_cluster_info *cluster_info; /* cluster info. Only for SSD */
 	struct list_head free_clusters; /* free clusters list */
+	struct list_head nonfull_clusters[SWAP_NR_ORDERS];
+					/* list of cluster that contains at least one free slot */
 	unsigned int lowest_bit;	/* index of first free in swap_map */
 	unsigned int highest_bit;	/* index of last free in swap_map */
 	unsigned int pages;		/* total of usable pages of swap */
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 2f878b374349..85a96178fd27 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -361,8 +361,12 @@ static void swap_cluster_schedule_discard(struct swap_info_struct *si,
 
 static void __free_cluster(struct swap_info_struct *si, struct swap_cluster_info *ci)
 {
+	if (ci->state == CLUSTER_STATE_NONFULL)
+		list_move_tail(&ci->list, &si->free_clusters);
+	else
+		list_add_tail(&ci->list, &si->free_clusters);
 	ci->state = CLUSTER_STATE_FREE;
-	list_add_tail(&ci->list, &si->free_clusters);
+	ci->order = 0;
 }
 
 /*
@@ -484,7 +488,12 @@ static void dec_cluster_info_page(struct swap_info_struct *p, struct swap_cluste
 	ci->count--;
 
 	if (!ci->count)
-		free_cluster(p, ci);
+		return free_cluster(p, ci);
+
+	if (ci->state == CLUSTER_STATE_SCANNED) {
+		list_add_tail(&ci->list, &p->nonfull_clusters[ci->order]);
+		ci->state = CLUSTER_STATE_NONFULL;
+	}
 }
 
 /*
@@ -535,17 +544,25 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
 	unsigned int nr_pages = 1 << order;
 	struct percpu_cluster *cluster;
 	struct swap_cluster_info *ci;
-	unsigned int tmp, max;
+	unsigned int tmp, max, found = 0;
 
 new_cluster:
 	cluster = this_cpu_ptr(si->percpu_cluster);
 	tmp = cluster->next[order];
 	if (tmp == SWAP_NEXT_INVALID) {
-		if (!list_empty(&si->free_clusters)) {
+		if (!list_empty(&si->nonfull_clusters[order])) {
+			ci = list_first_entry(&si->nonfull_clusters[order], struct swap_cluster_info, list);
+			list_del(&ci->list);
+			spin_lock(&ci->lock);
+			ci->state = CLUSTER_STATE_PER_CPU;
+			spin_unlock(&ci->lock);
+			tmp = (ci - si->cluster_info) * SWAPFILE_CLUSTER;
+		} else if (!list_empty(&si->free_clusters)) {
 			ci = list_first_entry(&si->free_clusters, struct swap_cluster_info, list);
 			list_del(&ci->list);
 			spin_lock(&ci->lock);
 			ci->state = CLUSTER_STATE_PER_CPU;
+			ci->order = order;
 			spin_unlock(&ci->lock);
 			tmp = (ci - si->cluster_info) * SWAPFILE_CLUSTER;
 		} else if (!list_empty(&si->discard_clusters)) {
@@ -570,21 +587,24 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
 	max = min_t(unsigned long, si->max, ALIGN(tmp + 1, SWAPFILE_CLUSTER));
 	if (tmp < max) {
 		ci = lock_cluster(si, tmp);
-		while (tmp < max) {
+		while (!found && tmp < max) {
 			if (swap_range_empty(si->swap_map, tmp, nr_pages))
-				break;
+				found = tmp;
 			tmp += nr_pages;
 		}
+		if (tmp >= max) {
+			ci->state = CLUSTER_STATE_SCANNED;
+			cluster->next[order] = SWAP_NEXT_INVALID;
+		} else
+			cluster->next[order] = tmp;
+		WARN_ONCE(ci->order != order, "expecting order %d got %d", order, ci->order);
 		unlock_cluster(ci);
 	}
-	if (tmp >= max) {
-		cluster->next[order] = SWAP_NEXT_INVALID;
+	if (!found)
 		goto new_cluster;
-	}
-	*offset = tmp;
-	*scan_base = tmp;
-	tmp += nr_pages;
-	cluster->next[order] = tmp < max ? tmp : SWAP_NEXT_INVALID;
+
+	*offset = found;
+	*scan_base = found;
 	return true;
 }
 
@@ -2896,6 +2916,9 @@ static int setup_swap_map_and_extents(struct swap_info_struct *p,
 	INIT_LIST_HEAD(&p->free_clusters);
 	INIT_LIST_HEAD(&p->discard_clusters);
 
+	for (i = 0; i < SWAP_NR_ORDERS; i++)
+		INIT_LIST_HEAD(&p->nonfull_clusters[i]);
+
 	for (i = 0; i < swap_header->info.nr_badpages; i++) {
 		unsigned int page_nr = swap_header->info.badpages[i];
 		if (page_nr == 0 || page_nr > swap_header->info.last_page)

-- 
2.45.2.627.g7a2c4fd464-goog



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 0/2] mm: swap: mTHP swap allocator base on swap cluster order
  2024-06-14 23:48 [PATCH v2 0/2] mm: swap: mTHP swap allocator base on swap cluster order Chris Li
  2024-06-14 23:48 ` [PATCH v2 1/2] mm: swap: swap cluster switch to double link list Chris Li
  2024-06-14 23:48 ` [PATCH v2 2/2] mm: swap: mTHP allocate swap entries from nonfull list Chris Li
@ 2024-06-15  1:06 ` Andrew Morton
  2024-06-15  2:51   ` Chris Li
  2024-06-18 13:08 ` David Hildenbrand
  3 siblings, 1 reply; 22+ messages in thread
From: Andrew Morton @ 2024-06-15  1:06 UTC (permalink / raw)
  To: Chris Li
  Cc: Kairui Song, Ryan Roberts, Huang, Ying, Kalesh Singh,
	linux-kernel, linux-mm, Barry Song

On Fri, 14 Jun 2024 16:48:06 -0700 Chris Li <chrisl@kernel.org> wrote:

> This is the short term solutiolns "swap cluster order" listed
> in my "Swap Abstraction" discussion slice 8 in the recent
> LSF/MM conference.
> 
> When commit 845982eb264bc "mm: swap: allow storage of all mTHP
> orders" is introduced, it only allocates the mTHP swap entries
> from new empty cluster list.  It has a fragmentation issue
> reported by Barry.
> 
> https://lore.kernel.org/all/CAGsJ_4zAcJkuW016Cfi6wicRr8N9X+GJJhgMQdSMp+Ah+NSgNQ@mail.gmail.com/
> 
> The mTHP allocation failure rate raises to almost 100% after a few
> hours in Barry's test run.
> 
> The reason is that all the empty cluster has been exhausted while
> there are planty of free swap entries to in the cluster that is
> not 100% free.
> 
> Remember the swap allocation order in the cluster.
> Keep track of the per order non full cluster list for later allocation.
> 
> This greatly improve the sucess rate of the mTHP swap allocation.
> 

I'm having trouble understanding the overall impact of this on users. 
We fail the mTHP swap allocation and fall back, but things continue to
operate OK?

> There is some test number in the V1 thread of this series:
> https://lore.kernel.org/r/20240524-swap-allocator-v1-0-47861b423b26@kernel.org

Well, please let's get the latest numbers into the latest patchset. 
Along with a higher-level (and quantitative) description of the user impact.

I'll add this into mm-unstable now for some exposure, but at this point
I'm not able to determine whether it should go in as a hotfix for
6.10-rcX.



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 0/2] mm: swap: mTHP swap allocator base on swap cluster order
  2024-06-15  1:06 ` [PATCH v2 0/2] mm: swap: mTHP swap allocator base on swap cluster order Andrew Morton
@ 2024-06-15  2:51   ` Chris Li
  2024-06-15  2:59     ` Andrew Morton
  0 siblings, 1 reply; 22+ messages in thread
From: Chris Li @ 2024-06-15  2:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Kairui Song, Ryan Roberts, Huang, Ying, Kalesh Singh,
	linux-kernel, linux-mm, Barry Song

On Fri, Jun 14, 2024 at 6:06 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> On Fri, 14 Jun 2024 16:48:06 -0700 Chris Li <chrisl@kernel.org> wrote:
>
> > This is the short term solutiolns "swap cluster order" listed
> > in my "Swap Abstraction" discussion slice 8 in the recent
> > LSF/MM conference.
> >
> > When commit 845982eb264bc "mm: swap: allow storage of all mTHP
> > orders" is introduced, it only allocates the mTHP swap entries
> > from new empty cluster list.  It has a fragmentation issue
> > reported by Barry.
> >
> > https://lore.kernel.org/all/CAGsJ_4zAcJkuW016Cfi6wicRr8N9X+GJJhgMQdSMp+Ah+NSgNQ@mail.gmail.com/
> >
> > The mTHP allocation failure rate raises to almost 100% after a few
> > hours in Barry's test run.
> >
> > The reason is that all the empty cluster has been exhausted while
> > there are planty of free swap entries to in the cluster that is
> > not 100% free.
> >
> > Remember the swap allocation order in the cluster.
> > Keep track of the per order non full cluster list for later allocation.
> >
> > This greatly improve the sucess rate of the mTHP swap allocation.
> >
>
> I'm having trouble understanding the overall impact of this on users.
> We fail the mTHP swap allocation and fall back, but things continue to
> operate OK?

Continue to operate OK in the sense that the mTHP will have to split
into 4K pages before the swap out, aka the fall back. The swap out and
swap in can continue to work as 4K pages, not as the mTHP. Due to the
fallback, the mTHP based zsmalloc compression with 64K buffer will not
happen. That is the effect of the fallback. But mTHP swap out and swap
in is relatively new, it is not really a regression.

>
> > There is some test number in the V1 thread of this series:
> > https://lore.kernel.org/r/20240524-swap-allocator-v1-0-47861b423b26@kernel.org
>
> Well, please let's get the latest numbers into the latest patchset.
> Along with a higher-level (and quantitative) description of the user impact.

I will need Barray's help to collect the number. I don't have the
setup to reproduce his test result.
Maybe a follow up commit message amendment for the test number when I get it?

>
> I'll add this into mm-unstable now for some exposure, but at this point
> I'm not able to determine whether it should go in as a hotfix for
> 6.10-rcX.

Maybe not need to be a hotfix. Not all Barry's mTHP swap out and swap
in patch got merged yet.

Chris


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 0/2] mm: swap: mTHP swap allocator base on swap cluster order
  2024-06-15  2:51   ` Chris Li
@ 2024-06-15  2:59     ` Andrew Morton
  2024-06-15  8:47       ` Barry Song
  0 siblings, 1 reply; 22+ messages in thread
From: Andrew Morton @ 2024-06-15  2:59 UTC (permalink / raw)
  To: Chris Li
  Cc: Kairui Song, Ryan Roberts, Huang, Ying, Kalesh Singh,
	linux-kernel, linux-mm, Barry Song

On Fri, 14 Jun 2024 19:51:11 -0700 Chris Li <chrisl@kernel.org> wrote:

> > I'm having trouble understanding the overall impact of this on users.
> > We fail the mTHP swap allocation and fall back, but things continue to
> > operate OK?
> 
> Continue to operate OK in the sense that the mTHP will have to split
> into 4K pages before the swap out, aka the fall back. The swap out and
> swap in can continue to work as 4K pages, not as the mTHP. Due to the
> fallback, the mTHP based zsmalloc compression with 64K buffer will not
> happen. That is the effect of the fallback. But mTHP swap out and swap
> in is relatively new, it is not really a regression.

Sure, but it's pretty bad to merge a new feature only to have it
ineffective after a few hours use.

> >
> > > There is some test number in the V1 thread of this series:
> > > https://lore.kernel.org/r/20240524-swap-allocator-v1-0-47861b423b26@kernel.org
> >
> > Well, please let's get the latest numbers into the latest patchset.
> > Along with a higher-level (and quantitative) description of the user impact.
> 
> I will need Barray's help to collect the number. I don't have the
> setup to reproduce his test result.
> Maybe a follow up commit message amendment for the test number when I get it?

Yep, I alter changelogs all the time.

> >
> > I'll add this into mm-unstable now for some exposure, but at this point
> > I'm not able to determine whether it should go in as a hotfix for
> > 6.10-rcX.
> 
> Maybe not need to be a hotfix. Not all Barry's mTHP swap out and swap
> in patch got merged yet.

OK, well please let's give appropriate consideration to what we should
add to 6.10-rcX in order to have this feature working well.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 0/2] mm: swap: mTHP swap allocator base on swap cluster order
  2024-06-15  2:59     ` Andrew Morton
@ 2024-06-15  8:47       ` Barry Song
  2024-06-17  3:00         ` Huang, Ying
                           ` (2 more replies)
  0 siblings, 3 replies; 22+ messages in thread
From: Barry Song @ 2024-06-15  8:47 UTC (permalink / raw)
  To: akpm, chrisl
  Cc: baohua, kaleshsingh, kasong, linux-kernel, linux-mm,
	ryan.roberts, ying.huang

On Sat, Jun 15, 2024 at 2:59 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> On Fri, 14 Jun 2024 19:51:11 -0700 Chris Li <chrisl@kernel.org> wrote:
>
> > > I'm having trouble understanding the overall impact of this on users.
> > > We fail the mTHP swap allocation and fall back, but things continue to
> > > operate OK?
> >
> > Continue to operate OK in the sense that the mTHP will have to split
> > into 4K pages before the swap out, aka the fall back. The swap out and
> > swap in can continue to work as 4K pages, not as the mTHP. Due to the
> > fallback, the mTHP based zsmalloc compression with 64K buffer will not
> > happen. That is the effect of the fallback. But mTHP swap out and swap
> > in is relatively new, it is not really a regression.
>
> Sure, but it's pretty bad to merge a new feature only to have it
> ineffective after a few hours use.
>
> > >
> > > > There is some test number in the V1 thread of this series:
> > > > https://lore.kernel.org/r/20240524-swap-allocator-v1-0-47861b423b26@kernel.org
> > >
> > > Well, please let's get the latest numbers into the latest patchset.
> > > Along with a higher-level (and quantitative) description of the user impact.
> >
> > I will need Barray's help to collect the number. I don't have the
> > setup to reproduce his test result.
> > Maybe a follow up commit message amendment for the test number when I get it?

Although the issue may seem complex at a systemic level, even a small program can
demonstrate the problem and highlight how Chris's patch has improved the
situation.

To demonstrate this, I designed a basic test program that maximally allocates
two memory blocks:

 *   A memory block of up to 60MB, recommended for HUGEPAGE usage
 *   A memory block of up to 1MB, recommended for NOHUGEPAGE usage

In the system configuration, I enabled 64KB mTHP and 64MB zRAM, providing more than
enough space for both the 60MB and 1MB allocations in the worst case. This setup
allows us to assess two effects:

1.  When we don't enable mem2 (small folios), we consistently allocate and free
    swap slots aligned with 64KB.  whether there is a risk of failure to obtain
    swap slots even though the zRAM has sufficient free space?
2.  When we enable mem2 (small folios), the presence of small folios may lead
    to fragmentation of clusters, potentially impacting the swapout process for
    large folios negatively.

(2) can be enabled by "-s", without -s, small folios are disabled.

The script to configure zRAM and mTHP:

echo lzo > /sys/block/zram0/comp_algorithm
echo 64M > /sys/block/zram0/disksize
echo never > /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled
echo always > /sys/kernel/mm/transparent_hugepage/hugepages-64kB/enabled
mkswap /dev/zram0
swapon /dev/zram0

The test program I made today after receiving Chris' patchset v2

(Andrew, Please let me know if you want this small test program to
be committed into kernel/tools/ folder. If yes, please let me know,
and I will cleanup and prepare a patch):

#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <string.h>
#include <sys/mman.h>
#include <errno.h>
#include <time.h>

#define MEMSIZE_MTHP (60 * 1024 * 1024)
#define MEMSIZE_SMALLFOLIO (1 * 1024 * 1024)
#define ALIGNMENT_MTHP (64 * 1024)
#define ALIGNMENT_SMALLFOLIO (4 * 1024)
#define TOTAL_DONTNEED_MTHP (16 * 1024 * 1024)
#define TOTAL_DONTNEED_SMALLFOLIO (256 * 1024)
#define MTHP_FOLIO_SIZE (64 * 1024)

#define SWPOUT_PATH \
    "/sys/kernel/mm/transparent_hugepage/hugepages-64kB/stats/swpout"
#define SWPOUT_FALLBACK_PATH \
    "/sys/kernel/mm/transparent_hugepage/hugepages-64kB/stats/swpout_fallback"

static void *aligned_alloc_mem(size_t size, size_t alignment)
{
    void *mem = NULL;
    if (posix_memalign(&mem, alignment, size) != 0) {
        perror("posix_memalign");
        return NULL;
    }
    return mem;
}

static void random_madvise_dontneed(void *mem, size_t mem_size,
                                     size_t align_size, size_t total_dontneed_size)
{
    size_t num_pages = total_dontneed_size / align_size;
    size_t i;
    size_t offset;
    void *addr;

    for (i = 0; i < num_pages; ++i) {
        offset = (rand() % (mem_size / align_size)) * align_size;
        addr = (char *)mem + offset;
        if (madvise(addr, align_size, MADV_DONTNEED) != 0) {
            perror("madvise dontneed");
        }
        memset(addr, 0x11, align_size);
    }
}

static unsigned long read_stat(const char *path)
{
    FILE *file;
    unsigned long value;

    file = fopen(path, "r");
    if (!file) {
        perror("fopen");
        return 0;
    }

    if (fscanf(file, "%lu", &value) != 1) {
        perror("fscanf");
        fclose(file);
        return 0;
    }

    fclose(file);
    return value;
}

int main(int argc, char *argv[])
{
    int use_small_folio = 0;
    int i;
    void *mem1 = aligned_alloc_mem(MEMSIZE_MTHP, ALIGNMENT_MTHP);
    if (mem1 == NULL) {
        fprintf(stderr, "Failed to allocate 60MB memory\n");
        return EXIT_FAILURE;
    }

    if (madvise(mem1, MEMSIZE_MTHP, MADV_HUGEPAGE) != 0) {
        perror("madvise hugepage for mem1");
        free(mem1);
        return EXIT_FAILURE;
    }

    for (i = 1; i < argc; ++i) {
        if (strcmp(argv[i], "-s") == 0) {
            use_small_folio = 1;
        }
    }

    void *mem2 = NULL;
    if (use_small_folio) {
        mem2 = aligned_alloc_mem(MEMSIZE_SMALLFOLIO, ALIGNMENT_MTHP);
        if (mem2 == NULL) {
            fprintf(stderr, "Failed to allocate 1MB memory\n");
            free(mem1);
            return EXIT_FAILURE;
        }

        if (madvise(mem2, MEMSIZE_SMALLFOLIO, MADV_NOHUGEPAGE) != 0) {
            perror("madvise nohugepage for mem2");
            free(mem1);
            free(mem2);
            return EXIT_FAILURE;
        }
    }

    for (i = 0; i < 100; ++i) {
        unsigned long initial_swpout;
        unsigned long initial_swpout_fallback;
        unsigned long final_swpout;
        unsigned long final_swpout_fallback;
        unsigned long swpout_inc;
        unsigned long swpout_fallback_inc;
        double fallback_percentage;

        initial_swpout = read_stat(SWPOUT_PATH);
        initial_swpout_fallback = read_stat(SWPOUT_FALLBACK_PATH);

        random_madvise_dontneed(mem1, MEMSIZE_MTHP, ALIGNMENT_MTHP,
                                 TOTAL_DONTNEED_MTHP);

        if (use_small_folio) {
            random_madvise_dontneed(mem2, MEMSIZE_SMALLFOLIO,
                                     ALIGNMENT_SMALLFOLIO,
                                     TOTAL_DONTNEED_SMALLFOLIO);
        }

        if (madvise(mem1, MEMSIZE_MTHP, MADV_PAGEOUT) != 0) {
            perror("madvise pageout for mem1");
            free(mem1);
            if (mem2 != NULL) {
                free(mem2);
            }
            return EXIT_FAILURE;
        }

        if (use_small_folio) {
            if (madvise(mem2, MEMSIZE_SMALLFOLIO, MADV_PAGEOUT) != 0) {
                perror("madvise pageout for mem2");
                free(mem1);
                free(mem2);
                return EXIT_FAILURE;
            }
        }

        final_swpout = read_stat(SWPOUT_PATH);
        final_swpout_fallback = read_stat(SWPOUT_FALLBACK_PATH);

        swpout_inc = final_swpout - initial_swpout;
        swpout_fallback_inc = final_swpout_fallback - initial_swpout_fallback;

        fallback_percentage = (double)swpout_fallback_inc /
                              (swpout_fallback_inc + swpout_inc) * 100;

        printf("Iteration %d: swpout inc: %lu, swpout fallback inc: %lu, Fallback percentage: %.2f%%\n",
               i + 1, swpout_inc, swpout_fallback_inc, fallback_percentage);
    }

    free(mem1);
    if (mem2 != NULL) {
        free(mem2);
    }

    return EXIT_SUCCESS;
}

w/o Chris' patchset:

Test A. without small folios

$ /home/barry/develop/linux/mthp_swpout_test

Iteration 1: swpout inc: 226, swpout fallback inc: 0, Fallback percentage: 0.00%
Iteration 2: swpout inc: 224, swpout fallback inc: 0, Fallback percentage: 0.00%
Iteration 3: swpout inc: 221, swpout fallback inc: 0, Fallback percentage: 0.00%
Iteration 4: swpout inc: 189, swpout fallback inc: 42, Fallback percentage: 18.18%
Iteration 5: swpout inc: 6, swpout fallback inc: 212, Fallback percentage: 97.25%
Iteration 6: swpout inc: 0, swpout fallback inc: 227, Fallback percentage: 100.00%
Iteration 7: swpout inc: 0, swpout fallback inc: 224, Fallback percentage: 100.00%
Iteration 8: swpout inc: 0, swpout fallback inc: 222, Fallback percentage: 100.00%
Iteration 9: swpout inc: 0, swpout fallback inc: 217, Fallback percentage: 100.00%
Iteration 10: swpout inc: 1, swpout fallback inc: 226, Fallback percentage: 99.56%
Iteration 11: swpout inc: 0, swpout fallback inc: 226, Fallback percentage: 100.00%
Iteration 12: swpout inc: 0, swpout fallback inc: 227, Fallback percentage: 100.00%
Iteration 13: swpout inc: 0, swpout fallback inc: 231, Fallback percentage: 100.00%
...

mthp swpout fallback ratio immediately goes up to 100%!!!

Test B. with small folios

$ /home/barry/develop/linux/mthp_swpout_test -s

Iteration 1: swpout inc: 226, swpout fallback inc: 0, Fallback percentage: 0.00%
Iteration 2: swpout inc: 229, swpout fallback inc: 0, Fallback percentage: 0.00%
Iteration 3: swpout inc: 229, swpout fallback inc: 0, Fallback percentage: 0.00%
Iteration 4: swpout inc: 20, swpout fallback inc: 206, Fallback percentage: 91.15%
Iteration 5: swpout inc: 26, swpout fallback inc: 201, Fallback percentage: 88.55%
Iteration 6: swpout inc: 2, swpout fallback inc: 216, Fallback percentage: 99.08%
Iteration 7: swpout inc: 16, swpout fallback inc: 209, Fallback percentage: 92.89%
Iteration 8: swpout inc: 5, swpout fallback inc: 222, Fallback percentage: 97.80%
Iteration 9: swpout inc: 0, swpout fallback inc: 226, Fallback percentage: 100.00%
Iteration 10: swpout inc: 0, swpout fallback inc: 224, Fallback percentage: 100.00%
Iteration 11: swpout inc: 0, swpout fallback inc: 228, Fallback percentage: 100.00%
Iteration 12: swpout inc: 0, swpout fallback inc: 227, Fallback percentage: 100.00%
Iteration 13: swpout inc: 0, swpout fallback inc: 226, Fallback percentage: 100.00%
Iteration 14: swpout inc: 0, swpout fallback inc: 234, Fallback percentage: 100.00%
...

mthp swpout fallback ratio immediately goes up to 100%!!!


w/ Chris' patchset:

Test C. without small folios
$ /home/barry/develop/linux/mthp_swpout_test

Iteration 1: swpout inc: 226, swpout fallback inc: 0, Fallback percentage: 0.00%
Iteration 2: swpout inc: 224, swpout fallback inc: 0, Fallback percentage: 0.00%
Iteration 3: swpout inc: 221, swpout fallback inc: 0, Fallback percentage: 0.00%
Iteration 4: swpout inc: 231, swpout fallback inc: 0, Fallback percentage: 0.00%
Iteration 5: swpout inc: 218, swpout fallback inc: 0, Fallback percentage: 0.00%
Iteration 6: swpout inc: 227, swpout fallback inc: 0, Fallback percentage: 0.00%
Iteration 7: swpout inc: 224, swpout fallback inc: 0, Fallback percentage: 0.00%
Iteration 8: swpout inc: 222, swpout fallback inc: 0, Fallback percentage: 0.00%
Iteration 9: swpout inc: 217, swpout fallback inc: 0, Fallback percentage: 0.00%
Iteration 10: swpout inc: 227, swpout fallback inc: 0, Fallback percentage: 0.00%
Iteration 11: swpout inc: 226, swpout fallback inc: 0, Fallback percentage: 0.00%
Iteration 12: swpout inc: 210, swpout fallback inc: 17, Fallback percentage: 7.49%
Iteration 13: swpout inc: 230, swpout fallback inc: 1, Fallback percentage: 0.43%
Iteration 14: swpout inc: 209, swpout fallback inc: 13, Fallback percentage: 5.86%
Iteration 15: swpout inc: 214, swpout fallback inc: 16, Fallback percentage: 6.96%
Iteration 16: swpout inc: 214, swpout fallback inc: 12, Fallback percentage: 5.31%
Iteration 17: swpout inc: 227, swpout fallback inc: 6, Fallback percentage: 2.58%
Iteration 18: swpout inc: 203, swpout fallback inc: 24, Fallback percentage: 10.57%
Iteration 19: swpout inc: 229, swpout fallback inc: 1, Fallback percentage: 0.43%
Iteration 20: swpout inc: 224, swpout fallback inc: 0, Fallback percentage: 0.00%
Iteration 21: swpout inc: 217, swpout fallback inc: 13, Fallback percentage: 5.65%
Iteration 22: swpout inc: 205, swpout fallback inc: 17, Fallback percentage: 7.66%
Iteration 23: swpout inc: 213, swpout fallback inc: 15, Fallback percentage: 6.58%
Iteration 24: swpout inc: 234, swpout fallback inc: 0, Fallback percentage: 0.00%
Iteration 25: swpout inc: 205, swpout fallback inc: 18, Fallback percentage: 8.07%
Iteration 26: swpout inc: 217, swpout fallback inc: 0, Fallback percentage: 0.00%
Iteration 27: swpout inc: 219, swpout fallback inc: 6, Fallback percentage: 2.67%
Iteration 28: swpout inc: 215, swpout fallback inc: 14, Fallback percentage: 6.11%
Iteration 29: swpout inc: 227, swpout fallback inc: 0, Fallback percentage: 0.00%
Iteration 30: swpout inc: 208, swpout fallback inc: 13, Fallback percentage: 5.88%
Iteration 31: swpout inc: 219, swpout fallback inc: 6, Fallback percentage: 2.67%
Iteration 32: swpout inc: 216, swpout fallback inc: 7, Fallback percentage: 3.14%
Iteration 33: swpout inc: 201, swpout fallback inc: 28, Fallback percentage: 12.23%
Iteration 34: swpout inc: 232, swpout fallback inc: 0, Fallback percentage: 0.00%
Iteration 35: swpout inc: 215, swpout fallback inc: 17, Fallback percentage: 7.33%
Iteration 36: swpout inc: 209, swpout fallback inc: 16, Fallback percentage: 7.11%
Iteration 37: swpout inc: 202, swpout fallback inc: 29, Fallback percentage: 12.55%
Iteration 38: swpout inc: 200, swpout fallback inc: 18, Fallback percentage: 8.26%
Iteration 39: swpout inc: 219, swpout fallback inc: 12, Fallback percentage: 5.19%
Iteration 40: swpout inc: 218, swpout fallback inc: 9, Fallback percentage: 3.96%
Iteration 41: swpout inc: 212, swpout fallback inc: 14, Fallback percentage: 6.19%
Iteration 42: swpout inc: 204, swpout fallback inc: 15, Fallback percentage: 6.85%
Iteration 43: swpout inc: 222, swpout fallback inc: 5, Fallback percentage: 2.20%
Iteration 44: swpout inc: 205, swpout fallback inc: 20, Fallback percentage: 8.89%
Iteration 45: swpout inc: 217, swpout fallback inc: 6, Fallback percentage: 2.69%
Iteration 46: swpout inc: 209, swpout fallback inc: 19, Fallback percentage: 8.33%
Iteration 47: swpout inc: 205, swpout fallback inc: 13, Fallback percentage: 5.96%
Iteration 48: swpout inc: 223, swpout fallback inc: 4, Fallback percentage: 1.76%
Iteration 49: swpout inc: 203, swpout fallback inc: 21, Fallback percentage: 9.38%
Iteration 50: swpout inc: 193, swpout fallback inc: 19, Fallback percentage: 8.96%
Iteration 51: swpout inc: 197, swpout fallback inc: 29, Fallback percentage: 12.83%
Iteration 52: swpout inc: 195, swpout fallback inc: 29, Fallback percentage: 12.95%
Iteration 53: swpout inc: 217, swpout fallback inc: 17, Fallback percentage: 7.26%
Iteration 54: swpout inc: 207, swpout fallback inc: 11, Fallback percentage: 5.05%
Iteration 55: swpout inc: 213, swpout fallback inc: 10, Fallback percentage: 4.48%
Iteration 56: swpout inc: 203, swpout fallback inc: 23, Fallback percentage: 10.18%
Iteration 57: swpout inc: 197, swpout fallback inc: 34, Fallback percentage: 14.72%
Iteration 58: swpout inc: 209, swpout fallback inc: 13, Fallback percentage: 5.86%
Iteration 59: swpout inc: 212, swpout fallback inc: 19, Fallback percentage: 8.23%
Iteration 60: swpout inc: 196, swpout fallback inc: 24, Fallback percentage: 10.91%
Iteration 61: swpout inc: 203, swpout fallback inc: 13, Fallback percentage: 6.02%
Iteration 62: swpout inc: 221, swpout fallback inc: 7, Fallback percentage: 3.07%
Iteration 63: swpout inc: 207, swpout fallback inc: 17, Fallback percentage: 7.59%
Iteration 64: swpout inc: 205, swpout fallback inc: 15, Fallback percentage: 6.82%
Iteration 65: swpout inc: 223, swpout fallback inc: 3, Fallback percentage: 1.33%
Iteration 66: swpout inc: 215, swpout fallback inc: 13, Fallback percentage: 5.70%
Iteration 67: swpout inc: 225, swpout fallback inc: 0, Fallback percentage: 0.00%
Iteration 68: swpout inc: 215, swpout fallback inc: 8, Fallback percentage: 3.59%
Iteration 69: swpout inc: 222, swpout fallback inc: 0, Fallback percentage: 0.00%
Iteration 70: swpout inc: 204, swpout fallback inc: 17, Fallback percentage: 7.69%
Iteration 71: swpout inc: 227, swpout fallback inc: 6, Fallback percentage: 2.58%
Iteration 72: swpout inc: 207, swpout fallback inc: 16, Fallback percentage: 7.17%
Iteration 73: swpout inc: 217, swpout fallback inc: 9, Fallback percentage: 3.98%
Iteration 74: swpout inc: 206, swpout fallback inc: 9, Fallback percentage: 4.19%
Iteration 75: swpout inc: 193, swpout fallback inc: 26, Fallback percentage: 11.87%
Iteration 76: swpout inc: 225, swpout fallback inc: 3, Fallback percentage: 1.32%
Iteration 77: swpout inc: 205, swpout fallback inc: 25, Fallback percentage: 10.87%
Iteration 78: swpout inc: 213, swpout fallback inc: 12, Fallback percentage: 5.33%
Iteration 79: swpout inc: 212, swpout fallback inc: 10, Fallback percentage: 4.50%
Iteration 80: swpout inc: 210, swpout fallback inc: 9, Fallback percentage: 4.11%
Iteration 81: swpout inc: 225, swpout fallback inc: 4, Fallback percentage: 1.75%
Iteration 82: swpout inc: 211, swpout fallback inc: 3, Fallback percentage: 1.40%
Iteration 83: swpout inc: 216, swpout fallback inc: 10, Fallback percentage: 4.42%
Iteration 84: swpout inc: 226, swpout fallback inc: 0, Fallback percentage: 0.00%
Iteration 85: swpout inc: 213, swpout fallback inc: 13, Fallback percentage: 5.75%
Iteration 86: swpout inc: 225, swpout fallback inc: 3, Fallback percentage: 1.32%
Iteration 87: swpout inc: 204, swpout fallback inc: 22, Fallback percentage: 9.73%
Iteration 88: swpout inc: 225, swpout fallback inc: 0, Fallback percentage: 0.00%
Iteration 89: swpout inc: 225, swpout fallback inc: 0, Fallback percentage: 0.00%
Iteration 90: swpout inc: 221, swpout fallback inc: 8, Fallback percentage: 3.49%
Iteration 91: swpout inc: 212, swpout fallback inc: 13, Fallback percentage: 5.78%
Iteration 92: swpout inc: 207, swpout fallback inc: 18, Fallback percentage: 8.00%
Iteration 93: swpout inc: 209, swpout fallback inc: 25, Fallback percentage: 10.68%
Iteration 94: swpout inc: 213, swpout fallback inc: 13, Fallback percentage: 5.75%
Iteration 95: swpout inc: 206, swpout fallback inc: 18, Fallback percentage: 8.04%
Iteration 96: swpout inc: 206, swpout fallback inc: 17, Fallback percentage: 7.62%
Iteration 97: swpout inc: 216, swpout fallback inc: 11, Fallback percentage: 4.85%
Iteration 98: swpout inc: 210, swpout fallback inc: 13, Fallback percentage: 5.83%
Iteration 99: swpout inc: 223, swpout fallback inc: 0, Fallback percentage: 0.00%
Iteration 100: swpout inc: 205, swpout fallback inc: 21, Fallback percentage: 9.29%
...

mthp swpout fallback ratio is stable and low in 100 iterations!!!
Though the number is very good, I wonder why it is not 0% since 64MB is larger
than 60MB? Chris, do you have any idea?

Test D. with small folios
$ /home/barry/develop/linux/mthp_swpout_test -s

[ 1013.535798] ------------[ cut here ]------------
[ 1013.538886] expecting order 4 got 0
[ 1013.540622] WARNING: CPU: 3 PID: 104 at mm/swapfile.c:600 scan_swap_map_try_ssd_cluster+0x340/0x370
[ 1013.544460] Modules linked in:
[ 1013.545411] CPU: 3 PID: 104 Comm: mthp_swpout_tes Not tainted 6.10.0-rc3-ga12328d9fb85-dirty #285
[ 1013.545990] Hardware name: linux,dummy-virt (DT)
[ 1013.546585] pstate: 61400005 (nZCv daif +PAN -UAO -TCO +DIT -SSBS BTYPE=--)
[ 1013.547136] pc : scan_swap_map_try_ssd_cluster+0x340/0x370
[ 1013.547768] lr : scan_swap_map_try_ssd_cluster+0x340/0x370
[ 1013.548263] sp : ffff8000863e32e0
[ 1013.548723] x29: ffff8000863e32e0 x28: 0000000000000670 x27: 0000000000000660
[ 1013.549626] x26: 0000000000000010 x25: ffff0000c1692108 x24: ffff0000c27c4800
[ 1013.550470] x23: 2e8ba2e8ba2e8ba3 x22: fffffdffbf7df2c0 x21: ffff0000c27c48b0
[ 1013.551285] x20: ffff800083a946d0 x19: 0000000000000004 x18: ffffffffffffffff
[ 1013.552263] x17: 0000000000000000 x16: 0000000000000000 x15: ffff800084b13568
[ 1013.553292] x14: ffffffffffffffff x13: ffff800084b13566 x12: 6e69746365707865
[ 1013.554423] x11: fffffffffffe0000 x10: ffff800083b18b68 x9 : ffff80008014c874
[ 1013.555231] x8 : 00000000ffffefff x7 : ffff800083b16318 x6 : 0000000000002850
[ 1013.555965] x5 : 40000000fffff1ae x4 : 0000000000000fff x3 : 0000000000000000
[ 1013.556779] x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffff0000c24a1bc0
[ 1013.557627] Call trace:
[ 1013.557960]  scan_swap_map_try_ssd_cluster+0x340/0x370
[ 1013.558498]  get_swap_pages+0x23c/0xc20
[ 1013.558899]  folio_alloc_swap+0x5c/0x248
[ 1013.559544]  add_to_swap+0x40/0xf0
[ 1013.559904]  shrink_folio_list+0x6dc/0xf20
[ 1013.560289]  reclaim_folio_list+0x8c/0x168
[ 1013.560710]  reclaim_pages+0xfc/0x178
[ 1013.561079]  madvise_cold_or_pageout_pte_range+0x8d8/0xf28
[ 1013.561524]  walk_pgd_range+0x390/0x808
[ 1013.561920]  __walk_page_range+0x1e0/0x1f0
[ 1013.562370]  walk_page_range+0x1f0/0x2c8
[ 1013.562888]  madvise_pageout+0xf8/0x280
[ 1013.563388]  madvise_vma_behavior+0x314/0xa20
[ 1013.563982]  madvise_walk_vmas+0xc0/0x128
[ 1013.564386]  do_madvise.part.0+0x110/0x558
[ 1013.564792]  __arm64_sys_madvise+0x68/0x88
[ 1013.565333]  invoke_syscall+0x50/0x128
[ 1013.565737]  el0_svc_common.constprop.0+0x48/0xf8
[ 1013.566285]  do_el0_svc+0x28/0x40
[ 1013.566667]  el0_svc+0x50/0x150
[ 1013.567094]  el0t_64_sync_handler+0x13c/0x158
[ 1013.567501]  el0t_64_sync+0x1a4/0x1a8
[ 1013.568058] irq event stamp: 0
[ 1013.568661] hardirqs last  enabled at (0): [<0000000000000000>] 0x0
[ 1013.569560] hardirqs last disabled at (0): [<ffff8000800add44>] copy_process+0x654/0x19a8
[ 1013.570167] softirqs last  enabled at (0): [<ffff8000800add44>] copy_process+0x654/0x19a8
[ 1013.570846] softirqs last disabled at (0): [<0000000000000000>] 0x0
[ 1013.571330] ---[ end trace 0000000000000000 ]---
Iteration 1: swpout inc: 226, swpout fallback inc: 0, Fallback percentage: 0.00%
Iteration 2: swpout inc: 229, swpout fallback inc: 0, Fallback percentage: 0.00%
Iteration 3: swpout inc: 229, swpout fallback inc: 0, Fallback percentage: 0.00%
Iteration 4: swpout inc: 226, swpout fallback inc: 0, Fallback percentage: 0.00%
Iteration 5: swpout inc: 227, swpout fallback inc: 0, Fallback percentage: 0.00%
Iteration 6: swpout inc: 218, swpout fallback inc: 0, Fallback percentage: 0.00%
Iteration 7: swpout inc: 225, swpout fallback inc: 0, Fallback percentage: 0.00%
Iteration 8: swpout inc: 227, swpout fallback inc: 0, Fallback percentage: 0.00%
Iteration 9: swpout inc: 224, swpout fallback inc: 2, Fallback percentage: 0.88%
Iteration 10: swpout inc: 213, swpout fallback inc: 11, Fallback percentage: 4.91%
Iteration 11: swpout inc: 219, swpout fallback inc: 9, Fallback percentage: 3.95%
Iteration 12: swpout inc: 207, swpout fallback inc: 20, Fallback percentage: 8.81%
Iteration 13: swpout inc: 193, swpout fallback inc: 33, Fallback percentage: 14.60%
Iteration 14: swpout inc: 215, swpout fallback inc: 19, Fallback percentage: 8.12%
Iteration 15: swpout inc: 217, swpout fallback inc: 12, Fallback percentage: 5.24%
Iteration 16: swpout inc: 207, swpout fallback inc: 15, Fallback percentage: 6.76%
Iteration 17: swpout inc: 207, swpout fallback inc: 23, Fallback percentage: 10.00%
Iteration 18: swpout inc: 198, swpout fallback inc: 30, Fallback percentage: 13.16%
Iteration 19: swpout inc: 199, swpout fallback inc: 26, Fallback percentage: 11.56%
Iteration 20: swpout inc: 197, swpout fallback inc: 27, Fallback percentage: 12.05%
Iteration 21: swpout inc: 192, swpout fallback inc: 25, Fallback percentage: 11.52%
Iteration 22: swpout inc: 190, swpout fallback inc: 30, Fallback percentage: 13.64%
Iteration 23: swpout inc: 203, swpout fallback inc: 27, Fallback percentage: 11.74%
Iteration 24: swpout inc: 197, swpout fallback inc: 32, Fallback percentage: 13.97%
Iteration 25: swpout inc: 184, swpout fallback inc: 41, Fallback percentage: 18.22%
Iteration 26: swpout inc: 203, swpout fallback inc: 28, Fallback percentage: 12.12%
Iteration 27: swpout inc: 193, swpout fallback inc: 31, Fallback percentage: 13.84%
Iteration 28: swpout inc: 191, swpout fallback inc: 43, Fallback percentage: 18.38%
Iteration 29: swpout inc: 194, swpout fallback inc: 31, Fallback percentage: 13.78%
Iteration 30: swpout inc: 180, swpout fallback inc: 50, Fallback percentage: 21.74%
Iteration 31: swpout inc: 205, swpout fallback inc: 22, Fallback percentage: 9.69%
Iteration 32: swpout inc: 199, swpout fallback inc: 24, Fallback percentage: 10.76%
Iteration 33: swpout inc: 192, swpout fallback inc: 34, Fallback percentage: 15.04%
Iteration 34: swpout inc: 186, swpout fallback inc: 38, Fallback percentage: 16.96%
Iteration 35: swpout inc: 190, swpout fallback inc: 32, Fallback percentage: 14.41%
Iteration 36: swpout inc: 181, swpout fallback inc: 41, Fallback percentage: 18.47%
Iteration 37: swpout inc: 181, swpout fallback inc: 47, Fallback percentage: 20.61%
Iteration 38: swpout inc: 173, swpout fallback inc: 45, Fallback percentage: 20.64%
Iteration 39: swpout inc: 196, swpout fallback inc: 27, Fallback percentage: 12.11%
Iteration 40: swpout inc: 195, swpout fallback inc: 27, Fallback percentage: 12.16%
Iteration 41: swpout inc: 195, swpout fallback inc: 31, Fallback percentage: 13.72%
Iteration 42: swpout inc: 189, swpout fallback inc: 34, Fallback percentage: 15.25%
Iteration 43: swpout inc: 185, swpout fallback inc: 41, Fallback percentage: 18.14%
Iteration 44: swpout inc: 189, swpout fallback inc: 34, Fallback percentage: 15.25%
Iteration 45: swpout inc: 177, swpout fallback inc: 49, Fallback percentage: 21.68%
Iteration 46: swpout inc: 193, swpout fallback inc: 36, Fallback percentage: 15.72%
Iteration 47: swpout inc: 197, swpout fallback inc: 30, Fallback percentage: 13.22%
Iteration 48: swpout inc: 188, swpout fallback inc: 24, Fallback percentage: 11.32%
Iteration 49: swpout inc: 187, swpout fallback inc: 29, Fallback percentage: 13.43%
Iteration 50: swpout inc: 181, swpout fallback inc: 48, Fallback percentage: 20.96%
Iteration 51: swpout inc: 191, swpout fallback inc: 28, Fallback percentage: 12.79%
Iteration 52: swpout inc: 184, swpout fallback inc: 43, Fallback percentage: 18.94%
Iteration 53: swpout inc: 184, swpout fallback inc: 44, Fallback percentage: 19.30%
Iteration 54: swpout inc: 173, swpout fallback inc: 49, Fallback percentage: 22.07%
Iteration 55: swpout inc: 170, swpout fallback inc: 47, Fallback percentage: 21.66%
Iteration 56: swpout inc: 185, swpout fallback inc: 43, Fallback percentage: 18.86%
Iteration 57: swpout inc: 178, swpout fallback inc: 55, Fallback percentage: 23.61%
Iteration 58: swpout inc: 178, swpout fallback inc: 50, Fallback percentage: 21.93%
Iteration 59: swpout inc: 181, swpout fallback inc: 45, Fallback percentage: 19.91%
Iteration 60: swpout inc: 180, swpout fallback inc: 45, Fallback percentage: 20.00%
Iteration 61: swpout inc: 172, swpout fallback inc: 56, Fallback percentage: 24.56%
Iteration 62: swpout inc: 184, swpout fallback inc: 44, Fallback percentage: 19.30%
Iteration 63: swpout inc: 174, swpout fallback inc: 42, Fallback percentage: 19.44%
Iteration 64: swpout inc: 166, swpout fallback inc: 51, Fallback percentage: 23.50%
Iteration 65: swpout inc: 172, swpout fallback inc: 57, Fallback percentage: 24.89%
Iteration 66: swpout inc: 180, swpout fallback inc: 40, Fallback percentage: 18.18%
Iteration 67: swpout inc: 173, swpout fallback inc: 63, Fallback percentage: 26.69%
Iteration 68: swpout inc: 186, swpout fallback inc: 43, Fallback percentage: 18.78%
Iteration 69: swpout inc: 175, swpout fallback inc: 53, Fallback percentage: 23.25%
Iteration 70: swpout inc: 170, swpout fallback inc: 54, Fallback percentage: 24.11%
Iteration 71: swpout inc: 166, swpout fallback inc: 62, Fallback percentage: 27.19%
Iteration 72: swpout inc: 169, swpout fallback inc: 54, Fallback percentage: 24.22%
Iteration 73: swpout inc: 175, swpout fallback inc: 50, Fallback percentage: 22.22%
Iteration 74: swpout inc: 160, swpout fallback inc: 60, Fallback percentage: 27.27%
Iteration 75: swpout inc: 173, swpout fallback inc: 45, Fallback percentage: 20.64%
Iteration 76: swpout inc: 172, swpout fallback inc: 61, Fallback percentage: 26.18%
Iteration 77: swpout inc: 173, swpout fallback inc: 50, Fallback percentage: 22.42%
Iteration 78: swpout inc: 160, swpout fallback inc: 65, Fallback percentage: 28.89%
Iteration 79: swpout inc: 165, swpout fallback inc: 61, Fallback percentage: 26.99%
Iteration 80: swpout inc: 183, swpout fallback inc: 43, Fallback percentage: 19.03%
Iteration 81: swpout inc: 206, swpout fallback inc: 22, Fallback percentage: 9.65%
Iteration 82: swpout inc: 176, swpout fallback inc: 49, Fallback percentage: 21.78%
Iteration 83: swpout inc: 184, swpout fallback inc: 45, Fallback percentage: 19.65%
Iteration 84: swpout inc: 181, swpout fallback inc: 45, Fallback percentage: 19.91%
Iteration 85: swpout inc: 175, swpout fallback inc: 56, Fallback percentage: 24.24%
Iteration 86: swpout inc: 157, swpout fallback inc: 59, Fallback percentage: 27.31%
Iteration 87: swpout inc: 171, swpout fallback inc: 54, Fallback percentage: 24.00%
Iteration 88: swpout inc: 189, swpout fallback inc: 34, Fallback percentage: 15.25%
Iteration 89: swpout inc: 185, swpout fallback inc: 45, Fallback percentage: 19.57%
Iteration 90: swpout inc: 173, swpout fallback inc: 49, Fallback percentage: 22.07%
Iteration 91: swpout inc: 170, swpout fallback inc: 58, Fallback percentage: 25.44%
Iteration 92: swpout inc: 184, swpout fallback inc: 44, Fallback percentage: 19.30%
Iteration 93: swpout inc: 193, swpout fallback inc: 37, Fallback percentage: 16.09%
Iteration 94: swpout inc: 181, swpout fallback inc: 38, Fallback percentage: 17.35%
Iteration 95: swpout inc: 205, swpout fallback inc: 25, Fallback percentage: 10.87%
Iteration 96: swpout inc: 164, swpout fallback inc: 49, Fallback percentage: 23.00%
Iteration 97: swpout inc: 158, swpout fallback inc: 65, Fallback percentage: 29.15%
Iteration 98: swpout inc: 168, swpout fallback inc: 57, Fallback percentage: 25.33%
Iteration 99: swpout inc: 163, swpout fallback inc: 56, Fallback percentage: 25.57%
Iteration 100: swpout inc: 180, swpout fallback inc: 44, Fallback percentage: 19.64%

It is getting worse than test C but still way better than test
A and B.

I previously observed a 100% fallback ratio in test v1 on an actual phone, especially
when order-0 and mthp were combined within one zRAM. This likely triggered a scenario
where order-0 had to scan swap to locate free slots, resulting in fragmentation
across all clusters.

Not quite sure if this still happens in v2. Will arrange a phone test next week. If
yes, I am still eager to see some approach to prevent order-0 from spreading across
all clusters.

BTW, Chris,

Is the warning "expecting order 4 got 0" normal in the above test?

>
> Yep, I alter changelogs all the time.
>
> > >
> > > I'll add this into mm-unstable now for some exposure, but at this point
> > > I'm not able to determine whether it should go in as a hotfix for
> > > 6.10-rcX.
> >
> > Maybe not need to be a hotfix. Not all Barry's mTHP swap out and swap
> > in patch got merged yet.

This could be a hotfix, considering swapping out a mTHP is slower than
swapping nr_pages small folios with the overhead of splitting folio
if we have to fallback. Ryan had the regression data before[1]

"
| alloc size |                baseline |           + this series |
|            | mm-unstable (~v6.9-rc1) |                         |
|:-----------|------------------------:|------------------------:|
| 4K Page    |                    0.0% |                    1.3% |
| 64K THP    |                  -13.6% |                   46.3% |
| 2M THP     |                   91.4% |                   89.6% |

So with this change, the 64K swap performance goes from a 14% regression to a
46% improvement. While 2M shows a small regression I'm confident that this is
just noise."

Ryan reported a 14% regression if mthp can not be swapped out as a whole
comparing to only using small folios.

[1] https://lore.kernel.org/all/20240408183946.2991168-1-ryan.roberts@arm.com/

>
> OK, well please let's give appropriate consideration to what we should
> add to 6.10-rcX in order to have this feature working well.

Thanks
Barry



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 0/2] mm: swap: mTHP swap allocator base on swap cluster order
  2024-06-15  8:47       ` Barry Song
@ 2024-06-17  3:00         ` Huang, Ying
  2024-06-17  3:12           ` Barry Song
  2024-06-17  6:48         ` Huang, Ying
  2024-06-17 18:34         ` Chris Li
  2 siblings, 1 reply; 22+ messages in thread
From: Huang, Ying @ 2024-06-17  3:00 UTC (permalink / raw)
  To: Barry Song, akpm, chrisl
  Cc: baohua, kaleshsingh, kasong, linux-kernel, linux-mm, ryan.roberts


Barry Song <21cnbao@gmail.com> writes:

> On Sat, Jun 15, 2024 at 2:59 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>>
>> On Fri, 14 Jun 2024 19:51:11 -0700 Chris Li <chrisl@kernel.org> wrote:
>>
>> > > I'm having trouble understanding the overall impact of this on users.
>> > > We fail the mTHP swap allocation and fall back, but things continue to
>> > > operate OK?
>> >
>> > Continue to operate OK in the sense that the mTHP will have to split
>> > into 4K pages before the swap out, aka the fall back. The swap out and
>> > swap in can continue to work as 4K pages, not as the mTHP. Due to the
>> > fallback, the mTHP based zsmalloc compression with 64K buffer will not
>> > happen. That is the effect of the fallback. But mTHP swap out and swap
>> > in is relatively new, it is not really a regression.
>>
>> Sure, but it's pretty bad to merge a new feature only to have it
>> ineffective after a few hours use.
>>
>> > >
>> > > > There is some test number in the V1 thread of this series:
>> > > > https://lore.kernel.org/r/20240524-swap-allocator-v1-0-47861b423b26@kernel.org
>> > >
>> > > Well, please let's get the latest numbers into the latest patchset.
>> > > Along with a higher-level (and quantitative) description of the user impact.
>> >
>> > I will need Barray's help to collect the number. I don't have the
>> > setup to reproduce his test result.
>> > Maybe a follow up commit message amendment for the test number when I get it?
>
> Although the issue may seem complex at a systemic level, even a small program can
> demonstrate the problem and highlight how Chris's patch has improved the
> situation.
>
> To demonstrate this, I designed a basic test program that maximally allocates
> two memory blocks:
>
>  *   A memory block of up to 60MB, recommended for HUGEPAGE usage
>  *   A memory block of up to 1MB, recommended for NOHUGEPAGE usage
>
> In the system configuration, I enabled 64KB mTHP and 64MB zRAM, providing more than
> enough space for both the 60MB and 1MB allocations in the worst case. This setup
> allows us to assess two effects:
>
> 1.  When we don't enable mem2 (small folios), we consistently allocate and free
>     swap slots aligned with 64KB.  whether there is a risk of failure to obtain
>     swap slots even though the zRAM has sufficient free space?
> 2.  When we enable mem2 (small folios), the presence of small folios may lead
>     to fragmentation of clusters, potentially impacting the swapout process for
>     large folios negatively.
>

IIUC, the test results are based on not-yet-merged patchset [1] (mm:
support large folios swap-in)?

[1] https://lore.kernel.org/linux-mm/20240304081348.197341-1-21cnbao@gmail.com/

If so, do we have any visible effect without that?  If not, should we
wait for patchset [1] (or something similar) to be merged firstly?

--
Best Regards,
Huang, Ying



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 0/2] mm: swap: mTHP swap allocator base on swap cluster order
  2024-06-17  3:00         ` Huang, Ying
@ 2024-06-17  3:12           ` Barry Song
  2024-06-17  3:29             ` Barry Song
  0 siblings, 1 reply; 22+ messages in thread
From: Barry Song @ 2024-06-17  3:12 UTC (permalink / raw)
  To: Huang, Ying
  Cc: akpm, chrisl, kaleshsingh, kasong, linux-kernel, linux-mm, ryan.roberts

[-- Attachment #1: Type: text/plain, Size: 3395 bytes --]

在 2024年6月17日星期一，Huang, Ying <ying.huang@intel.com> 写道：

>
> Barry Song <21cnbao@gmail.com> writes:
>
> > On Sat, Jun 15, 2024 at 2:59 PM Andrew Morton <akpm@linux-foundation.org>
> wrote:
> >>
> >> On Fri, 14 Jun 2024 19:51:11 -0700 Chris Li <chrisl@kernel.org> wrote:
> >>
> >> > > I'm having trouble understanding the overall impact of this on
> users.
> >> > > We fail the mTHP swap allocation and fall back, but things continue
> to
> >> > > operate OK?
> >> >
> >> > Continue to operate OK in the sense that the mTHP will have to split
> >> > into 4K pages before the swap out, aka the fall back. The swap out and
> >> > swap in can continue to work as 4K pages, not as the mTHP. Due to the
> >> > fallback, the mTHP based zsmalloc compression with 64K buffer will not
> >> > happen. That is the effect of the fallback. But mTHP swap out and swap
> >> > in is relatively new, it is not really a regression.
> >>
> >> Sure, but it's pretty bad to merge a new feature only to have it
> >> ineffective after a few hours use.
> >>
> >> > >
> >> > > > There is some test number in the V1 thread of this series:
> >> > > > https://lore.kernel.org/r/20240524-swap-allocator-v1-0-
> 47861b423b26@kernel.org
> >> > >
> >> > > Well, please let's get the latest numbers into the latest patchset.
> >> > > Along with a higher-level (and quantitative) description of the
> user impact.
> >> >
> >> > I will need Barray's help to collect the number. I don't have the
> >> > setup to reproduce his test result.
> >> > Maybe a follow up commit message amendment for the test number when I
> get it?
> >
> > Although the issue may seem complex at a systemic level, even a small
> program can
> > demonstrate the problem and highlight how Chris's patch has improved the
> > situation.
> >
> > To demonstrate this, I designed a basic test program that maximally
> allocates
> > two memory blocks:
> >
> >  *   A memory block of up to 60MB, recommended for HUGEPAGE usage
> >  *   A memory block of up to 1MB, recommended for NOHUGEPAGE usage
> >
> > In the system configuration, I enabled 64KB mTHP and 64MB zRAM,
> providing more than
> > enough space for both the 60MB and 1MB allocations in the worst case.
> This setup
> > allows us to assess two effects:
> >
> > 1.  When we don't enable mem2 (small folios), we consistently allocate
> and free
> >     swap slots aligned with 64KB.  whether there is a risk of failure to
> obtain
> >     swap slots even though the zRAM has sufficient free space?
> > 2.  When we enable mem2 (small folios), the presence of small folios may
> lead
> >     to fragmentation of clusters, potentially impacting the swapout
> process for
> >     large folios negatively.
> >
>
> IIUC, the test results are based on not-yet-merged patchset [1] (mm:
> support large folios swap-in)?


no. this data is based on mm-unstable.

the visible impact is that swapping out mthp will have 14% regression if
fallback againest swapping out nr_pages small folios regardless if mthp
swapin is there.



> [1] https://lore.kernel.org/linux-mm/20240304081348.197341-1-
> 21cnbao@gmail.com/
>
> If so, do we have any visible effect without that?  If not, should we
> wait for patchset [1] (or something similar) to be merged firstly?
>
> --
> Best Regards,
> Huang, Ying
>
>

[-- Attachment #2: Type: text/html, Size: 4587 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 0/2] mm: swap: mTHP swap allocator base on swap cluster order
  2024-06-17  3:12           ` Barry Song
@ 2024-06-17  3:29             ` Barry Song
  0 siblings, 0 replies; 22+ messages in thread
From: Barry Song @ 2024-06-17  3:29 UTC (permalink / raw)
  To: Huang, Ying
  Cc: akpm, chrisl, kaleshsingh, kasong, linux-kernel, linux-mm, ryan.roberts

On Mon, Jun 17, 2024 at 3:12 PM Barry Song <21cnbao@gmail.com> wrote:
>
>
>
> 在 2024年6月17日星期一，Huang, Ying <ying.huang@intel.com> 写道：
>>
>>
>> Barry Song <21cnbao@gmail.com> writes:
>>
>> > On Sat, Jun 15, 2024 at 2:59 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>> >>
>> >> On Fri, 14 Jun 2024 19:51:11 -0700 Chris Li <chrisl@kernel.org> wrote:
>> >>
>> >> > > I'm having trouble understanding the overall impact of this on users.
>> >> > > We fail the mTHP swap allocation and fall back, but things continue to
>> >> > > operate OK?
>> >> >
>> >> > Continue to operate OK in the sense that the mTHP will have to split
>> >> > into 4K pages before the swap out, aka the fall back. The swap out and
>> >> > swap in can continue to work as 4K pages, not as the mTHP. Due to the
>> >> > fallback, the mTHP based zsmalloc compression with 64K buffer will not
>> >> > happen. That is the effect of the fallback. But mTHP swap out and swap
>> >> > in is relatively new, it is not really a regression.
>> >>
>> >> Sure, but it's pretty bad to merge a new feature only to have it
>> >> ineffective after a few hours use.
>> >>
>> >> > >
>> >> > > > There is some test number in the V1 thread of this series:
>> >> > > > https://lore.kernel.org/r/20240524-swap-allocator-v1-0-47861b423b26@kernel.org
>> >> > >
>> >> > > Well, please let's get the latest numbers into the latest patchset.
>> >> > > Along with a higher-level (and quantitative) description of the user impact.
>> >> >
>> >> > I will need Barray's help to collect the number. I don't have the
>> >> > setup to reproduce his test result.
>> >> > Maybe a follow up commit message amendment for the test number when I get it?
>> >
>> > Although the issue may seem complex at a systemic level, even a small program can
>> > demonstrate the problem and highlight how Chris's patch has improved the
>> > situation.
>> >
>> > To demonstrate this, I designed a basic test program that maximally allocates
>> > two memory blocks:
>> >
>> >  *   A memory block of up to 60MB, recommended for HUGEPAGE usage
>> >  *   A memory block of up to 1MB, recommended for NOHUGEPAGE usage
>> >
>> > In the system configuration, I enabled 64KB mTHP and 64MB zRAM, providing more than
>> > enough space for both the 60MB and 1MB allocations in the worst case. This setup
>> > allows us to assess two effects:
>> >
>> > 1.  When we don't enable mem2 (small folios), we consistently allocate and free
>> >     swap slots aligned with 64KB.  whether there is a risk of failure to obtain
>> >     swap slots even though the zRAM has sufficient free space?
>> > 2.  When we enable mem2 (small folios), the presence of small folios may lead
>> >     to fragmentation of clusters, potentially impacting the swapout process for
>> >     large folios negatively.
>> >
>>
>> IIUC, the test results are based on not-yet-merged patchset [1] (mm:
>> support large folios swap-in)?
>
>
> no. this data is based on mm-unstable.
>
> the visible impact is that swapping out mthp will have 14% regression if
> fallback againest swapping out nr_pages small folios regardless if mthp swapin is there.

Ryan initially reported 14% swapout regression without mTHP swapout.
then he reported 46.3% improvement if mTHP can be swapped out as
a whole[1].

so we will drop 60%+ performance if fallback. but the 14% regression against
pure small folios are unacceptable considering more than 2/3 memory can
be swapped out on mobile devices.

So I am hoping we can find some way to merge Chris' patchset soon. though
the WARN_ONCE still indicates some BUG in v2. Hopefully, Chris can fix it
in v3.

[1] https://lore.kernel.org/all/20240408183946.2991168-1-ryan.roberts@arm.com/

>
>
>>
>> [1] https://lore.kernel.org/linux-mm/20240304081348.197341-1-21cnbao@gmail.com/
>>
>> If so, do we have any visible effect without that?  If not, should we
>> wait for patchset [1] (or something similar) to be merged firstly?
>>
>> --
>> Best Regards,
>> Huang, Ying
>>

Thanks
Barry


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 1/2] mm: swap: swap cluster switch to double link list
  2024-06-14 23:48 ` [PATCH v2 1/2] mm: swap: swap cluster switch to double link list Chris Li
@ 2024-06-17  6:19   ` Huang, Ying
  2024-06-18  5:06     ` Chris Li
  0 siblings, 1 reply; 22+ messages in thread
From: Huang, Ying @ 2024-06-17  6:19 UTC (permalink / raw)
  To: Chris Li
  Cc: Andrew Morton, Kairui Song, Ryan Roberts, Kalesh Singh,
	linux-kernel, linux-mm, Barry Song

Hi, Chris,

Chris Li <chrisl@kernel.org> writes:

> Previously, the swap cluster used a cluster index as a pointer
> to construct a custom single link list type "swap_cluster_list".
> The next cluster pointer is shared with the cluster->count.
> It prevents puting the non free cluster into a list.
> Change the cluster to use the standard double link list instead.
> This allows tracing the nonfull cluster in the follow up patch.
>
> Remove the cluster getter/setter for accessing the cluster
> struct member.
>
> The list operation is protected by the swap_info_struct->lock.
>
> Change cluster code to use "struct swap_cluster_info *" to
> reference the cluster rather than by using index. That is more
> consistent with the list manipulation. It avoids the repeat
> adding index to the cluser_info. The code is easier to understand.
>
> Remove the cluster next pointer is NULL flag, the double link
> list can handle the empty list pretty well.

The above is more about "what" instead of "why".  We can identify "what"
from the patch itself.  I expect more "why".  I guess that we can reduce
swap_map[] scanning if we have lists of non-full/non-free clusters.

> The "swap_cluster_info" struct is two pointer bigger, because
> 512 swap entries share one swap struct, it has very little impact
> on the average memory usage per swap entry. For 1TB swapfile, the
> swap cluster data structure increases from 8MB to 24MB.
>
> Other than the list conversion, there is no real function change
> in this patch.
>
> Signed-off-by: Chris Li <chrisl@kernel.org>
> ---
>  include/linux/swap.h |  28 +++----
>  mm/swapfile.c        | 227 +++++++++++++--------------------------------------
>  2 files changed, 70 insertions(+), 185 deletions(-)
>
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 3df75d62a835..cd9154a3e934 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -242,23 +242,22 @@ enum {
>   * space with SWAPFILE_CLUSTER pages long and naturally aligns in disk. All
>   * free clusters are organized into a list. We fetch an entry from the list to
>   * get a free cluster.
> - *
> - * The data field stores next cluster if the cluster is free or cluster usage
> - * counter otherwise. The flags field determines if a cluster is free. This is
> - * protected by swap_info_struct.lock.
>   */
>  struct swap_cluster_info {
>  	spinlock_t lock;	/*
> -				 * Protect swap_cluster_info fields
> -				 * and swap_info_struct->swap_map
> +				 * Protect swap_cluster_info count and state

Protect swap_cluster_info fields except 'list' ?

> +				 * field and swap_info_struct->swap_map
>  				 * elements correspond to the swap
>  				 * cluster
>  				 */
> -	unsigned int data:24;
> -	unsigned int flags:8;
> +	unsigned int count:12;
> +	unsigned int state:3;

I still prefer normal data type over bit fields.  How about

        u16 usage;
        u8  state;

And, how about use 'usage' instead of 'count'?  Personally I think that
it is more clear.  But I don't have strong opinions on this.

> +	struct list_head list;	/* Protected by swap_info_struct->lock */
>  };
> -#define CLUSTER_FLAG_FREE 1 /* This cluster is free */
> -#define CLUSTER_FLAG_NEXT_NULL 2 /* This cluster has no next cluster */
> +
> +#define CLUSTER_STATE_FREE	1 /* This cluster is free */

Can we use swap_cluster_info->count == 0?

> +#define CLUSTER_STATE_PER_CPU	2 /* This cluster on per_cpu_cluster  */
> +

There's no users of this state in this patch.  IMHO, it's better to
introduce a symbol with its users, otherwise, it's hard to understand
why do we need it and how to use it.  And, IIUC, the state isn't
maintained properly, it should be changed when we move the cluster off
the per-cpu cluster.

>  /*
>   * The first page in the swap file is the swap header, which is always marked
> @@ -283,11 +282,6 @@ struct percpu_cluster {
>  	unsigned int next[SWAP_NR_ORDERS]; /* Likely next allocation offset */
>  };
>  
> -struct swap_cluster_list {
> -	struct swap_cluster_info head;
> -	struct swap_cluster_info tail;
> -};
> -
>  /*
>   * The in-memory structure used to track swap areas.
>   */
> @@ -300,7 +294,7 @@ struct swap_info_struct {
>  	unsigned int	max;		/* extent of the swap_map */
>  	unsigned char *swap_map;	/* vmalloc'ed array of usage counts */
>  	struct swap_cluster_info *cluster_info; /* cluster info. Only for SSD */
> -	struct swap_cluster_list free_clusters; /* free clusters list */
> +	struct list_head free_clusters; /* free clusters list */
>  	unsigned int lowest_bit;	/* index of first free in swap_map */
>  	unsigned int highest_bit;	/* index of last free in swap_map */
>  	unsigned int pages;		/* total of usable pages of swap */
> @@ -331,7 +325,7 @@ struct swap_info_struct {
>  					 * list.
>  					 */
>  	struct work_struct discard_work; /* discard worker */
> -	struct swap_cluster_list discard_clusters; /* discard clusters list */
> +	struct list_head discard_clusters; /* discard clusters list */
>  	struct plist_node avail_lists[]; /*
>  					   * entries in swap_avail_heads, one
>  					   * entry per node.
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 9c6d8e557c0f..2f878b374349 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -290,62 +290,9 @@ static void discard_swap_cluster(struct swap_info_struct *si,
>  #endif
>  #define LATENCY_LIMIT		256
>  
> -static inline void cluster_set_flag(struct swap_cluster_info *info,
> -	unsigned int flag)
> -{
> -	info->flags = flag;
> -}
> -
> -static inline unsigned int cluster_count(struct swap_cluster_info *info)
> -{
> -	return info->data;
> -}
> -
> -static inline void cluster_set_count(struct swap_cluster_info *info,
> -				     unsigned int c)
> -{
> -	info->data = c;
> -}
> -
> -static inline void cluster_set_count_flag(struct swap_cluster_info *info,
> -					 unsigned int c, unsigned int f)
> -{
> -	info->flags = f;
> -	info->data = c;
> -}
> -
> -static inline unsigned int cluster_next(struct swap_cluster_info *info)
> -{
> -	return info->data;
> -}
> -
> -static inline void cluster_set_next(struct swap_cluster_info *info,
> -				    unsigned int n)
> -{
> -	info->data = n;
> -}
> -
> -static inline void cluster_set_next_flag(struct swap_cluster_info *info,
> -					 unsigned int n, unsigned int f)
> -{
> -	info->flags = f;
> -	info->data = n;
> -}
> -
>  static inline bool cluster_is_free(struct swap_cluster_info *info)
>  {
> -	return info->flags & CLUSTER_FLAG_FREE;
> -}
> -
> -static inline bool cluster_is_null(struct swap_cluster_info *info)
> -{
> -	return info->flags & CLUSTER_FLAG_NEXT_NULL;
> -}
> -
> -static inline void cluster_set_null(struct swap_cluster_info *info)
> -{
> -	info->flags = CLUSTER_FLAG_NEXT_NULL;
> -	info->data = 0;
> +	return info->state == CLUSTER_STATE_FREE;
>  }
>  
>  static inline struct swap_cluster_info *lock_cluster(struct swap_info_struct *si,
> @@ -394,65 +341,11 @@ static inline void unlock_cluster_or_swap_info(struct swap_info_struct *si,
>  		spin_unlock(&si->lock);
>  }
>  
> -static inline bool cluster_list_empty(struct swap_cluster_list *list)
> -{
> -	return cluster_is_null(&list->head);
> -}
> -
> -static inline unsigned int cluster_list_first(struct swap_cluster_list *list)
> -{
> -	return cluster_next(&list->head);
> -}
> -
> -static void cluster_list_init(struct swap_cluster_list *list)
> -{
> -	cluster_set_null(&list->head);
> -	cluster_set_null(&list->tail);
> -}
> -
> -static void cluster_list_add_tail(struct swap_cluster_list *list,
> -				  struct swap_cluster_info *ci,
> -				  unsigned int idx)
> -{
> -	if (cluster_list_empty(list)) {
> -		cluster_set_next_flag(&list->head, idx, 0);
> -		cluster_set_next_flag(&list->tail, idx, 0);
> -	} else {
> -		struct swap_cluster_info *ci_tail;
> -		unsigned int tail = cluster_next(&list->tail);
> -
> -		/*
> -		 * Nested cluster lock, but both cluster locks are
> -		 * only acquired when we held swap_info_struct->lock
> -		 */
> -		ci_tail = ci + tail;
> -		spin_lock_nested(&ci_tail->lock, SINGLE_DEPTH_NESTING);
> -		cluster_set_next(ci_tail, idx);
> -		spin_unlock(&ci_tail->lock);
> -		cluster_set_next_flag(&list->tail, idx, 0);
> -	}
> -}
> -
> -static unsigned int cluster_list_del_first(struct swap_cluster_list *list,
> -					   struct swap_cluster_info *ci)
> -{
> -	unsigned int idx;
> -
> -	idx = cluster_next(&list->head);
> -	if (cluster_next(&list->tail) == idx) {
> -		cluster_set_null(&list->head);
> -		cluster_set_null(&list->tail);
> -	} else
> -		cluster_set_next_flag(&list->head,
> -				      cluster_next(&ci[idx]), 0);
> -
> -	return idx;
> -}
> -
>  /* Add a cluster to discard list and schedule it to do discard */
>  static void swap_cluster_schedule_discard(struct swap_info_struct *si,
> -		unsigned int idx)
> +		struct swap_cluster_info *ci)
>  {
> +	unsigned int idx = ci - si->cluster_info;

I see this multiple times in the patch, can we define a helper for this?

>  	/*
>  	 * If scan_swap_map_slots() can't find a free cluster, it will check
>  	 * si->swap_map directly. To make sure the discarding cluster isn't
> @@ -462,17 +355,14 @@ static void swap_cluster_schedule_discard(struct swap_info_struct *si,
>  	memset(si->swap_map + idx * SWAPFILE_CLUSTER,
>  			SWAP_MAP_BAD, SWAPFILE_CLUSTER);
>  
> -	cluster_list_add_tail(&si->discard_clusters, si->cluster_info, idx);
> -
> +	list_add_tail(&ci->list, &si->discard_clusters);
>  	schedule_work(&si->discard_work);
>  }
>  
> -static void __free_cluster(struct swap_info_struct *si, unsigned long idx)
> +static void __free_cluster(struct swap_info_struct *si, struct swap_cluster_info *ci)
>  {
> -	struct swap_cluster_info *ci = si->cluster_info;
> -
> -	cluster_set_flag(ci + idx, CLUSTER_FLAG_FREE);
> -	cluster_list_add_tail(&si->free_clusters, ci, idx);
> +	ci->state = CLUSTER_STATE_FREE;
> +	list_add_tail(&ci->list, &si->free_clusters);
>  }
>  
>  /*
> @@ -481,21 +371,22 @@ static void __free_cluster(struct swap_info_struct *si, unsigned long idx)
>  */
>  static void swap_do_scheduled_discard(struct swap_info_struct *si)
>  {
> -	struct swap_cluster_info *info, *ci;
> +	struct swap_cluster_info *ci;
>  	unsigned int idx;
>  
> -	info = si->cluster_info;
> -
> -	while (!cluster_list_empty(&si->discard_clusters)) {
> -		idx = cluster_list_del_first(&si->discard_clusters, info);
> +	while (!list_empty(&si->discard_clusters)) {
> +		ci = list_first_entry(&si->discard_clusters, struct swap_cluster_info, list);
> +		list_del(&ci->list);
> +		idx = ci - si->cluster_info;
>  		spin_unlock(&si->lock);
>  
>  		discard_swap_cluster(si, idx * SWAPFILE_CLUSTER,
>  				SWAPFILE_CLUSTER);
>  
>  		spin_lock(&si->lock);
> -		ci = lock_cluster(si, idx * SWAPFILE_CLUSTER);
> -		__free_cluster(si, idx);
> +
> +		spin_lock(&ci->lock);

Personally, I still prefer to use lock_cluster(), which is more readable
and matches unlock_cluster() below.

> +		__free_cluster(si, ci);
>  		memset(si->swap_map + idx * SWAPFILE_CLUSTER,
>  				0, SWAPFILE_CLUSTER);
>  		unlock_cluster(ci);
> @@ -521,20 +412,19 @@ static void swap_users_ref_free(struct percpu_ref *ref)
>  	complete(&si->comp);
>  }
>  
> -static void alloc_cluster(struct swap_info_struct *si, unsigned long idx)
> +static struct swap_cluster_info *alloc_cluster(struct swap_info_struct *si, unsigned long idx)
>  {
> -	struct swap_cluster_info *ci = si->cluster_info;
> +	struct swap_cluster_info *ci = list_first_entry(&si->free_clusters, struct swap_cluster_info, list);
>  
> -	VM_BUG_ON(cluster_list_first(&si->free_clusters) != idx);
> -	cluster_list_del_first(&si->free_clusters, ci);
> -	cluster_set_count_flag(ci + idx, 0, 0);
> +	VM_BUG_ON(ci - si->cluster_info != idx);
> +	list_del(&ci->list);
> +	ci->count = 0;

Do we need this now?  If we keep CLUSTER_STATE_FREE, we need to change
it here.

> +	return ci;
>  }
>  
> -static void free_cluster(struct swap_info_struct *si, unsigned long idx)
> +static void free_cluster(struct swap_info_struct *si, struct swap_cluster_info *ci)
>  {
> -	struct swap_cluster_info *ci = si->cluster_info + idx;
> -
> -	VM_BUG_ON(cluster_count(ci) != 0);
> +	VM_BUG_ON(ci->count != 0);
>  	/*
>  	 * If the swap is discardable, prepare discard the cluster
>  	 * instead of free it immediately. The cluster will be freed
> @@ -542,11 +432,11 @@ static void free_cluster(struct swap_info_struct *si, unsigned long idx)
>  	 */
>  	if ((si->flags & (SWP_WRITEOK | SWP_PAGE_DISCARD)) ==
>  	    (SWP_WRITEOK | SWP_PAGE_DISCARD)) {
> -		swap_cluster_schedule_discard(si, idx);
> +		swap_cluster_schedule_discard(si, ci);
>  		return;
>  	}
>  
> -	__free_cluster(si, idx);
> +	__free_cluster(si, ci);
>  }
>  
>  /*
> @@ -559,15 +449,15 @@ static void add_cluster_info_page(struct swap_info_struct *p,
>  	unsigned long count)
>  {
>  	unsigned long idx = page_nr / SWAPFILE_CLUSTER;
> +	struct swap_cluster_info *ci = cluster_info + idx;
>  
>  	if (!cluster_info)
>  		return;
> -	if (cluster_is_free(&cluster_info[idx]))
> +	if (cluster_is_free(ci))
>  		alloc_cluster(p, idx);
>  
> -	VM_BUG_ON(cluster_count(&cluster_info[idx]) + count > SWAPFILE_CLUSTER);
> -	cluster_set_count(&cluster_info[idx],
> -		cluster_count(&cluster_info[idx]) + count);
> +	VM_BUG_ON(ci->count + count > SWAPFILE_CLUSTER);
> +	ci->count += count;
>  }
>  
>  /*
> @@ -581,24 +471,20 @@ static void inc_cluster_info_page(struct swap_info_struct *p,
>  }
>  
>  /*
> - * The cluster corresponding to page_nr decreases one usage. If the usage
> - * counter becomes 0, which means no page in the cluster is in using, we can
> - * optionally discard the cluster and add it to free cluster list.
> + * The cluster ci decreases one usage. If the usage counter becomes 0,
> + * which means no page in the cluster is in using, we can optionally discard
> + * the cluster and add it to free cluster list.
>   */
> -static void dec_cluster_info_page(struct swap_info_struct *p,
> -	struct swap_cluster_info *cluster_info, unsigned long page_nr)
> +static void dec_cluster_info_page(struct swap_info_struct *p, struct swap_cluster_info *ci)
>  {
> -	unsigned long idx = page_nr / SWAPFILE_CLUSTER;
> -
> -	if (!cluster_info)
> +	if (!p->cluster_info)
>  		return;
>  
> -	VM_BUG_ON(cluster_count(&cluster_info[idx]) == 0);
> -	cluster_set_count(&cluster_info[idx],
> -		cluster_count(&cluster_info[idx]) - 1);
> +	VM_BUG_ON(ci->count == 0);
> +	ci->count--;
>  
> -	if (cluster_count(&cluster_info[idx]) == 0)
> -		free_cluster(p, idx);
> +	if (!ci->count)
> +		free_cluster(p, ci);
>  }
>  
>  /*
> @@ -611,10 +497,10 @@ scan_swap_map_ssd_cluster_conflict(struct swap_info_struct *si,
>  {
>  	struct percpu_cluster *percpu_cluster;
>  	bool conflict;
> -

Usually we use one blank line after local variable declaration.

> +	struct swap_cluster_info *first = list_first_entry(&si->free_clusters, struct swap_cluster_info, list);
>  	offset /= SWAPFILE_CLUSTER;
> -	conflict = !cluster_list_empty(&si->free_clusters) &&
> -		offset != cluster_list_first(&si->free_clusters) &&
> +	conflict = !list_empty(&si->free_clusters) &&
> +		offset !=  first - si->cluster_info &&
>  		cluster_is_free(&si->cluster_info[offset]);
>  
>  	if (!conflict)
> @@ -655,10 +541,14 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
>  	cluster = this_cpu_ptr(si->percpu_cluster);
>  	tmp = cluster->next[order];
>  	if (tmp == SWAP_NEXT_INVALID) {
> -		if (!cluster_list_empty(&si->free_clusters)) {
> -			tmp = cluster_next(&si->free_clusters.head) *
> -					SWAPFILE_CLUSTER;
> -		} else if (!cluster_list_empty(&si->discard_clusters)) {
> +		if (!list_empty(&si->free_clusters)) {
> +			ci = list_first_entry(&si->free_clusters, struct swap_cluster_info, list);
> +			list_del(&ci->list);

The free cluster is deleted from si->free_clusters now.  But later you
will call scan_swap_map_ssd_cluster_conflict() and may abandon the
cluster.  And in alloc_cluster() later, it may be deleted again.

> +			spin_lock(&ci->lock);
> +			ci->state = CLUSTER_STATE_PER_CPU;

Need to change ci->state when move a cluster off the percpu_cluster.

> +			spin_unlock(&ci->lock);
> +			tmp = (ci - si->cluster_info) * SWAPFILE_CLUSTER;
> +		} else if (!list_empty(&si->discard_clusters)) {
>  			/*
>  			 * we don't have free cluster but have some clusters in
>  			 * discarding, do discard now and reclaim them, then
> @@ -1062,8 +952,8 @@ static void swap_free_cluster(struct swap_info_struct *si, unsigned long idx)
>  
>  	ci = lock_cluster(si, offset);
>  	memset(si->swap_map + offset, 0, SWAPFILE_CLUSTER);
> -	cluster_set_count_flag(ci, 0, 0);
> -	free_cluster(si, idx);
> +	ci->count = 0;
> +	free_cluster(si, ci);
>  	unlock_cluster(ci);
>  	swap_range_free(si, offset, SWAPFILE_CLUSTER);
>  }
> @@ -1336,7 +1226,7 @@ static void swap_entry_free(struct swap_info_struct *p, swp_entry_t entry)
>  	count = p->swap_map[offset];
>  	VM_BUG_ON(count != SWAP_HAS_CACHE);
>  	p->swap_map[offset] = 0;
> -	dec_cluster_info_page(p, p->cluster_info, offset);
> +	dec_cluster_info_page(p, ci);
>  	unlock_cluster(ci);
>  
>  	mem_cgroup_uncharge_swap(entry, 1);
> @@ -3003,8 +2893,8 @@ static int setup_swap_map_and_extents(struct swap_info_struct *p,
>  
>  	nr_good_pages = maxpages - 1;	/* omit header page */
>  
> -	cluster_list_init(&p->free_clusters);
> -	cluster_list_init(&p->discard_clusters);
> +	INIT_LIST_HEAD(&p->free_clusters);
> +	INIT_LIST_HEAD(&p->discard_clusters);
>  
>  	for (i = 0; i < swap_header->info.nr_badpages; i++) {
>  		unsigned int page_nr = swap_header->info.badpages[i];
> @@ -3055,14 +2945,15 @@ static int setup_swap_map_and_extents(struct swap_info_struct *p,
>  	for (k = 0; k < SWAP_CLUSTER_COLS; k++) {
>  		j = (k + col) % SWAP_CLUSTER_COLS;
>  		for (i = 0; i < DIV_ROUND_UP(nr_clusters, SWAP_CLUSTER_COLS); i++) {
> +			struct swap_cluster_info *ci;
>  			idx = i * SWAP_CLUSTER_COLS + j;
> +			ci = cluster_info + idx;
>  			if (idx >= nr_clusters)
>  				continue;
> -			if (cluster_count(&cluster_info[idx]))
> +			if (ci->count)
>  				continue;
> -			cluster_set_flag(&cluster_info[idx], CLUSTER_FLAG_FREE);
> -			cluster_list_add_tail(&p->free_clusters, cluster_info,
> -					      idx);
> +			ci->state = CLUSTER_STATE_FREE;
> +			list_add_tail(&ci->list, &p->free_clusters);
>  		}
>  	}
>  	return nr_extents;

--
Best Regards,
Huang, Ying


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 0/2] mm: swap: mTHP swap allocator base on swap cluster order
  2024-06-15  8:47       ` Barry Song
  2024-06-17  3:00         ` Huang, Ying
@ 2024-06-17  6:48         ` Huang, Ying
  2024-06-17  7:08           ` Barry Song
  2024-06-17 18:34         ` Chris Li
  2 siblings, 1 reply; 22+ messages in thread
From: Huang, Ying @ 2024-06-17  6:48 UTC (permalink / raw)
  To: Barry Song
  Cc: akpm, chrisl, baohua, kaleshsingh, kasong, linux-kernel,
	linux-mm, ryan.roberts

Hi, Barry,

Barry Song <21cnbao@gmail.com> writes:

> On Sat, Jun 15, 2024 at 2:59 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>>
>> On Fri, 14 Jun 2024 19:51:11 -0700 Chris Li <chrisl@kernel.org> wrote:
>>
>> > > I'm having trouble understanding the overall impact of this on users.
>> > > We fail the mTHP swap allocation and fall back, but things continue to
>> > > operate OK?
>> >
>> > Continue to operate OK in the sense that the mTHP will have to split
>> > into 4K pages before the swap out, aka the fall back. The swap out and
>> > swap in can continue to work as 4K pages, not as the mTHP. Due to the
>> > fallback, the mTHP based zsmalloc compression with 64K buffer will not
>> > happen. That is the effect of the fallback. But mTHP swap out and swap
>> > in is relatively new, it is not really a regression.
>>
>> Sure, but it's pretty bad to merge a new feature only to have it
>> ineffective after a few hours use.
>>
>> > >
>> > > > There is some test number in the V1 thread of this series:
>> > > > https://lore.kernel.org/r/20240524-swap-allocator-v1-0-47861b423b26@kernel.org
>> > >
>> > > Well, please let's get the latest numbers into the latest patchset.
>> > > Along with a higher-level (and quantitative) description of the user impact.
>> >
>> > I will need Barray's help to collect the number. I don't have the
>> > setup to reproduce his test result.
>> > Maybe a follow up commit message amendment for the test number when I get it?
>
> Although the issue may seem complex at a systemic level, even a small program can
> demonstrate the problem and highlight how Chris's patch has improved the
> situation.
>
> To demonstrate this, I designed a basic test program that maximally allocates
> two memory blocks:
>
>  *   A memory block of up to 60MB, recommended for HUGEPAGE usage
>  *   A memory block of up to 1MB, recommended for NOHUGEPAGE usage
>
> In the system configuration, I enabled 64KB mTHP and 64MB zRAM, providing more than
> enough space for both the 60MB and 1MB allocations in the worst case. This setup
> allows us to assess two effects:
>
> 1.  When we don't enable mem2 (small folios), we consistently allocate and free
>     swap slots aligned with 64KB.  whether there is a risk of failure to obtain
>     swap slots even though the zRAM has sufficient free space?
> 2.  When we enable mem2 (small folios), the presence of small folios may lead
>     to fragmentation of clusters, potentially impacting the swapout process for
>     large folios negatively.
>
> (2) can be enabled by "-s", without -s, small folios are disabled.
>
> The script to configure zRAM and mTHP:
>
> echo lzo > /sys/block/zram0/comp_algorithm
> echo 64M > /sys/block/zram0/disksize
> echo never > /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled
> echo always > /sys/kernel/mm/transparent_hugepage/hugepages-64kB/enabled
> mkswap /dev/zram0
> swapon /dev/zram0
>
> The test program I made today after receiving Chris' patchset v2
>
> (Andrew, Please let me know if you want this small test program to
> be committed into kernel/tools/ folder. If yes, please let me know,
> and I will cleanup and prepare a patch):
>
> #define _GNU_SOURCE
> #include <stdio.h>
> #include <stdlib.h>
> #include <unistd.h>
> #include <string.h>
> #include <sys/mman.h>
> #include <errno.h>
> #include <time.h>
>
> #define MEMSIZE_MTHP (60 * 1024 * 1024)
> #define MEMSIZE_SMALLFOLIO (1 * 1024 * 1024)
> #define ALIGNMENT_MTHP (64 * 1024)
> #define ALIGNMENT_SMALLFOLIO (4 * 1024)
> #define TOTAL_DONTNEED_MTHP (16 * 1024 * 1024)
> #define TOTAL_DONTNEED_SMALLFOLIO (256 * 1024)
> #define MTHP_FOLIO_SIZE (64 * 1024)
>
> #define SWPOUT_PATH \
>     "/sys/kernel/mm/transparent_hugepage/hugepages-64kB/stats/swpout"
> #define SWPOUT_FALLBACK_PATH \
>     "/sys/kernel/mm/transparent_hugepage/hugepages-64kB/stats/swpout_fallback"
>
> static void *aligned_alloc_mem(size_t size, size_t alignment)
> {
>     void *mem = NULL;
>     if (posix_memalign(&mem, alignment, size) != 0) {
>         perror("posix_memalign");
>         return NULL;
>     }
>     return mem;
> }
>
> static void random_madvise_dontneed(void *mem, size_t mem_size,
>                                      size_t align_size, size_t total_dontneed_size)
> {
>     size_t num_pages = total_dontneed_size / align_size;
>     size_t i;
>     size_t offset;
>     void *addr;
>
>     for (i = 0; i < num_pages; ++i) {
>         offset = (rand() % (mem_size / align_size)) * align_size;
>         addr = (char *)mem + offset;
>         if (madvise(addr, align_size, MADV_DONTNEED) != 0) {
>             perror("madvise dontneed");
>         }
>         memset(addr, 0x11, align_size);
>     }
> }
>
> static unsigned long read_stat(const char *path)
> {
>     FILE *file;
>     unsigned long value;
>
>     file = fopen(path, "r");
>     if (!file) {
>         perror("fopen");
>         return 0;
>     }
>
>     if (fscanf(file, "%lu", &value) != 1) {
>         perror("fscanf");
>         fclose(file);
>         return 0;
>     }
>
>     fclose(file);
>     return value;
> }
>
> int main(int argc, char *argv[])
> {
>     int use_small_folio = 0;
>     int i;
>     void *mem1 = aligned_alloc_mem(MEMSIZE_MTHP, ALIGNMENT_MTHP);
>     if (mem1 == NULL) {
>         fprintf(stderr, "Failed to allocate 60MB memory\n");
>         return EXIT_FAILURE;
>     }
>
>     if (madvise(mem1, MEMSIZE_MTHP, MADV_HUGEPAGE) != 0) {
>         perror("madvise hugepage for mem1");
>         free(mem1);
>         return EXIT_FAILURE;
>     }
>
>     for (i = 1; i < argc; ++i) {
>         if (strcmp(argv[i], "-s") == 0) {
>             use_small_folio = 1;
>         }
>     }
>
>     void *mem2 = NULL;
>     if (use_small_folio) {
>         mem2 = aligned_alloc_mem(MEMSIZE_SMALLFOLIO, ALIGNMENT_MTHP);
>         if (mem2 == NULL) {
>             fprintf(stderr, "Failed to allocate 1MB memory\n");
>             free(mem1);
>             return EXIT_FAILURE;
>         }
>
>         if (madvise(mem2, MEMSIZE_SMALLFOLIO, MADV_NOHUGEPAGE) != 0) {
>             perror("madvise nohugepage for mem2");
>             free(mem1);
>             free(mem2);
>             return EXIT_FAILURE;
>         }
>     }
>
>     for (i = 0; i < 100; ++i) {
>         unsigned long initial_swpout;
>         unsigned long initial_swpout_fallback;
>         unsigned long final_swpout;
>         unsigned long final_swpout_fallback;
>         unsigned long swpout_inc;
>         unsigned long swpout_fallback_inc;
>         double fallback_percentage;
>
>         initial_swpout = read_stat(SWPOUT_PATH);
>         initial_swpout_fallback = read_stat(SWPOUT_FALLBACK_PATH);
>
>         random_madvise_dontneed(mem1, MEMSIZE_MTHP, ALIGNMENT_MTHP,
>                                  TOTAL_DONTNEED_MTHP);
>
>         if (use_small_folio) {
>             random_madvise_dontneed(mem2, MEMSIZE_SMALLFOLIO,
>                                      ALIGNMENT_SMALLFOLIO,
>                                      TOTAL_DONTNEED_SMALLFOLIO);
>         }
>
>         if (madvise(mem1, MEMSIZE_MTHP, MADV_PAGEOUT) != 0) {
>             perror("madvise pageout for mem1");
>             free(mem1);
>             if (mem2 != NULL) {
>                 free(mem2);
>             }
>             return EXIT_FAILURE;
>         }
>
>         if (use_small_folio) {
>             if (madvise(mem2, MEMSIZE_SMALLFOLIO, MADV_PAGEOUT) != 0) {
>                 perror("madvise pageout for mem2");
>                 free(mem1);
>                 free(mem2);
>                 return EXIT_FAILURE;
>             }
>         }
>
>         final_swpout = read_stat(SWPOUT_PATH);
>         final_swpout_fallback = read_stat(SWPOUT_FALLBACK_PATH);
>
>         swpout_inc = final_swpout - initial_swpout;
>         swpout_fallback_inc = final_swpout_fallback - initial_swpout_fallback;
>
>         fallback_percentage = (double)swpout_fallback_inc /
>                               (swpout_fallback_inc + swpout_inc) * 100;
>
>         printf("Iteration %d: swpout inc: %lu, swpout fallback inc: %lu, Fallback percentage: %.2f%%\n",
>                i + 1, swpout_inc, swpout_fallback_inc, fallback_percentage);
>     }
>
>     free(mem1);
>     if (mem2 != NULL) {
>         free(mem2);
>     }
>
>     return EXIT_SUCCESS;
> }

Thank you very for your effort to write this test program.

TBH, personally, I thought that this test program isn't practical
enough.  Can we show performance difference with some normal workloads?

[snip]

--
Best Regards,
Huang, Ying


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 0/2] mm: swap: mTHP swap allocator base on swap cluster order
  2024-06-17  6:48         ` Huang, Ying
@ 2024-06-17  7:08           ` Barry Song
  0 siblings, 0 replies; 22+ messages in thread
From: Barry Song @ 2024-06-17  7:08 UTC (permalink / raw)
  To: Huang, Ying
  Cc: akpm, chrisl, kaleshsingh, kasong, linux-kernel, linux-mm, ryan.roberts

On Mon, Jun 17, 2024 at 6:50 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Hi, Barry,
>
> Barry Song <21cnbao@gmail.com> writes:
>
> > On Sat, Jun 15, 2024 at 2:59 PM Andrew Morton <akpm@linux-foundation.org> wrote:
> >>
> >> On Fri, 14 Jun 2024 19:51:11 -0700 Chris Li <chrisl@kernel.org> wrote:
> >>
> >> > > I'm having trouble understanding the overall impact of this on users.
> >> > > We fail the mTHP swap allocation and fall back, but things continue to
> >> > > operate OK?
> >> >
> >> > Continue to operate OK in the sense that the mTHP will have to split
> >> > into 4K pages before the swap out, aka the fall back. The swap out and
> >> > swap in can continue to work as 4K pages, not as the mTHP. Due to the
> >> > fallback, the mTHP based zsmalloc compression with 64K buffer will not
> >> > happen. That is the effect of the fallback. But mTHP swap out and swap
> >> > in is relatively new, it is not really a regression.
> >>
> >> Sure, but it's pretty bad to merge a new feature only to have it
> >> ineffective after a few hours use.
> >>
> >> > >
> >> > > > There is some test number in the V1 thread of this series:
> >> > > > https://lore.kernel.org/r/20240524-swap-allocator-v1-0-47861b423b26@kernel.org
> >> > >
> >> > > Well, please let's get the latest numbers into the latest patchset.
> >> > > Along with a higher-level (and quantitative) description of the user impact.
> >> >
> >> > I will need Barray's help to collect the number. I don't have the
> >> > setup to reproduce his test result.
> >> > Maybe a follow up commit message amendment for the test number when I get it?
> >
> > Although the issue may seem complex at a systemic level, even a small program can
> > demonstrate the problem and highlight how Chris's patch has improved the
> > situation.
> >
> > To demonstrate this, I designed a basic test program that maximally allocates
> > two memory blocks:
> >
> >  *   A memory block of up to 60MB, recommended for HUGEPAGE usage
> >  *   A memory block of up to 1MB, recommended for NOHUGEPAGE usage
> >
> > In the system configuration, I enabled 64KB mTHP and 64MB zRAM, providing more than
> > enough space for both the 60MB and 1MB allocations in the worst case. This setup
> > allows us to assess two effects:
> >
> > 1.  When we don't enable mem2 (small folios), we consistently allocate and free
> >     swap slots aligned with 64KB.  whether there is a risk of failure to obtain
> >     swap slots even though the zRAM has sufficient free space?
> > 2.  When we enable mem2 (small folios), the presence of small folios may lead
> >     to fragmentation of clusters, potentially impacting the swapout process for
> >     large folios negatively.
> >
> > (2) can be enabled by "-s", without -s, small folios are disabled.
> >
> > The script to configure zRAM and mTHP:
> >
> > echo lzo > /sys/block/zram0/comp_algorithm
> > echo 64M > /sys/block/zram0/disksize
> > echo never > /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled
> > echo always > /sys/kernel/mm/transparent_hugepage/hugepages-64kB/enabled
> > mkswap /dev/zram0
> > swapon /dev/zram0
> >
> > The test program I made today after receiving Chris' patchset v2
> >
> > (Andrew, Please let me know if you want this small test program to
> > be committed into kernel/tools/ folder. If yes, please let me know,
> > and I will cleanup and prepare a patch):
> >
> > #define _GNU_SOURCE
> > #include <stdio.h>
> > #include <stdlib.h>
> > #include <unistd.h>
> > #include <string.h>
> > #include <sys/mman.h>
> > #include <errno.h>
> > #include <time.h>
> >
> > #define MEMSIZE_MTHP (60 * 1024 * 1024)
> > #define MEMSIZE_SMALLFOLIO (1 * 1024 * 1024)
> > #define ALIGNMENT_MTHP (64 * 1024)
> > #define ALIGNMENT_SMALLFOLIO (4 * 1024)
> > #define TOTAL_DONTNEED_MTHP (16 * 1024 * 1024)
> > #define TOTAL_DONTNEED_SMALLFOLIO (256 * 1024)
> > #define MTHP_FOLIO_SIZE (64 * 1024)
> >
> > #define SWPOUT_PATH \
> >     "/sys/kernel/mm/transparent_hugepage/hugepages-64kB/stats/swpout"
> > #define SWPOUT_FALLBACK_PATH \
> >     "/sys/kernel/mm/transparent_hugepage/hugepages-64kB/stats/swpout_fallback"
> >
> > static void *aligned_alloc_mem(size_t size, size_t alignment)
> > {
> >     void *mem = NULL;
> >     if (posix_memalign(&mem, alignment, size) != 0) {
> >         perror("posix_memalign");
> >         return NULL;
> >     }
> >     return mem;
> > }
> >
> > static void random_madvise_dontneed(void *mem, size_t mem_size,
> >                                      size_t align_size, size_t total_dontneed_size)
> > {
> >     size_t num_pages = total_dontneed_size / align_size;
> >     size_t i;
> >     size_t offset;
> >     void *addr;
> >
> >     for (i = 0; i < num_pages; ++i) {
> >         offset = (rand() % (mem_size / align_size)) * align_size;
> >         addr = (char *)mem + offset;
> >         if (madvise(addr, align_size, MADV_DONTNEED) != 0) {
> >             perror("madvise dontneed");
> >         }
> >         memset(addr, 0x11, align_size);
> >     }
> > }
> >
> > static unsigned long read_stat(const char *path)
> > {
> >     FILE *file;
> >     unsigned long value;
> >
> >     file = fopen(path, "r");
> >     if (!file) {
> >         perror("fopen");
> >         return 0;
> >     }
> >
> >     if (fscanf(file, "%lu", &value) != 1) {
> >         perror("fscanf");
> >         fclose(file);
> >         return 0;
> >     }
> >
> >     fclose(file);
> >     return value;
> > }
> >
> > int main(int argc, char *argv[])
> > {
> >     int use_small_folio = 0;
> >     int i;
> >     void *mem1 = aligned_alloc_mem(MEMSIZE_MTHP, ALIGNMENT_MTHP);
> >     if (mem1 == NULL) {
> >         fprintf(stderr, "Failed to allocate 60MB memory\n");
> >         return EXIT_FAILURE;
> >     }
> >
> >     if (madvise(mem1, MEMSIZE_MTHP, MADV_HUGEPAGE) != 0) {
> >         perror("madvise hugepage for mem1");
> >         free(mem1);
> >         return EXIT_FAILURE;
> >     }
> >
> >     for (i = 1; i < argc; ++i) {
> >         if (strcmp(argv[i], "-s") == 0) {
> >             use_small_folio = 1;
> >         }
> >     }
> >
> >     void *mem2 = NULL;
> >     if (use_small_folio) {
> >         mem2 = aligned_alloc_mem(MEMSIZE_SMALLFOLIO, ALIGNMENT_MTHP);
> >         if (mem2 == NULL) {
> >             fprintf(stderr, "Failed to allocate 1MB memory\n");
> >             free(mem1);
> >             return EXIT_FAILURE;
> >         }
> >
> >         if (madvise(mem2, MEMSIZE_SMALLFOLIO, MADV_NOHUGEPAGE) != 0) {
> >             perror("madvise nohugepage for mem2");
> >             free(mem1);
> >             free(mem2);
> >             return EXIT_FAILURE;
> >         }
> >     }
> >
> >     for (i = 0; i < 100; ++i) {
> >         unsigned long initial_swpout;
> >         unsigned long initial_swpout_fallback;
> >         unsigned long final_swpout;
> >         unsigned long final_swpout_fallback;
> >         unsigned long swpout_inc;
> >         unsigned long swpout_fallback_inc;
> >         double fallback_percentage;
> >
> >         initial_swpout = read_stat(SWPOUT_PATH);
> >         initial_swpout_fallback = read_stat(SWPOUT_FALLBACK_PATH);
> >
> >         random_madvise_dontneed(mem1, MEMSIZE_MTHP, ALIGNMENT_MTHP,
> >                                  TOTAL_DONTNEED_MTHP);
> >
> >         if (use_small_folio) {
> >             random_madvise_dontneed(mem2, MEMSIZE_SMALLFOLIO,
> >                                      ALIGNMENT_SMALLFOLIO,
> >                                      TOTAL_DONTNEED_SMALLFOLIO);
> >         }
> >
> >         if (madvise(mem1, MEMSIZE_MTHP, MADV_PAGEOUT) != 0) {
> >             perror("madvise pageout for mem1");
> >             free(mem1);
> >             if (mem2 != NULL) {
> >                 free(mem2);
> >             }
> >             return EXIT_FAILURE;
> >         }
> >
> >         if (use_small_folio) {
> >             if (madvise(mem2, MEMSIZE_SMALLFOLIO, MADV_PAGEOUT) != 0) {
> >                 perror("madvise pageout for mem2");
> >                 free(mem1);
> >                 free(mem2);
> >                 return EXIT_FAILURE;
> >             }
> >         }
> >
> >         final_swpout = read_stat(SWPOUT_PATH);
> >         final_swpout_fallback = read_stat(SWPOUT_FALLBACK_PATH);
> >
> >         swpout_inc = final_swpout - initial_swpout;
> >         swpout_fallback_inc = final_swpout_fallback - initial_swpout_fallback;
> >
> >         fallback_percentage = (double)swpout_fallback_inc /
> >                               (swpout_fallback_inc + swpout_inc) * 100;
> >
> >         printf("Iteration %d: swpout inc: %lu, swpout fallback inc: %lu, Fallback percentage: %.2f%%\n",
> >                i + 1, swpout_inc, swpout_fallback_inc, fallback_percentage);
> >     }
> >
> >     free(mem1);
> >     if (mem2 != NULL) {
> >         free(mem2);
> >     }
> >
> >     return EXIT_SUCCESS;
> > }
>
> Thank you very for your effort to write this test program.
>
> TBH, personally, I thought that this test program isn't practical
> enough.  Can we show performance difference with some normal workloads?

Right.

The whole purpose of this small program is to demonstrate the problem
in the current code - even swap slots are always allocated and released
aligned with mTHP, the current mainline will soon get 100% fallback ratio
though swap space is enough, and swap slots are not fragmented at all.
as long as we lose empty clusters, we lose the chance to do mthp
swapout.

We are still running tests using real Android phones with real workloads, and
will update you with the result hopefully this week. I am a little worried that
the  triggered WARN_ONCE will lead to the failure of the test.

>
> [snip]
>
> --
> Best Regards,
> Huang, Ying

Thanks
Barry


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 0/2] mm: swap: mTHP swap allocator base on swap cluster order
  2024-06-15  8:47       ` Barry Song
  2024-06-17  3:00         ` Huang, Ying
  2024-06-17  6:48         ` Huang, Ying
@ 2024-06-17 18:34         ` Chris Li
  2024-06-17 23:00           ` Hugh Dickins
  2 siblings, 1 reply; 22+ messages in thread
From: Chris Li @ 2024-06-17 18:34 UTC (permalink / raw)
  To: Barry Song
  Cc: akpm, baohua, kaleshsingh, kasong, linux-kernel, linux-mm,
	ryan.roberts, ying.huang

On Sat, Jun 15, 2024 at 1:47 AM Barry Song <21cnbao@gmail.com> wrote:
>
> On Sat, Jun 15, 2024 at 2:59 PM Andrew Morton <akpm@linux-foundation.org> wrote:
> >
> > On Fri, 14 Jun 2024 19:51:11 -0700 Chris Li <chrisl@kernel.org> wrote:
> >
> > > > I'm having trouble understanding the overall impact of this on users.
> > > > We fail the mTHP swap allocation and fall back, but things continue to
> > > > operate OK?
> > >
> > > Continue to operate OK in the sense that the mTHP will have to split
> > > into 4K pages before the swap out, aka the fall back. The swap out and
> > > swap in can continue to work as 4K pages, not as the mTHP. Due to the
> > > fallback, the mTHP based zsmalloc compression with 64K buffer will not
> > > happen. That is the effect of the fallback. But mTHP swap out and swap
> > > in is relatively new, it is not really a regression.
> >
> > Sure, but it's pretty bad to merge a new feature only to have it
> > ineffective after a few hours use.
> >
> > > >
> > > > > There is some test number in the V1 thread of this series:
> > > > > https://lore.kernel.org/r/20240524-swap-allocator-v1-0-47861b423b26@kernel.org
> > > >
> > > > Well, please let's get the latest numbers into the latest patchset.
> > > > Along with a higher-level (and quantitative) description of the user impact.
> > >
> > > I will need Barray's help to collect the number. I don't have the
> > > setup to reproduce his test result.
> > > Maybe a follow up commit message amendment for the test number when I get it?
>
> Although the issue may seem complex at a systemic level, even a small program can
> demonstrate the problem and highlight how Chris's patch has improved the
> situation.
>
> To demonstrate this, I designed a basic test program that maximally allocates
> two memory blocks:
>
>  *   A memory block of up to 60MB, recommended for HUGEPAGE usage
>  *   A memory block of up to 1MB, recommended for NOHUGEPAGE usage
>
> In the system configuration, I enabled 64KB mTHP and 64MB zRAM, providing more than
> enough space for both the 60MB and 1MB allocations in the worst case. This setup
> allows us to assess two effects:

Thanks for the test program. I will certainly use it to stress and
debug my patches. Currently I have some tests to exercise the swap
stack, but not stress it enough.

>
> 1.  When we don't enable mem2 (small folios), we consistently allocate and free
>     swap slots aligned with 64KB.  whether there is a risk of failure to obtain
>     swap slots even though the zRAM has sufficient free space?
> 2.  When we enable mem2 (small folios), the presence of small folios may lead
>     to fragmentation of clusters, potentially impacting the swapout process for
>     large folios negatively.
>
> (2) can be enabled by "-s", without -s, small folios are disabled.
>
> The script to configure zRAM and mTHP:
>
> echo lzo > /sys/block/zram0/comp_algorithm
> echo 64M > /sys/block/zram0/disksize
> echo never > /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled
> echo always > /sys/kernel/mm/transparent_hugepage/hugepages-64kB/enabled
> mkswap /dev/zram0
> swapon /dev/zram0
>
> The test program I made today after receiving Chris' patchset v2
>
> (Andrew, Please let me know if you want this small test program to
> be committed into kernel/tools/ folder. If yes, please let me know,
> and I will cleanup and prepare a patch):
>
> #define _GNU_SOURCE
> #include <stdio.h>
> #include <stdlib.h>
> #include <unistd.h>
> #include <string.h>
> #include <sys/mman.h>
> #include <errno.h>
> #include <time.h>
>
> #define MEMSIZE_MTHP (60 * 1024 * 1024)
> #define MEMSIZE_SMALLFOLIO (1 * 1024 * 1024)
> #define ALIGNMENT_MTHP (64 * 1024)
> #define ALIGNMENT_SMALLFOLIO (4 * 1024)
> #define TOTAL_DONTNEED_MTHP (16 * 1024 * 1024)
> #define TOTAL_DONTNEED_SMALLFOLIO (256 * 1024)
> #define MTHP_FOLIO_SIZE (64 * 1024)
>
> #define SWPOUT_PATH \
>     "/sys/kernel/mm/transparent_hugepage/hugepages-64kB/stats/swpout"
> #define SWPOUT_FALLBACK_PATH \
>     "/sys/kernel/mm/transparent_hugepage/hugepages-64kB/stats/swpout_fallback"
>
> static void *aligned_alloc_mem(size_t size, size_t alignment)
> {
>     void *mem = NULL;
>     if (posix_memalign(&mem, alignment, size) != 0) {
>         perror("posix_memalign");
>         return NULL;
>     }
>     return mem;
> }
>
> static void random_madvise_dontneed(void *mem, size_t mem_size,
>                                      size_t align_size, size_t total_dontneed_size)
> {
>     size_t num_pages = total_dontneed_size / align_size;
>     size_t i;
>     size_t offset;
>     void *addr;
>
>     for (i = 0; i < num_pages; ++i) {
>         offset = (rand() % (mem_size / align_size)) * align_size;
>         addr = (char *)mem + offset;
>         if (madvise(addr, align_size, MADV_DONTNEED) != 0) {
>             perror("madvise dontneed");
>         }
>         memset(addr, 0x11, align_size);
>     }
> }
>
> static unsigned long read_stat(const char *path)
> {
>     FILE *file;
>     unsigned long value;
>
>     file = fopen(path, "r");
>     if (!file) {
>         perror("fopen");
>         return 0;
>     }
>
>     if (fscanf(file, "%lu", &value) != 1) {
>         perror("fscanf");
>         fclose(file);
>         return 0;
>     }
>
>     fclose(file);
>     return value;
> }
>
> int main(int argc, char *argv[])
> {
>     int use_small_folio = 0;
>     int i;
>     void *mem1 = aligned_alloc_mem(MEMSIZE_MTHP, ALIGNMENT_MTHP);
>     if (mem1 == NULL) {
>         fprintf(stderr, "Failed to allocate 60MB memory\n");
>         return EXIT_FAILURE;
>     }
>
>     if (madvise(mem1, MEMSIZE_MTHP, MADV_HUGEPAGE) != 0) {
>         perror("madvise hugepage for mem1");
>         free(mem1);
>         return EXIT_FAILURE;
>     }
>
>     for (i = 1; i < argc; ++i) {
>         if (strcmp(argv[i], "-s") == 0) {
>             use_small_folio = 1;
>         }
>     }
>
>     void *mem2 = NULL;
>     if (use_small_folio) {
>         mem2 = aligned_alloc_mem(MEMSIZE_SMALLFOLIO, ALIGNMENT_MTHP);
>         if (mem2 == NULL) {
>             fprintf(stderr, "Failed to allocate 1MB memory\n");
>             free(mem1);
>             return EXIT_FAILURE;
>         }
>
>         if (madvise(mem2, MEMSIZE_SMALLFOLIO, MADV_NOHUGEPAGE) != 0) {
>             perror("madvise nohugepage for mem2");
>             free(mem1);
>             free(mem2);
>             return EXIT_FAILURE;
>         }
>     }
>
>     for (i = 0; i < 100; ++i) {
>         unsigned long initial_swpout;
>         unsigned long initial_swpout_fallback;
>         unsigned long final_swpout;
>         unsigned long final_swpout_fallback;
>         unsigned long swpout_inc;
>         unsigned long swpout_fallback_inc;
>         double fallback_percentage;
>
>         initial_swpout = read_stat(SWPOUT_PATH);
>         initial_swpout_fallback = read_stat(SWPOUT_FALLBACK_PATH);
>
>         random_madvise_dontneed(mem1, MEMSIZE_MTHP, ALIGNMENT_MTHP,
>                                  TOTAL_DONTNEED_MTHP);
>
>         if (use_small_folio) {
>             random_madvise_dontneed(mem2, MEMSIZE_SMALLFOLIO,
>                                      ALIGNMENT_SMALLFOLIO,
>                                      TOTAL_DONTNEED_SMALLFOLIO);
>         }
>
>         if (madvise(mem1, MEMSIZE_MTHP, MADV_PAGEOUT) != 0) {
>             perror("madvise pageout for mem1");
>             free(mem1);
>             if (mem2 != NULL) {
>                 free(mem2);
>             }
>             return EXIT_FAILURE;
>         }
>
>         if (use_small_folio) {
>             if (madvise(mem2, MEMSIZE_SMALLFOLIO, MADV_PAGEOUT) != 0) {
>                 perror("madvise pageout for mem2");
>                 free(mem1);
>                 free(mem2);
>                 return EXIT_FAILURE;
>             }
>         }
>
>         final_swpout = read_stat(SWPOUT_PATH);
>         final_swpout_fallback = read_stat(SWPOUT_FALLBACK_PATH);
>
>         swpout_inc = final_swpout - initial_swpout;
>         swpout_fallback_inc = final_swpout_fallback - initial_swpout_fallback;
>
>         fallback_percentage = (double)swpout_fallback_inc /
>                               (swpout_fallback_inc + swpout_inc) * 100;
>
>         printf("Iteration %d: swpout inc: %lu, swpout fallback inc: %lu, Fallback percentage: %.2f%%\n",
>                i + 1, swpout_inc, swpout_fallback_inc, fallback_percentage);
>     }
>
>     free(mem1);
>     if (mem2 != NULL) {
>         free(mem2);
>     }
>
>     return EXIT_SUCCESS;
> }
>
> w/o Chris' patchset:
>
> Test A. without small folios
>
> $ /home/barry/develop/linux/mthp_swpout_test
>
> Iteration 1: swpout inc: 226, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 2: swpout inc: 224, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 3: swpout inc: 221, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 4: swpout inc: 189, swpout fallback inc: 42, Fallback percentage: 18.18%
> Iteration 5: swpout inc: 6, swpout fallback inc: 212, Fallback percentage: 97.25%
> Iteration 6: swpout inc: 0, swpout fallback inc: 227, Fallback percentage: 100.00%
> Iteration 7: swpout inc: 0, swpout fallback inc: 224, Fallback percentage: 100.00%
> Iteration 8: swpout inc: 0, swpout fallback inc: 222, Fallback percentage: 100.00%
> Iteration 9: swpout inc: 0, swpout fallback inc: 217, Fallback percentage: 100.00%
> Iteration 10: swpout inc: 1, swpout fallback inc: 226, Fallback percentage: 99.56%
> Iteration 11: swpout inc: 0, swpout fallback inc: 226, Fallback percentage: 100.00%
> Iteration 12: swpout inc: 0, swpout fallback inc: 227, Fallback percentage: 100.00%
> Iteration 13: swpout inc: 0, swpout fallback inc: 231, Fallback percentage: 100.00%
> ...
>
> mthp swpout fallback ratio immediately goes up to 100%!!!
>
> Test B. with small folios
>
> $ /home/barry/develop/linux/mthp_swpout_test -s
>
> Iteration 1: swpout inc: 226, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 2: swpout inc: 229, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 3: swpout inc: 229, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 4: swpout inc: 20, swpout fallback inc: 206, Fallback percentage: 91.15%
> Iteration 5: swpout inc: 26, swpout fallback inc: 201, Fallback percentage: 88.55%
> Iteration 6: swpout inc: 2, swpout fallback inc: 216, Fallback percentage: 99.08%
> Iteration 7: swpout inc: 16, swpout fallback inc: 209, Fallback percentage: 92.89%
> Iteration 8: swpout inc: 5, swpout fallback inc: 222, Fallback percentage: 97.80%
> Iteration 9: swpout inc: 0, swpout fallback inc: 226, Fallback percentage: 100.00%
> Iteration 10: swpout inc: 0, swpout fallback inc: 224, Fallback percentage: 100.00%
> Iteration 11: swpout inc: 0, swpout fallback inc: 228, Fallback percentage: 100.00%
> Iteration 12: swpout inc: 0, swpout fallback inc: 227, Fallback percentage: 100.00%
> Iteration 13: swpout inc: 0, swpout fallback inc: 226, Fallback percentage: 100.00%
> Iteration 14: swpout inc: 0, swpout fallback inc: 234, Fallback percentage: 100.00%
> ...
>
> mthp swpout fallback ratio immediately goes up to 100%!!!
>
>
> w/ Chris' patchset:
>
> Test C. without small folios
> $ /home/barry/develop/linux/mthp_swpout_test
>
> Iteration 1: swpout inc: 226, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 2: swpout inc: 224, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 3: swpout inc: 221, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 4: swpout inc: 231, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 5: swpout inc: 218, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 6: swpout inc: 227, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 7: swpout inc: 224, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 8: swpout inc: 222, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 9: swpout inc: 217, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 10: swpout inc: 227, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 11: swpout inc: 226, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 12: swpout inc: 210, swpout fallback inc: 17, Fallback percentage: 7.49%
> Iteration 13: swpout inc: 230, swpout fallback inc: 1, Fallback percentage: 0.43%
> Iteration 14: swpout inc: 209, swpout fallback inc: 13, Fallback percentage: 5.86%
> Iteration 15: swpout inc: 214, swpout fallback inc: 16, Fallback percentage: 6.96%
> Iteration 16: swpout inc: 214, swpout fallback inc: 12, Fallback percentage: 5.31%
> Iteration 17: swpout inc: 227, swpout fallback inc: 6, Fallback percentage: 2.58%
> Iteration 18: swpout inc: 203, swpout fallback inc: 24, Fallback percentage: 10.57%
> Iteration 19: swpout inc: 229, swpout fallback inc: 1, Fallback percentage: 0.43%
> Iteration 20: swpout inc: 224, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 21: swpout inc: 217, swpout fallback inc: 13, Fallback percentage: 5.65%
> Iteration 22: swpout inc: 205, swpout fallback inc: 17, Fallback percentage: 7.66%
> Iteration 23: swpout inc: 213, swpout fallback inc: 15, Fallback percentage: 6.58%
> Iteration 24: swpout inc: 234, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 25: swpout inc: 205, swpout fallback inc: 18, Fallback percentage: 8.07%
> Iteration 26: swpout inc: 217, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 27: swpout inc: 219, swpout fallback inc: 6, Fallback percentage: 2.67%
> Iteration 28: swpout inc: 215, swpout fallback inc: 14, Fallback percentage: 6.11%
> Iteration 29: swpout inc: 227, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 30: swpout inc: 208, swpout fallback inc: 13, Fallback percentage: 5.88%
> Iteration 31: swpout inc: 219, swpout fallback inc: 6, Fallback percentage: 2.67%
> Iteration 32: swpout inc: 216, swpout fallback inc: 7, Fallback percentage: 3.14%
> Iteration 33: swpout inc: 201, swpout fallback inc: 28, Fallback percentage: 12.23%
> Iteration 34: swpout inc: 232, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 35: swpout inc: 215, swpout fallback inc: 17, Fallback percentage: 7.33%
> Iteration 36: swpout inc: 209, swpout fallback inc: 16, Fallback percentage: 7.11%
> Iteration 37: swpout inc: 202, swpout fallback inc: 29, Fallback percentage: 12.55%
> Iteration 38: swpout inc: 200, swpout fallback inc: 18, Fallback percentage: 8.26%
> Iteration 39: swpout inc: 219, swpout fallback inc: 12, Fallback percentage: 5.19%
> Iteration 40: swpout inc: 218, swpout fallback inc: 9, Fallback percentage: 3.96%
> Iteration 41: swpout inc: 212, swpout fallback inc: 14, Fallback percentage: 6.19%
> Iteration 42: swpout inc: 204, swpout fallback inc: 15, Fallback percentage: 6.85%
> Iteration 43: swpout inc: 222, swpout fallback inc: 5, Fallback percentage: 2.20%
> Iteration 44: swpout inc: 205, swpout fallback inc: 20, Fallback percentage: 8.89%
> Iteration 45: swpout inc: 217, swpout fallback inc: 6, Fallback percentage: 2.69%
> Iteration 46: swpout inc: 209, swpout fallback inc: 19, Fallback percentage: 8.33%
> Iteration 47: swpout inc: 205, swpout fallback inc: 13, Fallback percentage: 5.96%
> Iteration 48: swpout inc: 223, swpout fallback inc: 4, Fallback percentage: 1.76%
> Iteration 49: swpout inc: 203, swpout fallback inc: 21, Fallback percentage: 9.38%
> Iteration 50: swpout inc: 193, swpout fallback inc: 19, Fallback percentage: 8.96%
> Iteration 51: swpout inc: 197, swpout fallback inc: 29, Fallback percentage: 12.83%
> Iteration 52: swpout inc: 195, swpout fallback inc: 29, Fallback percentage: 12.95%
> Iteration 53: swpout inc: 217, swpout fallback inc: 17, Fallback percentage: 7.26%
> Iteration 54: swpout inc: 207, swpout fallback inc: 11, Fallback percentage: 5.05%
> Iteration 55: swpout inc: 213, swpout fallback inc: 10, Fallback percentage: 4.48%
> Iteration 56: swpout inc: 203, swpout fallback inc: 23, Fallback percentage: 10.18%
> Iteration 57: swpout inc: 197, swpout fallback inc: 34, Fallback percentage: 14.72%
> Iteration 58: swpout inc: 209, swpout fallback inc: 13, Fallback percentage: 5.86%
> Iteration 59: swpout inc: 212, swpout fallback inc: 19, Fallback percentage: 8.23%
> Iteration 60: swpout inc: 196, swpout fallback inc: 24, Fallback percentage: 10.91%
> Iteration 61: swpout inc: 203, swpout fallback inc: 13, Fallback percentage: 6.02%
> Iteration 62: swpout inc: 221, swpout fallback inc: 7, Fallback percentage: 3.07%
> Iteration 63: swpout inc: 207, swpout fallback inc: 17, Fallback percentage: 7.59%
> Iteration 64: swpout inc: 205, swpout fallback inc: 15, Fallback percentage: 6.82%
> Iteration 65: swpout inc: 223, swpout fallback inc: 3, Fallback percentage: 1.33%
> Iteration 66: swpout inc: 215, swpout fallback inc: 13, Fallback percentage: 5.70%
> Iteration 67: swpout inc: 225, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 68: swpout inc: 215, swpout fallback inc: 8, Fallback percentage: 3.59%
> Iteration 69: swpout inc: 222, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 70: swpout inc: 204, swpout fallback inc: 17, Fallback percentage: 7.69%
> Iteration 71: swpout inc: 227, swpout fallback inc: 6, Fallback percentage: 2.58%
> Iteration 72: swpout inc: 207, swpout fallback inc: 16, Fallback percentage: 7.17%
> Iteration 73: swpout inc: 217, swpout fallback inc: 9, Fallback percentage: 3.98%
> Iteration 74: swpout inc: 206, swpout fallback inc: 9, Fallback percentage: 4.19%
> Iteration 75: swpout inc: 193, swpout fallback inc: 26, Fallback percentage: 11.87%
> Iteration 76: swpout inc: 225, swpout fallback inc: 3, Fallback percentage: 1.32%
> Iteration 77: swpout inc: 205, swpout fallback inc: 25, Fallback percentage: 10.87%
> Iteration 78: swpout inc: 213, swpout fallback inc: 12, Fallback percentage: 5.33%
> Iteration 79: swpout inc: 212, swpout fallback inc: 10, Fallback percentage: 4.50%
> Iteration 80: swpout inc: 210, swpout fallback inc: 9, Fallback percentage: 4.11%
> Iteration 81: swpout inc: 225, swpout fallback inc: 4, Fallback percentage: 1.75%
> Iteration 82: swpout inc: 211, swpout fallback inc: 3, Fallback percentage: 1.40%
> Iteration 83: swpout inc: 216, swpout fallback inc: 10, Fallback percentage: 4.42%
> Iteration 84: swpout inc: 226, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 85: swpout inc: 213, swpout fallback inc: 13, Fallback percentage: 5.75%
> Iteration 86: swpout inc: 225, swpout fallback inc: 3, Fallback percentage: 1.32%
> Iteration 87: swpout inc: 204, swpout fallback inc: 22, Fallback percentage: 9.73%
> Iteration 88: swpout inc: 225, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 89: swpout inc: 225, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 90: swpout inc: 221, swpout fallback inc: 8, Fallback percentage: 3.49%
> Iteration 91: swpout inc: 212, swpout fallback inc: 13, Fallback percentage: 5.78%
> Iteration 92: swpout inc: 207, swpout fallback inc: 18, Fallback percentage: 8.00%
> Iteration 93: swpout inc: 209, swpout fallback inc: 25, Fallback percentage: 10.68%
> Iteration 94: swpout inc: 213, swpout fallback inc: 13, Fallback percentage: 5.75%
> Iteration 95: swpout inc: 206, swpout fallback inc: 18, Fallback percentage: 8.04%
> Iteration 96: swpout inc: 206, swpout fallback inc: 17, Fallback percentage: 7.62%
> Iteration 97: swpout inc: 216, swpout fallback inc: 11, Fallback percentage: 4.85%
> Iteration 98: swpout inc: 210, swpout fallback inc: 13, Fallback percentage: 5.83%
> Iteration 99: swpout inc: 223, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 100: swpout inc: 205, swpout fallback inc: 21, Fallback percentage: 9.29%
> ...
>
> mthp swpout fallback ratio is stable and low in 100 iterations!!!
> Though the number is very good, I wonder why it is not 0% since 64MB is larger
> than 60MB? Chris, do you have any idea?
>
> Test D. with small folios
> $ /home/barry/develop/linux/mthp_swpout_test -s
>
> [ 1013.535798] ------------[ cut here ]------------
> [ 1013.538886] expecting order 4 got 0

This warning means there is a bug in this series somewhere I need to hunt down.
The V1 has the same warning but I haven't heard it get triggered in
V1, it is something new in V2.

Andrew, please consider removing the series from mm-unstable until I
resolve this warning assert.


> [ 1013.540622] WARNING: CPU: 3 PID: 104 at mm/swapfile.c:600 scan_swap_map_try_ssd_cluster+0x340/0x370
> [ 1013.544460] Modules linked in:
> [ 1013.545411] CPU: 3 PID: 104 Comm: mthp_swpout_tes Not tainted 6.10.0-rc3-ga12328d9fb85-dirty #285
> [ 1013.545990] Hardware name: linux,dummy-virt (DT)
> [ 1013.546585] pstate: 61400005 (nZCv daif +PAN -UAO -TCO +DIT -SSBS BTYPE=--)
> [ 1013.547136] pc : scan_swap_map_try_ssd_cluster+0x340/0x370
> [ 1013.547768] lr : scan_swap_map_try_ssd_cluster+0x340/0x370
> [ 1013.548263] sp : ffff8000863e32e0
> [ 1013.548723] x29: ffff8000863e32e0 x28: 0000000000000670 x27: 0000000000000660
> [ 1013.549626] x26: 0000000000000010 x25: ffff0000c1692108 x24: ffff0000c27c4800
> [ 1013.550470] x23: 2e8ba2e8ba2e8ba3 x22: fffffdffbf7df2c0 x21: ffff0000c27c48b0
> [ 1013.551285] x20: ffff800083a946d0 x19: 0000000000000004 x18: ffffffffffffffff
> [ 1013.552263] x17: 0000000000000000 x16: 0000000000000000 x15: ffff800084b13568
> [ 1013.553292] x14: ffffffffffffffff x13: ffff800084b13566 x12: 6e69746365707865
> [ 1013.554423] x11: fffffffffffe0000 x10: ffff800083b18b68 x9 : ffff80008014c874
> [ 1013.555231] x8 : 00000000ffffefff x7 : ffff800083b16318 x6 : 0000000000002850
> [ 1013.555965] x5 : 40000000fffff1ae x4 : 0000000000000fff x3 : 0000000000000000
> [ 1013.556779] x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffff0000c24a1bc0
> [ 1013.557627] Call trace:
> [ 1013.557960]  scan_swap_map_try_ssd_cluster+0x340/0x370
> [ 1013.558498]  get_swap_pages+0x23c/0xc20
> [ 1013.558899]  folio_alloc_swap+0x5c/0x248
> [ 1013.559544]  add_to_swap+0x40/0xf0
> [ 1013.559904]  shrink_folio_list+0x6dc/0xf20
> [ 1013.560289]  reclaim_folio_list+0x8c/0x168
> [ 1013.560710]  reclaim_pages+0xfc/0x178
> [ 1013.561079]  madvise_cold_or_pageout_pte_range+0x8d8/0xf28
> [ 1013.561524]  walk_pgd_range+0x390/0x808
> [ 1013.561920]  __walk_page_range+0x1e0/0x1f0
> [ 1013.562370]  walk_page_range+0x1f0/0x2c8
> [ 1013.562888]  madvise_pageout+0xf8/0x280
> [ 1013.563388]  madvise_vma_behavior+0x314/0xa20
> [ 1013.563982]  madvise_walk_vmas+0xc0/0x128
> [ 1013.564386]  do_madvise.part.0+0x110/0x558
> [ 1013.564792]  __arm64_sys_madvise+0x68/0x88
> [ 1013.565333]  invoke_syscall+0x50/0x128
> [ 1013.565737]  el0_svc_common.constprop.0+0x48/0xf8
> [ 1013.566285]  do_el0_svc+0x28/0x40
> [ 1013.566667]  el0_svc+0x50/0x150
> [ 1013.567094]  el0t_64_sync_handler+0x13c/0x158
> [ 1013.567501]  el0t_64_sync+0x1a4/0x1a8
> [ 1013.568058] irq event stamp: 0
> [ 1013.568661] hardirqs last  enabled at (0): [<0000000000000000>] 0x0
> [ 1013.569560] hardirqs last disabled at (0): [<ffff8000800add44>] copy_process+0x654/0x19a8
> [ 1013.570167] softirqs last  enabled at (0): [<ffff8000800add44>] copy_process+0x654/0x19a8
> [ 1013.570846] softirqs last disabled at (0): [<0000000000000000>] 0x0
> [ 1013.571330] ---[ end trace 0000000000000000 ]---
> Iteration 1: swpout inc: 226, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 2: swpout inc: 229, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 3: swpout inc: 229, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 4: swpout inc: 226, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 5: swpout inc: 227, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 6: swpout inc: 218, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 7: swpout inc: 225, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 8: swpout inc: 227, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 9: swpout inc: 224, swpout fallback inc: 2, Fallback percentage: 0.88%
> Iteration 10: swpout inc: 213, swpout fallback inc: 11, Fallback percentage: 4.91%
> Iteration 11: swpout inc: 219, swpout fallback inc: 9, Fallback percentage: 3.95%
> Iteration 12: swpout inc: 207, swpout fallback inc: 20, Fallback percentage: 8.81%
> Iteration 13: swpout inc: 193, swpout fallback inc: 33, Fallback percentage: 14.60%
> Iteration 14: swpout inc: 215, swpout fallback inc: 19, Fallback percentage: 8.12%
> Iteration 15: swpout inc: 217, swpout fallback inc: 12, Fallback percentage: 5.24%
> Iteration 16: swpout inc: 207, swpout fallback inc: 15, Fallback percentage: 6.76%
> Iteration 17: swpout inc: 207, swpout fallback inc: 23, Fallback percentage: 10.00%
> Iteration 18: swpout inc: 198, swpout fallback inc: 30, Fallback percentage: 13.16%
> Iteration 19: swpout inc: 199, swpout fallback inc: 26, Fallback percentage: 11.56%
> Iteration 20: swpout inc: 197, swpout fallback inc: 27, Fallback percentage: 12.05%
> Iteration 21: swpout inc: 192, swpout fallback inc: 25, Fallback percentage: 11.52%
> Iteration 22: swpout inc: 190, swpout fallback inc: 30, Fallback percentage: 13.64%
> Iteration 23: swpout inc: 203, swpout fallback inc: 27, Fallback percentage: 11.74%
> Iteration 24: swpout inc: 197, swpout fallback inc: 32, Fallback percentage: 13.97%
> Iteration 25: swpout inc: 184, swpout fallback inc: 41, Fallback percentage: 18.22%
> Iteration 26: swpout inc: 203, swpout fallback inc: 28, Fallback percentage: 12.12%
> Iteration 27: swpout inc: 193, swpout fallback inc: 31, Fallback percentage: 13.84%
> Iteration 28: swpout inc: 191, swpout fallback inc: 43, Fallback percentage: 18.38%
> Iteration 29: swpout inc: 194, swpout fallback inc: 31, Fallback percentage: 13.78%
> Iteration 30: swpout inc: 180, swpout fallback inc: 50, Fallback percentage: 21.74%
> Iteration 31: swpout inc: 205, swpout fallback inc: 22, Fallback percentage: 9.69%
> Iteration 32: swpout inc: 199, swpout fallback inc: 24, Fallback percentage: 10.76%
> Iteration 33: swpout inc: 192, swpout fallback inc: 34, Fallback percentage: 15.04%
> Iteration 34: swpout inc: 186, swpout fallback inc: 38, Fallback percentage: 16.96%
> Iteration 35: swpout inc: 190, swpout fallback inc: 32, Fallback percentage: 14.41%
> Iteration 36: swpout inc: 181, swpout fallback inc: 41, Fallback percentage: 18.47%
> Iteration 37: swpout inc: 181, swpout fallback inc: 47, Fallback percentage: 20.61%
> Iteration 38: swpout inc: 173, swpout fallback inc: 45, Fallback percentage: 20.64%
> Iteration 39: swpout inc: 196, swpout fallback inc: 27, Fallback percentage: 12.11%
> Iteration 40: swpout inc: 195, swpout fallback inc: 27, Fallback percentage: 12.16%
> Iteration 41: swpout inc: 195, swpout fallback inc: 31, Fallback percentage: 13.72%
> Iteration 42: swpout inc: 189, swpout fallback inc: 34, Fallback percentage: 15.25%
> Iteration 43: swpout inc: 185, swpout fallback inc: 41, Fallback percentage: 18.14%
> Iteration 44: swpout inc: 189, swpout fallback inc: 34, Fallback percentage: 15.25%
> Iteration 45: swpout inc: 177, swpout fallback inc: 49, Fallback percentage: 21.68%
> Iteration 46: swpout inc: 193, swpout fallback inc: 36, Fallback percentage: 15.72%
> Iteration 47: swpout inc: 197, swpout fallback inc: 30, Fallback percentage: 13.22%
> Iteration 48: swpout inc: 188, swpout fallback inc: 24, Fallback percentage: 11.32%
> Iteration 49: swpout inc: 187, swpout fallback inc: 29, Fallback percentage: 13.43%
> Iteration 50: swpout inc: 181, swpout fallback inc: 48, Fallback percentage: 20.96%
> Iteration 51: swpout inc: 191, swpout fallback inc: 28, Fallback percentage: 12.79%
> Iteration 52: swpout inc: 184, swpout fallback inc: 43, Fallback percentage: 18.94%
> Iteration 53: swpout inc: 184, swpout fallback inc: 44, Fallback percentage: 19.30%
> Iteration 54: swpout inc: 173, swpout fallback inc: 49, Fallback percentage: 22.07%
> Iteration 55: swpout inc: 170, swpout fallback inc: 47, Fallback percentage: 21.66%
> Iteration 56: swpout inc: 185, swpout fallback inc: 43, Fallback percentage: 18.86%
> Iteration 57: swpout inc: 178, swpout fallback inc: 55, Fallback percentage: 23.61%
> Iteration 58: swpout inc: 178, swpout fallback inc: 50, Fallback percentage: 21.93%
> Iteration 59: swpout inc: 181, swpout fallback inc: 45, Fallback percentage: 19.91%
> Iteration 60: swpout inc: 180, swpout fallback inc: 45, Fallback percentage: 20.00%
> Iteration 61: swpout inc: 172, swpout fallback inc: 56, Fallback percentage: 24.56%
> Iteration 62: swpout inc: 184, swpout fallback inc: 44, Fallback percentage: 19.30%
> Iteration 63: swpout inc: 174, swpout fallback inc: 42, Fallback percentage: 19.44%
> Iteration 64: swpout inc: 166, swpout fallback inc: 51, Fallback percentage: 23.50%
> Iteration 65: swpout inc: 172, swpout fallback inc: 57, Fallback percentage: 24.89%
> Iteration 66: swpout inc: 180, swpout fallback inc: 40, Fallback percentage: 18.18%
> Iteration 67: swpout inc: 173, swpout fallback inc: 63, Fallback percentage: 26.69%
> Iteration 68: swpout inc: 186, swpout fallback inc: 43, Fallback percentage: 18.78%
> Iteration 69: swpout inc: 175, swpout fallback inc: 53, Fallback percentage: 23.25%
> Iteration 70: swpout inc: 170, swpout fallback inc: 54, Fallback percentage: 24.11%
> Iteration 71: swpout inc: 166, swpout fallback inc: 62, Fallback percentage: 27.19%
> Iteration 72: swpout inc: 169, swpout fallback inc: 54, Fallback percentage: 24.22%
> Iteration 73: swpout inc: 175, swpout fallback inc: 50, Fallback percentage: 22.22%
> Iteration 74: swpout inc: 160, swpout fallback inc: 60, Fallback percentage: 27.27%
> Iteration 75: swpout inc: 173, swpout fallback inc: 45, Fallback percentage: 20.64%
> Iteration 76: swpout inc: 172, swpout fallback inc: 61, Fallback percentage: 26.18%
> Iteration 77: swpout inc: 173, swpout fallback inc: 50, Fallback percentage: 22.42%
> Iteration 78: swpout inc: 160, swpout fallback inc: 65, Fallback percentage: 28.89%
> Iteration 79: swpout inc: 165, swpout fallback inc: 61, Fallback percentage: 26.99%
> Iteration 80: swpout inc: 183, swpout fallback inc: 43, Fallback percentage: 19.03%
> Iteration 81: swpout inc: 206, swpout fallback inc: 22, Fallback percentage: 9.65%
> Iteration 82: swpout inc: 176, swpout fallback inc: 49, Fallback percentage: 21.78%
> Iteration 83: swpout inc: 184, swpout fallback inc: 45, Fallback percentage: 19.65%
> Iteration 84: swpout inc: 181, swpout fallback inc: 45, Fallback percentage: 19.91%
> Iteration 85: swpout inc: 175, swpout fallback inc: 56, Fallback percentage: 24.24%
> Iteration 86: swpout inc: 157, swpout fallback inc: 59, Fallback percentage: 27.31%
> Iteration 87: swpout inc: 171, swpout fallback inc: 54, Fallback percentage: 24.00%
> Iteration 88: swpout inc: 189, swpout fallback inc: 34, Fallback percentage: 15.25%
> Iteration 89: swpout inc: 185, swpout fallback inc: 45, Fallback percentage: 19.57%
> Iteration 90: swpout inc: 173, swpout fallback inc: 49, Fallback percentage: 22.07%
> Iteration 91: swpout inc: 170, swpout fallback inc: 58, Fallback percentage: 25.44%
> Iteration 92: swpout inc: 184, swpout fallback inc: 44, Fallback percentage: 19.30%
> Iteration 93: swpout inc: 193, swpout fallback inc: 37, Fallback percentage: 16.09%
> Iteration 94: swpout inc: 181, swpout fallback inc: 38, Fallback percentage: 17.35%
> Iteration 95: swpout inc: 205, swpout fallback inc: 25, Fallback percentage: 10.87%
> Iteration 96: swpout inc: 164, swpout fallback inc: 49, Fallback percentage: 23.00%
> Iteration 97: swpout inc: 158, swpout fallback inc: 65, Fallback percentage: 29.15%
> Iteration 98: swpout inc: 168, swpout fallback inc: 57, Fallback percentage: 25.33%
> Iteration 99: swpout inc: 163, swpout fallback inc: 56, Fallback percentage: 25.57%
> Iteration 100: swpout inc: 180, swpout fallback inc: 44, Fallback percentage: 19.64%
>
> It is getting worse than test C but still way better than test
> A and B.
>
> I previously observed a 100% fallback ratio in test v1 on an actual phone, especially
> when order-0 and mthp were combined within one zRAM. This likely triggered a scenario
> where order-0 had to scan swap to locate free slots, resulting in fragmentation
> across all clusters.
>
> Not quite sure if this still happens in v2. Will arrange a phone test next week. If
> yes, I am still eager to see some approach to prevent order-0 from spreading across
> all clusters.
>
> BTW, Chris,
>
> Is the warning "expecting order 4 got 0" normal in the above test?

Not normal. That means there is a bug some where I introduced in V2.

Chris

>
> >
> > Yep, I alter changelogs all the time.
> >
> > > >
> > > > I'll add this into mm-unstable now for some exposure, but at this point
> > > > I'm not able to determine whether it should go in as a hotfix for
> > > > 6.10-rcX.
> > >
> > > Maybe not need to be a hotfix. Not all Barry's mTHP swap out and swap
> > > in patch got merged yet.
>
> This could be a hotfix, considering swapping out a mTHP is slower than
> swapping nr_pages small folios with the overhead of splitting folio
> if we have to fallback. Ryan had the regression data before[1]
>
> "
> | alloc size |                baseline |           + this series |
> |            | mm-unstable (~v6.9-rc1) |                         |
> |:-----------|------------------------:|------------------------:|
> | 4K Page    |                    0.0% |                    1.3% |
> | 64K THP    |                  -13.6% |                   46.3% |
> | 2M THP     |                   91.4% |                   89.6% |
>
> So with this change, the 64K swap performance goes from a 14% regression to a
> 46% improvement. While 2M shows a small regression I'm confident that this is
> just noise."
>
> Ryan reported a 14% regression if mthp can not be swapped out as a whole
> comparing to only using small folios.
>
> [1] https://lore.kernel.org/all/20240408183946.2991168-1-ryan.roberts@arm.com/
>
> >
> > OK, well please let's give appropriate consideration to what we should
> > add to 6.10-rcX in order to have this feature working well.
>
> Thanks
> Barry
>
>


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 0/2] mm: swap: mTHP swap allocator base on swap cluster order
  2024-06-17 18:34         ` Chris Li
@ 2024-06-17 23:00           ` Hugh Dickins
  2024-06-17 23:47             ` Chris Li
  0 siblings, 1 reply; 22+ messages in thread
From: Hugh Dickins @ 2024-06-17 23:00 UTC (permalink / raw)
  To: Chris Li
  Cc: Barry Song, akpm, baohua, kaleshsingh, kasong, linux-kernel,
	linux-mm, ryan.roberts, ying.huang

[-- Attachment #1: Type: text/plain, Size: 7461 bytes --]

On Mon, 17 Jun 2024, Chris Li wrote:
> On Sat, Jun 15, 2024 at 1:47 AM Barry Song <21cnbao@gmail.com> wrote:
> > On Sat, Jun 15, 2024 at 2:59 PM Andrew Morton <akpm@linux-foundation.org> wrote:
> > > On Fri, 14 Jun 2024 19:51:11 -0700 Chris Li <chrisl@kernel.org> wrote:
> > >
> > > > > I'm having trouble understanding the overall impact of this on users.
> > > > > We fail the mTHP swap allocation and fall back, but things continue to
> > > > > operate OK?
> > > >
> > > > Continue to operate OK in the sense that the mTHP will have to split
> > > > into 4K pages before the swap out, aka the fall back. The swap out and
> > > > swap in can continue to work as 4K pages, not as the mTHP. Due to the
> > > > fallback, the mTHP based zsmalloc compression with 64K buffer will not
> > > > happen. That is the effect of the fallback. But mTHP swap out and swap
> > > > in is relatively new, it is not really a regression.
> > >
> > > Sure, but it's pretty bad to merge a new feature only to have it
> > > ineffective after a few hours use.
....
> > >
> > $ /home/barry/develop/linux/mthp_swpout_test -s
> >
> > [ 1013.535798] ------------[ cut here ]------------
> > [ 1013.538886] expecting order 4 got 0
> 
> This warning means there is a bug in this series somewhere I need to hunt down.
> The V1 has the same warning but I haven't heard it get triggered in
> V1, it is something new in V2.
> 
> Andrew, please consider removing the series from mm-unstable until I
> resolve this warning assert.

Agreed: I was glad to see it go into mm-unstable last week, that made
it easier to include in testing (or harder to avoid!), but my conclusion
is that it's not ready yet (and certainly not suitable for 6.10 hotfix).

I too saw this "expecting order 4 got 0" once-warning every boot (from
ordinary page reclaim rather than from madvise_pageout shown below),
shortly after starting my tmpfs swapping load. But I never saw any bad
effect immediately after it: actual crashes came a few minutes later.

(And I'm not seeing the warning at all now, with the change I made: that
doesn't tell us much, since what I have leaves out 2/2 entirely; but it
does suggest that it's more important to follow up the crashes, and
maybe when they are satisfactorily fixed, the warning will be fixed too.)

Most crashes have been on that VM_BUG_ON(ci - si->cluster_info != idx)
in alloc_cluster(). And when I poked around, it was usually (always?)
the case that si->free_clusters was empty, so list_first_entry() not
good at all. A few other crashes were GPFs, but I didn't pay much
attention to them, thinking the alloc_cluster() one best to pursue.

I reverted both patches from mm-everything, and had no problem.
I added back 1/2 expecting it to be harmless ("no real function
change in this patch"), but was surprised to get those same
"expecting order 4 got 0" warnings and VM_BUG_ONs and GPFs:
so have spent most time trying to get 1/2 working.

This patch on top of 1/2, restoring when cluster_is_free(ci) can
be seen to change, appears to have eliminated all those problems:

--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -418,6 +418,7 @@ static struct swap_cluster_info *alloc_c
 
 	VM_BUG_ON(ci - si->cluster_info != idx);
 	list_del(&ci->list);
+	ci->state = CLUSTER_STATE_PER_CPU;
 	ci->count = 0;
 	return ci;
 }
@@ -543,10 +544,6 @@ new_cluster:
 	if (tmp == SWAP_NEXT_INVALID) {
 		if (!list_empty(&si->free_clusters)) {
 			ci = list_first_entry(&si->free_clusters, struct swap_cluster_info, list);
-			list_del(&ci->list);
-			spin_lock(&ci->lock);
-			ci->state = CLUSTER_STATE_PER_CPU;
-			spin_unlock(&ci->lock);
 			tmp = (ci - si->cluster_info) * SWAPFILE_CLUSTER;
 		} else if (!list_empty(&si->discard_clusters)) {
 			/*

Delighted to have made progress after many attempts, I went to apply 2/2
on top, but found that it builds upon those scan_swap_map_try_ssd_cluster()
changes I've undone. I gave up at that point and hand back to you, Chris,
hoping that you will understand scan_swap_map_ssd_cluster_conflict() etc
much better than I ever shall!

Clarifications on my load: all swapping to SSD, but discard not enabled;
/sys/kernel/mm/transparent_hugepage/ enabled always, shmem_enabled force,
hugepages-64kB/enabled always, hugepages-64kB/shmem_enabled always;
swapoff between iterations, did not appear relevant to problems; x86_64.

Hugh

> 
> > [ 1013.540622] WARNING: CPU: 3 PID: 104 at mm/swapfile.c:600 scan_swap_map_try_ssd_cluster+0x340/0x370
> > [ 1013.544460] Modules linked in:
> > [ 1013.545411] CPU: 3 PID: 104 Comm: mthp_swpout_tes Not tainted 6.10.0-rc3-ga12328d9fb85-dirty #285
> > [ 1013.545990] Hardware name: linux,dummy-virt (DT)
> > [ 1013.546585] pstate: 61400005 (nZCv daif +PAN -UAO -TCO +DIT -SSBS BTYPE=--)
> > [ 1013.547136] pc : scan_swap_map_try_ssd_cluster+0x340/0x370
> > [ 1013.547768] lr : scan_swap_map_try_ssd_cluster+0x340/0x370
> > [ 1013.548263] sp : ffff8000863e32e0
> > [ 1013.548723] x29: ffff8000863e32e0 x28: 0000000000000670 x27: 0000000000000660
> > [ 1013.549626] x26: 0000000000000010 x25: ffff0000c1692108 x24: ffff0000c27c4800
> > [ 1013.550470] x23: 2e8ba2e8ba2e8ba3 x22: fffffdffbf7df2c0 x21: ffff0000c27c48b0
> > [ 1013.551285] x20: ffff800083a946d0 x19: 0000000000000004 x18: ffffffffffffffff
> > [ 1013.552263] x17: 0000000000000000 x16: 0000000000000000 x15: ffff800084b13568
> > [ 1013.553292] x14: ffffffffffffffff x13: ffff800084b13566 x12: 6e69746365707865
> > [ 1013.554423] x11: fffffffffffe0000 x10: ffff800083b18b68 x9 : ffff80008014c874
> > [ 1013.555231] x8 : 00000000ffffefff x7 : ffff800083b16318 x6 : 0000000000002850
> > [ 1013.555965] x5 : 40000000fffff1ae x4 : 0000000000000fff x3 : 0000000000000000
> > [ 1013.556779] x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffff0000c24a1bc0
> > [ 1013.557627] Call trace:
> > [ 1013.557960]  scan_swap_map_try_ssd_cluster+0x340/0x370
> > [ 1013.558498]  get_swap_pages+0x23c/0xc20
> > [ 1013.558899]  folio_alloc_swap+0x5c/0x248
> > [ 1013.559544]  add_to_swap+0x40/0xf0
> > [ 1013.559904]  shrink_folio_list+0x6dc/0xf20
> > [ 1013.560289]  reclaim_folio_list+0x8c/0x168
> > [ 1013.560710]  reclaim_pages+0xfc/0x178
> > [ 1013.561079]  madvise_cold_or_pageout_pte_range+0x8d8/0xf28
> > [ 1013.561524]  walk_pgd_range+0x390/0x808
> > [ 1013.561920]  __walk_page_range+0x1e0/0x1f0
> > [ 1013.562370]  walk_page_range+0x1f0/0x2c8
> > [ 1013.562888]  madvise_pageout+0xf8/0x280
> > [ 1013.563388]  madvise_vma_behavior+0x314/0xa20
> > [ 1013.563982]  madvise_walk_vmas+0xc0/0x128
> > [ 1013.564386]  do_madvise.part.0+0x110/0x558
> > [ 1013.564792]  __arm64_sys_madvise+0x68/0x88
> > [ 1013.565333]  invoke_syscall+0x50/0x128
> > [ 1013.565737]  el0_svc_common.constprop.0+0x48/0xf8
> > [ 1013.566285]  do_el0_svc+0x28/0x40
> > [ 1013.566667]  el0_svc+0x50/0x150
> > [ 1013.567094]  el0t_64_sync_handler+0x13c/0x158
> > [ 1013.567501]  el0t_64_sync+0x1a4/0x1a8
> > [ 1013.568058] irq event stamp: 0
> > [ 1013.568661] hardirqs last  enabled at (0): [<0000000000000000>] 0x0
> > [ 1013.569560] hardirqs last disabled at (0): [<ffff8000800add44>] copy_process+0x654/0x19a8
> > [ 1013.570167] softirqs last  enabled at (0): [<ffff8000800add44>] copy_process+0x654/0x19a8
> > [ 1013.570846] softirqs last disabled at (0): [<0000000000000000>] 0x0
> > [ 1013.571330] ---[ end trace 0000000000000000 ]---

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 0/2] mm: swap: mTHP swap allocator base on swap cluster order
  2024-06-17 23:00           ` Hugh Dickins
@ 2024-06-17 23:47             ` Chris Li
  0 siblings, 0 replies; 22+ messages in thread
From: Chris Li @ 2024-06-17 23:47 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Barry Song, akpm, baohua, kaleshsingh, kasong, linux-kernel,
	linux-mm, ryan.roberts, ying.huang

On Mon, Jun 17, 2024 at 4:00 PM Hugh Dickins <hughd@google.com> wrote:
>
> On Mon, 17 Jun 2024, Chris Li wrote:
> > On Sat, Jun 15, 2024 at 1:47 AM Barry Song <21cnbao@gmail.com> wrote:
> > > On Sat, Jun 15, 2024 at 2:59 PM Andrew Morton <akpm@linux-foundation.org> wrote:
> > > > On Fri, 14 Jun 2024 19:51:11 -0700 Chris Li <chrisl@kernel.org> wrote:
> > > >
> > > > > > I'm having trouble understanding the overall impact of this on users.
> > > > > > We fail the mTHP swap allocation and fall back, but things continue to
> > > > > > operate OK?
> > > > >
> > > > > Continue to operate OK in the sense that the mTHP will have to split
> > > > > into 4K pages before the swap out, aka the fall back. The swap out and
> > > > > swap in can continue to work as 4K pages, not as the mTHP. Due to the
> > > > > fallback, the mTHP based zsmalloc compression with 64K buffer will not
> > > > > happen. That is the effect of the fallback. But mTHP swap out and swap
> > > > > in is relatively new, it is not really a regression.
> > > >
> > > > Sure, but it's pretty bad to merge a new feature only to have it
> > > > ineffective after a few hours use.
> ....
> > > >
> > > $ /home/barry/develop/linux/mthp_swpout_test -s
> > >
> > > [ 1013.535798] ------------[ cut here ]------------
> > > [ 1013.538886] expecting order 4 got 0
> >
> > This warning means there is a bug in this series somewhere I need to hunt down.
> > The V1 has the same warning but I haven't heard it get triggered in
> > V1, it is something new in V2.
> >
> > Andrew, please consider removing the series from mm-unstable until I
> > resolve this warning assert.
>
> Agreed: I was glad to see it go into mm-unstable last week, that made
> it easier to include in testing (or harder to avoid!), but my conclusion
> is that it's not ready yet (and certainly not suitable for 6.10 hotfix).
>
> I too saw this "expecting order 4 got 0" once-warning every boot (from
> ordinary page reclaim rather than from madvise_pageout shown below),
> shortly after starting my tmpfs swapping load. But I never saw any bad
> effect immediately after it: actual crashes came a few minutes later.
>
> (And I'm not seeing the warning at all now, with the change I made: that
> doesn't tell us much, since what I have leaves out 2/2 entirely; but it
> does suggest that it's more important to follow up the crashes, and
> maybe when they are satisfactorily fixed, the warning will be fixed too.)
>
> Most crashes have been on that VM_BUG_ON(ci - si->cluster_info != idx)
> in alloc_cluster(). And when I poked around, it was usually (always?)
> the case that si->free_clusters was empty, so list_first_entry() not
> good at all. A few other crashes were GPFs, but I didn't pay much
> attention to them, thinking the alloc_cluster() one best to pursue.
>
> I reverted both patches from mm-everything, and had no problem.
> I added back 1/2 expecting it to be harmless ("no real function
> change in this patch"), but was surprised to get those same
> "expecting order 4 got 0" warnings and VM_BUG_ONs and GPFs:
> so have spent most time trying to get 1/2 working.
>
> This patch on top of 1/2, restoring when cluster_is_free(ci) can
> be seen to change, appears to have eliminated all those problems:
>
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -418,6 +418,7 @@ static struct swap_cluster_info *alloc_c
>
>         VM_BUG_ON(ci - si->cluster_info != idx);
>         list_del(&ci->list);
> +       ci->state = CLUSTER_STATE_PER_CPU;
>         ci->count = 0;
>         return ci;
>  }
> @@ -543,10 +544,6 @@ new_cluster:
>         if (tmp == SWAP_NEXT_INVALID) {
>                 if (!list_empty(&si->free_clusters)) {
>                         ci = list_first_entry(&si->free_clusters, struct swap_cluster_info, list);
> -                       list_del(&ci->list);
> -                       spin_lock(&ci->lock);
> -                       ci->state = CLUSTER_STATE_PER_CPU;
> -                       spin_unlock(&ci->lock);
>                         tmp = (ci - si->cluster_info) * SWAPFILE_CLUSTER;
>                 } else if (!list_empty(&si->discard_clusters)) {
>                         /*
>

Thanks for the nice bug report. That is my bad.

Both you and Ying point out the critical bug here: The cluster was
removed from the free list inside try_ssd() and in the case of
conflict() failure followed by alloc_cluster(). It allocates from the
cluster, it can remove the same cluster from the list again. That is
the path I haven't considered well.

All this attempt of allocation in try_ssd() but can have possible
conflict and perform the dance in alloc_cluster() make things very
complicated. In the try_ssd() when we have the cluster lock, can we
just perform the actual allocation with lock held? There should not be
conflict with the cluster lock protection, right?

Chris




> Delighted to have made progress after many attempts, I went to apply 2/2
> on top, but found that it builds upon those scan_swap_map_try_ssd_cluster()
> changes I've undone. I gave up at that point and hand back to you, Chris,
> hoping that you will understand scan_swap_map_ssd_cluster_conflict() etc
> much better than I ever shall!


>
> Clarifications on my load: all swapping to SSD, but discard not enabled;
> /sys/kernel/mm/transparent_hugepage/ enabled always, shmem_enabled force,
> hugepages-64kB/enabled always, hugepages-64kB/shmem_enabled always;
> swapoff between iterations, did not appear relevant to problems; x86_64.
>
> Hugh
>
> >
> > > [ 1013.540622] WARNING: CPU: 3 PID: 104 at mm/swapfile.c:600 scan_swap_map_try_ssd_cluster+0x340/0x370
> > > [ 1013.544460] Modules linked in:
> > > [ 1013.545411] CPU: 3 PID: 104 Comm: mthp_swpout_tes Not tainted 6.10.0-rc3-ga12328d9fb85-dirty #285
> > > [ 1013.545990] Hardware name: linux,dummy-virt (DT)
> > > [ 1013.546585] pstate: 61400005 (nZCv daif +PAN -UAO -TCO +DIT -SSBS BTYPE=--)
> > > [ 1013.547136] pc : scan_swap_map_try_ssd_cluster+0x340/0x370
> > > [ 1013.547768] lr : scan_swap_map_try_ssd_cluster+0x340/0x370
> > > [ 1013.548263] sp : ffff8000863e32e0
> > > [ 1013.548723] x29: ffff8000863e32e0 x28: 0000000000000670 x27: 0000000000000660
> > > [ 1013.549626] x26: 0000000000000010 x25: ffff0000c1692108 x24: ffff0000c27c4800
> > > [ 1013.550470] x23: 2e8ba2e8ba2e8ba3 x22: fffffdffbf7df2c0 x21: ffff0000c27c48b0
> > > [ 1013.551285] x20: ffff800083a946d0 x19: 0000000000000004 x18: ffffffffffffffff
> > > [ 1013.552263] x17: 0000000000000000 x16: 0000000000000000 x15: ffff800084b13568
> > > [ 1013.553292] x14: ffffffffffffffff x13: ffff800084b13566 x12: 6e69746365707865
> > > [ 1013.554423] x11: fffffffffffe0000 x10: ffff800083b18b68 x9 : ffff80008014c874
> > > [ 1013.555231] x8 : 00000000ffffefff x7 : ffff800083b16318 x6 : 0000000000002850
> > > [ 1013.555965] x5 : 40000000fffff1ae x4 : 0000000000000fff x3 : 0000000000000000
> > > [ 1013.556779] x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffff0000c24a1bc0
> > > [ 1013.557627] Call trace:
> > > [ 1013.557960]  scan_swap_map_try_ssd_cluster+0x340/0x370
> > > [ 1013.558498]  get_swap_pages+0x23c/0xc20
> > > [ 1013.558899]  folio_alloc_swap+0x5c/0x248
> > > [ 1013.559544]  add_to_swap+0x40/0xf0
> > > [ 1013.559904]  shrink_folio_list+0x6dc/0xf20
> > > [ 1013.560289]  reclaim_folio_list+0x8c/0x168
> > > [ 1013.560710]  reclaim_pages+0xfc/0x178
> > > [ 1013.561079]  madvise_cold_or_pageout_pte_range+0x8d8/0xf28
> > > [ 1013.561524]  walk_pgd_range+0x390/0x808
> > > [ 1013.561920]  __walk_page_range+0x1e0/0x1f0
> > > [ 1013.562370]  walk_page_range+0x1f0/0x2c8
> > > [ 1013.562888]  madvise_pageout+0xf8/0x280
> > > [ 1013.563388]  madvise_vma_behavior+0x314/0xa20
> > > [ 1013.563982]  madvise_walk_vmas+0xc0/0x128
> > > [ 1013.564386]  do_madvise.part.0+0x110/0x558
> > > [ 1013.564792]  __arm64_sys_madvise+0x68/0x88
> > > [ 1013.565333]  invoke_syscall+0x50/0x128
> > > [ 1013.565737]  el0_svc_common.constprop.0+0x48/0xf8
> > > [ 1013.566285]  do_el0_svc+0x28/0x40
> > > [ 1013.566667]  el0_svc+0x50/0x150
> > > [ 1013.567094]  el0t_64_sync_handler+0x13c/0x158
> > > [ 1013.567501]  el0t_64_sync+0x1a4/0x1a8
> > > [ 1013.568058] irq event stamp: 0
> > > [ 1013.568661] hardirqs last  enabled at (0): [<0000000000000000>] 0x0
> > > [ 1013.569560] hardirqs last disabled at (0): [<ffff8000800add44>] copy_process+0x654/0x19a8
> > > [ 1013.570167] softirqs last  enabled at (0): [<ffff8000800add44>] copy_process+0x654/0x19a8
> > > [ 1013.570846] softirqs last disabled at (0): [<0000000000000000>] 0x0
> > > [ 1013.571330] ---[ end trace 0000000000000000 ]---


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 1/2] mm: swap: swap cluster switch to double link list
  2024-06-17  6:19   ` Huang, Ying
@ 2024-06-18  5:06     ` Chris Li
  2024-06-18  7:54       ` Huang, Ying
  0 siblings, 1 reply; 22+ messages in thread
From: Chris Li @ 2024-06-18  5:06 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Andrew Morton, Kairui Song, Ryan Roberts, Kalesh Singh,
	linux-kernel, linux-mm, Barry Song

On Sun, Jun 16, 2024 at 11:21 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Hi, Chris,
>
> Chris Li <chrisl@kernel.org> writes:
>
> > Previously, the swap cluster used a cluster index as a pointer
> > to construct a custom single link list type "swap_cluster_list".
> > The next cluster pointer is shared with the cluster->count.
> > It prevents puting the non free cluster into a list.
> > Change the cluster to use the standard double link list instead.
> > This allows tracing the nonfull cluster in the follow up patch.
> >
> > Remove the cluster getter/setter for accessing the cluster
> > struct member.
> >
> > The list operation is protected by the swap_info_struct->lock.
> >
> > Change cluster code to use "struct swap_cluster_info *" to
> > reference the cluster rather than by using index. That is more
> > consistent with the list manipulation. It avoids the repeat
> > adding index to the cluser_info. The code is easier to understand.
> >
> > Remove the cluster next pointer is NULL flag, the double link
> > list can handle the empty list pretty well.
>
> The above is more about "what" instead of "why".  We can identify "what"
> from the patch itself.  I expect more "why".  I guess that we can reduce
> swap_map[] scanning if we have lists of non-full/non-free clusters.

In my mind, the "why" is captured by " This allows tracing the nonfull
cluster in the follow up patch.".
If you want to ask "why" we want the "nonfull cluster list". It is to
get to the suitable candidate cluster with that order quicker than
scanning swap_map[].

>
> > The "swap_cluster_info" struct is two pointer bigger, because
> > 512 swap entries share one swap struct, it has very little impact
> > on the average memory usage per swap entry. For 1TB swapfile, the
> > swap cluster data structure increases from 8MB to 24MB.
> >
> > Other than the list conversion, there is no real function change
> > in this patch.
> >
> > Signed-off-by: Chris Li <chrisl@kernel.org>
> > ---
> >  include/linux/swap.h |  28 +++----
> >  mm/swapfile.c        | 227 +++++++++++++--------------------------------------
> >  2 files changed, 70 insertions(+), 185 deletions(-)
> >
> > diff --git a/include/linux/swap.h b/include/linux/swap.h
> > index 3df75d62a835..cd9154a3e934 100644
> > --- a/include/linux/swap.h
> > +++ b/include/linux/swap.h
> > @@ -242,23 +242,22 @@ enum {
> >   * space with SWAPFILE_CLUSTER pages long and naturally aligns in disk. All
> >   * free clusters are organized into a list. We fetch an entry from the list to
> >   * get a free cluster.
> > - *
> > - * The data field stores next cluster if the cluster is free or cluster usage
> > - * counter otherwise. The flags field determines if a cluster is free. This is
> > - * protected by swap_info_struct.lock.
> >   */
> >  struct swap_cluster_info {
> >       spinlock_t lock;        /*
> > -                              * Protect swap_cluster_info fields
> > -                              * and swap_info_struct->swap_map
> > +                              * Protect swap_cluster_info count and state
>
> Protect swap_cluster_info fields except 'list' ?

I change it to protect the swap_cluster_info bitfields in the second patch.
>
> > +                              * field and swap_info_struct->swap_map
> >                                * elements correspond to the swap
> >                                * cluster
> >                                */
> > -     unsigned int data:24;
> > -     unsigned int flags:8;
> > +     unsigned int count:12;
> > +     unsigned int state:3;
>
> I still prefer normal data type over bit fields.  How about
>
>         u16 usage;
>         u8  state;

I don't mind the "count" rename to "usage". That is probably a better
name. However I have another patch intended to add more bit fields in
the cluster info struct. The second patch adds "order" and the later
patch will add more. That is why I choose bitfield to be more condense
with bits.

>
> And, how about use 'usage' instead of 'count'?  Personally I think that
> it is more clear.  But I don't have strong opinions on this.
>
> > +     struct list_head list;  /* Protected by swap_info_struct->lock */
> >  };
> > -#define CLUSTER_FLAG_FREE 1 /* This cluster is free */
> > -#define CLUSTER_FLAG_NEXT_NULL 2 /* This cluster has no next cluster */
> > +
> > +#define CLUSTER_STATE_FREE   1 /* This cluster is free */
>
> Can we use swap_cluster_info->count == 0?

It is not as good considering the second patch starts to track the
state of the cluster of per cpu struct. We will be comparing both the
cluster->count and cluster->state.

>
> > +#define CLUSTER_STATE_PER_CPU        2 /* This cluster on per_cpu_cluster  */
> > +
>
> There's no users of this state in this patch.  IMHO, it's better to

Yes, there is usage of this state in this patch in the sense that, if
you remove that state definition,
the code can't compile due to assignment of CLUSTER_STATE_PER_CPU.
There is a code test if a cluster state is not a free state, which
excludes "CLUSTER_STATE_PER_CPU".

> introduce a symbol with its users, otherwise, it's hard to understand
> why do we need it and how to use it.  And, IIUC, the state isn't
> maintained properly, it should be changed when we move the cluster off
> the per-cpu cluster.

I am actually following the same usage principle as you suggested
here. Only the second patch starts to use the off per cpu state
(SCANNED). That is why I introduce it there.

>
> >  /*
> >   * The first page in the swap file is the swap header, which is always marked
> > @@ -283,11 +282,6 @@ struct percpu_cluster {
> >       unsigned int next[SWAP_NR_ORDERS]; /* Likely next allocation offset */
> >  };
> >
> > -struct swap_cluster_list {
> > -     struct swap_cluster_info head;
> > -     struct swap_cluster_info tail;
> > -};
> > -
> >  /*
> >   * The in-memory structure used to track swap areas.
> >   */
> > @@ -300,7 +294,7 @@ struct swap_info_struct {
> >       unsigned int    max;            /* extent of the swap_map */
> >       unsigned char *swap_map;        /* vmalloc'ed array of usage counts */
> >       struct swap_cluster_info *cluster_info; /* cluster info. Only for SSD */
> > -     struct swap_cluster_list free_clusters; /* free clusters list */
> > +     struct list_head free_clusters; /* free clusters list */
> >       unsigned int lowest_bit;        /* index of first free in swap_map */
> >       unsigned int highest_bit;       /* index of last free in swap_map */
> >       unsigned int pages;             /* total of usable pages of swap */
> > @@ -331,7 +325,7 @@ struct swap_info_struct {
> >                                        * list.
> >                                        */
> >       struct work_struct discard_work; /* discard worker */
> > -     struct swap_cluster_list discard_clusters; /* discard clusters list */
> > +     struct list_head discard_clusters; /* discard clusters list */
> >       struct plist_node avail_lists[]; /*
> >                                          * entries in swap_avail_heads, one
> >                                          * entry per node.
> > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > index 9c6d8e557c0f..2f878b374349 100644
> > --- a/mm/swapfile.c
> > +++ b/mm/swapfile.c
> > @@ -290,62 +290,9 @@ static void discard_swap_cluster(struct swap_info_struct *si,
> >  #endif
> >  #define LATENCY_LIMIT                256
> >
> > -static inline void cluster_set_flag(struct swap_cluster_info *info,
> > -     unsigned int flag)
> > -{
> > -     info->flags = flag;
> > -}
> > -
> > -static inline unsigned int cluster_count(struct swap_cluster_info *info)
> > -{
> > -     return info->data;
> > -}
> > -
> > -static inline void cluster_set_count(struct swap_cluster_info *info,
> > -                                  unsigned int c)
> > -{
> > -     info->data = c;
> > -}
> > -
> > -static inline void cluster_set_count_flag(struct swap_cluster_info *info,
> > -                                      unsigned int c, unsigned int f)
> > -{
> > -     info->flags = f;
> > -     info->data = c;
> > -}
> > -
> > -static inline unsigned int cluster_next(struct swap_cluster_info *info)
> > -{
> > -     return info->data;
> > -}
> > -
> > -static inline void cluster_set_next(struct swap_cluster_info *info,
> > -                                 unsigned int n)
> > -{
> > -     info->data = n;
> > -}
> > -
> > -static inline void cluster_set_next_flag(struct swap_cluster_info *info,
> > -                                      unsigned int n, unsigned int f)
> > -{
> > -     info->flags = f;
> > -     info->data = n;
> > -}
> > -
> >  static inline bool cluster_is_free(struct swap_cluster_info *info)
> >  {
> > -     return info->flags & CLUSTER_FLAG_FREE;
> > -}
> > -
> > -static inline bool cluster_is_null(struct swap_cluster_info *info)
> > -{
> > -     return info->flags & CLUSTER_FLAG_NEXT_NULL;
> > -}
> > -
> > -static inline void cluster_set_null(struct swap_cluster_info *info)
> > -{
> > -     info->flags = CLUSTER_FLAG_NEXT_NULL;
> > -     info->data = 0;
> > +     return info->state == CLUSTER_STATE_FREE;
> >  }
> >
> >  static inline struct swap_cluster_info *lock_cluster(struct swap_info_struct *si,
> > @@ -394,65 +341,11 @@ static inline void unlock_cluster_or_swap_info(struct swap_info_struct *si,
> >               spin_unlock(&si->lock);
> >  }
> >
> > -static inline bool cluster_list_empty(struct swap_cluster_list *list)
> > -{
> > -     return cluster_is_null(&list->head);
> > -}
> > -
> > -static inline unsigned int cluster_list_first(struct swap_cluster_list *list)
> > -{
> > -     return cluster_next(&list->head);
> > -}
> > -
> > -static void cluster_list_init(struct swap_cluster_list *list)
> > -{
> > -     cluster_set_null(&list->head);
> > -     cluster_set_null(&list->tail);
> > -}
> > -
> > -static void cluster_list_add_tail(struct swap_cluster_list *list,
> > -                               struct swap_cluster_info *ci,
> > -                               unsigned int idx)
> > -{
> > -     if (cluster_list_empty(list)) {
> > -             cluster_set_next_flag(&list->head, idx, 0);
> > -             cluster_set_next_flag(&list->tail, idx, 0);
> > -     } else {
> > -             struct swap_cluster_info *ci_tail;
> > -             unsigned int tail = cluster_next(&list->tail);
> > -
> > -             /*
> > -              * Nested cluster lock, but both cluster locks are
> > -              * only acquired when we held swap_info_struct->lock
> > -              */
> > -             ci_tail = ci + tail;
> > -             spin_lock_nested(&ci_tail->lock, SINGLE_DEPTH_NESTING);
> > -             cluster_set_next(ci_tail, idx);
> > -             spin_unlock(&ci_tail->lock);
> > -             cluster_set_next_flag(&list->tail, idx, 0);
> > -     }
> > -}
> > -
> > -static unsigned int cluster_list_del_first(struct swap_cluster_list *list,
> > -                                        struct swap_cluster_info *ci)
> > -{
> > -     unsigned int idx;
> > -
> > -     idx = cluster_next(&list->head);
> > -     if (cluster_next(&list->tail) == idx) {
> > -             cluster_set_null(&list->head);
> > -             cluster_set_null(&list->tail);
> > -     } else
> > -             cluster_set_next_flag(&list->head,
> > -                                   cluster_next(&ci[idx]), 0);
> > -
> > -     return idx;
> > -}
> > -
> >  /* Add a cluster to discard list and schedule it to do discard */
> >  static void swap_cluster_schedule_discard(struct swap_info_struct *si,
> > -             unsigned int idx)
> > +             struct swap_cluster_info *ci)
> >  {
> > +     unsigned int idx = ci - si->cluster_info;
>
> I see this multiple times in the patch, can we define a helper for this?
Ack.

>
> >       /*
> >        * If scan_swap_map_slots() can't find a free cluster, it will check
> >        * si->swap_map directly. To make sure the discarding cluster isn't
> > @@ -462,17 +355,14 @@ static void swap_cluster_schedule_discard(struct swap_info_struct *si,
> >       memset(si->swap_map + idx * SWAPFILE_CLUSTER,
> >                       SWAP_MAP_BAD, SWAPFILE_CLUSTER);
> >
> > -     cluster_list_add_tail(&si->discard_clusters, si->cluster_info, idx);
> > -
> > +     list_add_tail(&ci->list, &si->discard_clusters);
> >       schedule_work(&si->discard_work);
> >  }
> >
> > -static void __free_cluster(struct swap_info_struct *si, unsigned long idx)
> > +static void __free_cluster(struct swap_info_struct *si, struct swap_cluster_info *ci)
> >  {
> > -     struct swap_cluster_info *ci = si->cluster_info;
> > -
> > -     cluster_set_flag(ci + idx, CLUSTER_FLAG_FREE);
> > -     cluster_list_add_tail(&si->free_clusters, ci, idx);
> > +     ci->state = CLUSTER_STATE_FREE;
> > +     list_add_tail(&ci->list, &si->free_clusters);
> >  }
> >
> >  /*
> > @@ -481,21 +371,22 @@ static void __free_cluster(struct swap_info_struct *si, unsigned long idx)
> >  */
> >  static void swap_do_scheduled_discard(struct swap_info_struct *si)
> >  {
> > -     struct swap_cluster_info *info, *ci;
> > +     struct swap_cluster_info *ci;
> >       unsigned int idx;
> >
> > -     info = si->cluster_info;
> > -
> > -     while (!cluster_list_empty(&si->discard_clusters)) {
> > -             idx = cluster_list_del_first(&si->discard_clusters, info);
> > +     while (!list_empty(&si->discard_clusters)) {
> > +             ci = list_first_entry(&si->discard_clusters, struct swap_cluster_info, list);
> > +             list_del(&ci->list);
> > +             idx = ci - si->cluster_info;
> >               spin_unlock(&si->lock);
> >
> >               discard_swap_cluster(si, idx * SWAPFILE_CLUSTER,
> >                               SWAPFILE_CLUSTER);
> >
> >               spin_lock(&si->lock);
> > -             ci = lock_cluster(si, idx * SWAPFILE_CLUSTER);
> > -             __free_cluster(si, idx);
> > +
> > +             spin_lock(&ci->lock);
>
> Personally, I still prefer to use lock_cluster(), which is more readable
> and matches unlock_cluster() below.

lock_cluster() uses an index which is not matching unlock_cluster()
which is using a pointer to cluster.
When you get the cluster from the list, you have a cluster pointer. I
feel it is unnecessary to convert to index then back convert to
cluster pointer inside lock_cluster(). I actually feel using indexes
to refer to the cluster is error prone because we also have offset.


>
> > +             __free_cluster(si, ci);
> >               memset(si->swap_map + idx * SWAPFILE_CLUSTER,
> >                               0, SWAPFILE_CLUSTER);
> >               unlock_cluster(ci);
> > @@ -521,20 +412,19 @@ static void swap_users_ref_free(struct percpu_ref *ref)
> >       complete(&si->comp);
> >  }
> >
> > -static void alloc_cluster(struct swap_info_struct *si, unsigned long idx)
> > +static struct swap_cluster_info *alloc_cluster(struct swap_info_struct *si, unsigned long idx)
> >  {
> > -     struct swap_cluster_info *ci = si->cluster_info;
> > +     struct swap_cluster_info *ci = list_first_entry(&si->free_clusters, struct swap_cluster_info, list);
> >
> > -     VM_BUG_ON(cluster_list_first(&si->free_clusters) != idx);
> > -     cluster_list_del_first(&si->free_clusters, ci);
> > -     cluster_set_count_flag(ci + idx, 0, 0);
> > +     VM_BUG_ON(ci - si->cluster_info != idx);
> > +     list_del(&ci->list);
> > +     ci->count = 0;
>
> Do we need this now?  If we keep CLUSTER_STATE_FREE, we need to change
> it here.

Good catch, thanks for catching that. Now I realized this is actually
problematic and tricky to get it right. Let me work on that.

>
> > +     return ci;
> >  }
> >
> > -static void free_cluster(struct swap_info_struct *si, unsigned long idx)
> > +static void free_cluster(struct swap_info_struct *si, struct swap_cluster_info *ci)
> >  {
> > -     struct swap_cluster_info *ci = si->cluster_info + idx;
> > -
> > -     VM_BUG_ON(cluster_count(ci) != 0);
> > +     VM_BUG_ON(ci->count != 0);
> >       /*
> >        * If the swap is discardable, prepare discard the cluster
> >        * instead of free it immediately. The cluster will be freed
> > @@ -542,11 +432,11 @@ static void free_cluster(struct swap_info_struct *si, unsigned long idx)
> >        */
> >       if ((si->flags & (SWP_WRITEOK | SWP_PAGE_DISCARD)) ==
> >           (SWP_WRITEOK | SWP_PAGE_DISCARD)) {
> > -             swap_cluster_schedule_discard(si, idx);
> > +             swap_cluster_schedule_discard(si, ci);
> >               return;
> >       }
> >
> > -     __free_cluster(si, idx);
> > +     __free_cluster(si, ci);
> >  }
> >
> >  /*
> > @@ -559,15 +449,15 @@ static void add_cluster_info_page(struct swap_info_struct *p,
> >       unsigned long count)
> >  {
> >       unsigned long idx = page_nr / SWAPFILE_CLUSTER;
> > +     struct swap_cluster_info *ci = cluster_info + idx;
> >
> >       if (!cluster_info)
> >               return;
> > -     if (cluster_is_free(&cluster_info[idx]))
> > +     if (cluster_is_free(ci))
> >               alloc_cluster(p, idx);
> >
> > -     VM_BUG_ON(cluster_count(&cluster_info[idx]) + count > SWAPFILE_CLUSTER);
> > -     cluster_set_count(&cluster_info[idx],
> > -             cluster_count(&cluster_info[idx]) + count);
> > +     VM_BUG_ON(ci->count + count > SWAPFILE_CLUSTER);
> > +     ci->count += count;
> >  }
> >
> >  /*
> > @@ -581,24 +471,20 @@ static void inc_cluster_info_page(struct swap_info_struct *p,
> >  }
> >
> >  /*
> > - * The cluster corresponding to page_nr decreases one usage. If the usage
> > - * counter becomes 0, which means no page in the cluster is in using, we can
> > - * optionally discard the cluster and add it to free cluster list.
> > + * The cluster ci decreases one usage. If the usage counter becomes 0,
> > + * which means no page in the cluster is in using, we can optionally discard
> > + * the cluster and add it to free cluster list.
> >   */
> > -static void dec_cluster_info_page(struct swap_info_struct *p,
> > -     struct swap_cluster_info *cluster_info, unsigned long page_nr)
> > +static void dec_cluster_info_page(struct swap_info_struct *p, struct swap_cluster_info *ci)
> >  {
> > -     unsigned long idx = page_nr / SWAPFILE_CLUSTER;
> > -
> > -     if (!cluster_info)
> > +     if (!p->cluster_info)
> >               return;
> >
> > -     VM_BUG_ON(cluster_count(&cluster_info[idx]) == 0);
> > -     cluster_set_count(&cluster_info[idx],
> > -             cluster_count(&cluster_info[idx]) - 1);
> > +     VM_BUG_ON(ci->count == 0);
> > +     ci->count--;
> >
> > -     if (cluster_count(&cluster_info[idx]) == 0)
> > -             free_cluster(p, idx);
> > +     if (!ci->count)
> > +             free_cluster(p, ci);
> >  }
> >
> >  /*
> > @@ -611,10 +497,10 @@ scan_swap_map_ssd_cluster_conflict(struct swap_info_struct *si,
> >  {
> >       struct percpu_cluster *percpu_cluster;
> >       bool conflict;
> > -
>
> Usually we use one blank line after local variable declaration.
Ack.

>
> > +     struct swap_cluster_info *first = list_first_entry(&si->free_clusters, struct swap_cluster_info, list);
> >       offset /= SWAPFILE_CLUSTER;
> > -     conflict = !cluster_list_empty(&si->free_clusters) &&
> > -             offset != cluster_list_first(&si->free_clusters) &&
> > +     conflict = !list_empty(&si->free_clusters) &&
> > +             offset !=  first - si->cluster_info &&
> >               cluster_is_free(&si->cluster_info[offset]);
> >
> >       if (!conflict)
> > @@ -655,10 +541,14 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
> >       cluster = this_cpu_ptr(si->percpu_cluster);
> >       tmp = cluster->next[order];
> >       if (tmp == SWAP_NEXT_INVALID) {
> > -             if (!cluster_list_empty(&si->free_clusters)) {
> > -                     tmp = cluster_next(&si->free_clusters.head) *
> > -                                     SWAPFILE_CLUSTER;
> > -             } else if (!cluster_list_empty(&si->discard_clusters)) {
> > +             if (!list_empty(&si->free_clusters)) {
> > +                     ci = list_first_entry(&si->free_clusters, struct swap_cluster_info, list);
> > +                     list_del(&ci->list);
>
> The free cluster is deleted from si->free_clusters now.  But later you
> will call scan_swap_map_ssd_cluster_conflict() and may abandon the
> cluster.  And in alloc_cluster() later, it may be deleted again.

Yes, that is a bug. Thanks for catching that.

>
> > +                     spin_lock(&ci->lock);
> > +                     ci->state = CLUSTER_STATE_PER_CPU;
>
> Need to change ci->state when move a cluster off the percpu_cluster.

In the next patch. This patch does not use the off state yet.

>
> > +                     spin_unlock(&ci->lock);
> > +                     tmp = (ci - si->cluster_info) * SWAPFILE_CLUSTER;
> > +             } else if (!list_empty(&si->discard_clusters)) {
> >                       /*
> >                        * we don't have free cluster but have some clusters in
> >                        * discarding, do discard now and reclaim them, then
> > @@ -1062,8 +952,8 @@ static void swap_free_cluster(struct swap_info_struct *si, unsigned long idx)
> >
> >       ci = lock_cluster(si, offset);
> >       memset(si->swap_map + offset, 0, SWAPFILE_CLUSTER);
> > -     cluster_set_count_flag(ci, 0, 0);
> > -     free_cluster(si, idx);
> > +     ci->count = 0;
> > +     free_cluster(si, ci);
> >       unlock_cluster(ci);
> >       swap_range_free(si, offset, SWAPFILE_CLUSTER);
> >  }
> > @@ -1336,7 +1226,7 @@ static void swap_entry_free(struct swap_info_struct *p, swp_entry_t entry)
> >       count = p->swap_map[offset];
> >       VM_BUG_ON(count != SWAP_HAS_CACHE);
> >       p->swap_map[offset] = 0;
> > -     dec_cluster_info_page(p, p->cluster_info, offset);
> > +     dec_cluster_info_page(p, ci);
> >       unlock_cluster(ci);
> >
> >       mem_cgroup_uncharge_swap(entry, 1);
> > @@ -3003,8 +2893,8 @@ static int setup_swap_map_and_extents(struct swap_info_struct *p,
> >
> >       nr_good_pages = maxpages - 1;   /* omit header page */
> >
> > -     cluster_list_init(&p->free_clusters);
> > -     cluster_list_init(&p->discard_clusters);
> > +     INIT_LIST_HEAD(&p->free_clusters);
> > +     INIT_LIST_HEAD(&p->discard_clusters);
> >
> >       for (i = 0; i < swap_header->info.nr_badpages; i++) {
> >               unsigned int page_nr = swap_header->info.badpages[i];
> > @@ -3055,14 +2945,15 @@ static int setup_swap_map_and_extents(struct swap_info_struct *p,
> >       for (k = 0; k < SWAP_CLUSTER_COLS; k++) {
> >               j = (k + col) % SWAP_CLUSTER_COLS;
> >               for (i = 0; i < DIV_ROUND_UP(nr_clusters, SWAP_CLUSTER_COLS); i++) {
> > +                     struct swap_cluster_info *ci;
> >                       idx = i * SWAP_CLUSTER_COLS + j;
> > +                     ci = cluster_info + idx;
> >                       if (idx >= nr_clusters)
> >                               continue;
> > -                     if (cluster_count(&cluster_info[idx]))
> > +                     if (ci->count)
> >                               continue;
> > -                     cluster_set_flag(&cluster_info[idx], CLUSTER_FLAG_FREE);
> > -                     cluster_list_add_tail(&p->free_clusters, cluster_info,
> > -                                           idx);
> > +                     ci->state = CLUSTER_STATE_FREE;
> > +                     list_add_tail(&ci->list, &p->free_clusters);
> >               }
> >       }
> >       return nr_extents;

Thank you for the review and spotting the bug.

Chris


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 1/2] mm: swap: swap cluster switch to double link list
  2024-06-18  5:06     ` Chris Li
@ 2024-06-18  7:54       ` Huang, Ying
  2024-06-18 10:01         ` Chris Li
  0 siblings, 1 reply; 22+ messages in thread
From: Huang, Ying @ 2024-06-18  7:54 UTC (permalink / raw)
  To: Chris Li
  Cc: Andrew Morton, Kairui Song, Ryan Roberts, Kalesh Singh,
	linux-kernel, linux-mm, Barry Song

Chris Li <chrisl@kernel.org> writes:

> On Sun, Jun 16, 2024 at 11:21 PM Huang, Ying <ying.huang@intel.com> wrote:
>>
>> Hi, Chris,
>>
>> Chris Li <chrisl@kernel.org> writes:
>>
>> > Previously, the swap cluster used a cluster index as a pointer
>> > to construct a custom single link list type "swap_cluster_list".
>> > The next cluster pointer is shared with the cluster->count.
>> > It prevents puting the non free cluster into a list.
>> > Change the cluster to use the standard double link list instead.
>> > This allows tracing the nonfull cluster in the follow up patch.
>> >
>> > Remove the cluster getter/setter for accessing the cluster
>> > struct member.
>> >
>> > The list operation is protected by the swap_info_struct->lock.
>> >
>> > Change cluster code to use "struct swap_cluster_info *" to
>> > reference the cluster rather than by using index. That is more
>> > consistent with the list manipulation. It avoids the repeat
>> > adding index to the cluser_info. The code is easier to understand.
>> >
>> > Remove the cluster next pointer is NULL flag, the double link
>> > list can handle the empty list pretty well.
>>
>> The above is more about "what" instead of "why".  We can identify "what"
>> from the patch itself.  I expect more "why".  I guess that we can reduce
>> swap_map[] scanning if we have lists of non-full/non-free clusters.
>
> In my mind, the "why" is captured by " This allows tracing the nonfull
> cluster in the follow up patch.".
> If you want to ask "why" we want the "nonfull cluster list". It is to
> get to the suitable candidate cluster with that order quicker than
> scanning swap_map[].

Good.  Please add that into the patch description.  And I think that we
can reduce the description about "what" too.

>>
>> > The "swap_cluster_info" struct is two pointer bigger, because
>> > 512 swap entries share one swap struct, it has very little impact
>> > on the average memory usage per swap entry. For 1TB swapfile, the
>> > swap cluster data structure increases from 8MB to 24MB.
>> >
>> > Other than the list conversion, there is no real function change
>> > in this patch.
>> >
>> > Signed-off-by: Chris Li <chrisl@kernel.org>
>> > ---
>> >  include/linux/swap.h |  28 +++----
>> >  mm/swapfile.c        | 227 +++++++++++++--------------------------------------
>> >  2 files changed, 70 insertions(+), 185 deletions(-)
>> >
>> > diff --git a/include/linux/swap.h b/include/linux/swap.h
>> > index 3df75d62a835..cd9154a3e934 100644
>> > --- a/include/linux/swap.h
>> > +++ b/include/linux/swap.h
>> > @@ -242,23 +242,22 @@ enum {
>> >   * space with SWAPFILE_CLUSTER pages long and naturally aligns in disk. All
>> >   * free clusters are organized into a list. We fetch an entry from the list to
>> >   * get a free cluster.
>> > - *
>> > - * The data field stores next cluster if the cluster is free or cluster usage
>> > - * counter otherwise. The flags field determines if a cluster is free. This is
>> > - * protected by swap_info_struct.lock.
>> >   */
>> >  struct swap_cluster_info {
>> >       spinlock_t lock;        /*
>> > -                              * Protect swap_cluster_info fields
>> > -                              * and swap_info_struct->swap_map
>> > +                              * Protect swap_cluster_info count and state
>>
>> Protect swap_cluster_info fields except 'list' ?
>
> I change it to protect the swap_cluster_info bitfields in the second patch.

Although I still prefer my version, I will not insist on that.

>>
>> > +                              * field and swap_info_struct->swap_map
>> >                                * elements correspond to the swap
>> >                                * cluster
>> >                                */
>> > -     unsigned int data:24;
>> > -     unsigned int flags:8;
>> > +     unsigned int count:12;
>> > +     unsigned int state:3;
>>
>> I still prefer normal data type over bit fields.  How about
>>
>>         u16 usage;
>>         u8  state;
>
> I don't mind the "count" rename to "usage". That is probably a better
> name. However I have another patch intended to add more bit fields in
> the cluster info struct. The second patch adds "order" and the later
> patch will add more. That is why I choose bitfield to be more condense
> with bits.

We still have space for another "u8" for "order".  It appears trivial to
change it to bit fields when necessary in the future.

>>
>> And, how about use 'usage' instead of 'count'?  Personally I think that
>> it is more clear.  But I don't have strong opinions on this.
>>
>> > +     struct list_head list;  /* Protected by swap_info_struct->lock */
>> >  };
>> > -#define CLUSTER_FLAG_FREE 1 /* This cluster is free */
>> > -#define CLUSTER_FLAG_NEXT_NULL 2 /* This cluster has no next cluster */
>> > +
>> > +#define CLUSTER_STATE_FREE   1 /* This cluster is free */
>>
>> Can we use swap_cluster_info->count == 0?
>
> It is not as good considering the second patch starts to track the
> state of the cluster of per cpu struct. We will be comparing both the
> cluster->count and cluster->state.
>
>>
>> > +#define CLUSTER_STATE_PER_CPU        2 /* This cluster on per_cpu_cluster  */
>> > +
>>
>> There's no users of this state in this patch.  IMHO, it's better to
>
> Yes, there is usage of this state in this patch in the sense that, if
> you remove that state definition,
> the code can't compile due to assignment of CLUSTER_STATE_PER_CPU.

Sorry, my words were confusing, we can move both the assignment and the
state itself to the second patch.

> There is a code test if a cluster state is not a free state, which
> excludes "CLUSTER_STATE_PER_CPU".

You mean the functionality that is equivalent to original
cluster_set_count_flag(0, 0) and cluster_is_free()?  I think
CLUSTER_STATE_PER_CPU cannot catch all.  If so, I suggest you to keep
swap_cluster_info.flags and CLUSTER_FLAG_FREE in this patch and change
it in the 2nd patch.  That will make this patch more focused and easier
to be reviewed.

In general, please try to keep this patch as simple as possible to help
reviewers.  Because it's quite long.  For example, just convert the list
implementation and keep other stuff as much as possible.

>> introduce a symbol with its users, otherwise, it's hard to understand
>> why do we need it and how to use it.  And, IIUC, the state isn't
>> maintained properly, it should be changed when we move the cluster off
>> the per-cpu cluster.
>
> I am actually following the same usage principle as you suggested
> here. Only the second patch starts to use the off per cpu state
> (SCANNED). That is why I introduce it there.
>
>>
>> >  /*
>> >   * The first page in the swap file is the swap header, which is always marked
>> > @@ -283,11 +282,6 @@ struct percpu_cluster {
>> >       unsigned int next[SWAP_NR_ORDERS]; /* Likely next allocation offset */
>> >  };
>> >
>> > -struct swap_cluster_list {
>> > -     struct swap_cluster_info head;
>> > -     struct swap_cluster_info tail;
>> > -};
>> > -
>> >  /*
>> >   * The in-memory structure used to track swap areas.
>> >   */
>> > @@ -300,7 +294,7 @@ struct swap_info_struct {
>> >       unsigned int    max;            /* extent of the swap_map */
>> >       unsigned char *swap_map;        /* vmalloc'ed array of usage counts */
>> >       struct swap_cluster_info *cluster_info; /* cluster info. Only for SSD */
>> > -     struct swap_cluster_list free_clusters; /* free clusters list */
>> > +     struct list_head free_clusters; /* free clusters list */
>> >       unsigned int lowest_bit;        /* index of first free in swap_map */
>> >       unsigned int highest_bit;       /* index of last free in swap_map */
>> >       unsigned int pages;             /* total of usable pages of swap */
>> > @@ -331,7 +325,7 @@ struct swap_info_struct {
>> >                                        * list.
>> >                                        */
>> >       struct work_struct discard_work; /* discard worker */
>> > -     struct swap_cluster_list discard_clusters; /* discard clusters list */
>> > +     struct list_head discard_clusters; /* discard clusters list */
>> >       struct plist_node avail_lists[]; /*
>> >                                          * entries in swap_avail_heads, one
>> >                                          * entry per node.
>> > diff --git a/mm/swapfile.c b/mm/swapfile.c
>> > index 9c6d8e557c0f..2f878b374349 100644
>> > --- a/mm/swapfile.c
>> > +++ b/mm/swapfile.c
>> > @@ -290,62 +290,9 @@ static void discard_swap_cluster(struct swap_info_struct *si,
>> >  #endif
>> >  #define LATENCY_LIMIT                256
>> >
>> > -static inline void cluster_set_flag(struct swap_cluster_info *info,
>> > -     unsigned int flag)
>> > -{
>> > -     info->flags = flag;
>> > -}
>> > -
>> > -static inline unsigned int cluster_count(struct swap_cluster_info *info)
>> > -{
>> > -     return info->data;
>> > -}
>> > -
>> > -static inline void cluster_set_count(struct swap_cluster_info *info,
>> > -                                  unsigned int c)
>> > -{
>> > -     info->data = c;
>> > -}
>> > -
>> > -static inline void cluster_set_count_flag(struct swap_cluster_info *info,
>> > -                                      unsigned int c, unsigned int f)
>> > -{
>> > -     info->flags = f;
>> > -     info->data = c;
>> > -}
>> > -
>> > -static inline unsigned int cluster_next(struct swap_cluster_info *info)
>> > -{
>> > -     return info->data;
>> > -}
>> > -
>> > -static inline void cluster_set_next(struct swap_cluster_info *info,
>> > -                                 unsigned int n)
>> > -{
>> > -     info->data = n;
>> > -}
>> > -
>> > -static inline void cluster_set_next_flag(struct swap_cluster_info *info,
>> > -                                      unsigned int n, unsigned int f)
>> > -{
>> > -     info->flags = f;
>> > -     info->data = n;
>> > -}
>> > -
>> >  static inline bool cluster_is_free(struct swap_cluster_info *info)
>> >  {
>> > -     return info->flags & CLUSTER_FLAG_FREE;
>> > -}
>> > -
>> > -static inline bool cluster_is_null(struct swap_cluster_info *info)
>> > -{
>> > -     return info->flags & CLUSTER_FLAG_NEXT_NULL;
>> > -}
>> > -
>> > -static inline void cluster_set_null(struct swap_cluster_info *info)
>> > -{
>> > -     info->flags = CLUSTER_FLAG_NEXT_NULL;
>> > -     info->data = 0;
>> > +     return info->state == CLUSTER_STATE_FREE;
>> >  }
>> >
>> >  static inline struct swap_cluster_info *lock_cluster(struct swap_info_struct *si,
>> > @@ -394,65 +341,11 @@ static inline void unlock_cluster_or_swap_info(struct swap_info_struct *si,
>> >               spin_unlock(&si->lock);
>> >  }
>> >
>> > -static inline bool cluster_list_empty(struct swap_cluster_list *list)
>> > -{
>> > -     return cluster_is_null(&list->head);
>> > -}
>> > -
>> > -static inline unsigned int cluster_list_first(struct swap_cluster_list *list)
>> > -{
>> > -     return cluster_next(&list->head);
>> > -}
>> > -
>> > -static void cluster_list_init(struct swap_cluster_list *list)
>> > -{
>> > -     cluster_set_null(&list->head);
>> > -     cluster_set_null(&list->tail);
>> > -}
>> > -
>> > -static void cluster_list_add_tail(struct swap_cluster_list *list,
>> > -                               struct swap_cluster_info *ci,
>> > -                               unsigned int idx)
>> > -{
>> > -     if (cluster_list_empty(list)) {
>> > -             cluster_set_next_flag(&list->head, idx, 0);
>> > -             cluster_set_next_flag(&list->tail, idx, 0);
>> > -     } else {
>> > -             struct swap_cluster_info *ci_tail;
>> > -             unsigned int tail = cluster_next(&list->tail);
>> > -
>> > -             /*
>> > -              * Nested cluster lock, but both cluster locks are
>> > -              * only acquired when we held swap_info_struct->lock
>> > -              */
>> > -             ci_tail = ci + tail;
>> > -             spin_lock_nested(&ci_tail->lock, SINGLE_DEPTH_NESTING);
>> > -             cluster_set_next(ci_tail, idx);
>> > -             spin_unlock(&ci_tail->lock);
>> > -             cluster_set_next_flag(&list->tail, idx, 0);
>> > -     }
>> > -}
>> > -
>> > -static unsigned int cluster_list_del_first(struct swap_cluster_list *list,
>> > -                                        struct swap_cluster_info *ci)
>> > -{
>> > -     unsigned int idx;
>> > -
>> > -     idx = cluster_next(&list->head);
>> > -     if (cluster_next(&list->tail) == idx) {
>> > -             cluster_set_null(&list->head);
>> > -             cluster_set_null(&list->tail);
>> > -     } else
>> > -             cluster_set_next_flag(&list->head,
>> > -                                   cluster_next(&ci[idx]), 0);
>> > -
>> > -     return idx;
>> > -}
>> > -
>> >  /* Add a cluster to discard list and schedule it to do discard */
>> >  static void swap_cluster_schedule_discard(struct swap_info_struct *si,
>> > -             unsigned int idx)
>> > +             struct swap_cluster_info *ci)
>> >  {
>> > +     unsigned int idx = ci - si->cluster_info;
>>
>> I see this multiple times in the patch, can we define a helper for this?
> Ack.
>
>>
>> >       /*
>> >        * If scan_swap_map_slots() can't find a free cluster, it will check
>> >        * si->swap_map directly. To make sure the discarding cluster isn't
>> > @@ -462,17 +355,14 @@ static void swap_cluster_schedule_discard(struct swap_info_struct *si,
>> >       memset(si->swap_map + idx * SWAPFILE_CLUSTER,
>> >                       SWAP_MAP_BAD, SWAPFILE_CLUSTER);
>> >
>> > -     cluster_list_add_tail(&si->discard_clusters, si->cluster_info, idx);
>> > -
>> > +     list_add_tail(&ci->list, &si->discard_clusters);
>> >       schedule_work(&si->discard_work);
>> >  }
>> >
>> > -static void __free_cluster(struct swap_info_struct *si, unsigned long idx)
>> > +static void __free_cluster(struct swap_info_struct *si, struct swap_cluster_info *ci)
>> >  {
>> > -     struct swap_cluster_info *ci = si->cluster_info;
>> > -
>> > -     cluster_set_flag(ci + idx, CLUSTER_FLAG_FREE);
>> > -     cluster_list_add_tail(&si->free_clusters, ci, idx);
>> > +     ci->state = CLUSTER_STATE_FREE;
>> > +     list_add_tail(&ci->list, &si->free_clusters);
>> >  }
>> >
>> >  /*
>> > @@ -481,21 +371,22 @@ static void __free_cluster(struct swap_info_struct *si, unsigned long idx)
>> >  */
>> >  static void swap_do_scheduled_discard(struct swap_info_struct *si)
>> >  {
>> > -     struct swap_cluster_info *info, *ci;
>> > +     struct swap_cluster_info *ci;
>> >       unsigned int idx;
>> >
>> > -     info = si->cluster_info;
>> > -
>> > -     while (!cluster_list_empty(&si->discard_clusters)) {
>> > -             idx = cluster_list_del_first(&si->discard_clusters, info);
>> > +     while (!list_empty(&si->discard_clusters)) {
>> > +             ci = list_first_entry(&si->discard_clusters, struct swap_cluster_info, list);
>> > +             list_del(&ci->list);
>> > +             idx = ci - si->cluster_info;
>> >               spin_unlock(&si->lock);
>> >
>> >               discard_swap_cluster(si, idx * SWAPFILE_CLUSTER,
>> >                               SWAPFILE_CLUSTER);
>> >
>> >               spin_lock(&si->lock);
>> > -             ci = lock_cluster(si, idx * SWAPFILE_CLUSTER);
>> > -             __free_cluster(si, idx);
>> > +
>> > +             spin_lock(&ci->lock);
>>
>> Personally, I still prefer to use lock_cluster(), which is more readable
>> and matches unlock_cluster() below.
>
> lock_cluster() uses an index which is not matching unlock_cluster()
> which is using a pointer to cluster.

lock_cluster()/unlock_cluster() are pair and fit original design
well.  They use different parameter because swap cluster is optional.

> When you get the cluster from the list, you have a cluster pointer. I
> feel it is unnecessary to convert to index then back convert to
> cluster pointer inside lock_cluster(). I actually feel using indexes
> to refer to the cluster is error prone because we also have offset.

I don't think so, it's common to use swap offset.

>
>>
>> > +             __free_cluster(si, ci);
>> >               memset(si->swap_map + idx * SWAPFILE_CLUSTER,
>> >                               0, SWAPFILE_CLUSTER);
>> >               unlock_cluster(ci);
>> > @@ -521,20 +412,19 @@ static void swap_users_ref_free(struct percpu_ref *ref)
>> >       complete(&si->comp);
>> >  }
>> >
>> > -static void alloc_cluster(struct swap_info_struct *si, unsigned long idx)
>> > +static struct swap_cluster_info *alloc_cluster(struct swap_info_struct *si, unsigned long idx)
>> >  {
>> > -     struct swap_cluster_info *ci = si->cluster_info;
>> > +     struct swap_cluster_info *ci = list_first_entry(&si->free_clusters, struct swap_cluster_info, list);
>> >
>> > -     VM_BUG_ON(cluster_list_first(&si->free_clusters) != idx);
>> > -     cluster_list_del_first(&si->free_clusters, ci);
>> > -     cluster_set_count_flag(ci + idx, 0, 0);
>> > +     VM_BUG_ON(ci - si->cluster_info != idx);
>> > +     list_del(&ci->list);
>> > +     ci->count = 0;
>>
>> Do we need this now?  If we keep CLUSTER_STATE_FREE, we need to change
>> it here.
>
> Good catch, thanks for catching that. Now I realized this is actually
> problematic and tricky to get it right. Let me work on that.
>
>>
>> > +     return ci;
>> >  }
>> >
>> > -static void free_cluster(struct swap_info_struct *si, unsigned long idx)
>> > +static void free_cluster(struct swap_info_struct *si, struct swap_cluster_info *ci)
>> >  {
>> > -     struct swap_cluster_info *ci = si->cluster_info + idx;
>> > -
>> > -     VM_BUG_ON(cluster_count(ci) != 0);
>> > +     VM_BUG_ON(ci->count != 0);
>> >       /*
>> >        * If the swap is discardable, prepare discard the cluster
>> >        * instead of free it immediately. The cluster will be freed
>> > @@ -542,11 +432,11 @@ static void free_cluster(struct swap_info_struct *si, unsigned long idx)
>> >        */
>> >       if ((si->flags & (SWP_WRITEOK | SWP_PAGE_DISCARD)) ==
>> >           (SWP_WRITEOK | SWP_PAGE_DISCARD)) {
>> > -             swap_cluster_schedule_discard(si, idx);
>> > +             swap_cluster_schedule_discard(si, ci);
>> >               return;
>> >       }
>> >
>> > -     __free_cluster(si, idx);
>> > +     __free_cluster(si, ci);
>> >  }
>> >
>> >  /*
>> > @@ -559,15 +449,15 @@ static void add_cluster_info_page(struct swap_info_struct *p,
>> >       unsigned long count)
>> >  {
>> >       unsigned long idx = page_nr / SWAPFILE_CLUSTER;
>> > +     struct swap_cluster_info *ci = cluster_info + idx;
>> >
>> >       if (!cluster_info)
>> >               return;
>> > -     if (cluster_is_free(&cluster_info[idx]))
>> > +     if (cluster_is_free(ci))
>> >               alloc_cluster(p, idx);
>> >
>> > -     VM_BUG_ON(cluster_count(&cluster_info[idx]) + count > SWAPFILE_CLUSTER);
>> > -     cluster_set_count(&cluster_info[idx],
>> > -             cluster_count(&cluster_info[idx]) + count);
>> > +     VM_BUG_ON(ci->count + count > SWAPFILE_CLUSTER);
>> > +     ci->count += count;
>> >  }
>> >
>> >  /*
>> > @@ -581,24 +471,20 @@ static void inc_cluster_info_page(struct swap_info_struct *p,
>> >  }
>> >
>> >  /*
>> > - * The cluster corresponding to page_nr decreases one usage. If the usage
>> > - * counter becomes 0, which means no page in the cluster is in using, we can
>> > - * optionally discard the cluster and add it to free cluster list.
>> > + * The cluster ci decreases one usage. If the usage counter becomes 0,
>> > + * which means no page in the cluster is in using, we can optionally discard
>> > + * the cluster and add it to free cluster list.
>> >   */
>> > -static void dec_cluster_info_page(struct swap_info_struct *p,
>> > -     struct swap_cluster_info *cluster_info, unsigned long page_nr)
>> > +static void dec_cluster_info_page(struct swap_info_struct *p, struct swap_cluster_info *ci)
>> >  {
>> > -     unsigned long idx = page_nr / SWAPFILE_CLUSTER;
>> > -
>> > -     if (!cluster_info)
>> > +     if (!p->cluster_info)
>> >               return;
>> >
>> > -     VM_BUG_ON(cluster_count(&cluster_info[idx]) == 0);
>> > -     cluster_set_count(&cluster_info[idx],
>> > -             cluster_count(&cluster_info[idx]) - 1);
>> > +     VM_BUG_ON(ci->count == 0);
>> > +     ci->count--;
>> >
>> > -     if (cluster_count(&cluster_info[idx]) == 0)
>> > -             free_cluster(p, idx);
>> > +     if (!ci->count)
>> > +             free_cluster(p, ci);
>> >  }
>> >
>> >  /*
>> > @@ -611,10 +497,10 @@ scan_swap_map_ssd_cluster_conflict(struct swap_info_struct *si,
>> >  {
>> >       struct percpu_cluster *percpu_cluster;
>> >       bool conflict;
>> > -
>>
>> Usually we use one blank line after local variable declaration.
> Ack.
>
>>
>> > +     struct swap_cluster_info *first = list_first_entry(&si->free_clusters, struct swap_cluster_info, list);
>> >       offset /= SWAPFILE_CLUSTER;
>> > -     conflict = !cluster_list_empty(&si->free_clusters) &&
>> > -             offset != cluster_list_first(&si->free_clusters) &&
>> > +     conflict = !list_empty(&si->free_clusters) &&
>> > +             offset !=  first - si->cluster_info &&
>> >               cluster_is_free(&si->cluster_info[offset]);
>> >
>> >       if (!conflict)
>> > @@ -655,10 +541,14 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
>> >       cluster = this_cpu_ptr(si->percpu_cluster);
>> >       tmp = cluster->next[order];
>> >       if (tmp == SWAP_NEXT_INVALID) {
>> > -             if (!cluster_list_empty(&si->free_clusters)) {
>> > -                     tmp = cluster_next(&si->free_clusters.head) *
>> > -                                     SWAPFILE_CLUSTER;
>> > -             } else if (!cluster_list_empty(&si->discard_clusters)) {
>> > +             if (!list_empty(&si->free_clusters)) {
>> > +                     ci = list_first_entry(&si->free_clusters, struct swap_cluster_info, list);
>> > +                     list_del(&ci->list);
>>
>> The free cluster is deleted from si->free_clusters now.  But later you
>> will call scan_swap_map_ssd_cluster_conflict() and may abandon the
>> cluster.  And in alloc_cluster() later, it may be deleted again.
>
> Yes, that is a bug. Thanks for catching that.
>
>>
>> > +                     spin_lock(&ci->lock);
>> > +                     ci->state = CLUSTER_STATE_PER_CPU;
>>
>> Need to change ci->state when move a cluster off the percpu_cluster.
>
> In the next patch. This patch does not use the off state yet.

But that is confusing to use wrong state name, the really meaning is
something like CLUSTER_STATE_NON_FREE.  But as I suggested above, we can
keep swap_cluster_info.flags and CLUSTER_FLAG_FREE in this patch.

>>
>> > +                     spin_unlock(&ci->lock);
>> > +                     tmp = (ci - si->cluster_info) * SWAPFILE_CLUSTER;
>> > +             } else if (!list_empty(&si->discard_clusters)) {
>> >                       /*
>> >                        * we don't have free cluster but have some clusters in
>> >                        * discarding, do discard now and reclaim them, then
>> > @@ -1062,8 +952,8 @@ static void swap_free_cluster(struct swap_info_struct *si, unsigned long idx)
>> >
>> >       ci = lock_cluster(si, offset);
>> >       memset(si->swap_map + offset, 0, SWAPFILE_CLUSTER);
>> > -     cluster_set_count_flag(ci, 0, 0);
>> > -     free_cluster(si, idx);
>> > +     ci->count = 0;
>> > +     free_cluster(si, ci);
>> >       unlock_cluster(ci);
>> >       swap_range_free(si, offset, SWAPFILE_CLUSTER);
>> >  }
>> > @@ -1336,7 +1226,7 @@ static void swap_entry_free(struct swap_info_struct *p, swp_entry_t entry)
>> >       count = p->swap_map[offset];
>> >       VM_BUG_ON(count != SWAP_HAS_CACHE);
>> >       p->swap_map[offset] = 0;
>> > -     dec_cluster_info_page(p, p->cluster_info, offset);
>> > +     dec_cluster_info_page(p, ci);
>> >       unlock_cluster(ci);
>> >
>> >       mem_cgroup_uncharge_swap(entry, 1);
>> > @@ -3003,8 +2893,8 @@ static int setup_swap_map_and_extents(struct swap_info_struct *p,
>> >
>> >       nr_good_pages = maxpages - 1;   /* omit header page */
>> >
>> > -     cluster_list_init(&p->free_clusters);
>> > -     cluster_list_init(&p->discard_clusters);
>> > +     INIT_LIST_HEAD(&p->free_clusters);
>> > +     INIT_LIST_HEAD(&p->discard_clusters);
>> >
>> >       for (i = 0; i < swap_header->info.nr_badpages; i++) {
>> >               unsigned int page_nr = swap_header->info.badpages[i];
>> > @@ -3055,14 +2945,15 @@ static int setup_swap_map_and_extents(struct swap_info_struct *p,
>> >       for (k = 0; k < SWAP_CLUSTER_COLS; k++) {
>> >               j = (k + col) % SWAP_CLUSTER_COLS;
>> >               for (i = 0; i < DIV_ROUND_UP(nr_clusters, SWAP_CLUSTER_COLS); i++) {
>> > +                     struct swap_cluster_info *ci;
>> >                       idx = i * SWAP_CLUSTER_COLS + j;
>> > +                     ci = cluster_info + idx;
>> >                       if (idx >= nr_clusters)
>> >                               continue;
>> > -                     if (cluster_count(&cluster_info[idx]))
>> > +                     if (ci->count)
>> >                               continue;
>> > -                     cluster_set_flag(&cluster_info[idx], CLUSTER_FLAG_FREE);
>> > -                     cluster_list_add_tail(&p->free_clusters, cluster_info,
>> > -                                           idx);
>> > +                     ci->state = CLUSTER_STATE_FREE;
>> > +                     list_add_tail(&ci->list, &p->free_clusters);
>> >               }
>> >       }
>> >       return nr_extents;
>
> Thank you for the review and spotting the bug.

My pleasure!

--
Best Regards,
Huang, Ying


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 1/2] mm: swap: swap cluster switch to double link list
  2024-06-18  7:54       ` Huang, Ying
@ 2024-06-18 10:01         ` Chris Li
  2024-06-19  7:51           ` Huang, Ying
  0 siblings, 1 reply; 22+ messages in thread
From: Chris Li @ 2024-06-18 10:01 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Andrew Morton, Kairui Song, Ryan Roberts, Kalesh Singh,
	linux-kernel, linux-mm, Barry Song

On Tue, Jun 18, 2024 at 12:56 AM Huang, Ying <ying.huang@intel.com> wrote:
>
> Chris Li <chrisl@kernel.org> writes:
>
> > On Sun, Jun 16, 2024 at 11:21 PM Huang, Ying <ying.huang@intel.com> wrote:
> >>
> >> Hi, Chris,
> >>
> >> Chris Li <chrisl@kernel.org> writes:
> >>
> >> > Previously, the swap cluster used a cluster index as a pointer
> >> > to construct a custom single link list type "swap_cluster_list".
> >> > The next cluster pointer is shared with the cluster->count.
> >> > It prevents puting the non free cluster into a list.
> >> > Change the cluster to use the standard double link list instead.
> >> > This allows tracing the nonfull cluster in the follow up patch.
> >> >
> >> > Remove the cluster getter/setter for accessing the cluster
> >> > struct member.
> >> >
> >> > The list operation is protected by the swap_info_struct->lock.
> >> >
> >> > Change cluster code to use "struct swap_cluster_info *" to
> >> > reference the cluster rather than by using index. That is more
> >> > consistent with the list manipulation. It avoids the repeat
> >> > adding index to the cluser_info. The code is easier to understand.
> >> >
> >> > Remove the cluster next pointer is NULL flag, the double link
> >> > list can handle the empty list pretty well.
> >>
> >> The above is more about "what" instead of "why".  We can identify "what"
> >> from the patch itself.  I expect more "why".  I guess that we can reduce
> >> swap_map[] scanning if we have lists of non-full/non-free clusters.
> >
> > In my mind, the "why" is captured by " This allows tracing the nonfull
> > cluster in the follow up patch.".
> > If you want to ask "why" we want the "nonfull cluster list". It is to
> > get to the suitable candidate cluster with that order quicker than
> > scanning swap_map[].
>
> Good.  Please add that into the patch description.  And I think that we
> can reduce the description about "what" too.

Sure.

>
> >>
> >> > The "swap_cluster_info" struct is two pointer bigger, because
> >> > 512 swap entries share one swap struct, it has very little impact
> >> > on the average memory usage per swap entry. For 1TB swapfile, the
> >> > swap cluster data structure increases from 8MB to 24MB.
> >> >
> >> > Other than the list conversion, there is no real function change
> >> > in this patch.
> >> >
> >> > Signed-off-by: Chris Li <chrisl@kernel.org>
> >> > ---
> >> >  include/linux/swap.h |  28 +++----
> >> >  mm/swapfile.c        | 227 +++++++++++++--------------------------------------
> >> >  2 files changed, 70 insertions(+), 185 deletions(-)
> >> >
> >> > diff --git a/include/linux/swap.h b/include/linux/swap.h
> >> > index 3df75d62a835..cd9154a3e934 100644
> >> > --- a/include/linux/swap.h
> >> > +++ b/include/linux/swap.h
> >> > @@ -242,23 +242,22 @@ enum {
> >> >   * space with SWAPFILE_CLUSTER pages long and naturally aligns in disk. All
> >> >   * free clusters are organized into a list. We fetch an entry from the list to
> >> >   * get a free cluster.
> >> > - *
> >> > - * The data field stores next cluster if the cluster is free or cluster usage
> >> > - * counter otherwise. The flags field determines if a cluster is free. This is
> >> > - * protected by swap_info_struct.lock.
> >> >   */
> >> >  struct swap_cluster_info {
> >> >       spinlock_t lock;        /*
> >> > -                              * Protect swap_cluster_info fields
> >> > -                              * and swap_info_struct->swap_map
> >> > +                              * Protect swap_cluster_info count and state
> >>
> >> Protect swap_cluster_info fields except 'list' ?
> >
> > I change it to protect the swap_cluster_info bitfields in the second patch.
>
> Although I still prefer my version, I will not insist on that.

Sure, I actually don't have a strong preference about that. It is just comments.

>
> >>
> >> > +                              * field and swap_info_struct->swap_map
> >> >                                * elements correspond to the swap
> >> >                                * cluster
> >> >                                */
> >> > -     unsigned int data:24;
> >> > -     unsigned int flags:8;
> >> > +     unsigned int count:12;
> >> > +     unsigned int state:3;
> >>
> >> I still prefer normal data type over bit fields.  How about
> >>
> >>         u16 usage;
> >>         u8  state;
> >
> > I don't mind the "count" rename to "usage". That is probably a better
> > name. However I have another patch intended to add more bit fields in
> > the cluster info struct. The second patch adds "order" and the later
> > patch will add more. That is why I choose bitfield to be more condense
> > with bits.
>
> We still have space for another "u8" for "order".  It appears trivial to
> change it to bit fields when necessary in the future.

We can, I don't see it necessary to change from bit field to u8 and
back to bit field in the future. It is more of a personal preference
issue.

> >>
> >> And, how about use 'usage' instead of 'count'?  Personally I think that
> >> it is more clear.  But I don't have strong opinions on this.
> >>
> >> > +     struct list_head list;  /* Protected by swap_info_struct->lock */
> >> >  };
> >> > -#define CLUSTER_FLAG_FREE 1 /* This cluster is free */
> >> > -#define CLUSTER_FLAG_NEXT_NULL 2 /* This cluster has no next cluster */
> >> > +
> >> > +#define CLUSTER_STATE_FREE   1 /* This cluster is free */
> >>
> >> Can we use swap_cluster_info->count == 0?
> >
> > It is not as good considering the second patch starts to track the
> > state of the cluster of per cpu struct. We will be comparing both the
> > cluster->count and cluster->state.
> >
> >>
> >> > +#define CLUSTER_STATE_PER_CPU        2 /* This cluster on per_cpu_cluster  */
> >> > +
> >>
> >> There's no users of this state in this patch.  IMHO, it's better to
> >
> > Yes, there is usage of this state in this patch in the sense that, if
> > you remove that state definition,
> > the code can't compile due to assignment of CLUSTER_STATE_PER_CPU.
>
> Sorry, my words were confusing, we can move both the assignment and the
> state itself to the second patch.
>
> > There is a code test if a cluster state is not a free state, which
> > excludes "CLUSTER_STATE_PER_CPU".
>
> You mean the functionality that is equivalent to original
> cluster_set_count_flag(0, 0) and cluster_is_free()?  I think
> CLUSTER_STATE_PER_CPU cannot catch all.  If so, I suggest you to keep
> swap_cluster_info.flags and CLUSTER_FLAG_FREE in this patch and change
> it in the 2nd patch.  That will make this patch more focused and easier
> to be reviewed.

That is one way to do it.

>
> In general, please try to keep this patch as simple as possible to help
> reviewers.  Because it's quite long.  For example, just convert the list
> implementation and keep other stuff as much as possible.
>
Let me think about it. Thanks.

> >> introduce a symbol with its users, otherwise, it's hard to understand
> >> why do we need it and how to use it.  And, IIUC, the state isn't
> >> maintained properly, it should be changed when we move the cluster off
> >> the per-cpu cluster.
> >
> > I am actually following the same usage principle as you suggested
> > here. Only the second patch starts to use the off per cpu state
> > (SCANNED). That is why I introduce it there.
> >
> >>
> >> >  /*
> >> >   * The first page in the swap file is the swap header, which is always marked
> >> > @@ -283,11 +282,6 @@ struct percpu_cluster {
> >> >       unsigned int next[SWAP_NR_ORDERS]; /* Likely next allocation offset */
> >> >  };
> >> >
> >> > -struct swap_cluster_list {
> >> > -     struct swap_cluster_info head;
> >> > -     struct swap_cluster_info tail;
> >> > -};
> >> > -
> >> >  /*
> >> >   * The in-memory structure used to track swap areas.
> >> >   */
> >> > @@ -300,7 +294,7 @@ struct swap_info_struct {
> >> >       unsigned int    max;            /* extent of the swap_map */
> >> >       unsigned char *swap_map;        /* vmalloc'ed array of usage counts */
> >> >       struct swap_cluster_info *cluster_info; /* cluster info. Only for SSD */
> >> > -     struct swap_cluster_list free_clusters; /* free clusters list */
> >> > +     struct list_head free_clusters; /* free clusters list */
> >> >       unsigned int lowest_bit;        /* index of first free in swap_map */
> >> >       unsigned int highest_bit;       /* index of last free in swap_map */
> >> >       unsigned int pages;             /* total of usable pages of swap */
> >> > @@ -331,7 +325,7 @@ struct swap_info_struct {
> >> >                                        * list.
> >> >                                        */
> >> >       struct work_struct discard_work; /* discard worker */
> >> > -     struct swap_cluster_list discard_clusters; /* discard clusters list */
> >> > +     struct list_head discard_clusters; /* discard clusters list */
> >> >       struct plist_node avail_lists[]; /*
> >> >                                          * entries in swap_avail_heads, one
> >> >                                          * entry per node.
> >> > diff --git a/mm/swapfile.c b/mm/swapfile.c
> >> > index 9c6d8e557c0f..2f878b374349 100644
> >> > --- a/mm/swapfile.c
> >> > +++ b/mm/swapfile.c
> >> > @@ -290,62 +290,9 @@ static void discard_swap_cluster(struct swap_info_struct *si,
> >> >  #endif
> >> >  #define LATENCY_LIMIT                256
> >> >
> >> > -static inline void cluster_set_flag(struct swap_cluster_info *info,
> >> > -     unsigned int flag)
> >> > -{
> >> > -     info->flags = flag;
> >> > -}
> >> > -
> >> > -static inline unsigned int cluster_count(struct swap_cluster_info *info)
> >> > -{
> >> > -     return info->data;
> >> > -}
> >> > -
> >> > -static inline void cluster_set_count(struct swap_cluster_info *info,
> >> > -                                  unsigned int c)
> >> > -{
> >> > -     info->data = c;
> >> > -}
> >> > -
> >> > -static inline void cluster_set_count_flag(struct swap_cluster_info *info,
> >> > -                                      unsigned int c, unsigned int f)
> >> > -{
> >> > -     info->flags = f;
> >> > -     info->data = c;
> >> > -}
> >> > -
> >> > -static inline unsigned int cluster_next(struct swap_cluster_info *info)
> >> > -{
> >> > -     return info->data;
> >> > -}
> >> > -
> >> > -static inline void cluster_set_next(struct swap_cluster_info *info,
> >> > -                                 unsigned int n)
> >> > -{
> >> > -     info->data = n;
> >> > -}
> >> > -
> >> > -static inline void cluster_set_next_flag(struct swap_cluster_info *info,
> >> > -                                      unsigned int n, unsigned int f)
> >> > -{
> >> > -     info->flags = f;
> >> > -     info->data = n;
> >> > -}
> >> > -
> >> >  static inline bool cluster_is_free(struct swap_cluster_info *info)
> >> >  {
> >> > -     return info->flags & CLUSTER_FLAG_FREE;
> >> > -}
> >> > -
> >> > -static inline bool cluster_is_null(struct swap_cluster_info *info)
> >> > -{
> >> > -     return info->flags & CLUSTER_FLAG_NEXT_NULL;
> >> > -}
> >> > -
> >> > -static inline void cluster_set_null(struct swap_cluster_info *info)
> >> > -{
> >> > -     info->flags = CLUSTER_FLAG_NEXT_NULL;
> >> > -     info->data = 0;
> >> > +     return info->state == CLUSTER_STATE_FREE;
> >> >  }
> >> >
> >> >  static inline struct swap_cluster_info *lock_cluster(struct swap_info_struct *si,
> >> > @@ -394,65 +341,11 @@ static inline void unlock_cluster_or_swap_info(struct swap_info_struct *si,
> >> >               spin_unlock(&si->lock);
> >> >  }
> >> >
> >> > -static inline bool cluster_list_empty(struct swap_cluster_list *list)
> >> > -{
> >> > -     return cluster_is_null(&list->head);
> >> > -}
> >> > -
> >> > -static inline unsigned int cluster_list_first(struct swap_cluster_list *list)
> >> > -{
> >> > -     return cluster_next(&list->head);
> >> > -}
> >> > -
> >> > -static void cluster_list_init(struct swap_cluster_list *list)
> >> > -{
> >> > -     cluster_set_null(&list->head);
> >> > -     cluster_set_null(&list->tail);
> >> > -}
> >> > -
> >> > -static void cluster_list_add_tail(struct swap_cluster_list *list,
> >> > -                               struct swap_cluster_info *ci,
> >> > -                               unsigned int idx)
> >> > -{
> >> > -     if (cluster_list_empty(list)) {
> >> > -             cluster_set_next_flag(&list->head, idx, 0);
> >> > -             cluster_set_next_flag(&list->tail, idx, 0);
> >> > -     } else {
> >> > -             struct swap_cluster_info *ci_tail;
> >> > -             unsigned int tail = cluster_next(&list->tail);
> >> > -
> >> > -             /*
> >> > -              * Nested cluster lock, but both cluster locks are
> >> > -              * only acquired when we held swap_info_struct->lock
> >> > -              */
> >> > -             ci_tail = ci + tail;
> >> > -             spin_lock_nested(&ci_tail->lock, SINGLE_DEPTH_NESTING);
> >> > -             cluster_set_next(ci_tail, idx);
> >> > -             spin_unlock(&ci_tail->lock);
> >> > -             cluster_set_next_flag(&list->tail, idx, 0);
> >> > -     }
> >> > -}
> >> > -
> >> > -static unsigned int cluster_list_del_first(struct swap_cluster_list *list,
> >> > -                                        struct swap_cluster_info *ci)
> >> > -{
> >> > -     unsigned int idx;
> >> > -
> >> > -     idx = cluster_next(&list->head);
> >> > -     if (cluster_next(&list->tail) == idx) {
> >> > -             cluster_set_null(&list->head);
> >> > -             cluster_set_null(&list->tail);
> >> > -     } else
> >> > -             cluster_set_next_flag(&list->head,
> >> > -                                   cluster_next(&ci[idx]), 0);
> >> > -
> >> > -     return idx;
> >> > -}
> >> > -
> >> >  /* Add a cluster to discard list and schedule it to do discard */
> >> >  static void swap_cluster_schedule_discard(struct swap_info_struct *si,
> >> > -             unsigned int idx)
> >> > +             struct swap_cluster_info *ci)
> >> >  {
> >> > +     unsigned int idx = ci - si->cluster_info;
> >>
> >> I see this multiple times in the patch, can we define a helper for this?
> > Ack.
> >
> >>
> >> >       /*
> >> >        * If scan_swap_map_slots() can't find a free cluster, it will check
> >> >        * si->swap_map directly. To make sure the discarding cluster isn't
> >> > @@ -462,17 +355,14 @@ static void swap_cluster_schedule_discard(struct swap_info_struct *si,
> >> >       memset(si->swap_map + idx * SWAPFILE_CLUSTER,
> >> >                       SWAP_MAP_BAD, SWAPFILE_CLUSTER);
> >> >
> >> > -     cluster_list_add_tail(&si->discard_clusters, si->cluster_info, idx);
> >> > -
> >> > +     list_add_tail(&ci->list, &si->discard_clusters);
> >> >       schedule_work(&si->discard_work);
> >> >  }
> >> >
> >> > -static void __free_cluster(struct swap_info_struct *si, unsigned long idx)
> >> > +static void __free_cluster(struct swap_info_struct *si, struct swap_cluster_info *ci)
> >> >  {
> >> > -     struct swap_cluster_info *ci = si->cluster_info;
> >> > -
> >> > -     cluster_set_flag(ci + idx, CLUSTER_FLAG_FREE);
> >> > -     cluster_list_add_tail(&si->free_clusters, ci, idx);
> >> > +     ci->state = CLUSTER_STATE_FREE;
> >> > +     list_add_tail(&ci->list, &si->free_clusters);
> >> >  }
> >> >
> >> >  /*
> >> > @@ -481,21 +371,22 @@ static void __free_cluster(struct swap_info_struct *si, unsigned long idx)
> >> >  */
> >> >  static void swap_do_scheduled_discard(struct swap_info_struct *si)
> >> >  {
> >> > -     struct swap_cluster_info *info, *ci;
> >> > +     struct swap_cluster_info *ci;
> >> >       unsigned int idx;
> >> >
> >> > -     info = si->cluster_info;
> >> > -
> >> > -     while (!cluster_list_empty(&si->discard_clusters)) {
> >> > -             idx = cluster_list_del_first(&si->discard_clusters, info);
> >> > +     while (!list_empty(&si->discard_clusters)) {
> >> > +             ci = list_first_entry(&si->discard_clusters, struct swap_cluster_info, list);
> >> > +             list_del(&ci->list);
> >> > +             idx = ci - si->cluster_info;
> >> >               spin_unlock(&si->lock);
> >> >
> >> >               discard_swap_cluster(si, idx * SWAPFILE_CLUSTER,
> >> >                               SWAPFILE_CLUSTER);
> >> >
> >> >               spin_lock(&si->lock);
> >> > -             ci = lock_cluster(si, idx * SWAPFILE_CLUSTER);
> >> > -             __free_cluster(si, idx);
> >> > +
> >> > +             spin_lock(&ci->lock);
> >>
> >> Personally, I still prefer to use lock_cluster(), which is more readable
> >> and matches unlock_cluster() below.
> >
> > lock_cluster() uses an index which is not matching unlock_cluster()
> > which is using a pointer to cluster.
>
> lock_cluster()/unlock_cluster() are pair and fit original design
> well.  They use different parameter because swap cluster is optional.
>
> > When you get the cluster from the list, you have a cluster pointer. I
> > feel it is unnecessary to convert to index then back convert to
> > cluster pointer inside lock_cluster(). I actually feel using indexes
> > to refer to the cluster is error prone because we also have offset.
>
> I don't think so, it's common to use swap offset.

Swap offset is not an issue, it is all over the place. The cluster
index(offset/512) is the one I try to avoid.
I have made some mistakes myself on offset vs index.

>
> >
> >>
> >> > +             __free_cluster(si, ci);
> >> >               memset(si->swap_map + idx * SWAPFILE_CLUSTER,
> >> >                               0, SWAPFILE_CLUSTER);
> >> >               unlock_cluster(ci);
> >> > @@ -521,20 +412,19 @@ static void swap_users_ref_free(struct percpu_ref *ref)
> >> >       complete(&si->comp);
> >> >  }
> >> >
> >> > -static void alloc_cluster(struct swap_info_struct *si, unsigned long idx)
> >> > +static struct swap_cluster_info *alloc_cluster(struct swap_info_struct *si, unsigned long idx)
> >> >  {
> >> > -     struct swap_cluster_info *ci = si->cluster_info;
> >> > +     struct swap_cluster_info *ci = list_first_entry(&si->free_clusters, struct swap_cluster_info, list);
> >> >
> >> > -     VM_BUG_ON(cluster_list_first(&si->free_clusters) != idx);
> >> > -     cluster_list_del_first(&si->free_clusters, ci);
> >> > -     cluster_set_count_flag(ci + idx, 0, 0);
> >> > +     VM_BUG_ON(ci - si->cluster_info != idx);
> >> > +     list_del(&ci->list);
> >> > +     ci->count = 0;
> >>
> >> Do we need this now?  If we keep CLUSTER_STATE_FREE, we need to change
> >> it here.
> >
> > Good catch, thanks for catching that. Now I realized this is actually
> > problematic and tricky to get it right. Let me work on that.
> >
> >>
> >> > +     return ci;
> >> >  }
> >> >
> >> > -static void free_cluster(struct swap_info_struct *si, unsigned long idx)
> >> > +static void free_cluster(struct swap_info_struct *si, struct swap_cluster_info *ci)
> >> >  {
> >> > -     struct swap_cluster_info *ci = si->cluster_info + idx;
> >> > -
> >> > -     VM_BUG_ON(cluster_count(ci) != 0);
> >> > +     VM_BUG_ON(ci->count != 0);
> >> >       /*
> >> >        * If the swap is discardable, prepare discard the cluster
> >> >        * instead of free it immediately. The cluster will be freed
> >> > @@ -542,11 +432,11 @@ static void free_cluster(struct swap_info_struct *si, unsigned long idx)
> >> >        */
> >> >       if ((si->flags & (SWP_WRITEOK | SWP_PAGE_DISCARD)) ==
> >> >           (SWP_WRITEOK | SWP_PAGE_DISCARD)) {
> >> > -             swap_cluster_schedule_discard(si, idx);
> >> > +             swap_cluster_schedule_discard(si, ci);
> >> >               return;
> >> >       }
> >> >
> >> > -     __free_cluster(si, idx);
> >> > +     __free_cluster(si, ci);
> >> >  }
> >> >
> >> >  /*
> >> > @@ -559,15 +449,15 @@ static void add_cluster_info_page(struct swap_info_struct *p,
> >> >       unsigned long count)
> >> >  {
> >> >       unsigned long idx = page_nr / SWAPFILE_CLUSTER;
> >> > +     struct swap_cluster_info *ci = cluster_info + idx;
> >> >
> >> >       if (!cluster_info)
> >> >               return;
> >> > -     if (cluster_is_free(&cluster_info[idx]))
> >> > +     if (cluster_is_free(ci))
> >> >               alloc_cluster(p, idx);
> >> >
> >> > -     VM_BUG_ON(cluster_count(&cluster_info[idx]) + count > SWAPFILE_CLUSTER);
> >> > -     cluster_set_count(&cluster_info[idx],
> >> > -             cluster_count(&cluster_info[idx]) + count);
> >> > +     VM_BUG_ON(ci->count + count > SWAPFILE_CLUSTER);
> >> > +     ci->count += count;
> >> >  }
> >> >
> >> >  /*
> >> > @@ -581,24 +471,20 @@ static void inc_cluster_info_page(struct swap_info_struct *p,
> >> >  }
> >> >
> >> >  /*
> >> > - * The cluster corresponding to page_nr decreases one usage. If the usage
> >> > - * counter becomes 0, which means no page in the cluster is in using, we can
> >> > - * optionally discard the cluster and add it to free cluster list.
> >> > + * The cluster ci decreases one usage. If the usage counter becomes 0,
> >> > + * which means no page in the cluster is in using, we can optionally discard
> >> > + * the cluster and add it to free cluster list.
> >> >   */
> >> > -static void dec_cluster_info_page(struct swap_info_struct *p,
> >> > -     struct swap_cluster_info *cluster_info, unsigned long page_nr)
> >> > +static void dec_cluster_info_page(struct swap_info_struct *p, struct swap_cluster_info *ci)
> >> >  {
> >> > -     unsigned long idx = page_nr / SWAPFILE_CLUSTER;
> >> > -
> >> > -     if (!cluster_info)
> >> > +     if (!p->cluster_info)
> >> >               return;
> >> >
> >> > -     VM_BUG_ON(cluster_count(&cluster_info[idx]) == 0);
> >> > -     cluster_set_count(&cluster_info[idx],
> >> > -             cluster_count(&cluster_info[idx]) - 1);
> >> > +     VM_BUG_ON(ci->count == 0);
> >> > +     ci->count--;
> >> >
> >> > -     if (cluster_count(&cluster_info[idx]) == 0)
> >> > -             free_cluster(p, idx);
> >> > +     if (!ci->count)
> >> > +             free_cluster(p, ci);
> >> >  }
> >> >
> >> >  /*
> >> > @@ -611,10 +497,10 @@ scan_swap_map_ssd_cluster_conflict(struct swap_info_struct *si,
> >> >  {
> >> >       struct percpu_cluster *percpu_cluster;
> >> >       bool conflict;
> >> > -
> >>
> >> Usually we use one blank line after local variable declaration.
> > Ack.
> >
> >>
> >> > +     struct swap_cluster_info *first = list_first_entry(&si->free_clusters, struct swap_cluster_info, list);
> >> >       offset /= SWAPFILE_CLUSTER;
> >> > -     conflict = !cluster_list_empty(&si->free_clusters) &&
> >> > -             offset != cluster_list_first(&si->free_clusters) &&
> >> > +     conflict = !list_empty(&si->free_clusters) &&
> >> > +             offset !=  first - si->cluster_info &&
> >> >               cluster_is_free(&si->cluster_info[offset]);
> >> >
> >> >       if (!conflict)
> >> > @@ -655,10 +541,14 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
> >> >       cluster = this_cpu_ptr(si->percpu_cluster);
> >> >       tmp = cluster->next[order];
> >> >       if (tmp == SWAP_NEXT_INVALID) {
> >> > -             if (!cluster_list_empty(&si->free_clusters)) {
> >> > -                     tmp = cluster_next(&si->free_clusters.head) *
> >> > -                                     SWAPFILE_CLUSTER;
> >> > -             } else if (!cluster_list_empty(&si->discard_clusters)) {
> >> > +             if (!list_empty(&si->free_clusters)) {
> >> > +                     ci = list_first_entry(&si->free_clusters, struct swap_cluster_info, list);
> >> > +                     list_del(&ci->list);
> >>
> >> The free cluster is deleted from si->free_clusters now.  But later you
> >> will call scan_swap_map_ssd_cluster_conflict() and may abandon the
> >> cluster.  And in alloc_cluster() later, it may be deleted again.
> >
> > Yes, that is a bug. Thanks for catching that.
> >
> >>
> >> > +                     spin_lock(&ci->lock);
> >> > +                     ci->state = CLUSTER_STATE_PER_CPU;
> >>
> >> Need to change ci->state when move a cluster off the percpu_cluster.
> >
> > In the next patch. This patch does not use the off state yet.
>
> But that is confusing to use wrong state name, the really meaning is
> something like CLUSTER_STATE_NON_FREE.  But as I suggested above, we can

It can be FREE and on the per cpu pointer as well. That is the tricky part.
It can happen on the current code as well.

> keep swap_cluster_info.flags and CLUSTER_FLAG_FREE in this patch.

Maybe. Will consider that.

>
> >>
> >> > +                     spin_unlock(&ci->lock);
> >> > +                     tmp = (ci - si->cluster_info) * SWAPFILE_CLUSTER;
> >> > +             } else if (!list_empty(&si->discard_clusters)) {
> >> >                       /*
> >> >                        * we don't have free cluster but have some clusters in
> >> >                        * discarding, do discard now and reclaim them, then
> >> > @@ -1062,8 +952,8 @@ static void swap_free_cluster(struct swap_info_struct *si, unsigned long idx)
> >> >
> >> >       ci = lock_cluster(si, offset);
> >> >       memset(si->swap_map + offset, 0, SWAPFILE_CLUSTER);
> >> > -     cluster_set_count_flag(ci, 0, 0);
> >> > -     free_cluster(si, idx);
> >> > +     ci->count = 0;
> >> > +     free_cluster(si, ci);
> >> >       unlock_cluster(ci);
> >> >       swap_range_free(si, offset, SWAPFILE_CLUSTER);
> >> >  }
> >> > @@ -1336,7 +1226,7 @@ static void swap_entry_free(struct swap_info_struct *p, swp_entry_t entry)
> >> >       count = p->swap_map[offset];
> >> >       VM_BUG_ON(count != SWAP_HAS_CACHE);
> >> >       p->swap_map[offset] = 0;
> >> > -     dec_cluster_info_page(p, p->cluster_info, offset);
> >> > +     dec_cluster_info_page(p, ci);
> >> >       unlock_cluster(ci);
> >> >
> >> >       mem_cgroup_uncharge_swap(entry, 1);
> >> > @@ -3003,8 +2893,8 @@ static int setup_swap_map_and_extents(struct swap_info_struct *p,
> >> >
> >> >       nr_good_pages = maxpages - 1;   /* omit header page */
> >> >
> >> > -     cluster_list_init(&p->free_clusters);
> >> > -     cluster_list_init(&p->discard_clusters);
> >> > +     INIT_LIST_HEAD(&p->free_clusters);
> >> > +     INIT_LIST_HEAD(&p->discard_clusters);
> >> >
> >> >       for (i = 0; i < swap_header->info.nr_badpages; i++) {
> >> >               unsigned int page_nr = swap_header->info.badpages[i];
> >> > @@ -3055,14 +2945,15 @@ static int setup_swap_map_and_extents(struct swap_info_struct *p,
> >> >       for (k = 0; k < SWAP_CLUSTER_COLS; k++) {
> >> >               j = (k + col) % SWAP_CLUSTER_COLS;
> >> >               for (i = 0; i < DIV_ROUND_UP(nr_clusters, SWAP_CLUSTER_COLS); i++) {
> >> > +                     struct swap_cluster_info *ci;
> >> >                       idx = i * SWAP_CLUSTER_COLS + j;
> >> > +                     ci = cluster_info + idx;
> >> >                       if (idx >= nr_clusters)
> >> >                               continue;
> >> > -                     if (cluster_count(&cluster_info[idx]))
> >> > +                     if (ci->count)
> >> >                               continue;
> >> > -                     cluster_set_flag(&cluster_info[idx], CLUSTER_FLAG_FREE);
> >> > -                     cluster_list_add_tail(&p->free_clusters, cluster_info,
> >> > -                                           idx);
> >> > +                     ci->state = CLUSTER_STATE_FREE;
> >> > +                     list_add_tail(&ci->list, &p->free_clusters);
> >> >               }
> >> >       }
> >> >       return nr_extents;
> >
> > Thank you for the review and spotting the bug.
>
> My pleasure!

Thanks!

Chris


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 0/2] mm: swap: mTHP swap allocator base on swap cluster order
  2024-06-14 23:48 [PATCH v2 0/2] mm: swap: mTHP swap allocator base on swap cluster order Chris Li
                   ` (2 preceding siblings ...)
  2024-06-15  1:06 ` [PATCH v2 0/2] mm: swap: mTHP swap allocator base on swap cluster order Andrew Morton
@ 2024-06-18 13:08 ` David Hildenbrand
  3 siblings, 0 replies; 22+ messages in thread
From: David Hildenbrand @ 2024-06-18 13:08 UTC (permalink / raw)
  To: Chris Li, Andrew Morton
  Cc: Kairui Song, Ryan Roberts, Huang, Ying, Kalesh Singh,
	linux-kernel, linux-mm, Barry Song

On 15.06.24 01:48, Chris Li wrote:
> This is the short term solutiolns "swap cluster order" listed
> in my "Swap Abstraction" discussion slice 8 in the recent
> LSF/MM conference.
> 
> When commit 845982eb264bc "mm: swap: allow storage of all mTHP
> orders" is introduced, it only allocates the mTHP swap entries
> from new empty cluster list.  It has a fragmentation issue
> reported by Barry.
> 
> https://lore.kernel.org/all/CAGsJ_4zAcJkuW016Cfi6wicRr8N9X+GJJhgMQdSMp+Ah+NSgNQ@mail.gmail.com/
> 
> The mTHP allocation failure rate raises to almost 100% after a few
> hours in Barry's test run.
> 
> The reason is that all the empty cluster has been exhausted while
> there are planty of free swap entries to in the cluster that is
> not 100% free.
> 
> Remember the swap allocation order in the cluster.
> Keep track of the per order non full cluster list for later allocation.
> 
> This greatly improve the sucess rate of the mTHP swap allocation.
> 
> There is some test number in the V1 thread of this series:
> https://lore.kernel.org/r/20240524-swap-allocator-v1-0-47861b423b26@kernel.org
> 
> Reported-by: Barry Song <21cnbao@gmail.com>
> Signed-off-by: Chris Li <chrisl@kernel.org>
> ---

Running the cow.c selftest with a bunch of debug config
settings enabled, I get on mm-unstable:

[   25.236555] list_add corruption. prev->next should be next (ffff888105b5ad08), but was ffff888105b5ae78. (prev=ffff88812580b048).
[   25.237432] ------------[ cut here ]------------
[   25.237702] kernel BUG at lib/list_debug.c:32!
[   25.237962] Oops: invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
[   25.238288] CPU: 23 PID: 1264 Comm: cow Tainted: G        W          6.10.0-rc4+ #301
[   25.238720] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
[   25.239335] RIP: 0010:__list_add_valid_or_report+0x78/0xa0
[   25.239646] Code: 6b ff 0f 0b 48 89 c1 48 c7 c7 c0 30 0e 83 e8 7f e5 6b ff 0f 0b 48 89 d1 48 89 c6 4c 89 c2 48 c7 c7 18 31 0e 83 e8 68 e5 6b ff <0f> 0b 48 89 f2 48 89 c1 48 89 fe 48 c7 c7 70 31 0e 83 e8 51 e5b
[   25.240670] RSP: 0000:ffffc90002c87bd0 EFLAGS: 00010246
[   25.240964] RAX: 0000000000000075 RBX: ffff888105b5ac00 RCX: 0000000000000000
[   25.241362] RDX: 0000000000000000 RSI: ffff88885f9a1a00 RDI: ffff88885f9a1a00
[   25.241762] RBP: ffff88810624de20 R08: 0000000000000000 R09: 0000000000000003
[   25.242158] R10: ffffc90002c87a78 R11: ffffffff83b5b808 R12: 0000000000044000
[   25.242556] R13: 0000000000044000 R14: ffff88810624e000 R15: ffff88812580bb00
[   25.242960] FS:  00007f4fb364b740(0000) GS:ffff88885f980000(0000) knlGS:0000000000000000
[   25.243413] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   25.243737] CR2: 00007f4fb343c000 CR3: 000000010a5dc000 CR4: 0000000000750ef0
[   25.244145] PKRU: 55555554
[   25.244303] Call Trace:
[   25.244445]  <TASK>
[   25.244572]  ? die+0x36/0x90
[   25.244742]  ? do_trap+0xdd/0x100
[   25.244935]  ? __list_add_valid_or_report+0x78/0xa0
[   25.245211]  ? __list_add_valid_or_report+0x78/0xa0
[   25.245488]  ? do_error_trap+0x81/0x110
[   25.245710]  ? __list_add_valid_or_report+0x78/0xa0
[   25.245988]  ? exc_invalid_op+0x50/0x70
[   25.246211]  ? __list_add_valid_or_report+0x78/0xa0
[   25.246488]  ? asm_exc_invalid_op+0x1a/0x20
[   25.246737]  ? __list_add_valid_or_report+0x78/0xa0
[   25.247016]  swapcache_free_entries+0x1ec/0x240
[   25.247286]  free_swap_slot+0xcc/0xe0
[   25.247498]  put_swap_folio+0xf3/0x3b0
[   25.247720]  delete_from_swap_cache+0x68/0x90
[   25.247972]  folio_free_swap+0xd0/0x200
[   25.248201]  do_swap_page+0xd95/0x12d0
[   25.248418]  ? __entry_text_end+0x101e45/0x101e49
[   25.248695]  ? srso_alias_return_thunk+0x5/0xfbef5
[   25.248969]  ? srso_alias_return_thunk+0x5/0xfbef5
[   25.249246]  ? __pte_offset_map+0x18e/0x270
[   25.249490]  __handle_mm_fault+0x915/0xf80
[   25.249731]  ? srso_alias_return_thunk+0x5/0xfbef5
[   25.250010]  handle_mm_fault+0x1d1/0x400
[   25.250242]  do_user_addr_fault+0x16f/0x790
[   25.250485]  exc_page_fault+0x83/0x260
[   25.250706]  asm_exc_page_fault+0x26/0x30



Maybe what Hugh reported already. I'll try reverting your patches
to see if that fixes these issues.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 1/2] mm: swap: swap cluster switch to double link list
  2024-06-18 10:01         ` Chris Li
@ 2024-06-19  7:51           ` Huang, Ying
  2024-06-19  9:03             ` Chris Li
  0 siblings, 1 reply; 22+ messages in thread
From: Huang, Ying @ 2024-06-19  7:51 UTC (permalink / raw)
  To: Chris Li
  Cc: Andrew Morton, Kairui Song, Ryan Roberts, Kalesh Singh,
	linux-kernel, linux-mm, Barry Song

Chris Li <chrisl@kernel.org> writes:

> On Tue, Jun 18, 2024 at 12:56 AM Huang, Ying <ying.huang@intel.com> wrote:
>>
>> Chris Li <chrisl@kernel.org> writes:
>>
>> > On Sun, Jun 16, 2024 at 11:21 PM Huang, Ying <ying.huang@intel.com> wrote:
>> >>
>> >> Hi, Chris,
>> >>
>> >> Chris Li <chrisl@kernel.org> writes:

[snip]

>> >> > diff --git a/include/linux/swap.h b/include/linux/swap.h
>> >> > index 3df75d62a835..cd9154a3e934 100644
>> >> > --- a/include/linux/swap.h
>> >> > +++ b/include/linux/swap.h
>> >> > @@ -242,23 +242,22 @@ enum {
>> >> >   * space with SWAPFILE_CLUSTER pages long and naturally aligns in disk. All
>> >> >   * free clusters are organized into a list. We fetch an entry from the list to
>> >> >   * get a free cluster.
>> >> > - *
>> >> > - * The data field stores next cluster if the cluster is free or cluster usage
>> >> > - * counter otherwise. The flags field determines if a cluster is free. This is
>> >> > - * protected by swap_info_struct.lock.
>> >> >   */
>> >> >  struct swap_cluster_info {
>> >> >       spinlock_t lock;        /*
>> >> > -                              * Protect swap_cluster_info fields
>> >> > -                              * and swap_info_struct->swap_map
>> >> > +                              * Protect swap_cluster_info count and state
>> >>
>> >> Protect swap_cluster_info fields except 'list' ?
>> >
>> > I change it to protect the swap_cluster_info bitfields in the second patch.
>>
>> Although I still prefer my version, I will not insist on that.
>
> Sure, I actually don't have a strong preference about that. It is just comments.
>
>>
>> >>
>> >> > +                              * field and swap_info_struct->swap_map
>> >> >                                * elements correspond to the swap
>> >> >                                * cluster
>> >> >                                */
>> >> > -     unsigned int data:24;
>> >> > -     unsigned int flags:8;
>> >> > +     unsigned int count:12;
>> >> > +     unsigned int state:3;
>> >>
>> >> I still prefer normal data type over bit fields.  How about
>> >>
>> >>         u16 usage;
>> >>         u8  state;
>> >
>> > I don't mind the "count" rename to "usage". That is probably a better
>> > name. However I have another patch intended to add more bit fields in
>> > the cluster info struct. The second patch adds "order" and the later
>> > patch will add more. That is why I choose bitfield to be more condense
>> > with bits.
>>
>> We still have space for another "u8" for "order".  It appears trivial to
>> change it to bit fields when necessary in the future.
>
> We can, I don't see it necessary to change from bit field to u8 and
> back to bit field in the future. It is more of a personal preference
> issue.

I have to say that I don't think that it's just a personal preference.
IMO, if it's unnecessary, we shouldn't use bit fields.  You cannot
guarantee that your future changes will be merged in its current state.
So, I still think that it's better to avoid bit fields for now.

>> >>
>> >> And, how about use 'usage' instead of 'count'?  Personally I think that
>> >> it is more clear.  But I don't have strong opinions on this.
>> >>
>> >> > +     struct list_head list;  /* Protected by swap_info_struct->lock */
>> >> >  };
>> >> > -#define CLUSTER_FLAG_FREE 1 /* This cluster is free */
>> >> > -#define CLUSTER_FLAG_NEXT_NULL 2 /* This cluster has no next cluster */
>> >> > +
>> >> > +#define CLUSTER_STATE_FREE   1 /* This cluster is free */
>> >>

[snip]

>> >> >  /*
>> >> > @@ -481,21 +371,22 @@ static void __free_cluster(struct swap_info_struct *si, unsigned long idx)
>> >> >  */
>> >> >  static void swap_do_scheduled_discard(struct swap_info_struct *si)
>> >> >  {
>> >> > -     struct swap_cluster_info *info, *ci;
>> >> > +     struct swap_cluster_info *ci;
>> >> >       unsigned int idx;
>> >> >
>> >> > -     info = si->cluster_info;
>> >> > -
>> >> > -     while (!cluster_list_empty(&si->discard_clusters)) {
>> >> > -             idx = cluster_list_del_first(&si->discard_clusters, info);
>> >> > +     while (!list_empty(&si->discard_clusters)) {
>> >> > +             ci = list_first_entry(&si->discard_clusters, struct swap_cluster_info, list);
>> >> > +             list_del(&ci->list);
>> >> > +             idx = ci - si->cluster_info;
>> >> >               spin_unlock(&si->lock);
>> >> >
>> >> >               discard_swap_cluster(si, idx * SWAPFILE_CLUSTER,
>> >> >                               SWAPFILE_CLUSTER);
>> >> >
>> >> >               spin_lock(&si->lock);
>> >> > -             ci = lock_cluster(si, idx * SWAPFILE_CLUSTER);
>> >> > -             __free_cluster(si, idx);
>> >> > +
>> >> > +             spin_lock(&ci->lock);
>> >>
>> >> Personally, I still prefer to use lock_cluster(), which is more readable
>> >> and matches unlock_cluster() below.
>> >
>> > lock_cluster() uses an index which is not matching unlock_cluster()
>> > which is using a pointer to cluster.
>>
>> lock_cluster()/unlock_cluster() are pair and fit original design
>> well.  They use different parameter because swap cluster is optional.
>>
>> > When you get the cluster from the list, you have a cluster pointer. I
>> > feel it is unnecessary to convert to index then back convert to
>> > cluster pointer inside lock_cluster(). I actually feel using indexes
>> > to refer to the cluster is error prone because we also have offset.
>>
>> I don't think so, it's common to use swap offset.
>
> Swap offset is not an issue, it is all over the place. The cluster
> index(offset/512) is the one I try to avoid.
> I have made some mistakes myself on offset vs index.

Yes.  That's not good, but it's hard to be avoided too.  Can we make the
variable name more consistent?  index: cluster index, offset: swap
offset.

And, in fact, swap offset is the parameter of lock_cluster() instead of
cluster index.

>> >
>> >>
>> >> > +             __free_cluster(si, ci);
>> >> >               memset(si->swap_map + idx * SWAPFILE_CLUSTER,
>> >> >                               0, SWAPFILE_CLUSTER);
>> >> >               unlock_cluster(ci);
>> >> > @@ -521,20 +412,19 @@ static void swap_users_ref_free(struct percpu_ref *ref)
>> >> >       complete(&si->comp);
>> >> >  }
>> >> >

[snip]

>> >> > @@ -611,10 +497,10 @@ scan_swap_map_ssd_cluster_conflict(struct swap_info_struct *si,
>> >> >  {
>> >> >       struct percpu_cluster *percpu_cluster;
>> >> >       bool conflict;
>> >> > -
>> >>
>> >> Usually we use one blank line after local variable declaration.
>> > Ack.
>> >
>> >>
>> >> > +     struct swap_cluster_info *first = list_first_entry(&si->free_clusters, struct swap_cluster_info, list);
>> >> >       offset /= SWAPFILE_CLUSTER;
>> >> > -     conflict = !cluster_list_empty(&si->free_clusters) &&
>> >> > -             offset != cluster_list_first(&si->free_clusters) &&
>> >> > +     conflict = !list_empty(&si->free_clusters) &&
>> >> > +             offset !=  first - si->cluster_info &&
>> >> >               cluster_is_free(&si->cluster_info[offset]);
>> >> >
>> >> >       if (!conflict)
>> >> > @@ -655,10 +541,14 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
>> >> >       cluster = this_cpu_ptr(si->percpu_cluster);
>> >> >       tmp = cluster->next[order];
>> >> >       if (tmp == SWAP_NEXT_INVALID) {
>> >> > -             if (!cluster_list_empty(&si->free_clusters)) {
>> >> > -                     tmp = cluster_next(&si->free_clusters.head) *
>> >> > -                                     SWAPFILE_CLUSTER;
>> >> > -             } else if (!cluster_list_empty(&si->discard_clusters)) {
>> >> > +             if (!list_empty(&si->free_clusters)) {
>> >> > +                     ci = list_first_entry(&si->free_clusters, struct swap_cluster_info, list);
>> >> > +                     list_del(&ci->list);
>> >>
>> >> The free cluster is deleted from si->free_clusters now.  But later you
>> >> will call scan_swap_map_ssd_cluster_conflict() and may abandon the
>> >> cluster.  And in alloc_cluster() later, it may be deleted again.
>> >
>> > Yes, that is a bug. Thanks for catching that.
>> >
>> >>
>> >> > +                     spin_lock(&ci->lock);
>> >> > +                     ci->state = CLUSTER_STATE_PER_CPU;
>> >>
>> >> Need to change ci->state when move a cluster off the percpu_cluster.
>> >
>> > In the next patch. This patch does not use the off state yet.
>>
>> But that is confusing to use wrong state name, the really meaning is
>> something like CLUSTER_STATE_NON_FREE.  But as I suggested above, we can
>
> It can be FREE and on the per cpu pointer as well. That is the tricky part.
> It can happen on the current code as well.

cluster_set_count_flag(0, 0) is called in alloc_cluster().  So, it's not
an issue in current code.  If you need more, that shouldn't be done in
this patch.

>> keep swap_cluster_info.flags and CLUSTER_FLAG_FREE in this patch.
>
> Maybe. Will consider that.
>
>>
>> >>
>> >> > +                     spin_unlock(&ci->lock);
>> >> > +                     tmp = (ci - si->cluster_info) * SWAPFILE_CLUSTER;
>> >> > +             } else if (!list_empty(&si->discard_clusters)) {
>> >> >                       /*
>> >> >                        * we don't have free cluster but have some clusters in
>> >> >                        * discarding, do discard now and reclaim them, then
>> >> > @@ -1062,8 +952,8 @@ static void swap_free_cluster(struct swap_info_struct *si, unsigned long idx)
>> >> >
>> >> >       ci = lock_cluster(si, offset);
>> >> >       memset(si->swap_map + offset, 0, SWAPFILE_CLUSTER);
>> >> > -     cluster_set_count_flag(ci, 0, 0);
>> >> > -     free_cluster(si, idx);
>> >> > +     ci->count = 0;
>> >> > +     free_cluster(si, ci);
>> >> >       unlock_cluster(ci);
>> >> >       swap_range_free(si, offset, SWAPFILE_CLUSTER);
>> >> >  }

[snip]

--
Best Regards,
Huang, Ying


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 1/2] mm: swap: swap cluster switch to double link list
  2024-06-19  7:51           ` Huang, Ying
@ 2024-06-19  9:03             ` Chris Li
  0 siblings, 0 replies; 22+ messages in thread
From: Chris Li @ 2024-06-19  9:03 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Andrew Morton, Kairui Song, Ryan Roberts, Kalesh Singh,
	linux-kernel, linux-mm, Barry Song

On Wed, Jun 19, 2024 at 12:53 AM Huang, Ying <ying.huang@intel.com> wrote:
>
> Chris Li <chrisl@kernel.org> writes:
>
> > On Tue, Jun 18, 2024 at 12:56 AM Huang, Ying <ying.huang@intel.com> wrote:
> >>
> >> Chris Li <chrisl@kernel.org> writes:
> >>
> >> > On Sun, Jun 16, 2024 at 11:21 PM Huang, Ying <ying.huang@intel.com> wrote:
> >> >>
> >> >> Hi, Chris,
> >> >>
> >> >> Chris Li <chrisl@kernel.org> writes:
>
> [snip]
>
> >> >> > diff --git a/include/linux/swap.h b/include/linux/swap.h
> >> >> > index 3df75d62a835..cd9154a3e934 100644
> >> >> > --- a/include/linux/swap.h
> >> >> > +++ b/include/linux/swap.h
> >> >> > @@ -242,23 +242,22 @@ enum {
> >> >> >   * space with SWAPFILE_CLUSTER pages long and naturally aligns in disk. All
> >> >> >   * free clusters are organized into a list. We fetch an entry from the list to
> >> >> >   * get a free cluster.
> >> >> > - *
> >> >> > - * The data field stores next cluster if the cluster is free or cluster usage
> >> >> > - * counter otherwise. The flags field determines if a cluster is free. This is
> >> >> > - * protected by swap_info_struct.lock.
> >> >> >   */
> >> >> >  struct swap_cluster_info {
> >> >> >       spinlock_t lock;        /*
> >> >> > -                              * Protect swap_cluster_info fields
> >> >> > -                              * and swap_info_struct->swap_map
> >> >> > +                              * Protect swap_cluster_info count and state
> >> >>
> >> >> Protect swap_cluster_info fields except 'list' ?
> >> >
> >> > I change it to protect the swap_cluster_info bitfields in the second patch.
> >>
> >> Although I still prefer my version, I will not insist on that.
> >
> > Sure, I actually don't have a strong preference about that. It is just comments.
> >
> >>
> >> >>
> >> >> > +                              * field and swap_info_struct->swap_map
> >> >> >                                * elements correspond to the swap
> >> >> >                                * cluster
> >> >> >                                */
> >> >> > -     unsigned int data:24;
> >> >> > -     unsigned int flags:8;
> >> >> > +     unsigned int count:12;
> >> >> > +     unsigned int state:3;
> >> >>
> >> >> I still prefer normal data type over bit fields.  How about
> >> >>
> >> >>         u16 usage;
> >> >>         u8  state;
> >> >
> >> > I don't mind the "count" rename to "usage". That is probably a better
> >> > name. However I have another patch intended to add more bit fields in
> >> > the cluster info struct. The second patch adds "order" and the later
> >> > patch will add more. That is why I choose bitfield to be more condense
> >> > with bits.
> >>
> >> We still have space for another "u8" for "order".  It appears trivial to
> >> change it to bit fields when necessary in the future.
> >
> > We can, I don't see it necessary to change from bit field to u8 and
> > back to bit field in the future. It is more of a personal preference
> > issue.
>
> I have to say that I don't think that it's just a personal preference.
> IMO, if it's unnecessary, we shouldn't use bit fields.  You cannot
> guarantee that your future changes will be merged in its current state.
> So, I still think that it's better to avoid bit fields for now.

That is surprising to hear, I am not dependent on any hardware
physical bit location.
Anyway, not too big a deal for me. I changed it to u16/u8.

> >> > When you get the cluster from the list, you have a cluster pointer. I
> >> > feel it is unnecessary to convert to index then back convert to
> >> > cluster pointer inside lock_cluster(). I actually feel using indexes
> >> > to refer to the cluster is error prone because we also have offset.
> >>
> >> I don't think so, it's common to use swap offset.
> >
> > Swap offset is not an issue, it is all over the place. The cluster
> > index(offset/512) is the one I try to avoid.
> > I have made some mistakes myself on offset vs index.
>
> Yes.  That's not good, but it's hard to be avoided too.  Can we make the
> variable name more consistent?  index: cluster index, offset: swap
> offset.
>
> And, in fact, swap offset is the parameter of lock_cluster() instead of
> cluster index.

Right, when you get the cluster pointer from the list, it can't
directly use lock_cluster().

>
> >> >
> >> >>
> >> >> > +             __free_cluster(si, ci);
> >> >> >               memset(si->swap_map + idx * SWAPFILE_CLUSTER,
> >> >> >                               0, SWAPFILE_CLUSTER);
> >> >> >               unlock_cluster(ci);
> >> >> > @@ -521,20 +412,19 @@ static void swap_users_ref_free(struct percpu_ref *ref)
> >> >> >       complete(&si->comp);
> >> >> >  }
> >> >> >
>
> [snip]
>
> >> >> > @@ -611,10 +497,10 @@ scan_swap_map_ssd_cluster_conflict(struct swap_info_struct *si,
> >> >> >  {
> >> >> >       struct percpu_cluster *percpu_cluster;
> >> >> >       bool conflict;
> >> >> > -
> >> >>
> >> >> Usually we use one blank line after local variable declaration.
> >> > Ack.
> >> >
> >> >>
> >> >> > +     struct swap_cluster_info *first = list_first_entry(&si->free_clusters, struct swap_cluster_info, list);
> >> >> >       offset /= SWAPFILE_CLUSTER;
> >> >> > -     conflict = !cluster_list_empty(&si->free_clusters) &&
> >> >> > -             offset != cluster_list_first(&si->free_clusters) &&
> >> >> > +     conflict = !list_empty(&si->free_clusters) &&
> >> >> > +             offset !=  first - si->cluster_info &&
> >> >> >               cluster_is_free(&si->cluster_info[offset]);
> >> >> >
> >> >> >       if (!conflict)
> >> >> > @@ -655,10 +541,14 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
> >> >> >       cluster = this_cpu_ptr(si->percpu_cluster);
> >> >> >       tmp = cluster->next[order];
> >> >> >       if (tmp == SWAP_NEXT_INVALID) {
> >> >> > -             if (!cluster_list_empty(&si->free_clusters)) {
> >> >> > -                     tmp = cluster_next(&si->free_clusters.head) *
> >> >> > -                                     SWAPFILE_CLUSTER;
> >> >> > -             } else if (!cluster_list_empty(&si->discard_clusters)) {
> >> >> > +             if (!list_empty(&si->free_clusters)) {
> >> >> > +                     ci = list_first_entry(&si->free_clusters, struct swap_cluster_info, list);
> >> >> > +                     list_del(&ci->list);
> >> >>
> >> >> The free cluster is deleted from si->free_clusters now.  But later you
> >> >> will call scan_swap_map_ssd_cluster_conflict() and may abandon the
> >> >> cluster.  And in alloc_cluster() later, it may be deleted again.
> >> >
> >> > Yes, that is a bug. Thanks for catching that.
> >> >
> >> >>
> >> >> > +                     spin_lock(&ci->lock);
> >> >> > +                     ci->state = CLUSTER_STATE_PER_CPU;
> >> >>
> >> >> Need to change ci->state when move a cluster off the percpu_cluster.
> >> >
> >> > In the next patch. This patch does not use the off state yet.
> >>
> >> But that is confusing to use wrong state name, the really meaning is
> >> something like CLUSTER_STATE_NON_FREE.  But as I suggested above, we can
> >
> > It can be FREE and on the per cpu pointer as well. That is the tricky part.
> > It can happen on the current code as well.
>
> cluster_set_count_flag(0, 0) is called in alloc_cluster().  So, it's not
> an issue in current code.  If you need more, that shouldn't be done in
> this patch.

Revert to V1 like using the flags.

Chris


^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2024-06-19  9:03 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-06-14 23:48 [PATCH v2 0/2] mm: swap: mTHP swap allocator base on swap cluster order Chris Li
2024-06-14 23:48 ` [PATCH v2 1/2] mm: swap: swap cluster switch to double link list Chris Li
2024-06-17  6:19   ` Huang, Ying
2024-06-18  5:06     ` Chris Li
2024-06-18  7:54       ` Huang, Ying
2024-06-18 10:01         ` Chris Li
2024-06-19  7:51           ` Huang, Ying
2024-06-19  9:03             ` Chris Li
2024-06-14 23:48 ` [PATCH v2 2/2] mm: swap: mTHP allocate swap entries from nonfull list Chris Li
2024-06-15  1:06 ` [PATCH v2 0/2] mm: swap: mTHP swap allocator base on swap cluster order Andrew Morton
2024-06-15  2:51   ` Chris Li
2024-06-15  2:59     ` Andrew Morton
2024-06-15  8:47       ` Barry Song
2024-06-17  3:00         ` Huang, Ying
2024-06-17  3:12           ` Barry Song
2024-06-17  3:29             ` Barry Song
2024-06-17  6:48         ` Huang, Ying
2024-06-17  7:08           ` Barry Song
2024-06-17 18:34         ` Chris Li
2024-06-17 23:00           ` Hugh Dickins
2024-06-17 23:47             ` Chris Li
2024-06-18 13:08 ` David Hildenbrand

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox