[PATCH] mm/swapfile.c: select the swap device with default priority round robin

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] mm/swapfile.c: select the swap device with default priority round robin
@ 2025-09-24  9:17 Baoquan He
  2025-09-24 10:23 ` Baoquan He
                   ` (5 more replies)
  0 siblings, 6 replies; 20+ messages in thread
From: Baoquan He @ 2025-09-24  9:17 UTC (permalink / raw)
  To: linux-mm; +Cc: akpm, chrisl, kasong, baohua, shikemeng, nphamcs, Baoquan He

Currently, on system with multiple swap devices, swap allocation will
select one swap device according to priority. The swap device with the
highest priority will be chosen to allocate firstly.

People can specify a priority from 0 to 32767 when swapon a swap device,
or the system will set it from -2 then downwards by default. Meanwhile,
on NUMA system, the swap device with node_id will be considered first
on that NUMA node of the node_id.

In the current code, an array of plist, swap_avail_heads[nid], is used
to organize swap devices on each NUMA node. For each NUMA node, there
is a plist organizing all swap devices. The 'prio' value in the plist
is the negated value of the device's priority due to plist being sorted
from low to high. The swap device owning one node_id will be promoted to
the front position on that NUMA node, then other swap devices are put in
order of their default priority.

E.g I got a system with 8 NUMA nodes, and I setup 4 zram partition as
swap devices.

Current behaviour:
their priorities will be(note that -1 is skipped):
NAME       TYPE      SIZE USED PRIO
/dev/zram0 partition  16G   0B   -2
/dev/zram1 partition  16G   0B   -3
/dev/zram2 partition  16G   0B   -4
/dev/zram3 partition  16G   0B   -5

And their positions in the 8 swap_avail_lists[nid] will be:
swap_avail_lists[0]: /* node 0's available swap device list */
zram0   -> zram1   -> zram2   -> zram3
prio:1     prio:3     prio:4     prio:5
swap_avali_lists[1]: /* node 1's available swap device list */
zram1   -> zram0   -> zram2   -> zram3
prio:1     prio:2     prio:4     prio:5
swap_avail_lists[2]: /* node 2's available swap device list */
zram2   -> zram0   -> zram1   -> zram3
prio:1     prio:2     prio:3     prio:5
swap_avail_lists[3]: /* node 3's available swap device list */
zram3   -> zram0   -> zram1   -> zram2
prio:1     prio:2     prio:3     prio:4
swap_avail_lists[4-7]: /* node 4,5,6,7's available swap device list */
zram0   -> zram1   -> zram2   -> zram3
prio:2     prio:3     prio:4     prio:5

The adjustment for swap device with node_id intended to decrease the
pressure of lock contention for one swap device by taking different
swap device on different node. However, the adjustment is very
coarse-grained. On the node, the swap device sharing the node's id will
always be selected firstly by node's CPUs until exhausted, then next one.
And on other nodes where no swap device shares its node id, swap device
with priority '-2' will be selected firstly until exhausted, then next
with priority '-3'.

This is the swapon output during the process high pressure vm-scability
test is being taken. It's clearly shown zram0 is heavily exploited until
exhausted.
===================================
[root@hp-dl385g10-03 ~]# swapon
NAME       TYPE      SIZE  USED PRIO
/dev/zram0 partition  16G 15.7G   -2
/dev/zram1 partition  16G  3.4G   -3
/dev/zram2 partition  16G  3.4G   -4
/dev/zram3 partition  16G  2.6G   -5

This is unreasonable because swap devices are assumed to have similar
accessing speed if no priority is specified when swapon. It's unfair and
doesn't make sense just because one swap device is swapped on firstly,
its priority will be higher than the one swapped on later.

So here change is made to select the swap device round robin if default
priority. In code, the plist array swap_avail_heads[nid] is replaced
with a plist swap_avail_head. Any device w/o specified priority will get
the same default priority '-1'. Surely, swap device with specified priority
are always put foremost, this is not impacted. If you care about their
different accessing speed, then use 'swapon -p xx' to deploy priority for
your swap devices.

New behaviour:

swap_avail_list: /* one global available swap device list */
zram0   -> zram1   -> zram2   -> zram3
prio:1     prio:1     prio:1     prio:1

This is the swapon output during the process high pressure vm-scability
being taken, all is selected round robin:
=======================================
[root@hp-dl385g10-03 linux]# swapon
NAME       TYPE      SIZE  USED PRIO
/dev/zram0 partition  16G 12.6G   -1
/dev/zram1 partition  16G 12.6G   -1
/dev/zram2 partition  16G 12.6G   -1
/dev/zram3 partition  16G 12.6G   -1

With the change, we can see about 18% efficiency promotion as below:

vm-scability test:
==================
Test with:
usemem --init-time -O -y -x -n 31 2G (4G memcg, zram as swap)
                           Before:          After:
System time:               637.92 s         526.74 s
Sum Throughput:            3546.56 MB/s     4207.56 MB/s
Single process Throughput: 114.40 MB/s      135.72 MB/s
free latency:              10138455.99 us   6810119.01 us

Suggested-by: Chris Li <chrisl@kernel.org>
Signed-off-by: Baoquan He <bhe@redhat.com>
---
 include/linux/swap.h | 11 +-----
 mm/swapfile.c        | 94 +++++++-------------------------------------
 2 files changed, 16 insertions(+), 89 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 3473e4247ca3..f72c8e5e0635 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -337,16 +337,7 @@ struct swap_info_struct {
 	struct work_struct discard_work; /* discard worker */
 	struct work_struct reclaim_work; /* reclaim worker */
 	struct list_head discard_clusters; /* discard clusters list */
-	struct plist_node avail_lists[]; /*
-					   * entries in swap_avail_heads, one
-					   * entry per node.
-					   * Must be last as the number of the
-					   * array is nr_node_ids, which is not
-					   * a fixed value so have to allocate
-					   * dynamically.
-					   * And it has to be an array so that
-					   * plist_for_each_* can work.
-					   */
+	struct plist_node avail_list;   /* entry in swap_avail_head */
 };
 
 static inline swp_entry_t page_swap_entry(struct page *page)
diff --git a/mm/swapfile.c b/mm/swapfile.c
index b4f3cc712580..d8a54e5af16d 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -73,7 +73,7 @@ atomic_long_t nr_swap_pages;
 EXPORT_SYMBOL_GPL(nr_swap_pages);
 /* protected with swap_lock. reading in vm_swap_full() doesn't need lock */
 long total_swap_pages;
-static int least_priority = -1;
+#define DEF_SWAP_PRIO  -1
 unsigned long swapfile_maximum_size;
 #ifdef CONFIG_MIGRATION
 bool swap_migration_ad_supported;
@@ -102,7 +102,7 @@ static PLIST_HEAD(swap_active_head);
  * is held and the locking order requires swap_lock to be taken
  * before any swap_info_struct->lock.
  */
-static struct plist_head *swap_avail_heads;
+static PLIST_HEAD(swap_avail_head);
 static DEFINE_SPINLOCK(swap_avail_lock);
 
 static struct swap_info_struct *swap_info[MAX_SWAPFILES];
@@ -995,7 +995,6 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
 /* SWAP_USAGE_OFFLIST_BIT can only be set by this helper. */
 static void del_from_avail_list(struct swap_info_struct *si, bool swapoff)
 {
-	int nid;
 	unsigned long pages;
 
 	spin_lock(&swap_avail_lock);
@@ -1007,7 +1006,7 @@ static void del_from_avail_list(struct swap_info_struct *si, bool swapoff)
 		 * swap_avail_lock, to ensure the result can be seen by
 		 * add_to_avail_list.
 		 */
-		lockdep_assert_held(&si->lock);
+		//lockdep_assert_held(&si->lock);
 		si->flags &= ~SWP_WRITEOK;
 		atomic_long_or(SWAP_USAGE_OFFLIST_BIT, &si->inuse_pages);
 	} else {
@@ -1024,8 +1023,7 @@ static void del_from_avail_list(struct swap_info_struct *si, bool swapoff)
 			goto skip;
 	}
 
-	for_each_node(nid)
-		plist_del(&si->avail_lists[nid], &swap_avail_heads[nid]);
+	plist_del(&si->avail_list, &swap_avail_head);
 
 skip:
 	spin_unlock(&swap_avail_lock);
@@ -1034,7 +1032,6 @@ static void del_from_avail_list(struct swap_info_struct *si, bool swapoff)
 /* SWAP_USAGE_OFFLIST_BIT can only be cleared by this helper. */
 static void add_to_avail_list(struct swap_info_struct *si, bool swapon)
 {
-	int nid;
 	long val;
 	unsigned long pages;
 
@@ -1067,8 +1064,7 @@ static void add_to_avail_list(struct swap_info_struct *si, bool swapon)
 			goto skip;
 	}
 
-	for_each_node(nid)
-		plist_add(&si->avail_lists[nid], &swap_avail_heads[nid]);
+	plist_add(&si->avail_list, &swap_avail_head);
 
 skip:
 	spin_unlock(&swap_avail_lock);
@@ -1211,16 +1207,14 @@ static bool swap_alloc_fast(swp_entry_t *entry,
 static bool swap_alloc_slow(swp_entry_t *entry,
 			    int order)
 {
-	int node;
 	unsigned long offset;
 	struct swap_info_struct *si, *next;
 
-	node = numa_node_id();
 	spin_lock(&swap_avail_lock);
 start_over:
-	plist_for_each_entry_safe(si, next, &swap_avail_heads[node], avail_lists[node]) {
+	plist_for_each_entry_safe(si, next, &swap_avail_head, avail_list) {
 		/* Rotate the device and switch to a new cluster */
-		plist_requeue(&si->avail_lists[node], &swap_avail_heads[node]);
+		plist_requeue(&si->avail_list, &swap_avail_head);
 		spin_unlock(&swap_avail_lock);
 		if (get_swap_device_info(si)) {
 			offset = cluster_alloc_swap_entry(si, order, SWAP_HAS_CACHE);
@@ -1245,7 +1239,7 @@ static bool swap_alloc_slow(swp_entry_t *entry,
 		 * still in the swap_avail_head list then try it, otherwise
 		 * start over if we have not gotten any slots.
 		 */
-		if (plist_node_empty(&next->avail_lists[node]))
+		if (plist_node_empty(&si->avail_list))
 			goto start_over;
 	}
 	spin_unlock(&swap_avail_lock);
@@ -2535,44 +2529,18 @@ static int setup_swap_extents(struct swap_info_struct *sis, sector_t *span)
 	return generic_swapfile_activate(sis, swap_file, span);
 }
 
-static int swap_node(struct swap_info_struct *si)
-{
-	struct block_device *bdev;
-
-	if (si->bdev)
-		bdev = si->bdev;
-	else
-		bdev = si->swap_file->f_inode->i_sb->s_bdev;
-
-	return bdev ? bdev->bd_disk->node_id : NUMA_NO_NODE;
-}
-
 static void setup_swap_info(struct swap_info_struct *si, int prio,
 			    unsigned char *swap_map,
 			    struct swap_cluster_info *cluster_info,
 			    unsigned long *zeromap)
 {
-	int i;
-
-	if (prio >= 0)
-		si->prio = prio;
-	else
-		si->prio = --least_priority;
+	si->prio = prio;
 	/*
 	 * the plist prio is negated because plist ordering is
 	 * low-to-high, while swap ordering is high-to-low
 	 */
 	si->list.prio = -si->prio;
-	for_each_node(i) {
-		if (si->prio >= 0)
-			si->avail_lists[i].prio = -si->prio;
-		else {
-			if (swap_node(si) == i)
-				si->avail_lists[i].prio = 1;
-			else
-				si->avail_lists[i].prio = -si->prio;
-		}
-	}
+	si->avail_list.prio = -si->prio;
 	si->swap_map = swap_map;
 	si->cluster_info = cluster_info;
 	si->zeromap = zeromap;
@@ -2721,20 +2689,6 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
 	}
 	spin_lock(&p->lock);
 	del_from_avail_list(p, true);
-	if (p->prio < 0) {
-		struct swap_info_struct *si = p;
-		int nid;
-
-		plist_for_each_entry_continue(si, &swap_active_head, list) {
-			si->prio++;
-			si->list.prio--;
-			for_each_node(nid) {
-				if (si->avail_lists[nid].prio != 1)
-					si->avail_lists[nid].prio--;
-			}
-		}
-		least_priority++;
-	}
 	plist_del(&p->list, &swap_active_head);
 	atomic_long_sub(p->pages, &nr_swap_pages);
 	total_swap_pages -= p->pages;
@@ -2972,9 +2926,8 @@ static struct swap_info_struct *alloc_swap_info(void)
 	struct swap_info_struct *p;
 	struct swap_info_struct *defer = NULL;
 	unsigned int type;
-	int i;
 
-	p = kvzalloc(struct_size(p, avail_lists, nr_node_ids), GFP_KERNEL);
+	p = kvzalloc(sizeof(struct swap_info_struct), GFP_KERNEL);
 	if (!p)
 		return ERR_PTR(-ENOMEM);
 
@@ -3013,8 +2966,7 @@ static struct swap_info_struct *alloc_swap_info(void)
 	}
 	p->swap_extent_root = RB_ROOT;
 	plist_node_init(&p->list, 0);
-	for_each_node(i)
-		plist_node_init(&p->avail_lists[i], 0);
+	plist_node_init(&p->avail_list, 0);
 	p->flags = SWP_USED;
 	spin_unlock(&swap_lock);
 	if (defer) {
@@ -3282,9 +3234,6 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
 	if (!capable(CAP_SYS_ADMIN))
 		return -EPERM;
 
-	if (!swap_avail_heads)
-		return -ENOMEM;
-
 	si = alloc_swap_info();
 	if (IS_ERR(si))
 		return PTR_ERR(si);
@@ -3465,7 +3414,7 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
 	}
 
 	mutex_lock(&swapon_mutex);
-	prio = -1;
+	prio = DEF_SWAP_PRIO;
 	if (swap_flags & SWAP_FLAG_PREFER)
 		prio = swap_flags & SWAP_FLAG_PRIO_MASK;
 	enable_swap_info(si, prio, swap_map, cluster_info, zeromap);
@@ -3904,7 +3853,6 @@ static bool __has_usable_swap(void)
 void __folio_throttle_swaprate(struct folio *folio, gfp_t gfp)
 {
 	struct swap_info_struct *si, *next;
-	int nid = folio_nid(folio);
 
 	if (!(gfp & __GFP_IO))
 		return;
@@ -3923,8 +3871,8 @@ void __folio_throttle_swaprate(struct folio *folio, gfp_t gfp)
 		return;
 
 	spin_lock(&swap_avail_lock);
-	plist_for_each_entry_safe(si, next, &swap_avail_heads[nid],
-				  avail_lists[nid]) {
+	plist_for_each_entry_safe(si, next, &swap_avail_head,
+				  avail_list) {
 		if (si->bdev) {
 			blkcg_schedule_throttle(si->bdev->bd_disk, true);
 			break;
@@ -3936,18 +3884,6 @@ void __folio_throttle_swaprate(struct folio *folio, gfp_t gfp)
 
 static int __init swapfile_init(void)
 {
-	int nid;
-
-	swap_avail_heads = kmalloc_array(nr_node_ids, sizeof(struct plist_head),
-					 GFP_KERNEL);
-	if (!swap_avail_heads) {
-		pr_emerg("Not enough memory for swap heads, swap is disabled\n");
-		return -ENOMEM;
-	}
-
-	for_each_node(nid)
-		plist_head_init(&swap_avail_heads[nid]);
-
 	swapfile_maximum_size = arch_max_swapfile_size();
 
 #ifdef CONFIG_MIGRATION
-- 
2.41.0



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] mm/swapfile.c: select the swap device with default priority round robin
  2025-09-24  9:17 [PATCH] mm/swapfile.c: select the swap device with default priority round robin Baoquan He
@ 2025-09-24 10:23 ` Baoquan He
  2025-09-24 15:41 ` YoungJun Park
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 20+ messages in thread
From: Baoquan He @ 2025-09-24 10:23 UTC (permalink / raw)
  To: linux-mm; +Cc: akpm, chrisl, kasong, baohua, shikemeng, nphamcs

On 09/24/25 at 05:17pm, Baoquan He wrote:
> Currently, on system with multiple swap devices, swap allocation will
> select one swap device according to priority. The swap device with the
> highest priority will be chosen to allocate firstly.
> 
> People can specify a priority from 0 to 32767 when swapon a swap device,
> or the system will set it from -2 then downwards by default. Meanwhile,
> on NUMA system, the swap device with node_id will be considered first
> on that NUMA node of the node_id.
> 
> In the current code, an array of plist, swap_avail_heads[nid], is used
> to organize swap devices on each NUMA node. For each NUMA node, there
> is a plist organizing all swap devices. The 'prio' value in the plist
> is the negated value of the device's priority due to plist being sorted
> from low to high. The swap device owning one node_id will be promoted to
> the front position on that NUMA node, then other swap devices are put in
> order of their default priority.
> 
> E.g I got a system with 8 NUMA nodes, and I setup 4 zram partition as
> swap devices.
> 
> Current behaviour:
> their priorities will be(note that -1 is skipped):
> NAME       TYPE      SIZE USED PRIO
> /dev/zram0 partition  16G   0B   -2
> /dev/zram1 partition  16G   0B   -3
> /dev/zram2 partition  16G   0B   -4
> /dev/zram3 partition  16G   0B   -5
> 
> And their positions in the 8 swap_avail_lists[nid] will be:
> swap_avail_lists[0]: /* node 0's available swap device list */
> zram0   -> zram1   -> zram2   -> zram3
> prio:1     prio:3     prio:4     prio:5
> swap_avali_lists[1]: /* node 1's available swap device list */
> zram1   -> zram0   -> zram2   -> zram3
> prio:1     prio:2     prio:4     prio:5
> swap_avail_lists[2]: /* node 2's available swap device list */
> zram2   -> zram0   -> zram1   -> zram3
> prio:1     prio:2     prio:3     prio:5
> swap_avail_lists[3]: /* node 3's available swap device list */
> zram3   -> zram0   -> zram1   -> zram2
> prio:1     prio:2     prio:3     prio:4
> swap_avail_lists[4-7]: /* node 4,5,6,7's available swap device list */
> zram0   -> zram1   -> zram2   -> zram3
> prio:2     prio:3     prio:4     prio:5

By the way, when testing, I hacked zram kernel code to assign device_id
to zram->disk->node_id to emulate those disk with node_id. 

diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index 8acad3cc6e6e..b0bb6531b029 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -2719,6 +2719,7 @@ static int zram_add(void)
 	zram->disk->flags |= GENHD_FL_NO_PART;
 	zram->disk->fops = &zram_devops;
 	zram->disk->private_data = zram;
+	zram->disk->node_id = device_id % nr_node_ids;
 	snprintf(zram->disk->disk_name, 16, "zram%d", device_id);
 	atomic_set(&zram->pp_in_progress, 0);
 	zram_comp_params_reset(zram);



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] mm/swapfile.c: select the swap device with default priority round robin
  2025-09-24  9:17 [PATCH] mm/swapfile.c: select the swap device with default priority round robin Baoquan He
  2025-09-24 10:23 ` Baoquan He
@ 2025-09-24 15:41 ` YoungJun Park
  2025-09-25  2:24   ` Baoquan He
  2025-09-24 15:52 ` YoungJun Park
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 20+ messages in thread
From: YoungJun Park @ 2025-09-24 15:41 UTC (permalink / raw)
  To: Baoquan He; +Cc: linux-mm, akpm, chrisl, kasong, baohua, shikemeng, nphamcs

> Suggested-by: Chris Li <chrisl@kernel.org>
> Signed-off-by: Baoquan He <bhe@redhat.com>
> ---
>  include/linux/swap.h | 11 +-----
>  mm/swapfile.c        | 94 +++++++-------------------------------------
>  2 files changed, 16 insertions(+), 89 deletions(-)
Hi Baoquan He,

Thanks for working on this patch.  
Shouldn’t `Documentation/admin-guide/mm/swap_numa.rst` 
also be removed as part of this change?

Best, Regards
Youngjun Park


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] mm/swapfile.c: select the swap device with default priority round robin
  2025-09-24  9:17 [PATCH] mm/swapfile.c: select the swap device with default priority round robin Baoquan He
  2025-09-24 10:23 ` Baoquan He
  2025-09-24 15:41 ` YoungJun Park
@ 2025-09-24 15:52 ` YoungJun Park
  2025-09-25  4:10   ` Baoquan He
  2025-09-24 15:54 ` Chris Li
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 20+ messages in thread
From: YoungJun Park @ 2025-09-24 15:52 UTC (permalink / raw)
  To: Baoquan He; +Cc: linux-mm, akpm, chrisl, kasong, baohua, shikemeng, nphamcs

On Wed, Sep 24, 2025 at 05:17:46PM +0800, Baoquan He wrote:
> -		lockdep_assert_held(&si->lock);
> +		//lockdep_assert_held(&si->lock);
>  		si->flags &= ~SWP_WRITEOK;

It seems that the commented line `//lockdep_assert_held(&si->lock);` 
should be removed.

Best regards,  
Youngjun Park


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] mm/swapfile.c: select the swap device with default priority round robin
  2025-09-24  9:17 [PATCH] mm/swapfile.c: select the swap device with default priority round robin Baoquan He
                   ` (2 preceding siblings ...)
  2025-09-24 15:52 ` YoungJun Park
@ 2025-09-24 15:54 ` Chris Li
  2025-09-24 16:06   ` Chris Li
  2025-09-25  1:55   ` Baoquan He
  2025-09-24 16:34 ` YoungJun Park
  2025-09-25  4:36 ` Kairui Song
  5 siblings, 2 replies; 20+ messages in thread
From: Chris Li @ 2025-09-24 15:54 UTC (permalink / raw)
  To: Baoquan He
  Cc: linux-mm, akpm, kasong, baohua, shikemeng, nphamcs, YoungJun Park

Hi Baoquan,

Very exciting numbers. I have always suspected the per node priority
is not doing much contribution in the new swap allocator world. I did
not expect it to have negative contributions.

On Wed, Sep 24, 2025 at 2:18 AM Baoquan He <bhe@redhat.com> wrote:
>
> Currently, on system with multiple swap devices, swap allocation will
> select one swap device according to priority. The swap device with the
> highest priority will be chosen to allocate firstly.
>
> People can specify a priority from 0 to 32767 when swapon a swap device,
> or the system will set it from -2 then downwards by default. Meanwhile,
> on NUMA system, the swap device with node_id will be considered first
> on that NUMA node of the node_id.

That behavior was introduced by: a2468cc9bfdf ("swap: choose swap
device according to numa node")
You are effectively reverting that patch and the following fix up
patches on top of that.
The commit message or maybe the title should reflect the reversion nature.

If you did more than the simple revert plus fix up, please document
what additional change you make in this patch.

>
> In the current code, an array of plist, swap_avail_heads[nid], is used
> to organize swap devices on each NUMA node. For each NUMA node, there
> is a plist organizing all swap devices. The 'prio' value in the plist
> is the negated value of the device's priority due to plist being sorted
> from low to high. The swap device owning one node_id will be promoted to
> the front position on that NUMA node, then other swap devices are put in
> order of their default priority.
>

The original patch that introduced it is using SSD as a benchmark.
Here you are using patched zram as a benchmark. You want to explain in
a little bit detail why you choose a different test method. e.g. You
don't have a machine with an SSD device as a raw partition to do the
original test. Compression ram based swap device, zswap or zram is
used a lot more in the data center server and android workload
environment, maybe even in some Linux workstation distro as well.

You can also invite others who do have the spare SSD driver to test
the SSD as a swap device. Maybe with some setup instructions how to
set up and repeat your test on their machine with multiple SSD drives.
How to compare the result with or without your reversion patch.

> E.g I got a system with 8 NUMA nodes, and I setup 4 zram partition as
> swap devices.

You want to make it clear up front that you are using a patched zram
to simulate the per node swap device behavior. Native zram does not
have that.

>
> Current behaviour:
> their priorities will be(note that -1 is skipped):
> NAME       TYPE      SIZE USED PRIO
> /dev/zram0 partition  16G   0B   -2
> /dev/zram1 partition  16G   0B   -3
> /dev/zram2 partition  16G   0B   -4
> /dev/zram3 partition  16G   0B   -5
>
> And their positions in the 8 swap_avail_lists[nid] will be:
> swap_avail_lists[0]: /* node 0's available swap device list */
> zram0   -> zram1   -> zram2   -> zram3
> prio:1     prio:3     prio:4     prio:5
> swap_avali_lists[1]: /* node 1's available swap device list */
> zram1   -> zram0   -> zram2   -> zram3
> prio:1     prio:2     prio:4     prio:5
> swap_avail_lists[2]: /* node 2's available swap device list */
> zram2   -> zram0   -> zram1   -> zram3
> prio:1     prio:2     prio:3     prio:5
> swap_avail_lists[3]: /* node 3's available swap device list */
> zram3   -> zram0   -> zram1   -> zram2
> prio:1     prio:2     prio:3     prio:4
> swap_avail_lists[4-7]: /* node 4,5,6,7's available swap device list */
> zram0   -> zram1   -> zram2   -> zram3
> prio:2     prio:3     prio:4     prio:5
>
> The adjustment for swap device with node_id intended to decrease the
> pressure of lock contention for one swap device by taking different
> swap device on different node. However, the adjustment is very
> coarse-grained. On the node, the swap device sharing the node's id will
> always be selected firstly by node's CPUs until exhausted, then next one.
> And on other nodes where no swap device shares its node id, swap device
> with priority '-2' will be selected firstly until exhausted, then next
> with priority '-3'.
>
> This is the swapon output during the process high pressure vm-scability
> test is being taken. It's clearly shown zram0 is heavily exploited until
> exhausted.

Any tips how others repeat your high pressure vm-scability test,
especially for someone who has multiple SSD drives as a test swap
device. Some test script setup would be nice. You can post the
instruction in the same email thread as separate email, it does not
have to be in the commit message.

> ===================================
> [root@hp-dl385g10-03 ~]# swapon
> NAME       TYPE      SIZE  USED PRIO
> /dev/zram0 partition  16G 15.7G   -2
> /dev/zram1 partition  16G  3.4G   -3
> /dev/zram2 partition  16G  3.4G   -4
> /dev/zram3 partition  16G  2.6G   -5
>
> This is unreasonable because swap devices are assumed to have similar
> accessing speed if no priority is specified when swapon. It's unfair and
> doesn't make sense just because one swap device is swapped on firstly,
> its priority will be higher than the one swapped on later.
>
> So here change is made to select the swap device round robin if default
> priority. In code, the plist array swap_avail_heads[nid] is replaced
> with a plist swap_avail_head. Any device w/o specified priority will get
> the same default priority '-1'. Surely, swap device with specified priority
> are always put foremost, this is not impacted. If you care about their
> different accessing speed, then use 'swapon -p xx' to deploy priority for
> your swap devices.
>
> New behaviour:
>
> swap_avail_list: /* one global available swap device list */
> zram0   -> zram1   -> zram2   -> zram3
> prio:1     prio:1     prio:1     prio:1
>
> This is the swapon output during the process high pressure vm-scability
> being taken, all is selected round robin:
> =======================================
> [root@hp-dl385g10-03 linux]# swapon
> NAME       TYPE      SIZE  USED PRIO
> /dev/zram0 partition  16G 12.6G   -1
> /dev/zram1 partition  16G 12.6G   -1
> /dev/zram2 partition  16G 12.6G   -1
> /dev/zram3 partition  16G 12.6G   -1
>
> With the change, we can see about 18% efficiency promotion as below:
>
> vm-scability test:
> ==================
> Test with:
> usemem --init-time -O -y -x -n 31 2G (4G memcg, zram as swap)
>                            Before:          After:
> System time:               637.92 s         526.74 s
You can clarify here lower is better.
> Sum Throughput:            3546.56 MB/s     4207.56 MB/s

Higher is better. Also a percentage number can be useful here. e.g.
that is +18.6% percent improvement since reverting to round robin. A
huge difference!

> Single process Throughput: 114.40 MB/s      135.72 MB/s
Higher is better.
> free latency:              10138455.99 us   6810119.01 us
>
> Suggested-by: Chris Li <chrisl@kernel.org>
> Signed-off-by: Baoquan He <bhe@redhat.com>
> ---
>  include/linux/swap.h | 11 +-----
>  mm/swapfile.c        | 94 +++++++-------------------------------------

Very nice patch stats! Fewer code and runs faster. What more can we ask for?


>  2 files changed, 16 insertions(+), 89 deletions(-)
>
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 3473e4247ca3..f72c8e5e0635 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -337,16 +337,7 @@ struct swap_info_struct {
>         struct work_struct discard_work; /* discard worker */
>         struct work_struct reclaim_work; /* reclaim worker */
>         struct list_head discard_clusters; /* discard clusters list */
> -       struct plist_node avail_lists[]; /*
> -                                          * entries in swap_avail_heads, one
> -                                          * entry per node.
> -                                          * Must be last as the number of the
> -                                          * array is nr_node_ids, which is not
> -                                          * a fixed value so have to allocate
> -                                          * dynamically.
> -                                          * And it has to be an array so that
> -                                          * plist_for_each_* can work.
> -                                          */
> +       struct plist_node avail_list;   /* entry in swap_avail_head */
>  };
>
>  static inline swp_entry_t page_swap_entry(struct page *page)
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index b4f3cc712580..d8a54e5af16d 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -73,7 +73,7 @@ atomic_long_t nr_swap_pages;
>  EXPORT_SYMBOL_GPL(nr_swap_pages);
>  /* protected with swap_lock. reading in vm_swap_full() doesn't need lock */
>  long total_swap_pages;
> -static int least_priority = -1;
> +#define DEF_SWAP_PRIO  -1
>  unsigned long swapfile_maximum_size;
>  #ifdef CONFIG_MIGRATION
>  bool swap_migration_ad_supported;
> @@ -102,7 +102,7 @@ static PLIST_HEAD(swap_active_head);
>   * is held and the locking order requires swap_lock to be taken
>   * before any swap_info_struct->lock.
>   */
> -static struct plist_head *swap_avail_heads;
> +static PLIST_HEAD(swap_avail_head);
>  static DEFINE_SPINLOCK(swap_avail_lock);
>
>  static struct swap_info_struct *swap_info[MAX_SWAPFILES];
> @@ -995,7 +995,6 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
>  /* SWAP_USAGE_OFFLIST_BIT can only be set by this helper. */
>  static void del_from_avail_list(struct swap_info_struct *si, bool swapoff)
>  {
> -       int nid;
>         unsigned long pages;
>
>         spin_lock(&swap_avail_lock);
> @@ -1007,7 +1006,7 @@ static void del_from_avail_list(struct swap_info_struct *si, bool swapoff)
>                  * swap_avail_lock, to ensure the result can be seen by
>                  * add_to_avail_list.
>                  */
> -               lockdep_assert_held(&si->lock);
> +               //lockdep_assert_held(&si->lock);

That seems like some debug stuff left over. If you need to remove it, remove it.

The rest of the patch looks fine to me. Thanks for working on it. That
is a very nice cleanup.

Agree with YoungJun Park on removing the numa swap document as well.

Looking forward to your refresh version. I should be able to Ack-by on
your next version.

Chris


>                 si->flags &= ~SWP_WRITEOK;
>                 atomic_long_or(SWAP_USAGE_OFFLIST_BIT, &si->inuse_pages);
>         } else {
> @@ -1024,8 +1023,7 @@ static void del_from_avail_list(struct swap_info_struct *si, bool swapoff)
>                         goto skip;
>         }
>
> -       for_each_node(nid)
> -               plist_del(&si->avail_lists[nid], &swap_avail_heads[nid]);
> +       plist_del(&si->avail_list, &swap_avail_head);
>
>  skip:
>         spin_unlock(&swap_avail_lock);
> @@ -1034,7 +1032,6 @@ static void del_from_avail_list(struct swap_info_struct *si, bool swapoff)
>  /* SWAP_USAGE_OFFLIST_BIT can only be cleared by this helper. */
>  static void add_to_avail_list(struct swap_info_struct *si, bool swapon)
>  {
> -       int nid;
>         long val;
>         unsigned long pages;
>
> @@ -1067,8 +1064,7 @@ static void add_to_avail_list(struct swap_info_struct *si, bool swapon)
>                         goto skip;
>         }
>
> -       for_each_node(nid)
> -               plist_add(&si->avail_lists[nid], &swap_avail_heads[nid]);
> +       plist_add(&si->avail_list, &swap_avail_head);
>
>  skip:
>         spin_unlock(&swap_avail_lock);
> @@ -1211,16 +1207,14 @@ static bool swap_alloc_fast(swp_entry_t *entry,
>  static bool swap_alloc_slow(swp_entry_t *entry,
>                             int order)
>  {
> -       int node;
>         unsigned long offset;
>         struct swap_info_struct *si, *next;
>
> -       node = numa_node_id();
>         spin_lock(&swap_avail_lock);
>  start_over:
> -       plist_for_each_entry_safe(si, next, &swap_avail_heads[node], avail_lists[node]) {
> +       plist_for_each_entry_safe(si, next, &swap_avail_head, avail_list) {
>                 /* Rotate the device and switch to a new cluster */
> -               plist_requeue(&si->avail_lists[node], &swap_avail_heads[node]);
> +               plist_requeue(&si->avail_list, &swap_avail_head);
>                 spin_unlock(&swap_avail_lock);
>                 if (get_swap_device_info(si)) {
>                         offset = cluster_alloc_swap_entry(si, order, SWAP_HAS_CACHE);
> @@ -1245,7 +1239,7 @@ static bool swap_alloc_slow(swp_entry_t *entry,
>                  * still in the swap_avail_head list then try it, otherwise
>                  * start over if we have not gotten any slots.
>                  */
> -               if (plist_node_empty(&next->avail_lists[node]))
> +               if (plist_node_empty(&si->avail_list))
>                         goto start_over;
>         }
>         spin_unlock(&swap_avail_lock);
> @@ -2535,44 +2529,18 @@ static int setup_swap_extents(struct swap_info_struct *sis, sector_t *span)
>         return generic_swapfile_activate(sis, swap_file, span);
>  }
>
> -static int swap_node(struct swap_info_struct *si)
> -{
> -       struct block_device *bdev;
> -
> -       if (si->bdev)
> -               bdev = si->bdev;
> -       else
> -               bdev = si->swap_file->f_inode->i_sb->s_bdev;
> -
> -       return bdev ? bdev->bd_disk->node_id : NUMA_NO_NODE;
> -}
> -
>  static void setup_swap_info(struct swap_info_struct *si, int prio,
>                             unsigned char *swap_map,
>                             struct swap_cluster_info *cluster_info,
>                             unsigned long *zeromap)
>  {
> -       int i;
> -
> -       if (prio >= 0)
> -               si->prio = prio;
> -       else
> -               si->prio = --least_priority;
> +       si->prio = prio;
>         /*
>          * the plist prio is negated because plist ordering is
>          * low-to-high, while swap ordering is high-to-low
>          */
>         si->list.prio = -si->prio;
> -       for_each_node(i) {
> -               if (si->prio >= 0)
> -                       si->avail_lists[i].prio = -si->prio;
> -               else {
> -                       if (swap_node(si) == i)
> -                               si->avail_lists[i].prio = 1;
> -                       else
> -                               si->avail_lists[i].prio = -si->prio;
> -               }
> -       }
> +       si->avail_list.prio = -si->prio;
>         si->swap_map = swap_map;
>         si->cluster_info = cluster_info;
>         si->zeromap = zeromap;
> @@ -2721,20 +2689,6 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
>         }
>         spin_lock(&p->lock);
>         del_from_avail_list(p, true);
> -       if (p->prio < 0) {
> -               struct swap_info_struct *si = p;
> -               int nid;
> -
> -               plist_for_each_entry_continue(si, &swap_active_head, list) {
> -                       si->prio++;
> -                       si->list.prio--;
> -                       for_each_node(nid) {
> -                               if (si->avail_lists[nid].prio != 1)
> -                                       si->avail_lists[nid].prio--;
> -                       }
> -               }
> -               least_priority++;
> -       }
>         plist_del(&p->list, &swap_active_head);
>         atomic_long_sub(p->pages, &nr_swap_pages);
>         total_swap_pages -= p->pages;
> @@ -2972,9 +2926,8 @@ static struct swap_info_struct *alloc_swap_info(void)
>         struct swap_info_struct *p;
>         struct swap_info_struct *defer = NULL;
>         unsigned int type;
> -       int i;
>
> -       p = kvzalloc(struct_size(p, avail_lists, nr_node_ids), GFP_KERNEL);
> +       p = kvzalloc(sizeof(struct swap_info_struct), GFP_KERNEL);
>         if (!p)
>                 return ERR_PTR(-ENOMEM);
>
> @@ -3013,8 +2966,7 @@ static struct swap_info_struct *alloc_swap_info(void)
>         }
>         p->swap_extent_root = RB_ROOT;
>         plist_node_init(&p->list, 0);
> -       for_each_node(i)
> -               plist_node_init(&p->avail_lists[i], 0);
> +       plist_node_init(&p->avail_list, 0);
>         p->flags = SWP_USED;
>         spin_unlock(&swap_lock);
>         if (defer) {
> @@ -3282,9 +3234,6 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
>         if (!capable(CAP_SYS_ADMIN))
>                 return -EPERM;
>
> -       if (!swap_avail_heads)
> -               return -ENOMEM;
> -
>         si = alloc_swap_info();
>         if (IS_ERR(si))
>                 return PTR_ERR(si);
> @@ -3465,7 +3414,7 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
>         }
>
>         mutex_lock(&swapon_mutex);
> -       prio = -1;
> +       prio = DEF_SWAP_PRIO;
>         if (swap_flags & SWAP_FLAG_PREFER)
>                 prio = swap_flags & SWAP_FLAG_PRIO_MASK;
>         enable_swap_info(si, prio, swap_map, cluster_info, zeromap);
> @@ -3904,7 +3853,6 @@ static bool __has_usable_swap(void)
>  void __folio_throttle_swaprate(struct folio *folio, gfp_t gfp)
>  {
>         struct swap_info_struct *si, *next;
> -       int nid = folio_nid(folio);
>
>         if (!(gfp & __GFP_IO))
>                 return;
> @@ -3923,8 +3871,8 @@ void __folio_throttle_swaprate(struct folio *folio, gfp_t gfp)
>                 return;
>
>         spin_lock(&swap_avail_lock);
> -       plist_for_each_entry_safe(si, next, &swap_avail_heads[nid],
> -                                 avail_lists[nid]) {
> +       plist_for_each_entry_safe(si, next, &swap_avail_head,
> +                                 avail_list) {
>                 if (si->bdev) {
>                         blkcg_schedule_throttle(si->bdev->bd_disk, true);
>                         break;
> @@ -3936,18 +3884,6 @@ void __folio_throttle_swaprate(struct folio *folio, gfp_t gfp)
>
>  static int __init swapfile_init(void)
>  {
> -       int nid;
> -
> -       swap_avail_heads = kmalloc_array(nr_node_ids, sizeof(struct plist_head),
> -                                        GFP_KERNEL);
> -       if (!swap_avail_heads) {
> -               pr_emerg("Not enough memory for swap heads, swap is disabled\n");
> -               return -ENOMEM;
> -       }
> -
> -       for_each_node(nid)
> -               plist_head_init(&swap_avail_heads[nid]);
> -
>         swapfile_maximum_size = arch_max_swapfile_size();
>
>  #ifdef CONFIG_MIGRATION
> --
> 2.41.0
>
>


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] mm/swapfile.c: select the swap device with default priority round robin
  2025-09-24 15:54 ` Chris Li
@ 2025-09-24 16:06   ` Chris Li
  2025-09-25  2:15     ` Baoquan He
  2025-09-25  1:55   ` Baoquan He
  1 sibling, 1 reply; 20+ messages in thread
From: Chris Li @ 2025-09-24 16:06 UTC (permalink / raw)
  To: Baoquan He
  Cc: linux-mm, akpm, kasong, baohua, shikemeng, nphamcs,
	YoungJun Park, Aaron Lu

On Wed, Sep 24, 2025 at 8:54 AM Chris Li <chrisl@kernel.org> wrote:
>
> Hi Baoquan,
>
> Very exciting numbers. I have always suspected the per node priority
> is not doing much contribution in the new swap allocator world. I did
> not expect it to have negative contributions.
>
> On Wed, Sep 24, 2025 at 2:18 AM Baoquan He <bhe@redhat.com> wrote:
> >
> > Currently, on system with multiple swap devices, swap allocation will
> > select one swap device according to priority. The swap device with the
> > highest priority will be chosen to allocate firstly.
> >
> > People can specify a priority from 0 to 32767 when swapon a swap device,
> > or the system will set it from -2 then downwards by default. Meanwhile,
> > on NUMA system, the swap device with node_id will be considered first
> > on that NUMA node of the node_id.
>
> That behavior was introduced by: a2468cc9bfdf ("swap: choose swap
> device according to numa node")

BTW, I noticed you did not CC Aaron Lu who introduced  a2468cc9bfdf. I
CC him here and give him a chance to test it in his environment to
defend his patch. Please CC him on the next version if his email does
not bounce.

Chris


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] mm/swapfile.c: select the swap device with default priority round robin
  2025-09-24  9:17 [PATCH] mm/swapfile.c: select the swap device with default priority round robin Baoquan He
                   ` (3 preceding siblings ...)
  2025-09-24 15:54 ` Chris Li
@ 2025-09-24 16:34 ` YoungJun Park
  2025-09-25  0:24   ` Baoquan He
  2025-09-25  4:36 ` Kairui Song
  5 siblings, 1 reply; 20+ messages in thread
From: YoungJun Park @ 2025-09-24 16:34 UTC (permalink / raw)
  To: Baoquan He; +Cc: linux-mm, akpm, chrisl, kasong, baohua, shikemeng, nphamcs

On Wed, Sep 24, 2025 at 05:17:46PM +0800, Baoquan He wrote:
> -	int i;
> -
> -	if (prio >= 0)
> -		si->prio = prio;
> -	else
> -		si->prio = --least_priority;
> +	si->prio = prio;


> -     if (p->prio < 0) {
> -             struct swap_info_struct *si = p;
> -             int nid;
> -
> -             plist_for_each_entry_continue(si, &swap_active_head, list) {
> -                     si->prio++;
> -                     si->list.prio--;
> -                     for_each_node(nid) {
> -                             if (si->avail_lists[nid].prio != 1)
> -                                     si->avail_lists[nid].prio--;
> -                     }
> -             }
> -             least_priority++;
> -     }
 
Hi Baoquan He,

I noticed that in `_enable_swap_info()` there is still a comment describing
the behavior that this patch removes(auto assigning priority). 
Shouldn’t that part also be updated accordingly?

static void _enable_swap_info(struct swap_info_struct *si)
{
        ...
        /*
         * both lists are plists, and thus priority ordered.
         * swap_active_head needs to be priority ordered for swapoff(),
         * which on removal of any swap_info_struct with an auto-assigned
         * (i.e. negative) priority increments the auto-assigned priority
         * of any lower-priority swap_info_structs.
         * swap_avail_head needs to be priority ordered for folio_alloc_swap(),
         * which allocates swap pages from the highest available priority
         * swap_info_struct.
         */

Best regards,
Youngjun Park


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] mm/swapfile.c: select the swap device with default priority round robin
  2025-09-24 16:34 ` YoungJun Park
@ 2025-09-25  0:24   ` Baoquan He
  0 siblings, 0 replies; 20+ messages in thread
From: Baoquan He @ 2025-09-25  0:24 UTC (permalink / raw)
  To: YoungJun Park; +Cc: linux-mm, akpm, chrisl, kasong, baohua, shikemeng, nphamcs

On 09/25/25 at 01:34am, YoungJun Park wrote:
> On Wed, Sep 24, 2025 at 05:17:46PM +0800, Baoquan He wrote:
> > -	int i;
> > -
> > -	if (prio >= 0)
> > -		si->prio = prio;
> > -	else
> > -		si->prio = --least_priority;
> > +	si->prio = prio;
> 
> 
> > -     if (p->prio < 0) {
> > -             struct swap_info_struct *si = p;
> > -             int nid;
> > -
> > -             plist_for_each_entry_continue(si, &swap_active_head, list) {
> > -                     si->prio++;
> > -                     si->list.prio--;
> > -                     for_each_node(nid) {
> > -                             if (si->avail_lists[nid].prio != 1)
> > -                                     si->avail_lists[nid].prio--;
> > -                     }
> > -             }
> > -             least_priority++;
> > -     }
>  
> Hi Baoquan He,
> 
> I noticed that in `_enable_swap_info()` there is still a comment describing
> the behavior that this patch removes(auto assigning priority). 
> Shouldn’t that part also be updated accordingly?

You are right, this need be updated. Will change in v2. Thanks a lot for
careful reviewing.

> 
> static void _enable_swap_info(struct swap_info_struct *si)
> {
>         ...
>         /*
>          * both lists are plists, and thus priority ordered.
>          * swap_active_head needs to be priority ordered for swapoff(),
>          * which on removal of any swap_info_struct with an auto-assigned
>          * (i.e. negative) priority increments the auto-assigned priority
>          * of any lower-priority swap_info_structs.
>          * swap_avail_head needs to be priority ordered for folio_alloc_swap(),
>          * which allocates swap pages from the highest available priority
>          * swap_info_struct.
>          */
> 
> Best regards,
> Youngjun Park
> 



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] mm/swapfile.c: select the swap device with default priority round robin
  2025-09-24 15:54 ` Chris Li
  2025-09-24 16:06   ` Chris Li
@ 2025-09-25  1:55   ` Baoquan He
  2025-09-25 18:25     ` Chris Li
  1 sibling, 1 reply; 20+ messages in thread
From: Baoquan He @ 2025-09-25  1:55 UTC (permalink / raw)
  To: Chris Li
  Cc: linux-mm, akpm, kasong, baohua, shikemeng, nphamcs, YoungJun Park

On 09/24/25 at 08:54am, Chris Li wrote:
> Hi Baoquan,
> 
> Very exciting numbers. I have always suspected the per node priority
> is not doing much contribution in the new swap allocator world. I did
> not expect it to have negative contributions.

Yes.

While compared with the very beginning, there has been some progress.
At the very beginning, there was one plist and swap device will get
priority from -1 then downwards by default, then all cpus will exhaust
the swap device of pirority '-1', then select swap device of priority
'-2' to exhaust, then -3, .... I think node-based adjustment distribute
the pressure of contending lock on one swap device a little bit.
However, in node they still try to exhaust one swap device by node's
CPUs; and nodes w/o swap device attached still try to exhaust swap
device one by one in the order of priority.

> 
> On Wed, Sep 24, 2025 at 2:18 AM Baoquan He <bhe@redhat.com> wrote:
> >
> > Currently, on system with multiple swap devices, swap allocation will
> > select one swap device according to priority. The swap device with the
> > highest priority will be chosen to allocate firstly.
> >
> > People can specify a priority from 0 to 32767 when swapon a swap device,
> > or the system will set it from -2 then downwards by default. Meanwhile,
> > on NUMA system, the swap device with node_id will be considered first
> > on that NUMA node of the node_id.
> 
> That behavior was introduced by: a2468cc9bfdf ("swap: choose swap
> device according to numa node")
> You are effectively reverting that patch and the following fix up
> patches on top of that.
> The commit message or maybe the title should reflect the reversion nature.
> 
> If you did more than the simple revert plus fix up, please document
> what additional change you make in this patch.

Sure, I will mention commit a2468cc9bfdf and my patch reverts it, on top
of that default priority of swap device will be set to '-1' so that all
swap devices with default priority will be chosen round robin. Like
this, the si->lock contention can be greatly reduced.

> 
> >
> > In the current code, an array of plist, swap_avail_heads[nid], is used
> > to organize swap devices on each NUMA node. For each NUMA node, there
> > is a plist organizing all swap devices. The 'prio' value in the plist
> > is the negated value of the device's priority due to plist being sorted
> > from low to high. The swap device owning one node_id will be promoted to
> > the front position on that NUMA node, then other swap devices are put in
> > order of their default priority.
> >
> 
> The original patch that introduced it is using SSD as a benchmark.
> Here you are using patched zram as a benchmark. You want to explain in
> a little bit detail why you choose a different test method. e.g. You
> don't have a machine with an SSD device as a raw partition to do the
> original test. Compression ram based swap device, zswap or zram is
> used a lot more in the data center server and android workload
> environment, maybe even in some Linux workstation distro as well.

Yeah, just as you said. I haven't got a machine to create several SSD
partitions as swap device to test this. So hacked the zram kernel code
to assign device id to zram devices's node_id. I also tried to use
several swap files on one SSD disk, the speed is too slow, the
comparison results are not very obvious, there's only about 1%
promotion. And as you said, in reality, currently zram or swap is being
taken more and more and become the default setup on systems. So I think
it makes sense to use zram devices to test this patch.

> 
> You can also invite others who do have the spare SSD driver to test
> the SSD as a swap device. Maybe with some setup instructions how to
> set up and repeat your test on their machine with multiple SSD drives.
> How to compare the result with or without your reversion patch.

Sure, I will write these detailed steps in v2 cover letter, and attach
my two draft scripts for easing doing statistics and output the final
result.

> 
> > E.g I got a system with 8 NUMA nodes, and I setup 4 zram partition as
> > swap devices.
> 
> You want to make it clear up front that you are using a patched zram
> to simulate the per node swap device behavior. Native zram does not
> have that.

Sure, will add it in v2 cover letter.

> 
> >
> > Current behaviour:
> > their priorities will be(note that -1 is skipped):
> > NAME       TYPE      SIZE USED PRIO
> > /dev/zram0 partition  16G   0B   -2
> > /dev/zram1 partition  16G   0B   -3
> > /dev/zram2 partition  16G   0B   -4
> > /dev/zram3 partition  16G   0B   -5
> >
> > And their positions in the 8 swap_avail_lists[nid] will be:
> > swap_avail_lists[0]: /* node 0's available swap device list */
> > zram0   -> zram1   -> zram2   -> zram3
> > prio:1     prio:3     prio:4     prio:5
> > swap_avali_lists[1]: /* node 1's available swap device list */
> > zram1   -> zram0   -> zram2   -> zram3
> > prio:1     prio:2     prio:4     prio:5
> > swap_avail_lists[2]: /* node 2's available swap device list */
> > zram2   -> zram0   -> zram1   -> zram3
> > prio:1     prio:2     prio:3     prio:5
> > swap_avail_lists[3]: /* node 3's available swap device list */
> > zram3   -> zram0   -> zram1   -> zram2
> > prio:1     prio:2     prio:3     prio:4
> > swap_avail_lists[4-7]: /* node 4,5,6,7's available swap device list */
> > zram0   -> zram1   -> zram2   -> zram3
> > prio:2     prio:3     prio:4     prio:5
> >
> > The adjustment for swap device with node_id intended to decrease the
> > pressure of lock contention for one swap device by taking different
> > swap device on different node. However, the adjustment is very
> > coarse-grained. On the node, the swap device sharing the node's id will
> > always be selected firstly by node's CPUs until exhausted, then next one.
> > And on other nodes where no swap device shares its node id, swap device
> > with priority '-2' will be selected firstly until exhausted, then next
> > with priority '-3'.
> >
> > This is the swapon output during the process high pressure vm-scability
> > test is being taken. It's clearly shown zram0 is heavily exploited until
> > exhausted.
> 
> Any tips how others repeat your high pressure vm-scability test,
> especially for someone who has multiple SSD drives as a test swap
> device. Some test script setup would be nice. You can post the
> instruction in the same email thread as separate email, it does not
> have to be in the commit message.

Sure, will do.

> 
> > ===================================
> > [root@hp-dl385g10-03 ~]# swapon
> > NAME       TYPE      SIZE  USED PRIO
> > /dev/zram0 partition  16G 15.7G   -2
> > /dev/zram1 partition  16G  3.4G   -3
> > /dev/zram2 partition  16G  3.4G   -4
> > /dev/zram3 partition  16G  2.6G   -5
> >
> > This is unreasonable because swap devices are assumed to have similar
> > accessing speed if no priority is specified when swapon. It's unfair and
> > doesn't make sense just because one swap device is swapped on firstly,
> > its priority will be higher than the one swapped on later.
> >
> > So here change is made to select the swap device round robin if default
> > priority. In code, the plist array swap_avail_heads[nid] is replaced
> > with a plist swap_avail_head. Any device w/o specified priority will get
> > the same default priority '-1'. Surely, swap device with specified priority
> > are always put foremost, this is not impacted. If you care about their
> > different accessing speed, then use 'swapon -p xx' to deploy priority for
> > your swap devices.
> >
> > New behaviour:
> >
> > swap_avail_list: /* one global available swap device list */
> > zram0   -> zram1   -> zram2   -> zram3
> > prio:1     prio:1     prio:1     prio:1
> >
> > This is the swapon output during the process high pressure vm-scability
> > being taken, all is selected round robin:
> > =======================================
> > [root@hp-dl385g10-03 linux]# swapon
> > NAME       TYPE      SIZE  USED PRIO
> > /dev/zram0 partition  16G 12.6G   -1
> > /dev/zram1 partition  16G 12.6G   -1
> > /dev/zram2 partition  16G 12.6G   -1
> > /dev/zram3 partition  16G 12.6G   -1
> >
> > With the change, we can see about 18% efficiency promotion as below:
> >
> > vm-scability test:
> > ==================
> > Test with:
> > usemem --init-time -O -y -x -n 31 2G (4G memcg, zram as swap)
> >                            Before:          After:
> > System time:               637.92 s         526.74 s
> You can clarify here lower is better.
> > Sum Throughput:            3546.56 MB/s     4207.56 MB/s
> 
> Higher is better. Also a percentage number can be useful here. e.g.
> that is +18.6% percent improvement since reverting to round robin. A
> huge difference!
> 
> > Single process Throughput: 114.40 MB/s      135.72 MB/s
> Higher is better.
> > free latency:              10138455.99 us   6810119.01 us

OK, will improve this.

> >
> > Suggested-by: Chris Li <chrisl@kernel.org>
> > Signed-off-by: Baoquan He <bhe@redhat.com>
> > ---
> >  include/linux/swap.h | 11 +-----
> >  mm/swapfile.c        | 94 +++++++-------------------------------------
> 
> Very nice patch stats! Fewer code and runs faster. What more can we ask for?
> 
> 
> >  2 files changed, 16 insertions(+), 89 deletions(-)
> >
> > diff --git a/include/linux/swap.h b/include/linux/swap.h
> > index 3473e4247ca3..f72c8e5e0635 100644
> > --- a/include/linux/swap.h
> > +++ b/include/linux/swap.h
> > @@ -337,16 +337,7 @@ struct swap_info_struct {
> >         struct work_struct discard_work; /* discard worker */
> >         struct work_struct reclaim_work; /* reclaim worker */
> >         struct list_head discard_clusters; /* discard clusters list */
> > -       struct plist_node avail_lists[]; /*
> > -                                          * entries in swap_avail_heads, one
> > -                                          * entry per node.
> > -                                          * Must be last as the number of the
> > -                                          * array is nr_node_ids, which is not
> > -                                          * a fixed value so have to allocate
> > -                                          * dynamically.
> > -                                          * And it has to be an array so that
> > -                                          * plist_for_each_* can work.
> > -                                          */
> > +       struct plist_node avail_list;   /* entry in swap_avail_head */
> >  };
> >
> >  static inline swp_entry_t page_swap_entry(struct page *page)
> > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > index b4f3cc712580..d8a54e5af16d 100644
> > --- a/mm/swapfile.c
> > +++ b/mm/swapfile.c
> > @@ -73,7 +73,7 @@ atomic_long_t nr_swap_pages;
> >  EXPORT_SYMBOL_GPL(nr_swap_pages);
> >  /* protected with swap_lock. reading in vm_swap_full() doesn't need lock */
> >  long total_swap_pages;
> > -static int least_priority = -1;
> > +#define DEF_SWAP_PRIO  -1
> >  unsigned long swapfile_maximum_size;
> >  #ifdef CONFIG_MIGRATION
> >  bool swap_migration_ad_supported;
> > @@ -102,7 +102,7 @@ static PLIST_HEAD(swap_active_head);
> >   * is held and the locking order requires swap_lock to be taken
> >   * before any swap_info_struct->lock.
> >   */
> > -static struct plist_head *swap_avail_heads;
> > +static PLIST_HEAD(swap_avail_head);
> >  static DEFINE_SPINLOCK(swap_avail_lock);
> >
> >  static struct swap_info_struct *swap_info[MAX_SWAPFILES];
> > @@ -995,7 +995,6 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
> >  /* SWAP_USAGE_OFFLIST_BIT can only be set by this helper. */
> >  static void del_from_avail_list(struct swap_info_struct *si, bool swapoff)
> >  {
> > -       int nid;
> >         unsigned long pages;
> >
> >         spin_lock(&swap_avail_lock);
> > @@ -1007,7 +1006,7 @@ static void del_from_avail_list(struct swap_info_struct *si, bool swapoff)
> >                  * swap_avail_lock, to ensure the result can be seen by
> >                  * add_to_avail_list.
> >                  */
> > -               lockdep_assert_held(&si->lock);
> > +               //lockdep_assert_held(&si->lock);
> 
> That seems like some debug stuff left over. If you need to remove it, remove it.

I ever tried to adjust the lock usage, later I found I need keep it as
is, but forgot to change this line back. Will change it in v2.

> 
> The rest of the patch looks fine to me. Thanks for working on it. That
> is a very nice cleanup.
> 
> Agree with YoungJun Park on removing the numa swap document as well.

Exactly, will do.

> 
> Looking forward to your refresh version. I should be able to Ack-by on
> your next version.

Thank you very much for your great suggestion from a big picture view,
and help during the process. Really appreciate it.

Thanks
Baoquan



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] mm/swapfile.c: select the swap device with default priority round robin
  2025-09-24 16:06   ` Chris Li
@ 2025-09-25  2:15     ` Baoquan He
  2025-09-25 18:31       ` Chris Li
  0 siblings, 1 reply; 20+ messages in thread
From: Baoquan He @ 2025-09-25  2:15 UTC (permalink / raw)
  To: Chris Li
  Cc: linux-mm, akpm, kasong, baohua, shikemeng, nphamcs,
	YoungJun Park, Aaron Lu

On 09/24/25 at 09:06am, Chris Li wrote:
> On Wed, Sep 24, 2025 at 8:54 AM Chris Li <chrisl@kernel.org> wrote:
> >
> > Hi Baoquan,
> >
> > Very exciting numbers. I have always suspected the per node priority
> > is not doing much contribution in the new swap allocator world. I did
> > not expect it to have negative contributions.
> >
> > On Wed, Sep 24, 2025 at 2:18 AM Baoquan He <bhe@redhat.com> wrote:
> > >
> > > Currently, on system with multiple swap devices, swap allocation will
> > > select one swap device according to priority. The swap device with the
> > > highest priority will be chosen to allocate firstly.
> > >
> > > People can specify a priority from 0 to 32767 when swapon a swap device,
> > > or the system will set it from -2 then downwards by default. Meanwhile,
> > > on NUMA system, the swap device with node_id will be considered first
> > > on that NUMA node of the node_id.
> >
> > That behavior was introduced by: a2468cc9bfdf ("swap: choose swap
> > device according to numa node")
> 
> BTW, I noticed you did not CC Aaron Lu who introduced  a2468cc9bfdf. I
> CC him here and give him a chance to test it in his environment to
> defend his patch. Please CC him on the next version if his email does
> not bounce.

Sure, definitely we should CC Aaron. Thanks a lot for noticing.



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] mm/swapfile.c: select the swap device with default priority round robin
  2025-09-24 15:41 ` YoungJun Park
@ 2025-09-25  2:24   ` Baoquan He
  0 siblings, 0 replies; 20+ messages in thread
From: Baoquan He @ 2025-09-25  2:24 UTC (permalink / raw)
  To: YoungJun Park; +Cc: linux-mm, akpm, chrisl, kasong, baohua, shikemeng, nphamcs

On 09/25/25 at 12:41am, YoungJun Park wrote:
> > Suggested-by: Chris Li <chrisl@kernel.org>
> > Signed-off-by: Baoquan He <bhe@redhat.com>
> > ---
> >  include/linux/swap.h | 11 +-----
> >  mm/swapfile.c        | 94 +++++++-------------------------------------
> >  2 files changed, 16 insertions(+), 89 deletions(-)
> Hi Baoquan He,
> 
> Thanks for working on this patch.  
> Shouldn’t `Documentation/admin-guide/mm/swap_numa.rst` 
> also be removed as part of this change?

Right, will remove it in v2. Thanks a lot.



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] mm/swapfile.c: select the swap device with default priority round robin
  2025-09-24 15:52 ` YoungJun Park
@ 2025-09-25  4:10   ` Baoquan He
  2025-09-25  4:23     ` Baoquan He
  0 siblings, 1 reply; 20+ messages in thread
From: Baoquan He @ 2025-09-25  4:10 UTC (permalink / raw)
  To: YoungJun Park; +Cc: linux-mm, akpm, chrisl, kasong, baohua, shikemeng, nphamcs

On 09/25/25 at 12:52am, YoungJun Park wrote:
> On Wed, Sep 24, 2025 at 05:17:46PM +0800, Baoquan He wrote:
> > -		lockdep_assert_held(&si->lock);
> > +		//lockdep_assert_held(&si->lock);
> >  		si->flags &= ~SWP_WRITEOK;
> 
> It seems that the commented line `//lockdep_assert_held(&si->lock);` 
> should be removed.

Yes, it should be removed. During swapoff, si->lock is taken, while
during swap_range_alloc(), only ci-> is taken. Thanks.



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] mm/swapfile.c: select the swap device with default priority round robin
  2025-09-25  4:10   ` Baoquan He
@ 2025-09-25  4:23     ` Baoquan He
  0 siblings, 0 replies; 20+ messages in thread
From: Baoquan He @ 2025-09-25  4:23 UTC (permalink / raw)
  To: YoungJun Park; +Cc: linux-mm, akpm, chrisl, kasong, baohua, shikemeng, nphamcs

On 09/25/25 at 12:10pm, Baoquan He wrote:
> On 09/25/25 at 12:52am, YoungJun Park wrote:
> > On Wed, Sep 24, 2025 at 05:17:46PM +0800, Baoquan He wrote:
> > > -		lockdep_assert_held(&si->lock);
> > > +		//lockdep_assert_held(&si->lock);
> > >  		si->flags &= ~SWP_WRITEOK;
> > 
> > It seems that the commented line `//lockdep_assert_held(&si->lock);` 
> > should be removed.
> 
> Yes, it should be removed. During swapoff, si->lock is taken, while
> during swap_range_alloc(), only ci-> is taken. Thanks.

Sorry, I was wrong. That line should be kept. Because during swapoff(),
we need hold si->lock when we call del_from_avail_list().



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] mm/swapfile.c: select the swap device with default priority round robin
  2025-09-24  9:17 [PATCH] mm/swapfile.c: select the swap device with default priority round robin Baoquan He
                   ` (4 preceding siblings ...)
  2025-09-24 16:34 ` YoungJun Park
@ 2025-09-25  4:36 ` Kairui Song
  2025-09-25  6:18   ` Baoquan He
  5 siblings, 1 reply; 20+ messages in thread
From: Kairui Song @ 2025-09-25  4:36 UTC (permalink / raw)
  To: Baoquan He; +Cc: linux-mm, akpm, chrisl, baohua, shikemeng, nphamcs

On Wed, Sep 24, 2025 at 6:15 PM Baoquan He <bhe@redhat.com> wrote:
>
> Currently, on system with multiple swap devices, swap allocation will
> select one swap device according to priority. The swap device with the
> highest priority will be chosen to allocate firstly.
>
> People can specify a priority from 0 to 32767 when swapon a swap device,
> or the system will set it from -2 then downwards by default. Meanwhile,
> on NUMA system, the swap device with node_id will be considered first
> on that NUMA node of the node_id.
>
> In the current code, an array of plist, swap_avail_heads[nid], is used
> to organize swap devices on each NUMA node. For each NUMA node, there
> is a plist organizing all swap devices. The 'prio' value in the plist
> is the negated value of the device's priority due to plist being sorted
> from low to high. The swap device owning one node_id will be promoted to
> the front position on that NUMA node, then other swap devices are put in
> order of their default priority.
>
> E.g I got a system with 8 NUMA nodes, and I setup 4 zram partition as
> swap devices.
>
> Current behaviour:
> their priorities will be(note that -1 is skipped):
> NAME       TYPE      SIZE USED PRIO
> /dev/zram0 partition  16G   0B   -2
> /dev/zram1 partition  16G   0B   -3
> /dev/zram2 partition  16G   0B   -4
> /dev/zram3 partition  16G   0B   -5
>
> And their positions in the 8 swap_avail_lists[nid] will be:
> swap_avail_lists[0]: /* node 0's available swap device list */
> zram0   -> zram1   -> zram2   -> zram3
> prio:1     prio:3     prio:4     prio:5
> swap_avali_lists[1]: /* node 1's available swap device list */
> zram1   -> zram0   -> zram2   -> zram3
> prio:1     prio:2     prio:4     prio:5
> swap_avail_lists[2]: /* node 2's available swap device list */
> zram2   -> zram0   -> zram1   -> zram3
> prio:1     prio:2     prio:3     prio:5
> swap_avail_lists[3]: /* node 3's available swap device list */
> zram3   -> zram0   -> zram1   -> zram2
> prio:1     prio:2     prio:3     prio:4
> swap_avail_lists[4-7]: /* node 4,5,6,7's available swap device list */
> zram0   -> zram1   -> zram2   -> zram3
> prio:2     prio:3     prio:4     prio:5
>
> The adjustment for swap device with node_id intended to decrease the
> pressure of lock contention for one swap device by taking different
> swap device on different node. However, the adjustment is very
> coarse-grained. On the node, the swap device sharing the node's id will
> always be selected firstly by node's CPUs until exhausted, then next one.
> And on other nodes where no swap device shares its node id, swap device
> with priority '-2' will be selected firstly until exhausted, then next
> with priority '-3'.
>
> This is the swapon output during the process high pressure vm-scability
> test is being taken. It's clearly shown zram0 is heavily exploited until
> exhausted.
> ===================================
> [root@hp-dl385g10-03 ~]# swapon
> NAME       TYPE      SIZE  USED PRIO
> /dev/zram0 partition  16G 15.7G   -2
> /dev/zram1 partition  16G  3.4G   -3
> /dev/zram2 partition  16G  3.4G   -4
> /dev/zram3 partition  16G  2.6G   -5
>
> This is unreasonable because swap devices are assumed to have similar
> accessing speed if no priority is specified when swapon. It's unfair and
> doesn't make sense just because one swap device is swapped on firstly,
> its priority will be higher than the one swapped on later.
>
> So here change is made to select the swap device round robin if default
> priority. In code, the plist array swap_avail_heads[nid] is replaced
> with a plist swap_avail_head. Any device w/o specified priority will get
> the same default priority '-1'. Surely, swap device with specified priority
> are always put foremost, this is not impacted. If you care about their
> different accessing speed, then use 'swapon -p xx' to deploy priority for
> your swap devices.
>
> New behaviour:
>
> swap_avail_list: /* one global available swap device list */
> zram0   -> zram1   -> zram2   -> zram3
> prio:1     prio:1     prio:1     prio:1
>
> This is the swapon output during the process high pressure vm-scability
> being taken, all is selected round robin:
> =======================================
> [root@hp-dl385g10-03 linux]# swapon
> NAME       TYPE      SIZE  USED PRIO
> /dev/zram0 partition  16G 12.6G   -1
> /dev/zram1 partition  16G 12.6G   -1
> /dev/zram2 partition  16G 12.6G   -1
> /dev/zram3 partition  16G 12.6G   -1
>
> With the change, we can see about 18% efficiency promotion as below:
>
> vm-scability test:
> ==================
> Test with:
> usemem --init-time -O -y -x -n 31 2G (4G memcg, zram as swap)
>                            Before:          After:
> System time:               637.92 s         526.74 s
> Sum Throughput:            3546.56 MB/s     4207.56 MB/s
> Single process Throughput: 114.40 MB/s      135.72 MB/s
> free latency:              10138455.99 us   6810119.01 us
>
> Suggested-by: Chris Li <chrisl@kernel.org>
> Signed-off-by: Baoquan He <bhe@redhat.com>
> ---
>  include/linux/swap.h | 11 +-----
>  mm/swapfile.c        | 94 +++++++-------------------------------------
>  2 files changed, 16 insertions(+), 89 deletions(-)

Hi Baoquan,

Thanks to the patch! The node plist thing always looked confusing to me.

>
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 3473e4247ca3..f72c8e5e0635 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -337,16 +337,7 @@ struct swap_info_struct {
>         struct work_struct discard_work; /* discard worker */
>         struct work_struct reclaim_work; /* reclaim worker */
>         struct list_head discard_clusters; /* discard clusters list */
> -       struct plist_node avail_lists[]; /*
> -                                          * entries in swap_avail_heads, one
> -                                          * entry per node.
> -                                          * Must be last as the number of the
> -                                          * array is nr_node_ids, which is not
> -                                          * a fixed value so have to allocate
> -                                          * dynamically.
> -                                          * And it has to be an array so that
> -                                          * plist_for_each_* can work.
> -                                          */
> +       struct plist_node avail_list;   /* entry in swap_avail_head */
>  };
>
>  static inline swp_entry_t page_swap_entry(struct page *page)
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index b4f3cc712580..d8a54e5af16d 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -73,7 +73,7 @@ atomic_long_t nr_swap_pages;
>  EXPORT_SYMBOL_GPL(nr_swap_pages);
>  /* protected with swap_lock. reading in vm_swap_full() doesn't need lock */
>  long total_swap_pages;
> -static int least_priority = -1;
> +#define DEF_SWAP_PRIO  -1
>  unsigned long swapfile_maximum_size;
>  #ifdef CONFIG_MIGRATION
>  bool swap_migration_ad_supported;
> @@ -102,7 +102,7 @@ static PLIST_HEAD(swap_active_head);
>   * is held and the locking order requires swap_lock to be taken
>   * before any swap_info_struct->lock.
>   */
> -static struct plist_head *swap_avail_heads;
> +static PLIST_HEAD(swap_avail_head);
>  static DEFINE_SPINLOCK(swap_avail_lock);
>
>  static struct swap_info_struct *swap_info[MAX_SWAPFILES];
> @@ -995,7 +995,6 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
>  /* SWAP_USAGE_OFFLIST_BIT can only be set by this helper. */
>  static void del_from_avail_list(struct swap_info_struct *si, bool swapoff)
>  {
> -       int nid;
>         unsigned long pages;
>
>         spin_lock(&swap_avail_lock);
> @@ -1007,7 +1006,7 @@ static void del_from_avail_list(struct swap_info_struct *si, bool swapoff)
>                  * swap_avail_lock, to ensure the result can be seen by
>                  * add_to_avail_list.
>                  */
> -               lockdep_assert_held(&si->lock);
> +               //lockdep_assert_held(&si->lock);

If this needs to be removed, then it doesn't seem to comply with the
comment above?

Here we are modifying si->flags, which is supposed to be protected by si lock.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] mm/swapfile.c: select the swap device with default priority round robin
  2025-09-25  4:36 ` Kairui Song
@ 2025-09-25  6:18   ` Baoquan He
  0 siblings, 0 replies; 20+ messages in thread
From: Baoquan He @ 2025-09-25  6:18 UTC (permalink / raw)
  To: Kairui Song; +Cc: linux-mm, akpm, chrisl, baohua, shikemeng, nphamcs

On 09/25/25 at 12:36pm, Kairui Song wrote:
> On Wed, Sep 24, 2025 at 6:15 PM Baoquan He <bhe@redhat.com> wrote:
......
> > vm-scability test:
> > ==================
> > Test with:
> > usemem --init-time -O -y -x -n 31 2G (4G memcg, zram as swap)
> >                            Before:          After:
> > System time:               637.92 s         526.74 s
> > Sum Throughput:            3546.56 MB/s     4207.56 MB/s
> > Single process Throughput: 114.40 MB/s      135.72 MB/s
> > free latency:              10138455.99 us   6810119.01 us
> >
> > Suggested-by: Chris Li <chrisl@kernel.org>
> > Signed-off-by: Baoquan He <bhe@redhat.com>
> > ---
> >  include/linux/swap.h | 11 +-----
> >  mm/swapfile.c        | 94 +++++++-------------------------------------
> >  2 files changed, 16 insertions(+), 89 deletions(-)
> 
> Hi Baoquan,
> 
> Thanks to the patch! The node plist thing always looked confusing to me.

Thanks.

> 
> >
> > diff --git a/include/linux/swap.h b/include/linux/swap.h
> > index 3473e4247ca3..f72c8e5e0635 100644
> > --- a/include/linux/swap.h
> > +++ b/include/linux/swap.h
> > @@ -337,16 +337,7 @@ struct swap_info_struct {
> >         struct work_struct discard_work; /* discard worker */
> >         struct work_struct reclaim_work; /* reclaim worker */
> >         struct list_head discard_clusters; /* discard clusters list */
> > -       struct plist_node avail_lists[]; /*
> > -                                          * entries in swap_avail_heads, one
> > -                                          * entry per node.
> > -                                          * Must be last as the number of the
> > -                                          * array is nr_node_ids, which is not
> > -                                          * a fixed value so have to allocate
> > -                                          * dynamically.
> > -                                          * And it has to be an array so that
> > -                                          * plist_for_each_* can work.
> > -                                          */
> > +       struct plist_node avail_list;   /* entry in swap_avail_head */
> >  };
> >
> >  static inline swp_entry_t page_swap_entry(struct page *page)
> > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > index b4f3cc712580..d8a54e5af16d 100644
> > --- a/mm/swapfile.c
> > +++ b/mm/swapfile.c
> > @@ -73,7 +73,7 @@ atomic_long_t nr_swap_pages;
> >  EXPORT_SYMBOL_GPL(nr_swap_pages);
> >  /* protected with swap_lock. reading in vm_swap_full() doesn't need lock */
> >  long total_swap_pages;
> > -static int least_priority = -1;
> > +#define DEF_SWAP_PRIO  -1
> >  unsigned long swapfile_maximum_size;
> >  #ifdef CONFIG_MIGRATION
> >  bool swap_migration_ad_supported;
> > @@ -102,7 +102,7 @@ static PLIST_HEAD(swap_active_head);
> >   * is held and the locking order requires swap_lock to be taken
> >   * before any swap_info_struct->lock.
> >   */
> > -static struct plist_head *swap_avail_heads;
> > +static PLIST_HEAD(swap_avail_head);
> >  static DEFINE_SPINLOCK(swap_avail_lock);
> >
> >  static struct swap_info_struct *swap_info[MAX_SWAPFILES];
> > @@ -995,7 +995,6 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
> >  /* SWAP_USAGE_OFFLIST_BIT can only be set by this helper. */
> >  static void del_from_avail_list(struct swap_info_struct *si, bool swapoff)
> >  {
> > -       int nid;
> >         unsigned long pages;
> >
> >         spin_lock(&swap_avail_lock);
> > @@ -1007,7 +1006,7 @@ static void del_from_avail_list(struct swap_info_struct *si, bool swapoff)
> >                  * swap_avail_lock, to ensure the result can be seen by
> >                  * add_to_avail_list.
> >                  */
> > -               lockdep_assert_held(&si->lock);
> > +               //lockdep_assert_held(&si->lock);
> 
> If this needs to be removed, then it doesn't seem to comply with the
> comment above?
> 
> Here we are modifying si->flags, which is supposed to be protected by si lock.

You are right, I was wrong when changing code and debugging, will add it
back in v2. Thanks a lot for careful checking.



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] mm/swapfile.c: select the swap device with default priority round robin
  2025-09-25  1:55   ` Baoquan He
@ 2025-09-25 18:25     ` Chris Li
  2025-09-26 15:31       ` Baoquan He
  0 siblings, 1 reply; 20+ messages in thread
From: Chris Li @ 2025-09-25 18:25 UTC (permalink / raw)
  To: Baoquan He
  Cc: linux-mm, akpm, kasong, baohua, shikemeng, nphamcs, YoungJun Park

On Wed, Sep 24, 2025 at 6:55 PM Baoquan He <bhe@redhat.com> wrote:
>
> On 09/24/25 at 08:54am, Chris Li wrote:
> > Hi Baoquan,
> >
> > Very exciting numbers. I have always suspected the per node priority
> > is not doing much contribution in the new swap allocator world. I did
> > not expect it to have negative contributions.
>
> Yes.
>
> While compared with the very beginning, there has been some progress.
> At the very beginning, there was one plist and swap device will get
> priority from -1 then downwards by default, then all cpus will exhaust
> the swap device of pirority '-1', then select swap device of priority
> '-2' to exhaust, then -3, .... I think node-based adjustment distribute
> the pressure of contending lock on one swap device a little bit.
> However, in node they still try to exhaust one swap device by node's
> CPUs; and nodes w/o swap device attached still try to exhaust swap
> device one by one in the order of priority.
>
> >
> > On Wed, Sep 24, 2025 at 2:18 AM Baoquan He <bhe@redhat.com> wrote:
> > >
> > > Currently, on system with multiple swap devices, swap allocation will
> > > select one swap device according to priority. The swap device with the
> > > highest priority will be chosen to allocate firstly.
> > >
> > > People can specify a priority from 0 to 32767 when swapon a swap device,
> > > or the system will set it from -2 then downwards by default. Meanwhile,
> > > on NUMA system, the swap device with node_id will be considered first
> > > on that NUMA node of the node_id.
> >
> > That behavior was introduced by: a2468cc9bfdf ("swap: choose swap
> > device according to numa node")
> > You are effectively reverting that patch and the following fix up
> > patches on top of that.
> > The commit message or maybe the title should reflect the reversion nature.
> >
> > If you did more than the simple revert plus fix up, please document
> > what additional change you make in this patch.
>
> Sure, I will mention commit a2468cc9bfdf and my patch reverts it, on top
> of that default priority of swap device will be set to '-1' so that all
> swap devices with default priority will be chosen round robin. Like
> this, the si->lock contention can be greatly reduced.

Just curious, is setting to "-1" matches to kernel behavior before
a2468cc9bfdf, if not what is the behavior before a2468cc9bfdf.

> > > In the current code, an array of plist, swap_avail_heads[nid], is used
> > > to organize swap devices on each NUMA node. For each NUMA node, there
> > > is a plist organizing all swap devices. The 'prio' value in the plist
> > > is the negated value of the device's priority due to plist being sorted
> > > from low to high. The swap device owning one node_id will be promoted to
> > > the front position on that NUMA node, then other swap devices are put in
> > > order of their default priority.
> > >
> >
> > The original patch that introduced it is using SSD as a benchmark.
> > Here you are using patched zram as a benchmark. You want to explain in
> > a little bit detail why you choose a different test method. e.g. You
> > don't have a machine with an SSD device as a raw partition to do the
> > original test. Compression ram based swap device, zswap or zram is
> > used a lot more in the data center server and android workload
> > environment, maybe even in some Linux workstation distro as well.
>
> Yeah, just as you said. I haven't got a machine to create several SSD
> partitions as swap device to test this. So hacked the zram kernel code
> to assign device id to zram devices's node_id. I also tried to use
> several swap files on one SSD disk, the speed is too slow, the
> comparison results are not very obvious, there's only about 1%
> promotion. And as you said, in reality, currently zram or swap is being
> taken more and more and become the default setup on systems. So I think
> it makes sense to use zram devices to test this patch.

Yes, I understand, you want to mention that in your cover letter for
the reader to know.

>
> >
> > You can also invite others who do have the spare SSD driver to test
> > the SSD as a swap device. Maybe with some setup instructions how to
> > set up and repeat your test on their machine with multiple SSD drives.
> > How to compare the result with or without your reversion patch.
>
> Sure, I will write these detailed steps in v2 cover letter, and attach
> my two draft scripts for easing doing statistics and output the final
> result.

Ack

>
> >
> > > E.g I got a system with 8 NUMA nodes, and I setup 4 zram partition as
> > > swap devices.
> >
> > You want to make it clear up front that you are using a patched zram
> > to simulate the per node swap device behavior. Native zram does not
> > have that.
>
> Sure, will add it in v2 cover letter.

Ack.

>
> >
> > >
> > > Current behaviour:
> > > their priorities will be(note that -1 is skipped):
> > > NAME       TYPE      SIZE USED PRIO
> > > /dev/zram0 partition  16G   0B   -2
> > > /dev/zram1 partition  16G   0B   -3
> > > /dev/zram2 partition  16G   0B   -4
> > > /dev/zram3 partition  16G   0B   -5
> > >
> > > And their positions in the 8 swap_avail_lists[nid] will be:
> > > swap_avail_lists[0]: /* node 0's available swap device list */
> > > zram0   -> zram1   -> zram2   -> zram3
> > > prio:1     prio:3     prio:4     prio:5
> > > swap_avali_lists[1]: /* node 1's available swap device list */
> > > zram1   -> zram0   -> zram2   -> zram3
> > > prio:1     prio:2     prio:4     prio:5
> > > swap_avail_lists[2]: /* node 2's available swap device list */
> > > zram2   -> zram0   -> zram1   -> zram3
> > > prio:1     prio:2     prio:3     prio:5
> > > swap_avail_lists[3]: /* node 3's available swap device list */
> > > zram3   -> zram0   -> zram1   -> zram2
> > > prio:1     prio:2     prio:3     prio:4
> > > swap_avail_lists[4-7]: /* node 4,5,6,7's available swap device list */
> > > zram0   -> zram1   -> zram2   -> zram3
> > > prio:2     prio:3     prio:4     prio:5
> > >
> > > The adjustment for swap device with node_id intended to decrease the
> > > pressure of lock contention for one swap device by taking different
> > > swap device on different node. However, the adjustment is very
> > > coarse-grained. On the node, the swap device sharing the node's id will
> > > always be selected firstly by node's CPUs until exhausted, then next one.
> > > And on other nodes where no swap device shares its node id, swap device
> > > with priority '-2' will be selected firstly until exhausted, then next
> > > with priority '-3'.
> > >
> > > This is the swapon output during the process high pressure vm-scability
> > > test is being taken. It's clearly shown zram0 is heavily exploited until
> > > exhausted.
> >
> > Any tips how others repeat your high pressure vm-scability test,
> > especially for someone who has multiple SSD drives as a test swap
> > device. Some test script setup would be nice. You can post the
> > instruction in the same email thread as separate email, it does not
> > have to be in the commit message.
>
> Sure, will do.

Ack

>
> >
> > > ===================================
> > > [root@hp-dl385g10-03 ~]# swapon
> > > NAME       TYPE      SIZE  USED PRIO
> > > /dev/zram0 partition  16G 15.7G   -2
> > > /dev/zram1 partition  16G  3.4G   -3
> > > /dev/zram2 partition  16G  3.4G   -4
> > > /dev/zram3 partition  16G  2.6G   -5
> > >
> > > This is unreasonable because swap devices are assumed to have similar
> > > accessing speed if no priority is specified when swapon. It's unfair and
> > > doesn't make sense just because one swap device is swapped on firstly,
> > > its priority will be higher than the one swapped on later.
> > >
> > > So here change is made to select the swap device round robin if default
> > > priority. In code, the plist array swap_avail_heads[nid] is replaced
> > > with a plist swap_avail_head. Any device w/o specified priority will get
> > > the same default priority '-1'. Surely, swap device with specified priority
> > > are always put foremost, this is not impacted. If you care about their
> > > different accessing speed, then use 'swapon -p xx' to deploy priority for
> > > your swap devices.
> > >
> > > New behaviour:
> > >
> > > swap_avail_list: /* one global available swap device list */
> > > zram0   -> zram1   -> zram2   -> zram3
> > > prio:1     prio:1     prio:1     prio:1
> > >
> > > This is the swapon output during the process high pressure vm-scability
> > > being taken, all is selected round robin:
> > > =======================================
> > > [root@hp-dl385g10-03 linux]# swapon
> > > NAME       TYPE      SIZE  USED PRIO
> > > /dev/zram0 partition  16G 12.6G   -1
> > > /dev/zram1 partition  16G 12.6G   -1
> > > /dev/zram2 partition  16G 12.6G   -1
> > > /dev/zram3 partition  16G 12.6G   -1
> > >
> > > With the change, we can see about 18% efficiency promotion as below:
> > >
> > > vm-scability test:
> > > ==================
> > > Test with:
> > > usemem --init-time -O -y -x -n 31 2G (4G memcg, zram as swap)
> > >                            Before:          After:
> > > System time:               637.92 s         526.74 s
> > You can clarify here lower is better.
> > > Sum Throughput:            3546.56 MB/s     4207.56 MB/s
> >
> > Higher is better. Also a percentage number can be useful here. e.g.
> > that is +18.6% percent improvement since reverting to round robin. A
> > huge difference!
> >
> > > Single process Throughput: 114.40 MB/s      135.72 MB/s
> > Higher is better.
> > > free latency:              10138455.99 us   6810119.01 us
>
> OK, will improve this.

Ack

> > >
> > > Suggested-by: Chris Li <chrisl@kernel.org>
> > > Signed-off-by: Baoquan He <bhe@redhat.com>
> > > ---
> > >  include/linux/swap.h | 11 +-----
> > >  mm/swapfile.c        | 94 +++++++-------------------------------------
> >
> > Very nice patch stats! Fewer code and runs faster. What more can we ask for?
> >
> >
> > >  2 files changed, 16 insertions(+), 89 deletions(-)
> > >
> > > diff --git a/include/linux/swap.h b/include/linux/swap.h
> > > index 3473e4247ca3..f72c8e5e0635 100644
> > > --- a/include/linux/swap.h
> > > +++ b/include/linux/swap.h
> > > @@ -337,16 +337,7 @@ struct swap_info_struct {
> > >         struct work_struct discard_work; /* discard worker */
> > >         struct work_struct reclaim_work; /* reclaim worker */
> > >         struct list_head discard_clusters; /* discard clusters list */
> > > -       struct plist_node avail_lists[]; /*
> > > -                                          * entries in swap_avail_heads, one
> > > -                                          * entry per node.
> > > -                                          * Must be last as the number of the
> > > -                                          * array is nr_node_ids, which is not
> > > -                                          * a fixed value so have to allocate
> > > -                                          * dynamically.
> > > -                                          * And it has to be an array so that
> > > -                                          * plist_for_each_* can work.
> > > -                                          */
> > > +       struct plist_node avail_list;   /* entry in swap_avail_head */
> > >  };
> > >
> > >  static inline swp_entry_t page_swap_entry(struct page *page)
> > > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > > index b4f3cc712580..d8a54e5af16d 100644
> > > --- a/mm/swapfile.c
> > > +++ b/mm/swapfile.c
> > > @@ -73,7 +73,7 @@ atomic_long_t nr_swap_pages;
> > >  EXPORT_SYMBOL_GPL(nr_swap_pages);
> > >  /* protected with swap_lock. reading in vm_swap_full() doesn't need lock */
> > >  long total_swap_pages;
> > > -static int least_priority = -1;
> > > +#define DEF_SWAP_PRIO  -1
> > >  unsigned long swapfile_maximum_size;
> > >  #ifdef CONFIG_MIGRATION
> > >  bool swap_migration_ad_supported;
> > > @@ -102,7 +102,7 @@ static PLIST_HEAD(swap_active_head);
> > >   * is held and the locking order requires swap_lock to be taken
> > >   * before any swap_info_struct->lock.
> > >   */
> > > -static struct plist_head *swap_avail_heads;
> > > +static PLIST_HEAD(swap_avail_head);
> > >  static DEFINE_SPINLOCK(swap_avail_lock);
> > >
> > >  static struct swap_info_struct *swap_info[MAX_SWAPFILES];
> > > @@ -995,7 +995,6 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
> > >  /* SWAP_USAGE_OFFLIST_BIT can only be set by this helper. */
> > >  static void del_from_avail_list(struct swap_info_struct *si, bool swapoff)
> > >  {
> > > -       int nid;
> > >         unsigned long pages;
> > >
> > >         spin_lock(&swap_avail_lock);
> > > @@ -1007,7 +1006,7 @@ static void del_from_avail_list(struct swap_info_struct *si, bool swapoff)
> > >                  * swap_avail_lock, to ensure the result can be seen by
> > >                  * add_to_avail_list.
> > >                  */
> > > -               lockdep_assert_held(&si->lock);
> > > +               //lockdep_assert_held(&si->lock);
> >
> > That seems like some debug stuff left over. If you need to remove it, remove it.
>
> I ever tried to adjust the lock usage, later I found I need keep it as
> is, but forgot to change this line back. Will change it in v2.

Ack.

>
> >
> > The rest of the patch looks fine to me. Thanks for working on it. That
> > is a very nice cleanup.
> >
> > Agree with YoungJun Park on removing the numa swap document as well.
>
> Exactly, will do.

Ack.

>
> >
> > Looking forward to your refresh version. I should be able to Ack-by on
> > your next version.
>
> Thank you very much for your great suggestion from a big picture view,
> and help during the process. Really appreciate it.

You are welcome. Thanks for the great work.

I am glad we don't need to carry that complexity any more.

Chris


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] mm/swapfile.c: select the swap device with default priority round robin
  2025-09-25  2:15     ` Baoquan He
@ 2025-09-25 18:31       ` Chris Li
  0 siblings, 0 replies; 20+ messages in thread
From: Chris Li @ 2025-09-25 18:31 UTC (permalink / raw)
  To: Baoquan He
  Cc: linux-mm, akpm, kasong, baohua, shikemeng, nphamcs,
	YoungJun Park, Aaron Lu

On Wed, Sep 24, 2025 at 7:15 PM Baoquan He <bhe@redhat.com> wrote:
>
> On 09/24/25 at 09:06am, Chris Li wrote:
> > On Wed, Sep 24, 2025 at 8:54 AM Chris Li <chrisl@kernel.org> wrote:
> >
> > BTW, I noticed you did not CC Aaron Lu who introduced  a2468cc9bfdf. I
> > CC him here and give him a chance to test it in his environment to
> > defend his patch. Please CC him on the next version if his email does
> > not bounce.
>
> Sure, definitely we should CC Aaron. Thanks a lot for noticing.

Hi Aaron,

If you are reading this, please let us know. e.g. If you need some
extra time to test it in your setup, that is totally understandable.
The patch might make sense back then. The new swap allocator
completely changed a lot of swap allocation behavior, so it is
possible the situation changed, we don't need to carry it any more.

Thanks

Chris


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] mm/swapfile.c: select the swap device with default priority round robin
  2025-09-25 18:25     ` Chris Li
@ 2025-09-26 15:31       ` Baoquan He
  2025-09-27  4:46         ` Chris Li
  0 siblings, 1 reply; 20+ messages in thread
From: Baoquan He @ 2025-09-26 15:31 UTC (permalink / raw)
  To: Chris Li
  Cc: linux-mm, akpm, kasong, baohua, shikemeng, nphamcs, YoungJun Park

On 09/25/25 at 11:25am, Chris Li wrote:
> On Wed, Sep 24, 2025 at 6:55 PM Baoquan He <bhe@redhat.com> wrote:
> >
> > On 09/24/25 at 08:54am, Chris Li wrote:
> > > Hi Baoquan,
> > >
> > > Very exciting numbers. I have always suspected the per node priority
> > > is not doing much contribution in the new swap allocator world. I did
> > > not expect it to have negative contributions.
> >
> > Yes.
> >
> > While compared with the very beginning, there has been some progress.
> > At the very beginning, there was one plist and swap device will get
> > priority from -1 then downwards by default, then all cpus will exhaust
> > the swap device of pirority '-1', then select swap device of priority
> > '-2' to exhaust, then -3, .... I think node-based adjustment distribute
> > the pressure of contending lock on one swap device a little bit.
> > However, in node they still try to exhaust one swap device by node's
> > CPUs; and nodes w/o swap device attached still try to exhaust swap
> > device one by one in the order of priority.
> >
> > >
> > > On Wed, Sep 24, 2025 at 2:18 AM Baoquan He <bhe@redhat.com> wrote:
> > > >
> > > > Currently, on system with multiple swap devices, swap allocation will
> > > > select one swap device according to priority. The swap device with the
> > > > highest priority will be chosen to allocate firstly.
> > > >
> > > > People can specify a priority from 0 to 32767 when swapon a swap device,
> > > > or the system will set it from -2 then downwards by default. Meanwhile,
> > > > on NUMA system, the swap device with node_id will be considered first
> > > > on that NUMA node of the node_id.
> > >
> > > That behavior was introduced by: a2468cc9bfdf ("swap: choose swap
> > > device according to numa node")
> > > You are effectively reverting that patch and the following fix up
> > > patches on top of that.
> > > The commit message or maybe the title should reflect the reversion nature.
> > >
> > > If you did more than the simple revert plus fix up, please document
> > > what additional change you make in this patch.
> >
> > Sure, I will mention commit a2468cc9bfdf and my patch reverts it, on top
> > of that default priority of swap device will be set to '-1' so that all
> > swap devices with default priority will be chosen round robin. Like
> > this, the si->lock contention can be greatly reduced.
> 
> Just curious, is setting to "-1" matches to kernel behavior before
> a2468cc9bfdf, if not what is the behavior before a2468cc9bfdf.

It should be like below. It's not a real output, I made the data to show
what it looks like.

# swapon
NAME       TYPE      SIZE  USED PRIO
/dev/zram0 partition  16G 15.8G   -1
/dev/zram1 partition  16G    0B   -2
/dev/zram2 partition  16G    0B   -3
/dev/zram3 partition  16G    0B   -4

I just apply this patch and set the priority to emulate the kerel
behavirour before a2468cc9bfdf. In kernel before a2468cc9bfdf, it sets
priority to swap device from -1 downwards. There's only one
swap_avail_head plist for all CPUs. The behaviour is very much like below:

[root@hp-dl385g10-03 ~]# swapon
NAME       TYPE      SIZE  USED PRIO
/dev/zram0 partition  16G    0B    0
/dev/zram1 partition  16G    0B    1
/dev/zram2 partition  16G    0B    2
/dev/zram3 partition  16G 14.3G    3



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] mm/swapfile.c: select the swap device with default priority round robin
  2025-09-26 15:31       ` Baoquan He
@ 2025-09-27  4:46         ` Chris Li
  2025-09-28  2:14           ` Baoquan He
  0 siblings, 1 reply; 20+ messages in thread
From: Chris Li @ 2025-09-27  4:46 UTC (permalink / raw)
  To: Baoquan He
  Cc: linux-mm, akpm, kasong, baohua, shikemeng, nphamcs, YoungJun Park

On Fri, Sep 26, 2025 at 8:31 AM Baoquan He <bhe@redhat.com> wrote:
> On 09/25/25 at 11:25am, Chris Li wrote:
> > Just curious, is setting to "-1" matches to kernel behavior before
> > a2468cc9bfdf, if not what is the behavior before a2468cc9bfdf.
>
> It should be like below. It's not a real output, I made the data to show
> what it looks like.
>
> # swapon
> NAME       TYPE      SIZE  USED PRIO
> /dev/zram0 partition  16G 15.8G   -1
> /dev/zram1 partition  16G    0B   -2
> /dev/zram2 partition  16G    0B   -3
> /dev/zram3 partition  16G    0B   -4
>
> I just apply this patch and set the priority to emulate the kerel
> behavirour before a2468cc9bfdf. In kernel before a2468cc9bfdf, it sets
> priority to swap device from -1 downwards. There's only one
> swap_avail_head plist for all CPUs. The behaviour is very much like below:
>
> [root@hp-dl385g10-03 ~]# swapon
> NAME       TYPE      SIZE  USED PRIO
> /dev/zram0 partition  16G    0B    0
> /dev/zram1 partition  16G    0B    1
> /dev/zram2 partition  16G    0B    2
> /dev/zram3 partition  16G 14.3G    3

I see, in that case it is not a simple revert. It is revert plus a
change to make the default to use round robin. I suggest you split the
patch into two parts, one is the conceptual clean revert, back to the
pre a2468cc9bfdf without behavior change. Then the second one is
changed to the round robin. That will make your behavior change more
obvious.

I suspect that even pre a2468cc9bfdf, the round robin might outperform
per node priority already. Which means there exists a much simpler
solution all alone. Even though I am curious, I am not demanding the
answer from you. Your test data against the latest kernel, which shows
great performance improvement is good enough. We don't have to test
that really old kernel.

Chris

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] mm/swapfile.c: select the swap device with default priority round robin
  2025-09-27  4:46         ` Chris Li
@ 2025-09-28  2:14           ` Baoquan He
  0 siblings, 0 replies; 20+ messages in thread
From: Baoquan He @ 2025-09-28  2:14 UTC (permalink / raw)
  To: Chris Li
  Cc: linux-mm, akpm, kasong, baohua, shikemeng, nphamcs, YoungJun Park

On 09/26/25 at 09:46pm, Chris Li wrote:
> On Fri, Sep 26, 2025 at 8:31 AM Baoquan He <bhe@redhat.com> wrote:
> > On 09/25/25 at 11:25am, Chris Li wrote:
> > > Just curious, is setting to "-1" matches to kernel behavior before
> > > a2468cc9bfdf, if not what is the behavior before a2468cc9bfdf.
> >
> > It should be like below. It's not a real output, I made the data to show
> > what it looks like.
> >
> > # swapon
> > NAME       TYPE      SIZE  USED PRIO
> > /dev/zram0 partition  16G 15.8G   -1
> > /dev/zram1 partition  16G    0B   -2
> > /dev/zram2 partition  16G    0B   -3
> > /dev/zram3 partition  16G    0B   -4
> >
> > I just apply this patch and set the priority to emulate the kerel
> > behavirour before a2468cc9bfdf. In kernel before a2468cc9bfdf, it sets
> > priority to swap device from -1 downwards. There's only one
> > swap_avail_head plist for all CPUs. The behaviour is very much like below:
> >
> > [root@hp-dl385g10-03 ~]# swapon
> > NAME       TYPE      SIZE  USED PRIO
> > /dev/zram0 partition  16G    0B    0
> > /dev/zram1 partition  16G    0B    1
> > /dev/zram2 partition  16G    0B    2
> > /dev/zram3 partition  16G 14.3G    3
> 
> I see, in that case it is not a simple revert. It is revert plus a
> change to make the default to use round robin. I suggest you split the
> patch into two parts, one is the conceptual clean revert, back to the
> pre a2468cc9bfdf without behavior change. Then the second one is
> changed to the round robin. That will make your behavior change more
> obvious.

Make senses, I will make v3 to split it.

> 
> I suspect that even pre a2468cc9bfdf, the round robin might outperform
> per node priority already. Which means there exists a much simpler
> solution all alone. Even though I am curious, I am not demanding the
> answer from you. Your test data against the latest kernel, which shows
> great performance improvement is good enough. We don't have to test
> that really old kernel.

Yeah, I agree. Just nobody realize it then.



^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2025-09-28  2:14 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-09-24  9:17 [PATCH] mm/swapfile.c: select the swap device with default priority round robin Baoquan He
2025-09-24 10:23 ` Baoquan He
2025-09-24 15:41 ` YoungJun Park
2025-09-25  2:24   ` Baoquan He
2025-09-24 15:52 ` YoungJun Park
2025-09-25  4:10   ` Baoquan He
2025-09-25  4:23     ` Baoquan He
2025-09-24 15:54 ` Chris Li
2025-09-24 16:06   ` Chris Li
2025-09-25  2:15     ` Baoquan He
2025-09-25 18:31       ` Chris Li
2025-09-25  1:55   ` Baoquan He
2025-09-25 18:25     ` Chris Li
2025-09-26 15:31       ` Baoquan He
2025-09-27  4:46         ` Chris Li
2025-09-28  2:14           ` Baoquan He
2025-09-24 16:34 ` YoungJun Park
2025-09-25  0:24   ` Baoquan He
2025-09-25  4:36 ` Kairui Song
2025-09-25  6:18   ` Baoquan He

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox