* [PATCH v4 mm-new 0/2] mm/swapfile.c: select the swap device with default priority round robin
@ 2025-10-11 8:16 Baoquan He
2025-10-11 8:16 ` [PATCH v4 mm-new 1/2] mm/swap: do not choose swap device according to numa node Baoquan He
2025-10-11 8:16 ` [PATCH v4 mm-new 2/2] mm/swap: select swap device with default priority round robin Baoquan He
0 siblings, 2 replies; 22+ messages in thread
From: Baoquan He @ 2025-10-11 8:16 UTC (permalink / raw)
To: linux-mm
Cc: akpm, chrisl, kasong, youngjun.park, aaron.lu, baohua, shikemeng,
nphamcs, Baoquan He
Currently, on system with multiple swap devices, swap allocation will
select one swap device according to priority. The swap device with the
highest priority will be chosen to allocate firstly.
People can specify a priority from 0 to 32767 when swapon a swap device,
or the system will set it from -2 then downwards by default. Meanwhile,
on NUMA system, the swap device with node_id will be considered first
on that NUMA node of the node_id.
In the current code, an array of plist, swap_avail_heads[nid], is used
to organize swap devices on each NUMA node. For each NUMA node, there
is a plist organizing all swap devices. The 'prio' value in the plist
is the negated value of the device's priority due to plist being sorted
from low to high. The swap device owning one node_id will be promoted to
the front position on that NUMA node, then other swap devices are put in
order of their default priority.
E.g I got a system with 8 NUMA nodes, and I setup 4 zram partition as
swap devices.
Current behaviour:
their priorities will be(note that -1 is skipped):
NAME TYPE SIZE USED PRIO
/dev/zram0 partition 16G 0B -2
/dev/zram1 partition 16G 0B -3
/dev/zram2 partition 16G 0B -4
/dev/zram3 partition 16G 0B -5
And their positions in the 8 swap_avail_lists[nid] will be:
swap_avail_lists[0]: /* node 0's available swap device list */
zram0 -> zram1 -> zram2 -> zram3
prio:1 prio:3 prio:4 prio:5
swap_avali_lists[1]: /* node 1's available swap device list */
zram1 -> zram0 -> zram2 -> zram3
prio:1 prio:2 prio:4 prio:5
swap_avail_lists[2]: /* node 2's available swap device list */
zram2 -> zram0 -> zram1 -> zram3
prio:1 prio:2 prio:3 prio:5
swap_avail_lists[3]: /* node 3's available swap device list */
zram3 -> zram0 -> zram1 -> zram2
prio:1 prio:2 prio:3 prio:4
swap_avail_lists[4-7]: /* node 4,5,6,7's available swap device list */
zram0 -> zram1 -> zram2 -> zram3
prio:2 prio:3 prio:4 prio:5
The adjustment for swap device with node_id intended to decrease the
pressure of lock contention for one swap device by taking different
swap device on different node. The adjustment was introduced in commit
a2468cc9bfdf ("swap: choose swap device according to numa node").
However, the adjustment is a little coarse-grained. On the node, the swap
device sharing the node's id will always be selected firstly by node's CPUs
until exhausted, then next one. And on other nodes where no swap device
shares its node id, swap device with priority '-2' will be selected firstly
until exhausted, then next with priority '-3'.
This is the swapon output during the process high pressure vm-scability
test is being taken. It's clearly showing zram0 is heavily exploited until
exhausted.
===================================
[root@hp-dl385g10-03 ~]# swapon
NAME TYPE SIZE USED PRIO
/dev/zram0 partition 16G 15.7G -2
/dev/zram1 partition 16G 3.4G -3
/dev/zram2 partition 16G 3.4G -4
/dev/zram3 partition 16G 2.6G -5
The node based strategy on selecting swap device is much better then the
old way one by one selecting swap device. However it is still unreasonable
because swap devices are assumed to have similar accessing speed if no
priority is specified when swapon. It's unfair and doesn't make sense just
because one swap device is swapped on firstly, its priority will be higher
than the one swapped on later.
So in this patchset, change is made to select the swap device round robin
if default priority. In code, the plist array swap_avail_heads[nid] is replaced
with a plist swap_avail_head which reverts commit a2468cc9bfdf. Meanwhile,
on top of the revert, further change is taken to make any device w/o
specified priority get the same default priority '-1'. Surely, swap device
with specified priority are always put foremost, this is not impacted. If
you care about their different accessing speed, then use 'swapon -p xx' to
deploy priority for your swap devices.
New behaviour:
swap_avail_list: /* one global available swap device list */
zram0 -> zram1 -> zram2 -> zram3
prio:1 prio:1 prio:1 prio:1
This is the swapon output during the process high pressure vm-scability
being taken, all is selected round robin:
=======================================
[root@hp-dl385g10-03 linux]# swapon
NAME TYPE SIZE USED PRIO
/dev/zram0 partition 16G 12.6G -1
/dev/zram1 partition 16G 12.6G -1
/dev/zram2 partition 16G 12.6G -1
/dev/zram3 partition 16G 12.6G -1
With the change, we can see about 18% efficiency promotion as below:
vm-scability test:
==================
Test with:
usemem --init-time -O -y -x -n 31 2G (4G memcg, zram as swap)
Before: After:
System time: 637.92 s 526.74 s (lower is better)
Sum Throughput: 3546.56 MB/s 4207.56 MB/s (higher is better)
Single process Throughput: 114.40 MB/s 135.72 MB/s (higher is better)
free latency: 10138455.99 us 6810119.01 us (low is better)
Changelog:
==========
v3->v4:
------
- Rebase on the latest mm-new;
- Add Chris's Suggested-by and Acked-by.
v2->v3:
-------
- Split the v2 patch into two parts, one is reverting commit
a2468cc9bfdf, the 2nd is making change to set default priority as -1
for all swap devices which makes swapping out select swap device round
robin. This eases patch reviewing which is suggested by Chris, thanks.
- Fix a LKP reported issue I mistakenly added other debugging code into
v2 patch. clean that up.
v1->v2:
-------
- Remove Documentation/admin-guide/mm/swap_numa.rst;
- Add back mistakenly removed lockdep_assert_held() line;
- Remove the unneeded code comment in _enable_swap_info().
Thanks a lot for careful reviewing from Chris, YoungJun and Kairui.
Baoquan He (2):
mm/swap: do not choose swap device according to numa node
mm/swap: select swap device with default priority round robin
Documentation/admin-guide/mm/swap_numa.rst | 78 ----------------
include/linux/swap.h | 11 +--
mm/swapfile.c | 103 +++------------------
3 files changed, 16 insertions(+), 176 deletions(-)
delete mode 100644 Documentation/admin-guide/mm/swap_numa.rst
--
2.41.0
^ permalink raw reply [flat|nested] 22+ messages in thread
* [PATCH v4 mm-new 1/2] mm/swap: do not choose swap device according to numa node
2025-10-11 8:16 [PATCH v4 mm-new 0/2] mm/swapfile.c: select the swap device with default priority round robin Baoquan He
@ 2025-10-11 8:16 ` Baoquan He
2025-10-11 20:45 ` kernel test robot
2025-10-13 6:09 ` Barry Song
2025-10-11 8:16 ` [PATCH v4 mm-new 2/2] mm/swap: select swap device with default priority round robin Baoquan He
1 sibling, 2 replies; 22+ messages in thread
From: Baoquan He @ 2025-10-11 8:16 UTC (permalink / raw)
To: linux-mm
Cc: akpm, chrisl, kasong, youngjun.park, aaron.lu, baohua, shikemeng,
nphamcs, Baoquan He
This reverts commit a2468cc9bfdf ("swap: choose swap device according
to numa node").
After this patch, the behaviour will change back to pre-commit
a2468cc9bfdf. Means the priority will be set from -1 then downwards by
default, and when swapping, it will exhault swap device one by one
according to priority from high to low. This is preparation work for
later change.
[root@hp-dl385g10-03 ~]# swapon
NAME TYPE SIZE USED PRIO
/dev/zram0 partition 16G 16G -1
/dev/zram1 partition 16G 966.2M -2
/dev/zram2 partition 16G 0B -3
/dev/zram3 partition 16G 0B -4
Suggested-by: Chris Li <chrisl@kernel.org>
Signed-off-by: Baoquan He <bhe@redhat.com>
Acked-by: Chris Li <chrisl@kernel.org>
---
Documentation/admin-guide/mm/swap_numa.rst | 78 ----------------------
include/linux/swap.h | 11 +--
mm/swapfile.c | 76 ++++-----------------
3 files changed, 14 insertions(+), 151 deletions(-)
delete mode 100644 Documentation/admin-guide/mm/swap_numa.rst
diff --git a/Documentation/admin-guide/mm/swap_numa.rst b/Documentation/admin-guide/mm/swap_numa.rst
deleted file mode 100644
index 2e630627bcee..000000000000
--- a/Documentation/admin-guide/mm/swap_numa.rst
+++ /dev/null
@@ -1,78 +0,0 @@
-===========================================
-Automatically bind swap device to numa node
-===========================================
-
-If the system has more than one swap device and swap device has the node
-information, we can make use of this information to decide which swap
-device to use in get_swap_pages() to get better performance.
-
-
-How to use this feature
-=======================
-
-Swap device has priority and that decides the order of it to be used. To make
-use of automatically binding, there is no need to manipulate priority settings
-for swap devices. e.g. on a 2 node machine, assume 2 swap devices swapA and
-swapB, with swapA attached to node 0 and swapB attached to node 1, are going
-to be swapped on. Simply swapping them on by doing::
-
- # swapon /dev/swapA
- # swapon /dev/swapB
-
-Then node 0 will use the two swap devices in the order of swapA then swapB and
-node 1 will use the two swap devices in the order of swapB then swapA. Note
-that the order of them being swapped on doesn't matter.
-
-A more complex example on a 4 node machine. Assume 6 swap devices are going to
-be swapped on: swapA and swapB are attached to node 0, swapC is attached to
-node 1, swapD and swapE are attached to node 2 and swapF is attached to node3.
-The way to swap them on is the same as above::
-
- # swapon /dev/swapA
- # swapon /dev/swapB
- # swapon /dev/swapC
- # swapon /dev/swapD
- # swapon /dev/swapE
- # swapon /dev/swapF
-
-Then node 0 will use them in the order of::
-
- swapA/swapB -> swapC -> swapD -> swapE -> swapF
-
-swapA and swapB will be used in a round robin mode before any other swap device.
-
-node 1 will use them in the order of::
-
- swapC -> swapA -> swapB -> swapD -> swapE -> swapF
-
-node 2 will use them in the order of::
-
- swapD/swapE -> swapA -> swapB -> swapC -> swapF
-
-Similaly, swapD and swapE will be used in a round robin mode before any
-other swap devices.
-
-node 3 will use them in the order of::
-
- swapF -> swapA -> swapB -> swapC -> swapD -> swapE
-
-
-Implementation details
-======================
-
-The current code uses a priority based list, swap_avail_list, to decide
-which swap device to use and if multiple swap devices share the same
-priority, they are used round robin. This change here replaces the single
-global swap_avail_list with a per-numa-node list, i.e. for each numa node,
-it sees its own priority based list of available swap devices. Swap
-device's priority can be promoted on its matching node's swap_avail_list.
-
-The current swap device's priority is set as: user can set a >=0 value,
-or the system will pick one starting from -1 then downwards. The priority
-value in the swap_avail_list is the negated value of the swap device's
-due to plist being sorted from low to high. The new policy doesn't change
-the semantics for priority >=0 cases, the previous starting from -1 then
-downwards now becomes starting from -2 then downwards and -1 is reserved
-as the promoted value. So if multiple swap devices are attached to the same
-node, they will all be promoted to priority -1 on that node's plist and will
-be used round robin before any other swap devices.
diff --git a/include/linux/swap.h b/include/linux/swap.h
index a4b264817735..38ca3df68716 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -301,16 +301,7 @@ struct swap_info_struct {
struct work_struct discard_work; /* discard worker */
struct work_struct reclaim_work; /* reclaim worker */
struct list_head discard_clusters; /* discard clusters list */
- struct plist_node avail_lists[]; /*
- * entries in swap_avail_heads, one
- * entry per node.
- * Must be last as the number of the
- * array is nr_node_ids, which is not
- * a fixed value so have to allocate
- * dynamically.
- * And it has to be an array so that
- * plist_for_each_* can work.
- */
+ struct plist_node avail_list; /* entry in swap_avail_head */
};
static inline swp_entry_t page_swap_entry(struct page *page)
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 0c2174d6b924..4a36ea15de2b 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -74,7 +74,7 @@ atomic_long_t nr_swap_pages;
EXPORT_SYMBOL_GPL(nr_swap_pages);
/* protected with swap_lock. reading in vm_swap_full() doesn't need lock */
long total_swap_pages;
-static int least_priority = -1;
+static int least_priority;
unsigned long swapfile_maximum_size;
#ifdef CONFIG_MIGRATION
bool swap_migration_ad_supported;
@@ -103,7 +103,7 @@ static PLIST_HEAD(swap_active_head);
* is held and the locking order requires swap_lock to be taken
* before any swap_info_struct->lock.
*/
-static struct plist_head *swap_avail_heads;
+static PLIST_HEAD(swap_avail_head);
static DEFINE_SPINLOCK(swap_avail_lock);
struct swap_info_struct *swap_info[MAX_SWAPFILES];
@@ -1130,7 +1130,6 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
/* SWAP_USAGE_OFFLIST_BIT can only be set by this helper. */
static void del_from_avail_list(struct swap_info_struct *si, bool swapoff)
{
- int nid;
unsigned long pages;
spin_lock(&swap_avail_lock);
@@ -1159,8 +1158,7 @@ static void del_from_avail_list(struct swap_info_struct *si, bool swapoff)
goto skip;
}
- for_each_node(nid)
- plist_del(&si->avail_lists[nid], &swap_avail_heads[nid]);
+ plist_del(&si->avail_list, &swap_avail_head);
skip:
spin_unlock(&swap_avail_lock);
@@ -1169,7 +1167,6 @@ static void del_from_avail_list(struct swap_info_struct *si, bool swapoff)
/* SWAP_USAGE_OFFLIST_BIT can only be cleared by this helper. */
static void add_to_avail_list(struct swap_info_struct *si, bool swapon)
{
- int nid;
long val;
unsigned long pages;
@@ -1202,8 +1199,7 @@ static void add_to_avail_list(struct swap_info_struct *si, bool swapon)
goto skip;
}
- for_each_node(nid)
- plist_add(&si->avail_lists[nid], &swap_avail_heads[nid]);
+ plist_add(&si->avail_list, &swap_avail_head);
skip:
spin_unlock(&swap_avail_lock);
@@ -1346,16 +1342,14 @@ static bool swap_alloc_fast(swp_entry_t *entry,
static bool swap_alloc_slow(swp_entry_t *entry,
int order)
{
- int node;
unsigned long offset;
struct swap_info_struct *si, *next;
- node = numa_node_id();
spin_lock(&swap_avail_lock);
start_over:
- plist_for_each_entry_safe(si, next, &swap_avail_heads[node], avail_lists[node]) {
+ plist_for_each_entry_safe(si, next, &swap_avail_head, avail_list) {
/* Rotate the device and switch to a new cluster */
- plist_requeue(&si->avail_lists[node], &swap_avail_heads[node]);
+ plist_requeue(&si->avail_list, &swap_avail_head);
spin_unlock(&swap_avail_lock);
if (get_swap_device_info(si)) {
offset = cluster_alloc_swap_entry(si, order, SWAP_HAS_CACHE);
@@ -1380,7 +1374,7 @@ static bool swap_alloc_slow(swp_entry_t *entry,
* still in the swap_avail_head list then try it, otherwise
* start over if we have not gotten any slots.
*/
- if (plist_node_empty(&next->avail_lists[node]))
+ if (plist_node_empty(&si->avail_list))
goto start_over;
}
spin_unlock(&swap_avail_lock);
@@ -2709,25 +2703,11 @@ static int setup_swap_extents(struct swap_info_struct *sis, sector_t *span)
return generic_swapfile_activate(sis, swap_file, span);
}
-static int swap_node(struct swap_info_struct *si)
-{
- struct block_device *bdev;
-
- if (si->bdev)
- bdev = si->bdev;
- else
- bdev = si->swap_file->f_inode->i_sb->s_bdev;
-
- return bdev ? bdev->bd_disk->node_id : NUMA_NO_NODE;
-}
-
static void setup_swap_info(struct swap_info_struct *si, int prio,
unsigned char *swap_map,
struct swap_cluster_info *cluster_info,
unsigned long *zeromap)
{
- int i;
-
if (prio >= 0)
si->prio = prio;
else
@@ -2737,16 +2717,7 @@ static void setup_swap_info(struct swap_info_struct *si, int prio,
* low-to-high, while swap ordering is high-to-low
*/
si->list.prio = -si->prio;
- for_each_node(i) {
- if (si->prio >= 0)
- si->avail_lists[i].prio = -si->prio;
- else {
- if (swap_node(si) == i)
- si->avail_lists[i].prio = 1;
- else
- si->avail_lists[i].prio = -si->prio;
- }
- }
+ si->avail_list.prio = -si->prio;
si->swap_map = swap_map;
si->cluster_info = cluster_info;
si->zeromap = zeromap;
@@ -2924,10 +2895,7 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
plist_for_each_entry_continue(si, &swap_active_head, list) {
si->prio++;
si->list.prio--;
- for_each_node(nid) {
- if (si->avail_lists[nid].prio != 1)
- si->avail_lists[nid].prio--;
- }
+ si->avail_list.prio--;
}
least_priority++;
}
@@ -3168,9 +3136,8 @@ static struct swap_info_struct *alloc_swap_info(void)
struct swap_info_struct *p;
struct swap_info_struct *defer = NULL;
unsigned int type;
- int i;
- p = kvzalloc(struct_size(p, avail_lists, nr_node_ids), GFP_KERNEL);
+ p = kvzalloc(sizeof(struct swap_info_struct), GFP_KERNEL);
if (!p)
return ERR_PTR(-ENOMEM);
@@ -3209,8 +3176,7 @@ static struct swap_info_struct *alloc_swap_info(void)
}
p->swap_extent_root = RB_ROOT;
plist_node_init(&p->list, 0);
- for_each_node(i)
- plist_node_init(&p->avail_lists[i], 0);
+ plist_node_init(&p->avail_list, 0);
p->flags = SWP_USED;
spin_unlock(&swap_lock);
if (defer) {
@@ -3467,9 +3433,6 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
if (!capable(CAP_SYS_ADMIN))
return -EPERM;
- if (!swap_avail_heads)
- return -ENOMEM;
-
si = alloc_swap_info();
if (IS_ERR(si))
return PTR_ERR(si);
@@ -4079,7 +4042,6 @@ static bool __has_usable_swap(void)
void __folio_throttle_swaprate(struct folio *folio, gfp_t gfp)
{
struct swap_info_struct *si, *next;
- int nid = folio_nid(folio);
if (!(gfp & __GFP_IO))
return;
@@ -4098,8 +4060,8 @@ void __folio_throttle_swaprate(struct folio *folio, gfp_t gfp)
return;
spin_lock(&swap_avail_lock);
- plist_for_each_entry_safe(si, next, &swap_avail_heads[nid],
- avail_lists[nid]) {
+ plist_for_each_entry_safe(si, next, &swap_avail_head,
+ avail_list) {
if (si->bdev) {
blkcg_schedule_throttle(si->bdev->bd_disk, true);
break;
@@ -4111,18 +4073,6 @@ void __folio_throttle_swaprate(struct folio *folio, gfp_t gfp)
static int __init swapfile_init(void)
{
- int nid;
-
- swap_avail_heads = kmalloc_array(nr_node_ids, sizeof(struct plist_head),
- GFP_KERNEL);
- if (!swap_avail_heads) {
- pr_emerg("Not enough memory for swap heads, swap is disabled\n");
- return -ENOMEM;
- }
-
- for_each_node(nid)
- plist_head_init(&swap_avail_heads[nid]);
-
swapfile_maximum_size = arch_max_swapfile_size();
/*
--
2.41.0
^ permalink raw reply [flat|nested] 22+ messages in thread
* [PATCH v4 mm-new 2/2] mm/swap: select swap device with default priority round robin
2025-10-11 8:16 [PATCH v4 mm-new 0/2] mm/swapfile.c: select the swap device with default priority round robin Baoquan He
2025-10-11 8:16 ` [PATCH v4 mm-new 1/2] mm/swap: do not choose swap device according to numa node Baoquan He
@ 2025-10-11 8:16 ` Baoquan He
2025-10-12 20:40 ` Barry Song
1 sibling, 1 reply; 22+ messages in thread
From: Baoquan He @ 2025-10-11 8:16 UTC (permalink / raw)
To: linux-mm
Cc: akpm, chrisl, kasong, youngjun.park, aaron.lu, baohua, shikemeng,
nphamcs, Baoquan He
Swap devices are assumed to have similar accessing speed if no priority
is specified when swapon. It's unfair and doesn't make sense just because
one swap device is swapped on firstly, its priority will be higher than
the one swapped on later.
Here, set all swap devicess to have priority '-1' by default. With this
change, swap device with default priority will be selected round robin
when swapping out. This can improve the swapping efficiency a lot among
multiple swap devices with default priority.
Below are swapon output during processes high pressure vm-scability test
is being taken:
1) This is pre-commit a2468cc9bfdf, swap device is selectd one by one by
priority from high to low when one swap device is exhausted:
------------------------------------
[root@hp-dl385g10-03 ~]# swapon
NAME TYPE SIZE USED PRIO
/dev/zram0 partition 16G 16G -1
/dev/zram1 partition 16G 966.2M -2
/dev/zram2 partition 16G 0B -3
/dev/zram3 partition 16G 0B -4
2) This is behaviour with commit a2468cc9bfdf, on node, swap device
sharing the same node id is selected firstly until exhausted; while
on node no swap device sharing the node id it selects the one with
highest priority until exhaustd:
------------------------------------
[root@hp-dl385g10-03 ~]# swapon
NAME TYPE SIZE USED PRIO
/dev/zram0 partition 16G 15.7G -2
/dev/zram1 partition 16G 3.4G -3
/dev/zram2 partition 16G 3.4G -4
/dev/zram3 partition 16G 2.6G -5
3) After this patch applied, swap devices with default priority are selectd
round robin:
------------------------------------
[root@hp-dl385g10-03 block]# swapon
NAME TYPE SIZE USED PRIO
/dev/zram0 partition 16G 6.6G -1
/dev/zram1 partition 16G 6.6G -1
/dev/zram2 partition 16G 6.6G -1
/dev/zram3 partition 16G 6.6G -1
With the change, we can see about 18% efficiency promotion relative to
node based way as below. (Surely, the pre-commit a2468cc9bfdf way is
the worst.)
vm-scability test:
==================
Test with:
usemem --init-time -O -y -x -n 31 2G (4G memcg, zram as swap)
one by one: node based: round robin:
System time: 1087.38 s 637.92 s 526.74 s (lower is better)
Sum Throughput: 2036.55 MB/s 3546.56 MB/s 4207.56 MB/s (higher is better)
Single process Throughput: 65.69 MB/s 114.40 MB/s 135.72 MB/s (high is better)
free latency: 15769409.48 us 10138455.99 us 6810119.01 us(lower is better)
Suggested-by: Chris Li <chrisl@kernel.org>
Signed-off-by: Baoquan He <bhe@redhat.com>
Acked-by: Chris Li <chrisl@kernel.org>
---
mm/swapfile.c | 31 ++++---------------------------
1 file changed, 4 insertions(+), 27 deletions(-)
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 4a36ea15de2b..5bd65cb56a77 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -74,7 +74,7 @@ atomic_long_t nr_swap_pages;
EXPORT_SYMBOL_GPL(nr_swap_pages);
/* protected with swap_lock. reading in vm_swap_full() doesn't need lock */
long total_swap_pages;
-static int least_priority;
+#define DEF_SWAP_PRIO -1
unsigned long swapfile_maximum_size;
#ifdef CONFIG_MIGRATION
bool swap_migration_ad_supported;
@@ -2708,10 +2708,7 @@ static void setup_swap_info(struct swap_info_struct *si, int prio,
struct swap_cluster_info *cluster_info,
unsigned long *zeromap)
{
- if (prio >= 0)
- si->prio = prio;
- else
- si->prio = --least_priority;
+ si->prio = prio;
/*
* the plist prio is negated because plist ordering is
* low-to-high, while swap ordering is high-to-low
@@ -2729,16 +2726,7 @@ static void _enable_swap_info(struct swap_info_struct *si)
total_swap_pages += si->pages;
assert_spin_locked(&swap_lock);
- /*
- * both lists are plists, and thus priority ordered.
- * swap_active_head needs to be priority ordered for swapoff(),
- * which on removal of any swap_info_struct with an auto-assigned
- * (i.e. negative) priority increments the auto-assigned priority
- * of any lower-priority swap_info_structs.
- * swap_avail_head needs to be priority ordered for folio_alloc_swap(),
- * which allocates swap pages from the highest available priority
- * swap_info_struct.
- */
+
plist_add(&si->list, &swap_active_head);
/* Add back to available list */
@@ -2888,17 +2876,6 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
}
spin_lock(&p->lock);
del_from_avail_list(p, true);
- if (p->prio < 0) {
- struct swap_info_struct *si = p;
- int nid;
-
- plist_for_each_entry_continue(si, &swap_active_head, list) {
- si->prio++;
- si->list.prio--;
- si->avail_list.prio--;
- }
- least_priority++;
- }
plist_del(&p->list, &swap_active_head);
atomic_long_sub(p->pages, &nr_swap_pages);
total_swap_pages -= p->pages;
@@ -3609,7 +3586,7 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
}
mutex_lock(&swapon_mutex);
- prio = -1;
+ prio = DEF_SWAP_PRIO;
if (swap_flags & SWAP_FLAG_PREFER)
prio = swap_flags & SWAP_FLAG_PRIO_MASK;
enable_swap_info(si, prio, swap_map, cluster_info, zeromap);
--
2.41.0
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v4 mm-new 1/2] mm/swap: do not choose swap device according to numa node
2025-10-11 8:16 ` [PATCH v4 mm-new 1/2] mm/swap: do not choose swap device according to numa node Baoquan He
@ 2025-10-11 20:45 ` kernel test robot
2025-10-11 22:04 ` Andrew Morton
2025-10-13 6:09 ` Barry Song
1 sibling, 1 reply; 22+ messages in thread
From: kernel test robot @ 2025-10-11 20:45 UTC (permalink / raw)
To: Baoquan He, linux-mm
Cc: oe-kbuild-all, akpm, chrisl, kasong, youngjun.park, baohua,
shikemeng, nphamcs, Baoquan He
Hi Baoquan,
kernel test robot noticed the following build errors:
[auto build test ERROR on akpm-mm/mm-everything]
url: https://github.com/intel-lab-lkp/linux/commits/Baoquan-He/mm-swap-do-not-choose-swap-device-according-to-numa-node/20251011-161743
base: https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link: https://lore.kernel.org/r/20251011081624.224202-2-bhe%40redhat.com
patch subject: [PATCH v4 mm-new 1/2] mm/swap: do not choose swap device according to numa node
config: i386-buildonly-randconfig-003-20251012 (https://download.01.org/0day-ci/archive/20251012/202510120456.vTSygUXV-lkp@intel.com/config)
compiler: gcc-14 (Debian 14.2.0-19) 14.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20251012/202510120456.vTSygUXV-lkp@intel.com/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202510120456.vTSygUXV-lkp@intel.com/
All error/warnings (new ones prefixed by >>):
In file included from include/linux/list.h:5,
from include/linux/wait.h:7,
from include/linux/wait_bit.h:8,
from include/linux/fs.h:7,
from include/linux/highmem.h:5,
from include/linux/bvec.h:10,
from include/linux/blk_types.h:10,
from include/linux/blkdev.h:9,
from mm/swapfile.c:9:
mm/swapfile.c: In function 'swap_sync_discard':
>> mm/swapfile.c:1395:46: error: 'swap_avail_heads' undeclared (first use in this function); did you mean 'swap_avail_head'?
1395 | plist_for_each_entry_safe(si, next, &swap_avail_heads[nid], avail_lists[nid]) {
| ^~~~~~~~~~~~~~~~
include/linux/container_of.h:20:33: note: in definition of macro 'container_of'
20 | void *__mptr = (void *)(ptr); \
| ^~~
include/linux/list.h:620:9: note: in expansion of macro 'list_entry'
620 | list_entry((ptr)->next, type, member)
| ^~~~~~~~~~
include/linux/list.h:869:20: note: in expansion of macro 'list_first_entry'
869 | for (pos = list_first_entry(head, typeof(*pos), member), \
| ^~~~~~~~~~~~~~~~
include/linux/plist.h:197:9: note: in expansion of macro 'list_for_each_entry_safe'
197 | list_for_each_entry_safe(pos, n, &(head)->node_list, m.node_list)
| ^~~~~~~~~~~~~~~~~~~~~~~~
mm/swapfile.c:1395:9: note: in expansion of macro 'plist_for_each_entry_safe'
1395 | plist_for_each_entry_safe(si, next, &swap_avail_heads[nid], avail_lists[nid]) {
| ^~~~~~~~~~~~~~~~~~~~~~~~~
mm/swapfile.c:1395:46: note: each undeclared identifier is reported only once for each function it appears in
1395 | plist_for_each_entry_safe(si, next, &swap_avail_heads[nid], avail_lists[nid]) {
| ^~~~~~~~~~~~~~~~
include/linux/container_of.h:20:33: note: in definition of macro 'container_of'
20 | void *__mptr = (void *)(ptr); \
| ^~~
include/linux/list.h:620:9: note: in expansion of macro 'list_entry'
620 | list_entry((ptr)->next, type, member)
| ^~~~~~~~~~
include/linux/list.h:869:20: note: in expansion of macro 'list_first_entry'
869 | for (pos = list_first_entry(head, typeof(*pos), member), \
| ^~~~~~~~~~~~~~~~
include/linux/plist.h:197:9: note: in expansion of macro 'list_for_each_entry_safe'
197 | list_for_each_entry_safe(pos, n, &(head)->node_list, m.node_list)
| ^~~~~~~~~~~~~~~~~~~~~~~~
mm/swapfile.c:1395:9: note: in expansion of macro 'plist_for_each_entry_safe'
1395 | plist_for_each_entry_safe(si, next, &swap_avail_heads[nid], avail_lists[nid]) {
| ^~~~~~~~~~~~~~~~~~~~~~~~~
In file included from include/linux/init.h:5,
from include/linux/printk.h:6,
from include/asm-generic/bug.h:22,
from arch/x86/include/asm/bug.h:103,
from include/linux/bug.h:5,
from include/linux/vfsdebug.h:5,
from include/linux/fs.h:5:
>> mm/swapfile.c:1395:69: error: 'struct swap_info_struct' has no member named 'avail_lists'; did you mean 'avail_list'?
1395 | plist_for_each_entry_safe(si, next, &swap_avail_heads[nid], avail_lists[nid]) {
| ^~~~~~~~~~~
include/linux/build_bug.h:78:56: note: in definition of macro '__static_assert'
78 | #define __static_assert(expr, msg, ...) _Static_assert(expr, msg)
| ^~~~
include/linux/container_of.h:21:9: note: in expansion of macro 'static_assert'
21 | static_assert(__same_type(*(ptr), ((type *)0)->member) || \
| ^~~~~~~~~~~~~
include/linux/container_of.h:21:23: note: in expansion of macro '__same_type'
21 | static_assert(__same_type(*(ptr), ((type *)0)->member) || \
| ^~~~~~~~~~~
include/linux/list.h:609:9: note: in expansion of macro 'container_of'
609 | container_of(ptr, type, member)
| ^~~~~~~~~~~~
include/linux/list.h:620:9: note: in expansion of macro 'list_entry'
620 | list_entry((ptr)->next, type, member)
| ^~~~~~~~~~
include/linux/list.h:869:20: note: in expansion of macro 'list_first_entry'
869 | for (pos = list_first_entry(head, typeof(*pos), member), \
| ^~~~~~~~~~~~~~~~
include/linux/plist.h:197:9: note: in expansion of macro 'list_for_each_entry_safe'
197 | list_for_each_entry_safe(pos, n, &(head)->node_list, m.node_list)
| ^~~~~~~~~~~~~~~~~~~~~~~~
mm/swapfile.c:1395:9: note: in expansion of macro 'plist_for_each_entry_safe'
1395 | plist_for_each_entry_safe(si, next, &swap_avail_heads[nid], avail_lists[nid]) {
| ^~~~~~~~~~~~~~~~~~~~~~~~~
include/linux/compiler_types.h:530:27: error: expression in static assertion is not an integer
530 | #define __same_type(a, b) __builtin_types_compatible_p(typeof(a), typeof(b))
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~
include/linux/build_bug.h:78:56: note: in definition of macro '__static_assert'
78 | #define __static_assert(expr, msg, ...) _Static_assert(expr, msg)
| ^~~~
include/linux/container_of.h:21:9: note: in expansion of macro 'static_assert'
21 | static_assert(__same_type(*(ptr), ((type *)0)->member) || \
| ^~~~~~~~~~~~~
include/linux/container_of.h:21:23: note: in expansion of macro '__same_type'
21 | static_assert(__same_type(*(ptr), ((type *)0)->member) || \
| ^~~~~~~~~~~
include/linux/list.h:609:9: note: in expansion of macro 'container_of'
609 | container_of(ptr, type, member)
| ^~~~~~~~~~~~
include/linux/list.h:620:9: note: in expansion of macro 'list_entry'
620 | list_entry((ptr)->next, type, member)
| ^~~~~~~~~~
include/linux/list.h:869:20: note: in expansion of macro 'list_first_entry'
869 | for (pos = list_first_entry(head, typeof(*pos), member), \
| ^~~~~~~~~~~~~~~~
include/linux/plist.h:197:9: note: in expansion of macro 'list_for_each_entry_safe'
197 | list_for_each_entry_safe(pos, n, &(head)->node_list, m.node_list)
| ^~~~~~~~~~~~~~~~~~~~~~~~
mm/swapfile.c:1395:9: note: in expansion of macro 'plist_for_each_entry_safe'
1395 | plist_for_each_entry_safe(si, next, &swap_avail_heads[nid], avail_lists[nid]) {
| ^~~~~~~~~~~~~~~~~~~~~~~~~
In file included from include/uapi/linux/posix_types.h:5,
from include/uapi/linux/types.h:14,
from include/linux/types.h:6,
from include/linux/blkdev.h:8:
>> mm/swapfile.c:1395:69: error: 'struct swap_info_struct' has no member named 'avail_lists'; did you mean 'avail_list'?
1395 | plist_for_each_entry_safe(si, next, &swap_avail_heads[nid], avail_lists[nid]) {
| ^~~~~~~~~~~
include/linux/stddef.h:16:58: note: in definition of macro 'offsetof'
16 | #define offsetof(TYPE, MEMBER) __builtin_offsetof(TYPE, MEMBER)
| ^~~~~~
include/linux/list.h:609:9: note: in expansion of macro 'container_of'
609 | container_of(ptr, type, member)
| ^~~~~~~~~~~~
include/linux/list.h:620:9: note: in expansion of macro 'list_entry'
620 | list_entry((ptr)->next, type, member)
| ^~~~~~~~~~
include/linux/list.h:869:20: note: in expansion of macro 'list_first_entry'
869 | for (pos = list_first_entry(head, typeof(*pos), member), \
| ^~~~~~~~~~~~~~~~
include/linux/plist.h:197:9: note: in expansion of macro 'list_for_each_entry_safe'
197 | list_for_each_entry_safe(pos, n, &(head)->node_list, m.node_list)
| ^~~~~~~~~~~~~~~~~~~~~~~~
mm/swapfile.c:1395:9: note: in expansion of macro 'plist_for_each_entry_safe'
1395 | plist_for_each_entry_safe(si, next, &swap_avail_heads[nid], avail_lists[nid]) {
| ^~~~~~~~~~~~~~~~~~~~~~~~~
>> mm/swapfile.c:1395:69: error: 'struct swap_info_struct' has no member named 'avail_lists'; did you mean 'avail_list'?
1395 | plist_for_each_entry_safe(si, next, &swap_avail_heads[nid], avail_lists[nid]) {
| ^~~~~~~~~~~
include/linux/container_of.h:20:33: note: in definition of macro 'container_of'
20 | void *__mptr = (void *)(ptr); \
| ^~~
include/linux/list.h:667:9: note: in expansion of macro 'list_entry'
667 | list_entry((pos)->member.next, typeof(*(pos)), member)
| ^~~~~~~~~~
include/linux/list.h:870:21: note: in expansion of macro 'list_next_entry'
870 | n = list_next_entry(pos, member); \
| ^~~~~~~~~~~~~~~
include/linux/plist.h:197:9: note: in expansion of macro 'list_for_each_entry_safe'
197 | list_for_each_entry_safe(pos, n, &(head)->node_list, m.node_list)
| ^~~~~~~~~~~~~~~~~~~~~~~~
mm/swapfile.c:1395:9: note: in expansion of macro 'plist_for_each_entry_safe'
1395 | plist_for_each_entry_safe(si, next, &swap_avail_heads[nid], avail_lists[nid]) {
| ^~~~~~~~~~~~~~~~~~~~~~~~~
>> mm/swapfile.c:1395:69: error: 'struct swap_info_struct' has no member named 'avail_lists'; did you mean 'avail_list'?
1395 | plist_for_each_entry_safe(si, next, &swap_avail_heads[nid], avail_lists[nid]) {
| ^~~~~~~~~~~
include/linux/build_bug.h:78:56: note: in definition of macro '__static_assert'
78 | #define __static_assert(expr, msg, ...) _Static_assert(expr, msg)
| ^~~~
include/linux/container_of.h:21:9: note: in expansion of macro 'static_assert'
21 | static_assert(__same_type(*(ptr), ((type *)0)->member) || \
| ^~~~~~~~~~~~~
include/linux/container_of.h:21:23: note: in expansion of macro '__same_type'
21 | static_assert(__same_type(*(ptr), ((type *)0)->member) || \
| ^~~~~~~~~~~
include/linux/list.h:609:9: note: in expansion of macro 'container_of'
609 | container_of(ptr, type, member)
| ^~~~~~~~~~~~
include/linux/list.h:667:9: note: in expansion of macro 'list_entry'
667 | list_entry((pos)->member.next, typeof(*(pos)), member)
| ^~~~~~~~~~
include/linux/list.h:870:21: note: in expansion of macro 'list_next_entry'
870 | n = list_next_entry(pos, member); \
| ^~~~~~~~~~~~~~~
include/linux/plist.h:197:9: note: in expansion of macro 'list_for_each_entry_safe'
197 | list_for_each_entry_safe(pos, n, &(head)->node_list, m.node_list)
| ^~~~~~~~~~~~~~~~~~~~~~~~
mm/swapfile.c:1395:9: note: in expansion of macro 'plist_for_each_entry_safe'
1395 | plist_for_each_entry_safe(si, next, &swap_avail_heads[nid], avail_lists[nid]) {
| ^~~~~~~~~~~~~~~~~~~~~~~~~
>> mm/swapfile.c:1395:69: error: 'struct swap_info_struct' has no member named 'avail_lists'; did you mean 'avail_list'?
1395 | plist_for_each_entry_safe(si, next, &swap_avail_heads[nid], avail_lists[nid]) {
| ^~~~~~~~~~~
include/linux/build_bug.h:78:56: note: in definition of macro '__static_assert'
78 | #define __static_assert(expr, msg, ...) _Static_assert(expr, msg)
| ^~~~
include/linux/container_of.h:21:9: note: in expansion of macro 'static_assert'
21 | static_assert(__same_type(*(ptr), ((type *)0)->member) || \
| ^~~~~~~~~~~~~
include/linux/container_of.h:21:23: note: in expansion of macro '__same_type'
21 | static_assert(__same_type(*(ptr), ((type *)0)->member) || \
| ^~~~~~~~~~~
include/linux/list.h:609:9: note: in expansion of macro 'container_of'
609 | container_of(ptr, type, member)
| ^~~~~~~~~~~~
include/linux/list.h:667:9: note: in expansion of macro 'list_entry'
667 | list_entry((pos)->member.next, typeof(*(pos)), member)
| ^~~~~~~~~~
include/linux/list.h:870:21: note: in expansion of macro 'list_next_entry'
870 | n = list_next_entry(pos, member); \
| ^~~~~~~~~~~~~~~
include/linux/plist.h:197:9: note: in expansion of macro 'list_for_each_entry_safe'
197 | list_for_each_entry_safe(pos, n, &(head)->node_list, m.node_list)
| ^~~~~~~~~~~~~~~~~~~~~~~~
mm/swapfile.c:1395:9: note: in expansion of macro 'plist_for_each_entry_safe'
1395 | plist_for_each_entry_safe(si, next, &swap_avail_heads[nid], avail_lists[nid]) {
| ^~~~~~~~~~~~~~~~~~~~~~~~~
>> mm/swapfile.c:1395:69: error: 'struct swap_info_struct' has no member named 'avail_lists'; did you mean 'avail_list'?
1395 | plist_for_each_entry_safe(si, next, &swap_avail_heads[nid], avail_lists[nid]) {
| ^~~~~~~~~~~
include/linux/build_bug.h:78:56: note: in definition of macro '__static_assert'
78 | #define __static_assert(expr, msg, ...) _Static_assert(expr, msg)
| ^~~~
include/linux/container_of.h:21:9: note: in expansion of macro 'static_assert'
21 | static_assert(__same_type(*(ptr), ((type *)0)->member) || \
| ^~~~~~~~~~~~~
include/linux/container_of.h:22:23: note: in expansion of macro '__same_type'
22 | __same_type(*(ptr), void), \
| ^~~~~~~~~~~
include/linux/list.h:609:9: note: in expansion of macro 'container_of'
609 | container_of(ptr, type, member)
| ^~~~~~~~~~~~
include/linux/list.h:667:9: note: in expansion of macro 'list_entry'
667 | list_entry((pos)->member.next, typeof(*(pos)), member)
| ^~~~~~~~~~
include/linux/list.h:870:21: note: in expansion of macro 'list_next_entry'
870 | n = list_next_entry(pos, member); \
| ^~~~~~~~~~~~~~~
include/linux/plist.h:197:9: note: in expansion of macro 'list_for_each_entry_safe'
197 | list_for_each_entry_safe(pos, n, &(head)->node_list, m.node_list)
| ^~~~~~~~~~~~~~~~~~~~~~~~
mm/swapfile.c:1395:9: note: in expansion of macro 'plist_for_each_entry_safe'
1395 | plist_for_each_entry_safe(si, next, &swap_avail_heads[nid], avail_lists[nid]) {
| ^~~~~~~~~~~~~~~~~~~~~~~~~
include/linux/compiler_types.h:530:27: error: expression in static assertion is not an integer
530 | #define __same_type(a, b) __builtin_types_compatible_p(typeof(a), typeof(b))
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~
include/linux/build_bug.h:78:56: note: in definition of macro '__static_assert'
78 | #define __static_assert(expr, msg, ...) _Static_assert(expr, msg)
| ^~~~
include/linux/container_of.h:21:9: note: in expansion of macro 'static_assert'
21 | static_assert(__same_type(*(ptr), ((type *)0)->member) || \
| ^~~~~~~~~~~~~
include/linux/container_of.h:21:23: note: in expansion of macro '__same_type'
21 | static_assert(__same_type(*(ptr), ((type *)0)->member) || \
| ^~~~~~~~~~~
include/linux/list.h:609:9: note: in expansion of macro 'container_of'
609 | container_of(ptr, type, member)
| ^~~~~~~~~~~~
include/linux/list.h:667:9: note: in expansion of macro 'list_entry'
667 | list_entry((pos)->member.next, typeof(*(pos)), member)
| ^~~~~~~~~~
include/linux/list.h:870:21: note: in expansion of macro 'list_next_entry'
870 | n = list_next_entry(pos, member); \
| ^~~~~~~~~~~~~~~
include/linux/plist.h:197:9: note: in expansion of macro 'list_for_each_entry_safe'
197 | list_for_each_entry_safe(pos, n, &(head)->node_list, m.node_list)
| ^~~~~~~~~~~~~~~~~~~~~~~~
mm/swapfile.c:1395:9: note: in expansion of macro 'plist_for_each_entry_safe'
1395 | plist_for_each_entry_safe(si, next, &swap_avail_heads[nid], avail_lists[nid]) {
| ^~~~~~~~~~~~~~~~~~~~~~~~~
>> mm/swapfile.c:1395:69: error: 'struct swap_info_struct' has no member named 'avail_lists'; did you mean 'avail_list'?
1395 | plist_for_each_entry_safe(si, next, &swap_avail_heads[nid], avail_lists[nid]) {
| ^~~~~~~~~~~
include/linux/stddef.h:16:58: note: in definition of macro 'offsetof'
16 | #define offsetof(TYPE, MEMBER) __builtin_offsetof(TYPE, MEMBER)
| ^~~~~~
include/linux/list.h:609:9: note: in expansion of macro 'container_of'
609 | container_of(ptr, type, member)
| ^~~~~~~~~~~~
include/linux/list.h:667:9: note: in expansion of macro 'list_entry'
667 | list_entry((pos)->member.next, typeof(*(pos)), member)
| ^~~~~~~~~~
include/linux/list.h:870:21: note: in expansion of macro 'list_next_entry'
870 | n = list_next_entry(pos, member); \
| ^~~~~~~~~~~~~~~
include/linux/plist.h:197:9: note: in expansion of macro 'list_for_each_entry_safe'
197 | list_for_each_entry_safe(pos, n, &(head)->node_list, m.node_list)
| ^~~~~~~~~~~~~~~~~~~~~~~~
mm/swapfile.c:1395:9: note: in expansion of macro 'plist_for_each_entry_safe'
1395 | plist_for_each_entry_safe(si, next, &swap_avail_heads[nid], avail_lists[nid]) {
| ^~~~~~~~~~~~~~~~~~~~~~~~~
include/linux/list.h:869:64: warning: left-hand operand of comma expression has no effect [-Wunused-value]
869 | for (pos = list_first_entry(head, typeof(*pos), member), \
| ^
include/linux/plist.h:197:9: note: in expansion of macro 'list_for_each_entry_safe'
197 | list_for_each_entry_safe(pos, n, &(head)->node_list, m.node_list)
| ^~~~~~~~~~~~~~~~~~~~~~~~
mm/swapfile.c:1395:9: note: in expansion of macro 'plist_for_each_entry_safe'
1395 | plist_for_each_entry_safe(si, next, &swap_avail_heads[nid], avail_lists[nid]) {
| ^~~~~~~~~~~~~~~~~~~~~~~~~
>> mm/swapfile.c:1395:69: error: 'struct swap_info_struct' has no member named 'avail_lists'; did you mean 'avail_list'?
1395 | plist_for_each_entry_safe(si, next, &swap_avail_heads[nid], avail_lists[nid]) {
| ^~~~~~~~~~~
include/linux/list.h:773:28: note: in definition of macro 'list_entry_is_head'
773 | list_is_head(&pos->member, (head))
| ^~~~~~
include/linux/plist.h:197:9: note: in expansion of macro 'list_for_each_entry_safe'
197 | list_for_each_entry_safe(pos, n, &(head)->node_list, m.node_list)
| ^~~~~~~~~~~~~~~~~~~~~~~~
mm/swapfile.c:1395:9: note: in expansion of macro 'plist_for_each_entry_safe'
1395 | plist_for_each_entry_safe(si, next, &swap_avail_heads[nid], avail_lists[nid]) {
| ^~~~~~~~~~~~~~~~~~~~~~~~~
>> mm/swapfile.c:1395:69: error: 'struct swap_info_struct' has no member named 'avail_lists'; did you mean 'avail_list'?
1395 | plist_for_each_entry_safe(si, next, &swap_avail_heads[nid], avail_lists[nid]) {
| ^~~~~~~~~~~
include/linux/container_of.h:20:33: note: in definition of macro 'container_of'
20 | void *__mptr = (void *)(ptr); \
| ^~~
include/linux/list.h:667:9: note: in expansion of macro 'list_entry'
667 | list_entry((pos)->member.next, typeof(*(pos)), member)
| ^~~~~~~~~~
include/linux/list.h:872:27: note: in expansion of macro 'list_next_entry'
872 | pos = n, n = list_next_entry(n, member))
| ^~~~~~~~~~~~~~~
include/linux/plist.h:197:9: note: in expansion of macro 'list_for_each_entry_safe'
197 | list_for_each_entry_safe(pos, n, &(head)->node_list, m.node_list)
| ^~~~~~~~~~~~~~~~~~~~~~~~
mm/swapfile.c:1395:9: note: in expansion of macro 'plist_for_each_entry_safe'
1395 | plist_for_each_entry_safe(si, next, &swap_avail_heads[nid], avail_lists[nid]) {
| ^~~~~~~~~~~~~~~~~~~~~~~~~
>> mm/swapfile.c:1395:69: error: 'struct swap_info_struct' has no member named 'avail_lists'; did you mean 'avail_list'?
1395 | plist_for_each_entry_safe(si, next, &swap_avail_heads[nid], avail_lists[nid]) {
| ^~~~~~~~~~~
include/linux/build_bug.h:78:56: note: in definition of macro '__static_assert'
78 | #define __static_assert(expr, msg, ...) _Static_assert(expr, msg)
| ^~~~
include/linux/container_of.h:21:9: note: in expansion of macro 'static_assert'
21 | static_assert(__same_type(*(ptr), ((type *)0)->member) || \
| ^~~~~~~~~~~~~
include/linux/container_of.h:21:23: note: in expansion of macro '__same_type'
21 | static_assert(__same_type(*(ptr), ((type *)0)->member) || \
| ^~~~~~~~~~~
include/linux/list.h:609:9: note: in expansion of macro 'container_of'
609 | container_of(ptr, type, member)
| ^~~~~~~~~~~~
include/linux/list.h:667:9: note: in expansion of macro 'list_entry'
667 | list_entry((pos)->member.next, typeof(*(pos)), member)
| ^~~~~~~~~~
include/linux/list.h:872:27: note: in expansion of macro 'list_next_entry'
872 | pos = n, n = list_next_entry(n, member))
| ^~~~~~~~~~~~~~~
include/linux/plist.h:197:9: note: in expansion of macro 'list_for_each_entry_safe'
197 | list_for_each_entry_safe(pos, n, &(head)->node_list, m.node_list)
| ^~~~~~~~~~~~~~~~~~~~~~~~
mm/swapfile.c:1395:9: note: in expansion of macro 'plist_for_each_entry_safe'
1395 | plist_for_each_entry_safe(si, next, &swap_avail_heads[nid], avail_lists[nid]) {
| ^~~~~~~~~~~~~~~~~~~~~~~~~
>> mm/swapfile.c:1395:69: error: 'struct swap_info_struct' has no member named 'avail_lists'; did you mean 'avail_list'?
1395 | plist_for_each_entry_safe(si, next, &swap_avail_heads[nid], avail_lists[nid]) {
| ^~~~~~~~~~~
include/linux/build_bug.h:78:56: note: in definition of macro '__static_assert'
78 | #define __static_assert(expr, msg, ...) _Static_assert(expr, msg)
| ^~~~
include/linux/container_of.h:21:9: note: in expansion of macro 'static_assert'
21 | static_assert(__same_type(*(ptr), ((type *)0)->member) || \
| ^~~~~~~~~~~~~
include/linux/container_of.h:21:23: note: in expansion of macro '__same_type'
21 | static_assert(__same_type(*(ptr), ((type *)0)->member) || \
| ^~~~~~~~~~~
include/linux/list.h:609:9: note: in expansion of macro 'container_of'
609 | container_of(ptr, type, member)
| ^~~~~~~~~~~~
include/linux/list.h:667:9: note: in expansion of macro 'list_entry'
667 | list_entry((pos)->member.next, typeof(*(pos)), member)
| ^~~~~~~~~~
include/linux/list.h:872:27: note: in expansion of macro 'list_next_entry'
872 | pos = n, n = list_next_entry(n, member))
| ^~~~~~~~~~~~~~~
include/linux/plist.h:197:9: note: in expansion of macro 'list_for_each_entry_safe'
197 | list_for_each_entry_safe(pos, n, &(head)->node_list, m.node_list)
| ^~~~~~~~~~~~~~~~~~~~~~~~
mm/swapfile.c:1395:9: note: in expansion of macro 'plist_for_each_entry_safe'
1395 | plist_for_each_entry_safe(si, next, &swap_avail_heads[nid], avail_lists[nid]) {
| ^~~~~~~~~~~~~~~~~~~~~~~~~
>> mm/swapfile.c:1395:69: error: 'struct swap_info_struct' has no member named 'avail_lists'; did you mean 'avail_list'?
1395 | plist_for_each_entry_safe(si, next, &swap_avail_heads[nid], avail_lists[nid]) {
| ^~~~~~~~~~~
include/linux/build_bug.h:78:56: note: in definition of macro '__static_assert'
78 | #define __static_assert(expr, msg, ...) _Static_assert(expr, msg)
| ^~~~
include/linux/container_of.h:21:9: note: in expansion of macro 'static_assert'
21 | static_assert(__same_type(*(ptr), ((type *)0)->member) || \
| ^~~~~~~~~~~~~
include/linux/container_of.h:22:23: note: in expansion of macro '__same_type'
22 | __same_type(*(ptr), void), \
| ^~~~~~~~~~~
include/linux/list.h:609:9: note: in expansion of macro 'container_of'
609 | container_of(ptr, type, member)
| ^~~~~~~~~~~~
include/linux/list.h:667:9: note: in expansion of macro 'list_entry'
667 | list_entry((pos)->member.next, typeof(*(pos)), member)
| ^~~~~~~~~~
include/linux/list.h:872:27: note: in expansion of macro 'list_next_entry'
872 | pos = n, n = list_next_entry(n, member))
| ^~~~~~~~~~~~~~~
include/linux/plist.h:197:9: note: in expansion of macro 'list_for_each_entry_safe'
197 | list_for_each_entry_safe(pos, n, &(head)->node_list, m.node_list)
| ^~~~~~~~~~~~~~~~~~~~~~~~
mm/swapfile.c:1395:9: note: in expansion of macro 'plist_for_each_entry_safe'
1395 | plist_for_each_entry_safe(si, next, &swap_avail_heads[nid], avail_lists[nid]) {
| ^~~~~~~~~~~~~~~~~~~~~~~~~
include/linux/compiler_types.h:530:27: error: expression in static assertion is not an integer
530 | #define __same_type(a, b) __builtin_types_compatible_p(typeof(a), typeof(b))
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~
include/linux/build_bug.h:78:56: note: in definition of macro '__static_assert'
78 | #define __static_assert(expr, msg, ...) _Static_assert(expr, msg)
| ^~~~
include/linux/container_of.h:21:9: note: in expansion of macro 'static_assert'
21 | static_assert(__same_type(*(ptr), ((type *)0)->member) || \
| ^~~~~~~~~~~~~
include/linux/container_of.h:21:23: note: in expansion of macro '__same_type'
21 | static_assert(__same_type(*(ptr), ((type *)0)->member) || \
| ^~~~~~~~~~~
include/linux/list.h:609:9: note: in expansion of macro 'container_of'
609 | container_of(ptr, type, member)
| ^~~~~~~~~~~~
include/linux/list.h:667:9: note: in expansion of macro 'list_entry'
667 | list_entry((pos)->member.next, typeof(*(pos)), member)
| ^~~~~~~~~~
include/linux/list.h:872:27: note: in expansion of macro 'list_next_entry'
872 | pos = n, n = list_next_entry(n, member))
| ^~~~~~~~~~~~~~~
include/linux/plist.h:197:9: note: in expansion of macro 'list_for_each_entry_safe'
197 | list_for_each_entry_safe(pos, n, &(head)->node_list, m.node_list)
| ^~~~~~~~~~~~~~~~~~~~~~~~
mm/swapfile.c:1395:9: note: in expansion of macro 'plist_for_each_entry_safe'
1395 | plist_for_each_entry_safe(si, next, &swap_avail_heads[nid], avail_lists[nid]) {
| ^~~~~~~~~~~~~~~~~~~~~~~~~
>> mm/swapfile.c:1395:69: error: 'struct swap_info_struct' has no member named 'avail_lists'; did you mean 'avail_list'?
1395 | plist_for_each_entry_safe(si, next, &swap_avail_heads[nid], avail_lists[nid]) {
| ^~~~~~~~~~~
include/linux/stddef.h:16:58: note: in definition of macro 'offsetof'
16 | #define offsetof(TYPE, MEMBER) __builtin_offsetof(TYPE, MEMBER)
| ^~~~~~
include/linux/list.h:609:9: note: in expansion of macro 'container_of'
609 | container_of(ptr, type, member)
| ^~~~~~~~~~~~
include/linux/list.h:667:9: note: in expansion of macro 'list_entry'
667 | list_entry((pos)->member.next, typeof(*(pos)), member)
| ^~~~~~~~~~
include/linux/list.h:872:27: note: in expansion of macro 'list_next_entry'
872 | pos = n, n = list_next_entry(n, member))
| ^~~~~~~~~~~~~~~
include/linux/plist.h:197:9: note: in expansion of macro 'list_for_each_entry_safe'
197 | list_for_each_entry_safe(pos, n, &(head)->node_list, m.node_list)
| ^~~~~~~~~~~~~~~~~~~~~~~~
mm/swapfile.c:1395:9: note: in expansion of macro 'plist_for_each_entry_safe'
1395 | plist_for_each_entry_safe(si, next, &swap_avail_heads[nid], avail_lists[nid]) {
| ^~~~~~~~~~~~~~~~~~~~~~~~~
>> mm/swapfile.c:1391:13: warning: variable 'nid' set but not used [-Wunused-but-set-variable]
1391 | int nid = numa_node_id();
| ^~~
mm/swapfile.c: In function '__do_sys_swapoff':
mm/swapfile.c:2893:21: warning: unused variable 'nid' [-Wunused-variable]
2893 | int nid;
| ^~~
vim +1395 mm/swapfile.c
b487a2da3575b6 Kairui Song 2025-03-14 1383
74cdc690138b65 Kairui Song 2025-10-07 1384 /*
74cdc690138b65 Kairui Song 2025-10-07 1385 * Discard pending clusters in a synchronized way when under high pressure.
74cdc690138b65 Kairui Song 2025-10-07 1386 * Return: true if any cluster is discarded.
74cdc690138b65 Kairui Song 2025-10-07 1387 */
74cdc690138b65 Kairui Song 2025-10-07 1388 static bool swap_sync_discard(void)
74cdc690138b65 Kairui Song 2025-10-07 1389 {
74cdc690138b65 Kairui Song 2025-10-07 1390 bool ret = false;
74cdc690138b65 Kairui Song 2025-10-07 @1391 int nid = numa_node_id();
74cdc690138b65 Kairui Song 2025-10-07 1392 struct swap_info_struct *si, *next;
74cdc690138b65 Kairui Song 2025-10-07 1393
74cdc690138b65 Kairui Song 2025-10-07 1394 spin_lock(&swap_avail_lock);
74cdc690138b65 Kairui Song 2025-10-07 @1395 plist_for_each_entry_safe(si, next, &swap_avail_heads[nid], avail_lists[nid]) {
74cdc690138b65 Kairui Song 2025-10-07 1396 spin_unlock(&swap_avail_lock);
74cdc690138b65 Kairui Song 2025-10-07 1397 if (get_swap_device_info(si)) {
74cdc690138b65 Kairui Song 2025-10-07 1398 if (si->flags & SWP_PAGE_DISCARD)
74cdc690138b65 Kairui Song 2025-10-07 1399 ret = swap_do_scheduled_discard(si);
74cdc690138b65 Kairui Song 2025-10-07 1400 put_swap_device(si);
74cdc690138b65 Kairui Song 2025-10-07 1401 }
74cdc690138b65 Kairui Song 2025-10-07 1402 if (ret)
74cdc690138b65 Kairui Song 2025-10-07 1403 break;
74cdc690138b65 Kairui Song 2025-10-07 1404 spin_lock(&swap_avail_lock);
74cdc690138b65 Kairui Song 2025-10-07 1405 }
74cdc690138b65 Kairui Song 2025-10-07 1406 spin_unlock(&swap_avail_lock);
74cdc690138b65 Kairui Song 2025-10-07 1407
74cdc690138b65 Kairui Song 2025-10-07 1408 return ret;
74cdc690138b65 Kairui Song 2025-10-07 1409 }
74cdc690138b65 Kairui Song 2025-10-07 1410
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v4 mm-new 1/2] mm/swap: do not choose swap device according to numa node
2025-10-11 20:45 ` kernel test robot
@ 2025-10-11 22:04 ` Andrew Morton
2025-10-12 2:08 ` Baoquan He
2025-10-14 11:56 ` Baoquan He
0 siblings, 2 replies; 22+ messages in thread
From: Andrew Morton @ 2025-10-11 22:04 UTC (permalink / raw)
To: kernel test robot
Cc: Baoquan He, linux-mm, oe-kbuild-all, chrisl, kasong,
youngjun.park, baohua, shikemeng, nphamcs
On Sun, 12 Oct 2025 04:45:07 +0800 kernel test robot <lkp@intel.com> wrote:
> Hi Baoquan,
>
> kernel test robot noticed the following build errors:
>
> [auto build test ERROR on akpm-mm/mm-everything]
>
> url: https://github.com/intel-lab-lkp/linux/commits/Baoquan-He/mm-swap-do-not-choose-swap-device-according-to-numa-node/20251011-161743
> base: https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
> patch link: https://lore.kernel.org/r/20251011081624.224202-2-bhe%40redhat.com
> patch subject: [PATCH v4 mm-new 1/2] mm/swap: do not choose swap device according to numa node
> config: i386-buildonly-randconfig-003-20251012 (https://download.01.org/0day-ci/archive/20251012/202510120456.vTSygUXV-lkp@intel.com/config)
> compiler: gcc-14 (Debian 14.2.0-19) 14.2.0
> reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20251012/202510120456.vTSygUXV-lkp@intel.com/reproduce)
>
> If you fix the issue in a separate patch/commit (i.e. not just a new version of
> the same patch/commit), kindly add following tags
> | Reported-by: kernel test robot <lkp@intel.com>
> | Closes: https://lore.kernel.org/oe-kbuild-all/202510120456.vTSygUXV-lkp@intel.com/
>
> All error/warnings (new ones prefixed by >>):
>
> In file included from include/linux/list.h:5,
> from include/linux/wait.h:7,
> from include/linux/wait_bit.h:8,
> from include/linux/fs.h:7,
> from include/linux/highmem.h:5,
> from include/linux/bvec.h:10,
> from include/linux/blk_types.h:10,
> from include/linux/blkdev.h:9,
> from mm/swapfile.c:9:
Thanks. I did this. Baoquan, please review and retest?
--- a/mm/swapfile.c~mm-swap-do-not-choose-swap-device-according-to-numa-node-fix
+++ a/mm/swapfile.c
@@ -1388,11 +1388,10 @@ start_over:
static bool swap_sync_discard(void)
{
bool ret = false;
- int nid = numa_node_id();
struct swap_info_struct *si, *next;
spin_lock(&swap_avail_lock);
- plist_for_each_entry_safe(si, next, &swap_avail_heads[nid], avail_lists[nid]) {
+ plist_for_each_entry_safe(si, next, &swap_avail_head, avail_list) {
spin_unlock(&swap_avail_lock);
if (get_swap_device_info(si)) {
if (si->flags & SWP_PAGE_DISCARD)
@@ -2890,7 +2889,6 @@ SYSCALL_DEFINE1(swapoff, const char __us
del_from_avail_list(p, true);
if (p->prio < 0) {
struct swap_info_struct *si = p;
- int nid;
plist_for_each_entry_continue(si, &swap_active_head, list) {
si->prio++;
_
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v4 mm-new 1/2] mm/swap: do not choose swap device according to numa node
2025-10-11 22:04 ` Andrew Morton
@ 2025-10-12 2:08 ` Baoquan He
2025-10-14 11:56 ` Baoquan He
1 sibling, 0 replies; 22+ messages in thread
From: Baoquan He @ 2025-10-12 2:08 UTC (permalink / raw)
To: Andrew Morton
Cc: kernel test robot, linux-mm, oe-kbuild-all, chrisl, kasong,
youngjun.park, baohua, shikemeng, nphamcs
On 10/11/25 at 03:04pm, Andrew Morton wrote:
> On Sun, 12 Oct 2025 04:45:07 +0800 kernel test robot <lkp@intel.com> wrote:
>
> > Hi Baoquan,
> >
> > kernel test robot noticed the following build errors:
> >
> > [auto build test ERROR on akpm-mm/mm-everything]
> >
> > url: https://github.com/intel-lab-lkp/linux/commits/Baoquan-He/mm-swap-do-not-choose-swap-device-according-to-numa-node/20251011-161743
> > base: https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
> > patch link: https://lore.kernel.org/r/20251011081624.224202-2-bhe%40redhat.com
> > patch subject: [PATCH v4 mm-new 1/2] mm/swap: do not choose swap device according to numa node
> > config: i386-buildonly-randconfig-003-20251012 (https://download.01.org/0day-ci/archive/20251012/202510120456.vTSygUXV-lkp@intel.com/config)
> > compiler: gcc-14 (Debian 14.2.0-19) 14.2.0
> > reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20251012/202510120456.vTSygUXV-lkp@intel.com/reproduce)
> >
> > If you fix the issue in a separate patch/commit (i.e. not just a new version of
> > the same patch/commit), kindly add following tags
> > | Reported-by: kernel test robot <lkp@intel.com>
> > | Closes: https://lore.kernel.org/oe-kbuild-all/202510120456.vTSygUXV-lkp@intel.com/
> >
> > All error/warnings (new ones prefixed by >>):
> >
> > In file included from include/linux/list.h:5,
> > from include/linux/wait.h:7,
> > from include/linux/wait_bit.h:8,
> > from include/linux/fs.h:7,
> > from include/linux/highmem.h:5,
> > from include/linux/bvec.h:10,
> > from include/linux/blk_types.h:10,
> > from include/linux/blkdev.h:9,
> > from mm/swapfile.c:9:
>
> Thanks. I did this. Baoquan, please review and retest?
>
>
> --- a/mm/swapfile.c~mm-swap-do-not-choose-swap-device-according-to-numa-node-fix
> +++ a/mm/swapfile.c
> @@ -1388,11 +1388,10 @@ start_over:
> static bool swap_sync_discard(void)
> {
> bool ret = false;
> - int nid = numa_node_id();
> struct swap_info_struct *si, *next;
>
> spin_lock(&swap_avail_lock);
> - plist_for_each_entry_safe(si, next, &swap_avail_heads[nid], avail_lists[nid]) {
> + plist_for_each_entry_safe(si, next, &swap_avail_head, avail_list) {
> spin_unlock(&swap_avail_lock);
> if (get_swap_device_info(si)) {
> if (si->flags & SWP_PAGE_DISCARD)
> @@ -2890,7 +2889,6 @@ SYSCALL_DEFINE1(swapoff, const char __us
> del_from_avail_list(p, true);
> if (p->prio < 0) {
> struct swap_info_struct *si = p;
> - int nid;
>
> plist_for_each_entry_continue(si, &swap_active_head, list) {
> si->prio++;
The code change looks good to me, and I applied this on top of my
patches and build, the building passed, while the link failed. I haven't
got the reason. I will go outside now and will be back in the evening to
fix the linking error and test.
SORTTAB vmlinux.unstripped
incomplete ORC unwind tables in file: vmlinux.unstripped
Failed to sort kernel tables
make[2]: *** [scripts/Makefile.vmlinux:72: vmlinux.unstripped] Error 1
Thanks
Baoquan
> _
>
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v4 mm-new 2/2] mm/swap: select swap device with default priority round robin
2025-10-11 8:16 ` [PATCH v4 mm-new 2/2] mm/swap: select swap device with default priority round robin Baoquan He
@ 2025-10-12 20:40 ` Barry Song
2025-10-13 3:58 ` Baoquan He
2025-10-14 22:01 ` Chris Li
0 siblings, 2 replies; 22+ messages in thread
From: Barry Song @ 2025-10-12 20:40 UTC (permalink / raw)
To: Baoquan He
Cc: linux-mm, akpm, chrisl, kasong, youngjun.park, aaron.lu,
shikemeng, nphamcs
On Sun, Oct 12, 2025 at 5:14 AM Baoquan He <bhe@redhat.com> wrote:
>
> Swap devices are assumed to have similar accessing speed if no priority
> is specified when swapon. It's unfair and doesn't make sense just because
> one swap device is swapped on firstly, its priority will be higher than
> the one swapped on later.
>
> Here, set all swap devicess to have priority '-1' by default. With this
> change, swap device with default priority will be selected round robin
> when swapping out. This can improve the swapping efficiency a lot among
> multiple swap devices with default priority.
>
> Below are swapon output during processes high pressure vm-scability test
> is being taken:
>
> 1) This is pre-commit a2468cc9bfdf, swap device is selectd one by one by
> priority from high to low when one swap device is exhausted:
> ------------------------------------
> [root@hp-dl385g10-03 ~]# swapon
> NAME TYPE SIZE USED PRIO
> /dev/zram0 partition 16G 16G -1
> /dev/zram1 partition 16G 966.2M -2
> /dev/zram2 partition 16G 0B -3
> /dev/zram3 partition 16G 0B -4
>
> 2) This is behaviour with commit a2468cc9bfdf, on node, swap device
> sharing the same node id is selected firstly until exhausted; while
> on node no swap device sharing the node id it selects the one with
> highest priority until exhaustd:
> ------------------------------------
> [root@hp-dl385g10-03 ~]# swapon
> NAME TYPE SIZE USED PRIO
> /dev/zram0 partition 16G 15.7G -2
> /dev/zram1 partition 16G 3.4G -3
> /dev/zram2 partition 16G 3.4G -4
> /dev/zram3 partition 16G 2.6G -5
>
> 3) After this patch applied, swap devices with default priority are selectd
> round robin:
> ------------------------------------
> [root@hp-dl385g10-03 block]# swapon
> NAME TYPE SIZE USED PRIO
> /dev/zram0 partition 16G 6.6G -1
> /dev/zram1 partition 16G 6.6G -1
> /dev/zram2 partition 16G 6.6G -1
> /dev/zram3 partition 16G 6.6G -1
>
> With the change, we can see about 18% efficiency promotion relative to
> node based way as below. (Surely, the pre-commit a2468cc9bfdf way is
> the worst.)
>
I’m not against the behavior change; but the swapon man page says:
"
Each swap area has a priority, either high or low. The default
priority is low. Within the low-priority areas, newer areas are
even lower priority than older areas.
"
So my question is whether users still assume that newly added swap areas
get a lower priority than the older ones?
I assume the priority decrement isn’t a stable ABI, so this change won’t
break userspace?
Or if someone sets up Linux assuming that a newer swap file will only be
used after the older one is full, then this change would break those cases?
Thanks
Barry
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v4 mm-new 2/2] mm/swap: select swap device with default priority round robin
2025-10-12 20:40 ` Barry Song
@ 2025-10-13 3:58 ` Baoquan He
2025-10-13 6:17 ` Barry Song
2025-10-14 22:01 ` Chris Li
1 sibling, 1 reply; 22+ messages in thread
From: Baoquan He @ 2025-10-13 3:58 UTC (permalink / raw)
To: Barry Song
Cc: linux-mm, akpm, chrisl, kasong, youngjun.park, aaron.lu,
shikemeng, nphamcs
On 10/13/25 at 04:40am, Barry Song wrote:
> On Sun, Oct 12, 2025 at 5:14 AM Baoquan He <bhe@redhat.com> wrote:
> >
> > Swap devices are assumed to have similar accessing speed if no priority
> > is specified when swapon. It's unfair and doesn't make sense just because
> > one swap device is swapped on firstly, its priority will be higher than
> > the one swapped on later.
> >
> > Here, set all swap devicess to have priority '-1' by default. With this
> > change, swap device with default priority will be selected round robin
> > when swapping out. This can improve the swapping efficiency a lot among
> > multiple swap devices with default priority.
> >
> > Below are swapon output during processes high pressure vm-scability test
> > is being taken:
> >
> > 1) This is pre-commit a2468cc9bfdf, swap device is selectd one by one by
> > priority from high to low when one swap device is exhausted:
> > ------------------------------------
> > [root@hp-dl385g10-03 ~]# swapon
> > NAME TYPE SIZE USED PRIO
> > /dev/zram0 partition 16G 16G -1
> > /dev/zram1 partition 16G 966.2M -2
> > /dev/zram2 partition 16G 0B -3
> > /dev/zram3 partition 16G 0B -4
> >
> > 2) This is behaviour with commit a2468cc9bfdf, on node, swap device
> > sharing the same node id is selected firstly until exhausted; while
> > on node no swap device sharing the node id it selects the one with
> > highest priority until exhaustd:
> > ------------------------------------
> > [root@hp-dl385g10-03 ~]# swapon
> > NAME TYPE SIZE USED PRIO
> > /dev/zram0 partition 16G 15.7G -2
> > /dev/zram1 partition 16G 3.4G -3
> > /dev/zram2 partition 16G 3.4G -4
> > /dev/zram3 partition 16G 2.6G -5
> >
> > 3) After this patch applied, swap devices with default priority are selectd
> > round robin:
> > ------------------------------------
> > [root@hp-dl385g10-03 block]# swapon
> > NAME TYPE SIZE USED PRIO
> > /dev/zram0 partition 16G 6.6G -1
> > /dev/zram1 partition 16G 6.6G -1
> > /dev/zram2 partition 16G 6.6G -1
> > /dev/zram3 partition 16G 6.6G -1
> >
> > With the change, we can see about 18% efficiency promotion relative to
> > node based way as below. (Surely, the pre-commit a2468cc9bfdf way is
> > the worst.)
> >
Thanks a lot for reviewing, Barry.
>
> I’m not against the behavior change; but the swapon man page says:
> "
> Each swap area has a priority, either high or low. The default
> priority is low. Within the low-priority areas, newer areas are
> even lower priority than older areas.
I didn't see this in man 8 page of swapon, while see it in man 2 page.
Means people may feel that change when they call the call swapon()
syscall, but people may not cares about in script or something like that?
> "
> So my question is whether users still assume that newly added swap areas
> get a lower priority than the older ones?
>
> I assume the priority decrement isn’t a stable ABI, so this change won’t
> break userspace?
Hmm, I would say that this will change the assumption, BUT I don't start
it. That assumption has been broken since the numa based swap device
choosing at below commit:
commit a2468cc9bfdf ("swap: choose swap device according to numa node").
Before commit a2468cc9bfdf, swapon behaviour is taken strictly as the
man page states. The earlier the swap device is added, the higher its
default priority is. And the highest priority device is used up, then
the 2nd highest priority swap device, and so on in sequence. Below
swapon output demonstrate.
===============================
[root@hp-dl385g10-03 ~]# swapon
NAME TYPE SIZE USED PRIO
/dev/zram0 partition 16G 16G -1
/dev/zram1 partition 16G 966.2M -2
/dev/zram2 partition 16G 0B -3
/dev/zram3 partition 16G 0B -4
However, after commit a2468cc9bfdf applied, above behaviour had been
changed. I can give an extreme example, imagine on a system with one
NUMA Node, node_id is 0. Then I swapon several swap devices w/o node_id
value (namely node_id is -1), at last I swapon one device with node_id
0. You can see the last one will have the highest priority to be chosen,
then other swap devices.
So I would argue that if people realy care about the default priority,
it has been broken since 2017 when commit a2468cc9bfdf was introduce,
and complaint would be heard since long before. While we didn't hear
complaint, means the default priority doesn't really matter?
>
> Or if someone sets up Linux assuming that a newer swap file will only be
> used after the older one is full, then this change would break those cases?
Hmm, it could happen, but I doubt people really count on that. I would use
'swapon -p xx' to specify explicit priority to make sure it. In the case you
said, swapped out pages will be swapped in, it's either not guaranteed.
Thanks
Baoquan
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v4 mm-new 1/2] mm/swap: do not choose swap device according to numa node
2025-10-11 8:16 ` [PATCH v4 mm-new 1/2] mm/swap: do not choose swap device according to numa node Baoquan He
2025-10-11 20:45 ` kernel test robot
@ 2025-10-13 6:09 ` Barry Song
2025-10-14 21:50 ` Chris Li
2025-10-15 3:06 ` Baoquan He
1 sibling, 2 replies; 22+ messages in thread
From: Barry Song @ 2025-10-13 6:09 UTC (permalink / raw)
To: Baoquan He
Cc: linux-mm, akpm, chrisl, kasong, youngjun.park, aaron.lu,
shikemeng, nphamcs
> -static int swap_node(struct swap_info_struct *si)
> -{
> - struct block_device *bdev;
> -
> - if (si->bdev)
> - bdev = si->bdev;
> - else
> - bdev = si->swap_file->f_inode->i_sb->s_bdev;
> -
> - return bdev ? bdev->bd_disk->node_id : NUMA_NO_NODE;
> -}
> -
Looking at the code, it seems to have some hardware affinity awareness,
as it uses the swapfile’s bdev’s node_id. Are we regressing cases where
each node has a closer block device?
Thanks
Barry
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v4 mm-new 2/2] mm/swap: select swap device with default priority round robin
2025-10-13 3:58 ` Baoquan He
@ 2025-10-13 6:17 ` Barry Song
2025-10-13 23:07 ` Baoquan He
2025-10-14 22:11 ` Chris Li
0 siblings, 2 replies; 22+ messages in thread
From: Barry Song @ 2025-10-13 6:17 UTC (permalink / raw)
To: Baoquan He
Cc: linux-mm, akpm, chrisl, kasong, youngjun.park, aaron.lu,
shikemeng, nphamcs
On Mon, Oct 13, 2025 at 11:58 AM Baoquan He <bhe@redhat.com> wrote:
>
> On 10/13/25 at 04:40am, Barry Song wrote:
> > On Sun, Oct 12, 2025 at 5:14 AM Baoquan He <bhe@redhat.com> wrote:
> > >
> > > Swap devices are assumed to have similar accessing speed if no priority
> > > is specified when swapon. It's unfair and doesn't make sense just because
> > > one swap device is swapped on firstly, its priority will be higher than
> > > the one swapped on later.
> > >
> > > Here, set all swap devicess to have priority '-1' by default. With this
> > > change, swap device with default priority will be selected round robin
> > > when swapping out. This can improve the swapping efficiency a lot among
> > > multiple swap devices with default priority.
> > >
> > > Below are swapon output during processes high pressure vm-scability test
> > > is being taken:
> > >
> > > 1) This is pre-commit a2468cc9bfdf, swap device is selectd one by one by
> > > priority from high to low when one swap device is exhausted:
> > > ------------------------------------
> > > [root@hp-dl385g10-03 ~]# swapon
> > > NAME TYPE SIZE USED PRIO
> > > /dev/zram0 partition 16G 16G -1
> > > /dev/zram1 partition 16G 966.2M -2
> > > /dev/zram2 partition 16G 0B -3
> > > /dev/zram3 partition 16G 0B -4
> > >
> > > 2) This is behaviour with commit a2468cc9bfdf, on node, swap device
> > > sharing the same node id is selected firstly until exhausted; while
> > > on node no swap device sharing the node id it selects the one with
> > > highest priority until exhaustd:
> > > ------------------------------------
> > > [root@hp-dl385g10-03 ~]# swapon
> > > NAME TYPE SIZE USED PRIO
> > > /dev/zram0 partition 16G 15.7G -2
> > > /dev/zram1 partition 16G 3.4G -3
> > > /dev/zram2 partition 16G 3.4G -4
> > > /dev/zram3 partition 16G 2.6G -5
> > >
> > > 3) After this patch applied, swap devices with default priority are selectd
> > > round robin:
> > > ------------------------------------
> > > [root@hp-dl385g10-03 block]# swapon
> > > NAME TYPE SIZE USED PRIO
> > > /dev/zram0 partition 16G 6.6G -1
> > > /dev/zram1 partition 16G 6.6G -1
> > > /dev/zram2 partition 16G 6.6G -1
> > > /dev/zram3 partition 16G 6.6G -1
> > >
> > > With the change, we can see about 18% efficiency promotion relative to
> > > node based way as below. (Surely, the pre-commit a2468cc9bfdf way is
> > > the worst.)
> > >
>
> Thanks a lot for reviewing, Barry.
>
> >
> > I’m not against the behavior change; but the swapon man page says:
> > "
> > Each swap area has a priority, either high or low. The default
> > priority is low. Within the low-priority areas, newer areas are
> > even lower priority than older areas.
>
> I didn't see this in man 8 page of swapon, while see it in man 2 page.
> Means people may feel that change when they call the call swapon()
> syscall, but people may not cares about in script or something like that?
>
> > "
> > So my question is whether users still assume that newly added swap areas
> > get a lower priority than the older ones?
> >
> > I assume the priority decrement isn’t a stable ABI, so this change won’t
> > break userspace?
>
> Hmm, I would say that this will change the assumption, BUT I don't start
> it. That assumption has been broken since the numa based swap device
> choosing at below commit:
>
> commit a2468cc9bfdf ("swap: choose swap device according to numa node").
>
> Before commit a2468cc9bfdf, swapon behaviour is taken strictly as the
> man page states. The earlier the swap device is added, the higher its
> default priority is. And the highest priority device is used up, then
> the 2nd highest priority swap device, and so on in sequence. Below
> swapon output demonstrate.
> ===============================
> [root@hp-dl385g10-03 ~]# swapon
> NAME TYPE SIZE USED PRIO
> /dev/zram0 partition 16G 16G -1
> /dev/zram1 partition 16G 966.2M -2
> /dev/zram2 partition 16G 0B -3
> /dev/zram3 partition 16G 0B -4
>
> However, after commit a2468cc9bfdf applied, above behaviour had been
> changed. I can give an extreme example, imagine on a system with one
> NUMA Node, node_id is 0. Then I swapon several swap devices w/o node_id
> value (namely node_id is -1), at last I swapon one device with node_id
> 0. You can see the last one will have the highest priority to be chosen,
> then other swap devices.
I assume this adds logic to prefer swapping to the closer swapfile first,
while still maintaining the old behavior for non-NUMA cases.
>
> So I would argue that if people realy care about the default priority,
> it has been broken since 2017 when commit a2468cc9bfdf was introduce,
> and complaint would be heard since long before. While we didn't hear
> complaint, means the default priority doesn't really matter?
> >
> > Or if someone sets up Linux assuming that a newer swap file will only be
> > used after the older one is full, then this change would break those cases?
>
> Hmm, it could happen, but I doubt people really count on that. I would use
> 'swapon -p xx' to specify explicit priority to make sure it. In the case you
> said, swapped out pages will be swapped in, it's either not guaranteed.
Personally, I also dislike the behavior where a newer swap file
automatically gets a lower priority than an older one. However, since
we have a rule to never break userspace, is this considered such a
case? Or at least, do we need to update the man page as well?
BTW, we can achieve all the benefits of the round-robin “18%
efficiency boost” once users set an explicit priority in userspace for
the four zRAMs you’re using?
Thanks
Barry
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v4 mm-new 2/2] mm/swap: select swap device with default priority round robin
2025-10-13 6:17 ` Barry Song
@ 2025-10-13 23:07 ` Baoquan He
2025-10-14 22:11 ` Chris Li
1 sibling, 0 replies; 22+ messages in thread
From: Baoquan He @ 2025-10-13 23:07 UTC (permalink / raw)
To: Barry Song
Cc: linux-mm, akpm, chrisl, kasong, youngjun.park, aaron.lu,
shikemeng, nphamcs
On 10/13/25 at 02:17pm, Barry Song wrote:
> On Mon, Oct 13, 2025 at 11:58 AM Baoquan He <bhe@redhat.com> wrote:
> >
> > On 10/13/25 at 04:40am, Barry Song wrote:
> > > On Sun, Oct 12, 2025 at 5:14 AM Baoquan He <bhe@redhat.com> wrote:
> > > >
> > > > Swap devices are assumed to have similar accessing speed if no priority
> > > > is specified when swapon. It's unfair and doesn't make sense just because
> > > > one swap device is swapped on firstly, its priority will be higher than
> > > > the one swapped on later.
> > > >
> > > > Here, set all swap devicess to have priority '-1' by default. With this
> > > > change, swap device with default priority will be selected round robin
> > > > when swapping out. This can improve the swapping efficiency a lot among
> > > > multiple swap devices with default priority.
> > > >
> > > > Below are swapon output during processes high pressure vm-scability test
> > > > is being taken:
> > > >
> > > > 1) This is pre-commit a2468cc9bfdf, swap device is selectd one by one by
> > > > priority from high to low when one swap device is exhausted:
> > > > ------------------------------------
> > > > [root@hp-dl385g10-03 ~]# swapon
> > > > NAME TYPE SIZE USED PRIO
> > > > /dev/zram0 partition 16G 16G -1
> > > > /dev/zram1 partition 16G 966.2M -2
> > > > /dev/zram2 partition 16G 0B -3
> > > > /dev/zram3 partition 16G 0B -4
> > > >
> > > > 2) This is behaviour with commit a2468cc9bfdf, on node, swap device
> > > > sharing the same node id is selected firstly until exhausted; while
> > > > on node no swap device sharing the node id it selects the one with
> > > > highest priority until exhaustd:
> > > > ------------------------------------
> > > > [root@hp-dl385g10-03 ~]# swapon
> > > > NAME TYPE SIZE USED PRIO
> > > > /dev/zram0 partition 16G 15.7G -2
> > > > /dev/zram1 partition 16G 3.4G -3
> > > > /dev/zram2 partition 16G 3.4G -4
> > > > /dev/zram3 partition 16G 2.6G -5
> > > >
> > > > 3) After this patch applied, swap devices with default priority are selectd
> > > > round robin:
> > > > ------------------------------------
> > > > [root@hp-dl385g10-03 block]# swapon
> > > > NAME TYPE SIZE USED PRIO
> > > > /dev/zram0 partition 16G 6.6G -1
> > > > /dev/zram1 partition 16G 6.6G -1
> > > > /dev/zram2 partition 16G 6.6G -1
> > > > /dev/zram3 partition 16G 6.6G -1
> > > >
> > > > With the change, we can see about 18% efficiency promotion relative to
> > > > node based way as below. (Surely, the pre-commit a2468cc9bfdf way is
> > > > the worst.)
> > > >
> >
> > Thanks a lot for reviewing, Barry.
> >
> > >
> > > I’m not against the behavior change; but the swapon man page says:
> > > "
> > > Each swap area has a priority, either high or low. The default
> > > priority is low. Within the low-priority areas, newer areas are
> > > even lower priority than older areas.
> >
> > I didn't see this in man 8 page of swapon, while see it in man 2 page.
> > Means people may feel that change when they call the call swapon()
> > syscall, but people may not cares about in script or something like that?
> >
> > > "
> > > So my question is whether users still assume that newly added swap areas
> > > get a lower priority than the older ones?
> > >
> > > I assume the priority decrement isn’t a stable ABI, so this change won’t
> > > break userspace?
> >
> > Hmm, I would say that this will change the assumption, BUT I don't start
> > it. That assumption has been broken since the numa based swap device
> > choosing at below commit:
> >
> > commit a2468cc9bfdf ("swap: choose swap device according to numa node").
> >
> > Before commit a2468cc9bfdf, swapon behaviour is taken strictly as the
> > man page states. The earlier the swap device is added, the higher its
> > default priority is. And the highest priority device is used up, then
> > the 2nd highest priority swap device, and so on in sequence. Below
> > swapon output demonstrate.
> > ===============================
> > [root@hp-dl385g10-03 ~]# swapon
> > NAME TYPE SIZE USED PRIO
> > /dev/zram0 partition 16G 16G -1
> > /dev/zram1 partition 16G 966.2M -2
> > /dev/zram2 partition 16G 0B -3
> > /dev/zram3 partition 16G 0B -4
> >
> > However, after commit a2468cc9bfdf applied, above behaviour had been
> > changed. I can give an extreme example, imagine on a system with one
> > NUMA Node, node_id is 0. Then I swapon several swap devices w/o node_id
> > value (namely node_id is -1), at last I swapon one device with node_id
> > 0. You can see the last one will have the highest priority to be chosen,
> > then other swap devices.
>
> I assume this adds logic to prefer swapping to the closer swapfile first,
> while still maintaining the old behavior for non-NUMA cases.
But it still change the traditional behaviour, right?
The old man 2 page of swapon obviously doesn't state the prefer swapping
to the closer swapfile first on NUMA, while still maintain the old
behaviour for non-NUMA cases.
===
Each swap area has a priority, either high or low. The default
priority is low. Within the low-priority areas, newer areas are
even lower priority than older areas.
===
>
> >
> > So I would argue that if people realy care about the default priority,
> > it has been broken since 2017 when commit a2468cc9bfdf was introduce,
> > and complaint would be heard since long before. While we didn't hear
> > complaint, means the default priority doesn't really matter?
> > >
> > > Or if someone sets up Linux assuming that a newer swap file will only be
> > > used after the older one is full, then this change would break those cases?
> >
> > Hmm, it could happen, but I doubt people really count on that. I would use
> > 'swapon -p xx' to specify explicit priority to make sure it. In the case you
> > said, swapped out pages will be swapped in, it's either not guaranteed.
>
> Personally, I also dislike the behavior where a newer swap file
> automatically gets a lower priority than an older one. However, since
> we have a rule to never break userspace, is this considered such a
> case? Or at least, do we need to update the man page as well?
As discussed above, the rule on swapon had been broken. Of course, I can
update the man 2 page of swapon. There's no change to man 8 page of
swapon, because it's not mentioning the default priority thing.
>
> BTW, we can achieve all the benefits of the round-robin “18%
> efficiency boost” once users set an explicit priority in userspace for
> the four zRAMs you’re using?
Not sure if I got you correctly. The 18% boost is only related to
default priority. If user sets explicit priority via 'swapon -p xx',
nothing changed.
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v4 mm-new 1/2] mm/swap: do not choose swap device according to numa node
2025-10-11 22:04 ` Andrew Morton
2025-10-12 2:08 ` Baoquan He
@ 2025-10-14 11:56 ` Baoquan He
1 sibling, 0 replies; 22+ messages in thread
From: Baoquan He @ 2025-10-14 11:56 UTC (permalink / raw)
To: Andrew Morton
Cc: kernel test robot, linux-mm, oe-kbuild-all, chrisl, kasong,
youngjun.park, baohua, shikemeng, nphamcs
On 10/11/25 at 03:04pm, Andrew Morton wrote:
> On Sun, 12 Oct 2025 04:45:07 +0800 kernel test robot <lkp@intel.com> wrote:
>
> > Hi Baoquan,
> >
> > kernel test robot noticed the following build errors:
> >
> > [auto build test ERROR on akpm-mm/mm-everything]
> >
> > url: https://github.com/intel-lab-lkp/linux/commits/Baoquan-He/mm-swap-do-not-choose-swap-device-according-to-numa-node/20251011-161743
> > base: https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
> > patch link: https://lore.kernel.org/r/20251011081624.224202-2-bhe%40redhat.com
> > patch subject: [PATCH v4 mm-new 1/2] mm/swap: do not choose swap device according to numa node
> > config: i386-buildonly-randconfig-003-20251012 (https://download.01.org/0day-ci/archive/20251012/202510120456.vTSygUXV-lkp@intel.com/config)
> > compiler: gcc-14 (Debian 14.2.0-19) 14.2.0
> > reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20251012/202510120456.vTSygUXV-lkp@intel.com/reproduce)
> >
> > If you fix the issue in a separate patch/commit (i.e. not just a new version of
> > the same patch/commit), kindly add following tags
> > | Reported-by: kernel test robot <lkp@intel.com>
> > | Closes: https://lore.kernel.org/oe-kbuild-all/202510120456.vTSygUXV-lkp@intel.com/
> >
> > All error/warnings (new ones prefixed by >>):
> >
> > In file included from include/linux/list.h:5,
> > from include/linux/wait.h:7,
> > from include/linux/wait_bit.h:8,
> > from include/linux/fs.h:7,
> > from include/linux/highmem.h:5,
> > from include/linux/bvec.h:10,
> > from include/linux/blk_types.h:10,
> > from include/linux/blkdev.h:9,
> > from mm/swapfile.c:9:
>
> Thanks. I did this. Baoquan, please review and retest?
I finally got a system, retest this patchset + the appended fix, all
works well. Sorry for my mistake, and thanks a lot for the fix.
>
>
> --- a/mm/swapfile.c~mm-swap-do-not-choose-swap-device-according-to-numa-node-fix
> +++ a/mm/swapfile.c
> @@ -1388,11 +1388,10 @@ start_over:
> static bool swap_sync_discard(void)
> {
> bool ret = false;
> - int nid = numa_node_id();
> struct swap_info_struct *si, *next;
>
> spin_lock(&swap_avail_lock);
> - plist_for_each_entry_safe(si, next, &swap_avail_heads[nid], avail_lists[nid]) {
> + plist_for_each_entry_safe(si, next, &swap_avail_head, avail_list) {
> spin_unlock(&swap_avail_lock);
> if (get_swap_device_info(si)) {
> if (si->flags & SWP_PAGE_DISCARD)
> @@ -2890,7 +2889,6 @@ SYSCALL_DEFINE1(swapoff, const char __us
> del_from_avail_list(p, true);
> if (p->prio < 0) {
> struct swap_info_struct *si = p;
> - int nid;
>
> plist_for_each_entry_continue(si, &swap_active_head, list) {
> si->prio++;
> _
>
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v4 mm-new 1/2] mm/swap: do not choose swap device according to numa node
2025-10-13 6:09 ` Barry Song
@ 2025-10-14 21:50 ` Chris Li
2025-10-15 3:06 ` Baoquan He
1 sibling, 0 replies; 22+ messages in thread
From: Chris Li @ 2025-10-14 21:50 UTC (permalink / raw)
To: Barry Song
Cc: Baoquan He, linux-mm, akpm, kasong, youngjun.park, aaron.lu,
shikemeng, nphamcs
On Sun, Oct 12, 2025 at 11:09 PM Barry Song <21cnbao@gmail.com> wrote:
>
> > -static int swap_node(struct swap_info_struct *si)
> > -{
> > - struct block_device *bdev;
> > -
> > - if (si->bdev)
> > - bdev = si->bdev;
> > - else
> > - bdev = si->swap_file->f_inode->i_sb->s_bdev;
> > -
> > - return bdev ? bdev->bd_disk->node_id : NUMA_NO_NODE;
> > -}
> > -
>
> Looking at the code, it seems to have some hardware affinity awareness,
> as it uses the swapfile’s bdev’s node_id. Are we regressing cases where
> each node has a closer block device?
Hi Barry,
I don't believe there is a valid usage case of the hardware affinity
block device driver using node_id inside the swapfile usage case in
the current upstream kernel. If you know any, please point it out and
I am very interested in knowing. As such, it is basically a moot point
to have the default swapfile using those negative priority by default.
I am happy to be proven wrong, anyone who wants to make that claim
that the node_id swapfile usage case is doing something useful in the
current upstream kernel please provide a repeatable test case to
justify it. I am curious to learn about it. That is why Aaron is CC on
the list and his email did not bounce for me.
If the hardware exists, just the hardware doesn't have an upstream
kernel driver. There are some vendor specific drivers maintaining
those node_id awarded swap devices in the vendor specific kernel.
Then maybe the node_id swapfile logic can move to those vendor
specific kernels as well. The upstream kernel shouldn't be burdened
with this extra complexity on the node_id in swap core without a valid
upstream driver to test and validate its usage case.
It is part of the spring cleaning effort of the swap core. It happens
with the other aspect of the swap code cleaning effort right now.
Chris
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v4 mm-new 2/2] mm/swap: select swap device with default priority round robin
2025-10-12 20:40 ` Barry Song
2025-10-13 3:58 ` Baoquan He
@ 2025-10-14 22:01 ` Chris Li
1 sibling, 0 replies; 22+ messages in thread
From: Chris Li @ 2025-10-14 22:01 UTC (permalink / raw)
To: Barry Song
Cc: Baoquan He, linux-mm, akpm, kasong, youngjun.park, aaron.lu,
shikemeng, nphamcs
On Sun, Oct 12, 2025 at 1:41 PM Barry Song <21cnbao@gmail.com> wrote:
>
> On Sun, Oct 12, 2025 at 5:14 AM Baoquan He <bhe@redhat.com> wrote:
> >
> > Swap devices are assumed to have similar accessing speed if no priority
> > is specified when swapon. It's unfair and doesn't make sense just because
> > one swap device is swapped on firstly, its priority will be higher than
> > the one swapped on later.
> >
> > Here, set all swap devicess to have priority '-1' by default. With this
> > change, swap device with default priority will be selected round robin
> > when swapping out. This can improve the swapping efficiency a lot among
> > multiple swap devices with default priority.
> >
> > Below are swapon output during processes high pressure vm-scability test
> > is being taken:
> >
> > 1) This is pre-commit a2468cc9bfdf, swap device is selectd one by one by
> > priority from high to low when one swap device is exhausted:
> > ------------------------------------
> > [root@hp-dl385g10-03 ~]# swapon
> > NAME TYPE SIZE USED PRIO
> > /dev/zram0 partition 16G 16G -1
> > /dev/zram1 partition 16G 966.2M -2
> > /dev/zram2 partition 16G 0B -3
> > /dev/zram3 partition 16G 0B -4
> >
> > 2) This is behaviour with commit a2468cc9bfdf, on node, swap device
> > sharing the same node id is selected firstly until exhausted; while
> > on node no swap device sharing the node id it selects the one with
> > highest priority until exhaustd:
> > ------------------------------------
> > [root@hp-dl385g10-03 ~]# swapon
> > NAME TYPE SIZE USED PRIO
> > /dev/zram0 partition 16G 15.7G -2
> > /dev/zram1 partition 16G 3.4G -3
> > /dev/zram2 partition 16G 3.4G -4
> > /dev/zram3 partition 16G 2.6G -5
> >
> > 3) After this patch applied, swap devices with default priority are selectd
> > round robin:
> > ------------------------------------
> > [root@hp-dl385g10-03 block]# swapon
> > NAME TYPE SIZE USED PRIO
> > /dev/zram0 partition 16G 6.6G -1
> > /dev/zram1 partition 16G 6.6G -1
> > /dev/zram2 partition 16G 6.6G -1
> > /dev/zram3 partition 16G 6.6G -1
> >
> > With the change, we can see about 18% efficiency promotion relative to
> > node based way as below. (Surely, the pre-commit a2468cc9bfdf way is
> > the worst.)
> >
>
> I’m not against the behavior change; but the swapon man page says:
> "
> Each swap area has a priority, either high or low. The default
> priority is low. Within the low-priority areas, newer areas are
> even lower priority than older areas.
> "
> So my question is whether users still assume that newly added swap areas
> get a lower priority than the older ones?
That is a good catch, if the per node_id swapfile logic reverted, the
man page should be updated to match the kernel behavior as well.
It is a good place to describe the default round robin behavior.
> I assume the priority decrement isn’t a stable ABI, so this change won’t
> break userspace?
There is no ABI change as far as I can tell. The swapon has an option
to specify the priority. The default swap_on does not specify the
priority. It is a kernel internal tuning how we arrange the default
swapfile for the better performance by default. If the user don't
happy with that arrangement, they can always specify a priority with
the existing ABI, there is no ABI change.
> Or if someone sets up Linux assuming that a newer swap file will only be
> used after the older one is full, then this change would break those cases?
The existing kernel implementation always fills up the high priority
swapfile before the low priority one, which hasn't changed.
The negative node_id has been removed/reverted, that is a behavior
change yet. But I fail to see how it breaks the user. If you have a
test case that breaks the user, please specify.
Chris
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v4 mm-new 2/2] mm/swap: select swap device with default priority round robin
2025-10-13 6:17 ` Barry Song
2025-10-13 23:07 ` Baoquan He
@ 2025-10-14 22:11 ` Chris Li
2025-10-15 4:29 ` Barry Song
1 sibling, 1 reply; 22+ messages in thread
From: Chris Li @ 2025-10-14 22:11 UTC (permalink / raw)
To: Barry Song
Cc: Baoquan He, linux-mm, akpm, kasong, youngjun.park, aaron.lu,
shikemeng, nphamcs
On Sun, Oct 12, 2025 at 11:17 PM Barry Song <21cnbao@gmail.com> wrote:
> >
> > However, after commit a2468cc9bfdf applied, above behaviour had been
> > changed. I can give an extreme example, imagine on a system with one
> > NUMA Node, node_id is 0. Then I swapon several swap devices w/o node_id
> > value (namely node_id is -1), at last I swapon one device with node_id
> > 0. You can see the last one will have the highest priority to be chosen,
> > then other swap devices.
>
> I assume this adds logic to prefer swapping to the closer swapfile first,
> while still maintaining the old behavior for non-NUMA cases.
That commit a2468cc9bfdf changes the default behavior of the swapon
which does not specify a priority.
If you claim the revert breaks the user behavior, that commit also
breaks the user behavior as well.
> > So I would argue that if people realy care about the default priority,
> > it has been broken since 2017 when commit a2468cc9bfdf was introduce,
> > and complaint would be heard since long before. While we didn't hear
> > complaint, means the default priority doesn't really matter?
> > >
> > > Or if someone sets up Linux assuming that a newer swap file will only be
> > > used after the older one is full, then this change would break those cases?
> >
> > Hmm, it could happen, but I doubt people really count on that. I would use
> > 'swapon -p xx' to specify explicit priority to make sure it. In the case you
> > said, swapped out pages will be swapped in, it's either not guaranteed.
>
> Personally, I also dislike the behavior where a newer swap file
> automatically gets a lower priority than an older one. However, since
> we have a rule to never break userspace, is this considered such a
> case? Or at least, do we need to update the man page as well?
See above, we should update the man page as well.
If the a2468cc9bfdf can break user default swapon behavior by
introducing node_id in the first place. It is totally justifiable to
break it again to revert it. I fail to see the logic why the breaking
rule only applies to the revert but not the commit introducing it in
the first place.
>
> BTW, we can achieve all the benefits of the round-robin “18%
> efficiency boost” once users set an explicit priority in userspace for
> the four zRAMs you’re using?
Possible, it also depends on your setup. The zRAM experiment is a hack
to simulate the node_id behavior. The reason for that hack is that the
kernel does not have a node_id block device driver can use for the
node_id swapfile testing.
Back to the swapfile si->lock. Yes, that is a contented lock. It will
benefit the user if split the swapfile into a dozen of them instead of
just one. That is even with all the swap allocator optimizations to
reduce the si->lock contention.
Chris
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v4 mm-new 1/2] mm/swap: do not choose swap device according to numa node
2025-10-13 6:09 ` Barry Song
2025-10-14 21:50 ` Chris Li
@ 2025-10-15 3:06 ` Baoquan He
2025-10-15 5:02 ` Barry Song
1 sibling, 1 reply; 22+ messages in thread
From: Baoquan He @ 2025-10-15 3:06 UTC (permalink / raw)
To: Barry Song
Cc: linux-mm, akpm, chrisl, kasong, youngjun.park, aaron.lu,
shikemeng, nphamcs
On 10/13/25 at 02:09pm, Barry Song wrote:
> > -static int swap_node(struct swap_info_struct *si)
> > -{
> > - struct block_device *bdev;
> > -
> > - if (si->bdev)
> > - bdev = si->bdev;
> > - else
> > - bdev = si->swap_file->f_inode->i_sb->s_bdev;
> > -
> > - return bdev ? bdev->bd_disk->node_id : NUMA_NO_NODE;
> > -}
> > -
>
> Looking at the code, it seems to have some hardware affinity awareness,
> as it uses the swapfile’s bdev’s node_id. Are we regressing cases where
> each node has a closer block device?
I had talked about this with Chris before I posted v1. We don't need to
worry about this because:
1) Kernel code rarely set disk->node_id, all disks just assign
NUMA_NO_NODE to it except of these:
drivers/nvdimm/pmem.c <<pmem_attach_disk>>
drivers/md/dm.c <<alloc_dev>>
For intel ssd Aaron introduced the node based si choosing is for, it
should be Optane which has been discontinued. It could be wrong, then
hope intel can help test so that we can see what impact is brought in.
2) The gap between disk io and memory accessing
Usually memory accessing is nanosecond level, while disk io is
microsecond level, HDD even could be at millisecond. The node affinity
saving nanoseconds is negligible compared to the disk's own acessing
speed. This includes pmem, its io is more than ten times or even more
than memory accessing.
If there's a real system which owns disks belonging to NUMA nodes, we
can test to see if the new round robin way is better or worse then the
node based way.
Thanks
Baoquan
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v4 mm-new 2/2] mm/swap: select swap device with default priority round robin
2025-10-14 22:11 ` Chris Li
@ 2025-10-15 4:29 ` Barry Song
2025-10-15 6:24 ` Chris Li
0 siblings, 1 reply; 22+ messages in thread
From: Barry Song @ 2025-10-15 4:29 UTC (permalink / raw)
To: Chris Li
Cc: Baoquan He, linux-mm, akpm, kasong, youngjun.park, aaron.lu,
shikemeng, nphamcs
On Wed, Oct 15, 2025 at 6:49 AM Chris Li <chrisl@kernel.org> wrote:
>
> On Sun, Oct 12, 2025 at 11:17 PM Barry Song <21cnbao@gmail.com> wrote:
> > >
> > > However, after commit a2468cc9bfdf applied, above behaviour had been
> > > changed. I can give an extreme example, imagine on a system with one
> > > NUMA Node, node_id is 0. Then I swapon several swap devices w/o node_id
> > > value (namely node_id is -1), at last I swapon one device with node_id
> > > 0. You can see the last one will have the highest priority to be chosen,
> > > then other swap devices.
> >
> > I assume this adds logic to prefer swapping to the closer swapfile first,
> > while still maintaining the old behavior for non-NUMA cases.
>
> That commit a2468cc9bfdf changes the default behavior of the swapon
> which does not specify a priority.
> If you claim the revert breaks the user behavior, that commit also
> breaks the user behavior as well.
>
> > > So I would argue that if people realy care about the default priority,
> > > it has been broken since 2017 when commit a2468cc9bfdf was introduce,
> > > and complaint would be heard since long before. While we didn't hear
> > > complaint, means the default priority doesn't really matter?
> > > >
> > > > Or if someone sets up Linux assuming that a newer swap file will only be
> > > > used after the older one is full, then this change would break those cases?
> > >
> > > Hmm, it could happen, but I doubt people really count on that. I would use
> > > 'swapon -p xx' to specify explicit priority to make sure it. In the case you
> > > said, swapped out pages will be swapped in, it's either not guaranteed.
> >
> > Personally, I also dislike the behavior where a newer swap file
> > automatically gets a lower priority than an older one. However, since
> > we have a rule to never break userspace, is this considered such a
> > case? Or at least, do we need to update the man page as well?
>
> See above, we should update the man page as well.
>
> If the a2468cc9bfdf can break user default swapon behavior by
> introducing node_id in the first place. It is totally justifiable to
> break it again to revert it. I fail to see the logic why the breaking
> rule only applies to the revert but not the commit introducing it in
> the first place.
I’m actually referring to the part below, which changes the default priority
to -1 for all swapfiles while it used to be -n, -n -1, -n -2 , -n -3....
- if (prio >= 0)
- si->prio = prio;
- else
- si->prio = --least_priority;
I agree the NUMA-ID swap patch (a2468cc9bfdf) has partially broken the man
page, but only for very rare hardware whose block device has a NUMA node ID
(yes, the man page should have been updated back then). Now we’re breaking
the man page for all hardware, so I think it’s even more important to
highlight that in the man page at this point.
Thanks
Barry
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v4 mm-new 1/2] mm/swap: do not choose swap device according to numa node
2025-10-15 3:06 ` Baoquan He
@ 2025-10-15 5:02 ` Barry Song
2025-10-15 6:23 ` Chris Li
0 siblings, 1 reply; 22+ messages in thread
From: Barry Song @ 2025-10-15 5:02 UTC (permalink / raw)
To: Baoquan He
Cc: linux-mm, akpm, chrisl, kasong, youngjun.park, aaron.lu,
shikemeng, nphamcs
On Wed, Oct 15, 2025 at 11:06 AM Baoquan He <bhe@redhat.com> wrote:
>
> On 10/13/25 at 02:09pm, Barry Song wrote:
> > > -static int swap_node(struct swap_info_struct *si)
> > > -{
> > > - struct block_device *bdev;
> > > -
> > > - if (si->bdev)
> > > - bdev = si->bdev;
> > > - else
> > > - bdev = si->swap_file->f_inode->i_sb->s_bdev;
> > > -
> > > - return bdev ? bdev->bd_disk->node_id : NUMA_NO_NODE;
> > > -}
> > > -
> >
> > Looking at the code, it seems to have some hardware affinity awareness,
> > as it uses the swapfile’s bdev’s node_id. Are we regressing cases where
> > each node has a closer block device?
>
> I had talked about this with Chris before I posted v1. We don't need to
> worry about this because:
>
> 1) Kernel code rarely set disk->node_id, all disks just assign
> NUMA_NO_NODE to it except of these:
>
> drivers/nvdimm/pmem.c <<pmem_attach_disk>>
> drivers/md/dm.c <<alloc_dev>>
>
> For intel ssd Aaron introduced the node based si choosing is for, it
> should be Optane which has been discontinued. It could be wrong, then
> hope intel can help test so that we can see what impact is brought in.
>
> 2) The gap between disk io and memory accessing
> Usually memory accessing is nanosecond level, while disk io is
> microsecond level, HDD even could be at millisecond. The node affinity
> saving nanoseconds is negligible compared to the disk's own acessing
> speed. This includes pmem, its io is more than ten times or even more
> than memory accessing.
I agree that it’s fine to remove the code if the related hardware is obsolete.
I found a paper [1] showing that accessing local Optane PMEM is much faster
than accessing remote Optane PMEM (see slides 4 and 5). That might explain why
they started the project to make swapfile NUMA-aware.
My point is that we should at least mention this in the changelog to
honor their past contributions. But since the hardware is no longer used,
we can remove the code to reduce complexity and stop maintaining it.
I see Aaron's email is no longer reachable, which is probably why we haven’t
received any feedback from them.
[1] https://www.usenix.org/system/files/osdi21_slides_wang-qing.pdf
>
> If there's a real system which owns disks belonging to NUMA nodes, we
> can test to see if the new round robin way is better or worse then the
> node based way.
Yep. If there might be a real user in the future, we can revisit this.
For now, I agree that we can drop the complexity.
Thanks
Barry
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v4 mm-new 1/2] mm/swap: do not choose swap device according to numa node
2025-10-15 5:02 ` Barry Song
@ 2025-10-15 6:23 ` Chris Li
2025-10-15 8:09 ` Barry Song
0 siblings, 1 reply; 22+ messages in thread
From: Chris Li @ 2025-10-15 6:23 UTC (permalink / raw)
To: Barry Song
Cc: Baoquan He, linux-mm, akpm, kasong, youngjun.park, aaron.lu,
shikemeng, nphamcs
On Tue, Oct 14, 2025 at 10:02 PM Barry Song <21cnbao@gmail.com> wrote:
>
> On Wed, Oct 15, 2025 at 11:06 AM Baoquan He <bhe@redhat.com> wrote:
> >
> > On 10/13/25 at 02:09pm, Barry Song wrote:
> > > > -static int swap_node(struct swap_info_struct *si)
> > > > -{
> > > > - struct block_device *bdev;
> > > > -
> > > > - if (si->bdev)
> > > > - bdev = si->bdev;
> > > > - else
> > > > - bdev = si->swap_file->f_inode->i_sb->s_bdev;
> > > > -
> > > > - return bdev ? bdev->bd_disk->node_id : NUMA_NO_NODE;
> > > > -}
> > > > -
> > >
> > > Looking at the code, it seems to have some hardware affinity awareness,
> > > as it uses the swapfile’s bdev’s node_id. Are we regressing cases where
> > > each node has a closer block device?
> >
> > I had talked about this with Chris before I posted v1. We don't need to
> > worry about this because:
> >
> > 1) Kernel code rarely set disk->node_id, all disks just assign
> > NUMA_NO_NODE to it except of these:
> >
> > drivers/nvdimm/pmem.c <<pmem_attach_disk>>
> > drivers/md/dm.c <<alloc_dev>>
> >
> > For intel ssd Aaron introduced the node based si choosing is for, it
> > should be Optane which has been discontinued. It could be wrong, then
> > hope intel can help test so that we can see what impact is brought in.
> >
> > 2) The gap between disk io and memory accessing
> > Usually memory accessing is nanosecond level, while disk io is
> > microsecond level, HDD even could be at millisecond. The node affinity
> > saving nanoseconds is negligible compared to the disk's own acessing
> > speed. This includes pmem, its io is more than ten times or even more
> > than memory accessing.
>
> I agree that it’s fine to remove the code if the related hardware is obsolete.
> I found a paper [1] showing that accessing local Optane PMEM is much faster
> than accessing remote Optane PMEM (see slides 4 and 5). That might explain why
> they started the project to make swapfile NUMA-aware.
Are you suggesting the swapfiel is used for PMEM devices? It sounds
very strange to back swapfile with PMEM. I am under the impression
that the original a2468cc9bfdf commit is introduced with the intel SSD
as a testing swapfile device. I just looked it up. Here is what I find
out in the commit log:
======= quote ========
To see the effect of the patch, a test that starts N process, each mmap
a region of anonymous memory and then continually write to it at random
position to trigger both swap in and out is used.
On a 2 node Skylake EP machine with 64GiB memory, two 170GB SSD drives
are used as swap devices with each attached to a different node, the
result is:
======= end quote =====
> My point is that we should at least mention this in the changelog to
> honor their past contributions. But since the hardware is no longer used,
> we can remove the code to reduce complexity and stop maintaining it.
Optane was not even supported in Skylake. Commit a2468cc9bfdf has
nothing to do with Optane. The Op]tane talk in a2468cc9bfdf is just a
red herring. I fail to see why reverting a2468cc9bfdf needs to mention
Optane is obsolete.
> I see Aaron's email is no longer reachable, which is probably why we haven’t
> received any feedback from them.
>
> [1] https://www.usenix.org/system/files/osdi21_slides_wang-qing.pdf
>
> >
> > If there's a real system which owns disks belonging to NUMA nodes, we
> > can test to see if the new round robin way is better or worse then the
> > node based way.
>
> Yep. If there might be a real user in the future, we can revisit this.
> For now, I agree that we can drop the complexity.
Thank you for the alignment.
Chris
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v4 mm-new 2/2] mm/swap: select swap device with default priority round robin
2025-10-15 4:29 ` Barry Song
@ 2025-10-15 6:24 ` Chris Li
0 siblings, 0 replies; 22+ messages in thread
From: Chris Li @ 2025-10-15 6:24 UTC (permalink / raw)
To: Barry Song
Cc: Baoquan He, linux-mm, akpm, kasong, youngjun.park, aaron.lu,
shikemeng, nphamcs
On Tue, Oct 14, 2025 at 9:29 PM Barry Song <21cnbao@gmail.com> wrote:
>
> On Wed, Oct 15, 2025 at 6:49 AM Chris Li <chrisl@kernel.org> wrote:
> >
> > On Sun, Oct 12, 2025 at 11:17 PM Barry Song <21cnbao@gmail.com> wrote:
> > > >
> > > > However, after commit a2468cc9bfdf applied, above behaviour had been
> > > > changed. I can give an extreme example, imagine on a system with one
> > > > NUMA Node, node_id is 0. Then I swapon several swap devices w/o node_id
> > > > value (namely node_id is -1), at last I swapon one device with node_id
> > > > 0. You can see the last one will have the highest priority to be chosen,
> > > > then other swap devices.
> > >
> > > I assume this adds logic to prefer swapping to the closer swapfile first,
> > > while still maintaining the old behavior for non-NUMA cases.
> >
> > That commit a2468cc9bfdf changes the default behavior of the swapon
> > which does not specify a priority.
> > If you claim the revert breaks the user behavior, that commit also
> > breaks the user behavior as well.
> >
> > > > So I would argue that if people realy care about the default priority,
> > > > it has been broken since 2017 when commit a2468cc9bfdf was introduce,
> > > > and complaint would be heard since long before. While we didn't hear
> > > > complaint, means the default priority doesn't really matter?
> > > > >
> > > > > Or if someone sets up Linux assuming that a newer swap file will only be
> > > > > used after the older one is full, then this change would break those cases?
> > > >
> > > > Hmm, it could happen, but I doubt people really count on that. I would use
> > > > 'swapon -p xx' to specify explicit priority to make sure it. In the case you
> > > > said, swapped out pages will be swapped in, it's either not guaranteed.
> > >
> > > Personally, I also dislike the behavior where a newer swap file
> > > automatically gets a lower priority than an older one. However, since
> > > we have a rule to never break userspace, is this considered such a
> > > case? Or at least, do we need to update the man page as well?
> >
> > See above, we should update the man page as well.
> >
> > If the a2468cc9bfdf can break user default swapon behavior by
> > introducing node_id in the first place. It is totally justifiable to
> > break it again to revert it. I fail to see the logic why the breaking
> > rule only applies to the revert but not the commit introducing it in
> > the first place.
>
> I’m actually referring to the part below, which changes the default priority
> to -1 for all swapfiles while it used to be -n, -n -1, -n -2 , -n -3....
>
> - if (prio >= 0)
> - si->prio = prio;
> - else
> - si->prio = --least_priority;
>
> I agree the NUMA-ID swap patch (a2468cc9bfdf) has partially broken the man
> page, but only for very rare hardware whose block device has a NUMA node ID
> (yes, the man page should have been updated back then). Now we’re breaking
> the man page for all hardware, so I think it’s even more important to
> highlight that in the man page at this point.
Right, I think we are in agreement that the man page needs to be
updated as well. Thanks for catching that.
Chris
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v4 mm-new 1/2] mm/swap: do not choose swap device according to numa node
2025-10-15 6:23 ` Chris Li
@ 2025-10-15 8:09 ` Barry Song
2025-10-15 13:27 ` Chris Li
0 siblings, 1 reply; 22+ messages in thread
From: Barry Song @ 2025-10-15 8:09 UTC (permalink / raw)
To: chrisl
Cc: 21cnbao, aaron.lu, akpm, bhe, kasong, linux-mm, nphamcs,
shikemeng, youngjun.park
>
> >
> > On Wed, Oct 15, 2025 at 11:06 AM Baoquan He <bhe@redhat.com> wrote:
> > >
> > > On 10/13/25 at 02:09pm, Barry Song wrote:
> > > > > -static int swap_node(struct swap_info_struct *si)
> > > > > -{
> > > > > - struct block_device *bdev;
> > > > > -
> > > > > - if (si->bdev)
> > > > > - bdev = si->bdev;
> > > > > - else
> > > > > - bdev = si->swap_file->f_inode->i_sb->s_bdev;
> > > > > -
> > > > > - return bdev ? bdev->bd_disk->node_id : NUMA_NO_NODE;
> > > > > -}
> > > > > -
> > > >
> > > > Looking at the code, it seems to have some hardware affinity awareness,
> > > > as it uses the swapfile’s bdev’s node_id. Are we regressing cases where
> > > > each node has a closer block device?
> > >
> > > I had talked about this with Chris before I posted v1. We don't need to
> > > worry about this because:
> > >
> > > 1) Kernel code rarely set disk->node_id, all disks just assign
> > > NUMA_NO_NODE to it except of these:
> > >
> > > drivers/nvdimm/pmem.c <<pmem_attach_disk>>
> > > drivers/md/dm.c <<alloc_dev>>
> > >
> > > For intel ssd Aaron introduced the node based si choosing is for, it
> > > should be Optane which has been discontinued. It could be wrong, then
> > > hope intel can help test so that we can see what impact is brought in.
> > >
> > > 2) The gap between disk io and memory accessing
> > > Usually memory accessing is nanosecond level, while disk io is
> > > microsecond level, HDD even could be at millisecond. The node affinity
> > > saving nanoseconds is negligible compared to the disk's own acessing
> > > speed. This includes pmem, its io is more than ten times or even more
> > > than memory accessing.
> >
> > I agree that it’s fine to remove the code if the related hardware is obsolete.
> > I found a paper [1] showing that accessing local Optane PMEM is much faster
> > than accessing remote Optane PMEM (see slides 4 and 5). That might explain why
> > they started the project to make swapfile NUMA-aware.
>
> Are you suggesting the swapfiel is used for PMEM devices? It sounds
> very strange to back swapfile with PMEM. I am under the impression
> that the original a2468cc9bfdf commit is introduced with the intel SSD
> as a testing swapfile device. I just looked it up. Here is what I find
> out in the commit log:
>
> ======= quote ========
> To see the effect of the patch, a test that starts N process, each mmap
> a region of anonymous memory and then continually write to it at random
> position to trigger both swap in and out is used.
>
> On a 2 node Skylake EP machine with 64GiB memory, two 170GB SSD drives
> are used as swap devices with each attached to a different node, the
> result is:
> ======= end quote =====
>
> > My point is that we should at least mention this in the changelog to
> > honor their past contributions. But since the hardware is no longer used,
> > we can remove the code to reduce complexity and stop maintaining it.
>
> Optane was not even supported in Skylake. Commit a2468cc9bfdf has
> nothing to do with Optane. The Op]tane talk in a2468cc9bfdf is just a
> red herring. I fail to see why reverting a2468cc9bfdf needs to mention
> Optane is obsolete.
Thanks for the clarification. The Optane discussion turned out to be a goof :-)
Just for the record, this paper [1] also mentions that accessing remote SSDs
can significantly decrease performance. However, it is rare to find any NUMA
machine using SSDs directly as swap files without a RAM compression frontend,
so I don’t think the performance penalty of remote access would be a problem
when choosing a direct swapfile.
[1] https://shbakram.github.io/assets/papers/akram-caos12.pdf
Thanks
Barry
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v4 mm-new 1/2] mm/swap: do not choose swap device according to numa node
2025-10-15 8:09 ` Barry Song
@ 2025-10-15 13:27 ` Chris Li
0 siblings, 0 replies; 22+ messages in thread
From: Chris Li @ 2025-10-15 13:27 UTC (permalink / raw)
To: Barry Song
Cc: aaron.lu, akpm, bhe, kasong, linux-mm, nphamcs, shikemeng, youngjun.park
On Wed, Oct 15, 2025 at 1:09 AM Barry Song <21cnbao@gmail.com> wrote:
> >
> > Optane was not even supported in Skylake. Commit a2468cc9bfdf has
> > nothing to do with Optane. The Op]tane talk in a2468cc9bfdf is just a
> > red herring. I fail to see why reverting a2468cc9bfdf needs to mention
> > Optane is obsolete.
>
> Thanks for the clarification. The Optane discussion turned out to be a goof :-)
>
> Just for the record, this paper [1] also mentions that accessing remote SSDs
> can significantly decrease performance. However, it is rare to find any NUMA
Ack. I am not saying NUMA is aware SSD does not matter. I am sure it
does. The question is does the benefit in real world usage cases
justify the extra complexity.
> machine using SSDs directly as swap files without a RAM compression frontend,
> so I don’t think the performance penalty of remote access would be a problem
> when choosing a direct swapfile.
The per NUMA aware SSD swap setup is so hard to come by, very few
users are actually benefiting from it in real life. You need to have
each NUMA node a SSD attached to that node. Also each SSD on the node
needs to set up the swap partition, preferably the same size. If you
have only one SSD setup the swap partition, it does not matter if it
is local or remote, that is the only SSD swap can write to anyway.
Both Android and data center usage I am aware of do not satisfy this
per NUMA node swapfile requirement. If anyone has a counterexample, I
would like to hear about it. My conguesture is that this setup is so
rare, it does not justify the complexity it adds to the swap core.
Especially considering it has conflict with the swap.tiers down the
road.
Agree that nowadays the zswap or zram is the more common swap usage
case by far, the SSD swap is just an overflow secondary swap. The
local vs remote SSD speed difference there matters much less.
Chris
^ permalink raw reply [flat|nested] 22+ messages in thread
end of thread, other threads:[~2025-10-15 13:28 UTC | newest]
Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-10-11 8:16 [PATCH v4 mm-new 0/2] mm/swapfile.c: select the swap device with default priority round robin Baoquan He
2025-10-11 8:16 ` [PATCH v4 mm-new 1/2] mm/swap: do not choose swap device according to numa node Baoquan He
2025-10-11 20:45 ` kernel test robot
2025-10-11 22:04 ` Andrew Morton
2025-10-12 2:08 ` Baoquan He
2025-10-14 11:56 ` Baoquan He
2025-10-13 6:09 ` Barry Song
2025-10-14 21:50 ` Chris Li
2025-10-15 3:06 ` Baoquan He
2025-10-15 5:02 ` Barry Song
2025-10-15 6:23 ` Chris Li
2025-10-15 8:09 ` Barry Song
2025-10-15 13:27 ` Chris Li
2025-10-11 8:16 ` [PATCH v4 mm-new 2/2] mm/swap: select swap device with default priority round robin Baoquan He
2025-10-12 20:40 ` Barry Song
2025-10-13 3:58 ` Baoquan He
2025-10-13 6:17 ` Barry Song
2025-10-13 23:07 ` Baoquan He
2025-10-14 22:11 ` Chris Li
2025-10-15 4:29 ` Barry Song
2025-10-15 6:24 ` Chris Li
2025-10-14 22:01 ` Chris Li
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox