From: Baoquan He <bhe@redhat.com>
To: Barry Song <21cnbao@gmail.com>
Cc: linux-mm@kvack.org, akpm@linux-foundation.org, chrisl@kernel.org,
kasong@tencent.com, youngjun.park@lge.com, aaron.lu@intel.com,
shikemeng@huaweicloud.com, nphamcs@gmail.com
Subject: Re: [PATCH v4 mm-new 2/2] mm/swap: select swap device with default priority round robin
Date: Mon, 13 Oct 2025 11:58:18 +0800 [thread overview]
Message-ID: <aOx42iLkuyYqXBMW@MiWiFi-R3L-srv> (raw)
In-Reply-To: <CAGsJ_4y4CLu7qeHijhJtL+NDrehfiWpu9mtsVGxmn5rBy03v0w@mail.gmail.com>
On 10/13/25 at 04:40am, Barry Song wrote:
> On Sun, Oct 12, 2025 at 5:14 AM Baoquan He <bhe@redhat.com> wrote:
> >
> > Swap devices are assumed to have similar accessing speed if no priority
> > is specified when swapon. It's unfair and doesn't make sense just because
> > one swap device is swapped on firstly, its priority will be higher than
> > the one swapped on later.
> >
> > Here, set all swap devicess to have priority '-1' by default. With this
> > change, swap device with default priority will be selected round robin
> > when swapping out. This can improve the swapping efficiency a lot among
> > multiple swap devices with default priority.
> >
> > Below are swapon output during processes high pressure vm-scability test
> > is being taken:
> >
> > 1) This is pre-commit a2468cc9bfdf, swap device is selectd one by one by
> > priority from high to low when one swap device is exhausted:
> > ------------------------------------
> > [root@hp-dl385g10-03 ~]# swapon
> > NAME TYPE SIZE USED PRIO
> > /dev/zram0 partition 16G 16G -1
> > /dev/zram1 partition 16G 966.2M -2
> > /dev/zram2 partition 16G 0B -3
> > /dev/zram3 partition 16G 0B -4
> >
> > 2) This is behaviour with commit a2468cc9bfdf, on node, swap device
> > sharing the same node id is selected firstly until exhausted; while
> > on node no swap device sharing the node id it selects the one with
> > highest priority until exhaustd:
> > ------------------------------------
> > [root@hp-dl385g10-03 ~]# swapon
> > NAME TYPE SIZE USED PRIO
> > /dev/zram0 partition 16G 15.7G -2
> > /dev/zram1 partition 16G 3.4G -3
> > /dev/zram2 partition 16G 3.4G -4
> > /dev/zram3 partition 16G 2.6G -5
> >
> > 3) After this patch applied, swap devices with default priority are selectd
> > round robin:
> > ------------------------------------
> > [root@hp-dl385g10-03 block]# swapon
> > NAME TYPE SIZE USED PRIO
> > /dev/zram0 partition 16G 6.6G -1
> > /dev/zram1 partition 16G 6.6G -1
> > /dev/zram2 partition 16G 6.6G -1
> > /dev/zram3 partition 16G 6.6G -1
> >
> > With the change, we can see about 18% efficiency promotion relative to
> > node based way as below. (Surely, the pre-commit a2468cc9bfdf way is
> > the worst.)
> >
Thanks a lot for reviewing, Barry.
>
> I’m not against the behavior change; but the swapon man page says:
> "
> Each swap area has a priority, either high or low. The default
> priority is low. Within the low-priority areas, newer areas are
> even lower priority than older areas.
I didn't see this in man 8 page of swapon, while see it in man 2 page.
Means people may feel that change when they call the call swapon()
syscall, but people may not cares about in script or something like that?
> "
> So my question is whether users still assume that newly added swap areas
> get a lower priority than the older ones?
>
> I assume the priority decrement isn’t a stable ABI, so this change won’t
> break userspace?
Hmm, I would say that this will change the assumption, BUT I don't start
it. That assumption has been broken since the numa based swap device
choosing at below commit:
commit a2468cc9bfdf ("swap: choose swap device according to numa node").
Before commit a2468cc9bfdf, swapon behaviour is taken strictly as the
man page states. The earlier the swap device is added, the higher its
default priority is. And the highest priority device is used up, then
the 2nd highest priority swap device, and so on in sequence. Below
swapon output demonstrate.
===============================
[root@hp-dl385g10-03 ~]# swapon
NAME TYPE SIZE USED PRIO
/dev/zram0 partition 16G 16G -1
/dev/zram1 partition 16G 966.2M -2
/dev/zram2 partition 16G 0B -3
/dev/zram3 partition 16G 0B -4
However, after commit a2468cc9bfdf applied, above behaviour had been
changed. I can give an extreme example, imagine on a system with one
NUMA Node, node_id is 0. Then I swapon several swap devices w/o node_id
value (namely node_id is -1), at last I swapon one device with node_id
0. You can see the last one will have the highest priority to be chosen,
then other swap devices.
So I would argue that if people realy care about the default priority,
it has been broken since 2017 when commit a2468cc9bfdf was introduce,
and complaint would be heard since long before. While we didn't hear
complaint, means the default priority doesn't really matter?
>
> Or if someone sets up Linux assuming that a newer swap file will only be
> used after the older one is full, then this change would break those cases?
Hmm, it could happen, but I doubt people really count on that. I would use
'swapon -p xx' to specify explicit priority to make sure it. In the case you
said, swapped out pages will be swapped in, it's either not guaranteed.
Thanks
Baoquan
next prev parent reply other threads:[~2025-10-13 3:58 UTC|newest]
Thread overview: 22+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-10-11 8:16 [PATCH v4 mm-new 0/2] mm/swapfile.c: select the " Baoquan He
2025-10-11 8:16 ` [PATCH v4 mm-new 1/2] mm/swap: do not choose swap device according to numa node Baoquan He
2025-10-11 20:45 ` kernel test robot
2025-10-11 22:04 ` Andrew Morton
2025-10-12 2:08 ` Baoquan He
2025-10-14 11:56 ` Baoquan He
2025-10-13 6:09 ` Barry Song
2025-10-14 21:50 ` Chris Li
2025-10-15 3:06 ` Baoquan He
2025-10-15 5:02 ` Barry Song
2025-10-15 6:23 ` Chris Li
2025-10-15 8:09 ` Barry Song
2025-10-15 13:27 ` Chris Li
2025-10-11 8:16 ` [PATCH v4 mm-new 2/2] mm/swap: select swap device with default priority round robin Baoquan He
2025-10-12 20:40 ` Barry Song
2025-10-13 3:58 ` Baoquan He [this message]
2025-10-13 6:17 ` Barry Song
2025-10-13 23:07 ` Baoquan He
2025-10-14 22:11 ` Chris Li
2025-10-15 4:29 ` Barry Song
2025-10-15 6:24 ` Chris Li
2025-10-14 22:01 ` Chris Li
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=aOx42iLkuyYqXBMW@MiWiFi-R3L-srv \
--to=bhe@redhat.com \
--cc=21cnbao@gmail.com \
--cc=aaron.lu@intel.com \
--cc=akpm@linux-foundation.org \
--cc=chrisl@kernel.org \
--cc=kasong@tencent.com \
--cc=linux-mm@kvack.org \
--cc=nphamcs@gmail.com \
--cc=shikemeng@huaweicloud.com \
--cc=youngjun.park@lge.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox