From: Kairui Song <ryncsn@gmail.com>
To: Baoquan He <bhe@redhat.com>
Cc: linux-mm@kvack.org, akpm@linux-foundation.org, chrisl@kernel.org,
youngjun.park@lge.com, baohua@kernel.org,
shikemeng@huaweicloud.com, nphamcs@gmail.com
Subject: Re: [PATCH v5 mm-new 0/2] mm/swapfile.c: select swap devices of default priority round robin
Date: Wed, 29 Oct 2025 23:38:10 +0800 [thread overview]
Message-ID: <CAMgjq7CkMXwQuyXZWJuiqxHXQ=CWPoFN+aQtioN941Z6To1qFg@mail.gmail.com> (raw)
In-Reply-To: <20251028034308.929550-1-bhe@redhat.com>
On Wed, Oct 29, 2025 at 4:30 AM Baoquan He <bhe@redhat.com> wrote:
>
> Currently, on system with multiple swap devices, swap allocation will
> select one swap device according to priority. The swap device with the
> highest priority will be chosen to allocate firstly.
>
> People can specify a priority from 0 to 32767 when swapon a swap device,
> or the system will set it from -2 then downwards by default. Meanwhile,
> on NUMA system, the swap device with node_id will be considered first
> on that NUMA node of the node_id.
>
> In the current code, an array of plist, swap_avail_heads[nid], is used
> to organize swap devices on each NUMA node. For each NUMA node, there
> is a plist organizing all swap devices. The 'prio' value in the plist
> is the negated value of the device's priority due to plist being sorted
> from low to high. The swap device owning one node_id will be promoted to
> the front position on that NUMA node, then other swap devices are put in
> order of their default priority.
>
> E.g I got a system with 8 NUMA nodes, and I setup 4 zram partition as
> swap devices.
>
> Current behaviour:
> their priorities will be(note that -1 is skipped):
> NAME TYPE SIZE USED PRIO
> /dev/zram0 partition 16G 0B -2
> /dev/zram1 partition 16G 0B -3
> /dev/zram2 partition 16G 0B -4
> /dev/zram3 partition 16G 0B -5
>
> And their positions in the 8 swap_avail_lists[nid] will be:
> swap_avail_lists[0]: /* node 0's available swap device list */
> zram0 -> zram1 -> zram2 -> zram3
> prio:1 prio:3 prio:4 prio:5
> swap_avali_lists[1]: /* node 1's available swap device list */
> zram1 -> zram0 -> zram2 -> zram3
> prio:1 prio:2 prio:4 prio:5
> swap_avail_lists[2]: /* node 2's available swap device list */
> zram2 -> zram0 -> zram1 -> zram3
> prio:1 prio:2 prio:3 prio:5
> swap_avail_lists[3]: /* node 3's available swap device list */
> zram3 -> zram0 -> zram1 -> zram2
> prio:1 prio:2 prio:3 prio:4
> swap_avail_lists[4-7]: /* node 4,5,6,7's available swap device list */
> zram0 -> zram1 -> zram2 -> zram3
> prio:2 prio:3 prio:4 prio:5
>
> The adjustment for swap device with node_id intended to decrease the
> pressure of lock contention for one swap device by taking different
> swap device on different node. The adjustment was introduced in commit
> a2468cc9bfdf ("swap: choose swap device according to numa node").
> However, the adjustment is a little coarse-grained. On the node, the swap
> device sharing the node's id will always be selected firstly by node's CPUs
> until exhausted, then next one. And on other nodes where no swap device
> shares its node id, swap device with priority '-2' will be selected firstly
> until exhausted, then next with priority '-3'.
>
> This is the swapon output during the process high pressure vm-scability
> test is being taken. It's clearly showing zram0 is heavily exploited until
> exhausted.
>
> ===================================
> [root@hp-dl385g10-03 ~]# swapon
> NAME TYPE SIZE USED PRIO
> /dev/zram0 partition 16G 15.7G -2
> /dev/zram1 partition 16G 3.4G -3
> /dev/zram2 partition 16G 3.4G -4
> /dev/zram3 partition 16G 2.6G -5
>
> The node based strategy on selecting swap device is much better then the
> old way one by one selecting swap device. However it is still unreasonable
> because swap devices are assumed to have similar accessing speed if no
> priority is specified when swapon. It's unfair and doesn't make sense just
> because one swap device is swapped on firstly, its priority will be higher
> than the one swapped on later.
>
> So in this patchset, change is made to select the swap device round robin
> if default priority. In code, the plist array swap_avail_heads[nid] is replaced
> with a plist swap_avail_head which reverts commit a2468cc9bfdf. Meanwhile,
> on top of the revert, further change is taken to make any device w/o
> specified priority get the same default priority '-1'. Surely, swap device
> with specified priority are always put foremost, this is not impacted. If
> you care about their different accessing speed, then use 'swapon -p xx' to
> deploy priority for your swap devices.
>
> New behaviour:
>
> swap_avail_list: /* one global available swap device list */
> zram0 -> zram1 -> zram2 -> zram3
> prio:1 prio:1 prio:1 prio:1
>
> This is the swapon output during the process high pressure vm-scability
> being taken, all is selected round robin:
> =======================================
> [root@hp-dl385g10-03 linux]# swapon
> NAME TYPE SIZE USED PRIO
> /dev/zram0 partition 16G 12.6G -1
> /dev/zram1 partition 16G 12.6G -1
> /dev/zram2 partition 16G 12.6G -1
> /dev/zram3 partition 16G 12.6G -1
>
> With the change, we can see about 18% efficiency promotion as below:
>
> vm-scability test:
> ==================
> Test with:
> usemem --init-time -O -y -x -n 31 2G (4G memcg, zram as swap)
> Before: After:
> System time: 637.92 s 526.74 s (lower is better)
> Sum Throughput: 3546.56 MB/s 4207.56 MB/s (higher is better)
> Single process Throughput: 114.40 MB/s 135.72 MB/s (higher is better)
> free latency: 10138455.99 us 6810119.01 us (low is better)
>
> Changelog:
> ==========
> v4->v5:
> ------
> - Rebase on the latest mm-new;
> - Clean up the relics of swap_numa in Documentation/admin-guide/mm/index.rst.
>
> v3->v4:
> ------
> - Rebase on the latest mm-new;
> - Add Chris's Suggested-by and Acked-by.
>
> v2->v3:
> -------
> - Split the v2 patch into two parts, one is reverting commit
> a2468cc9bfdf, the 2nd is making change to set default priority as -1
> for all swap devices which makes swapping out select swap device round
> robin. This eases patch reviewing which is suggested by Chris, thanks.
> - Fix a LKP reported issue I mistakenly added other debugging code into
> v2 patch. clean that up.
>
> v1->v2:
> -------
> - Remove Documentation/admin-guide/mm/swap_numa.rst;
> - Add back mistakenly removed lockdep_assert_held() line;
> - Remove the unneeded code comment in _enable_swap_info().
> Thanks a lot for careful reviewing from Chris, YoungJun and Kairui.
>
> Baoquan He (2):
> mm/swap: do not choose swap device according to numa node
> mm/swap: select swap device with default priority round robin
>
> Documentation/admin-guide/mm/index.rst | 1 -
> Documentation/admin-guide/mm/swap_numa.rst | 78 ---------------
> include/linux/swap.h | 11 +--
> mm/swapfile.c | 106 ++++-----------------
> 4 files changed, 17 insertions(+), 179 deletions(-)
> delete mode 100644 Documentation/admin-guide/mm/swap_numa.rst
>
> --
> 2.41.0
>
>
Glad to see the performance is better and the code is cleaner, thanks!
For the series:
Reviewed-by: Kairui Song <kasong@tencent.com>
prev parent reply other threads:[~2025-10-29 15:38 UTC|newest]
Thread overview: 6+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-10-28 3:43 Baoquan He
2025-10-28 3:43 ` [PATCH v5 mm-new 1/2] mm/swap: do not choose swap device according to numa node Baoquan He
2025-10-28 19:54 ` Nhat Pham
2025-10-28 3:43 ` [PATCH v5 mm-new 2/2] mm/swap: select swap device with default priority round robin Baoquan He
2025-10-28 19:56 ` Nhat Pham
2025-10-29 15:38 ` Kairui Song [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to='CAMgjq7CkMXwQuyXZWJuiqxHXQ=CWPoFN+aQtioN941Z6To1qFg@mail.gmail.com' \
--to=ryncsn@gmail.com \
--cc=akpm@linux-foundation.org \
--cc=baohua@kernel.org \
--cc=bhe@redhat.com \
--cc=chrisl@kernel.org \
--cc=linux-mm@kvack.org \
--cc=nphamcs@gmail.com \
--cc=shikemeng@huaweicloud.com \
--cc=youngjun.park@lge.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox