Re: [PATCH v5 mm-new 0/2] mm/swapfile.c: select swap devices of default priority round robin

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Kairui Song <ryncsn@gmail.com>
To: Baoquan He <bhe@redhat.com>
Cc: linux-mm@kvack.org, akpm@linux-foundation.org, chrisl@kernel.org,
	 youngjun.park@lge.com, baohua@kernel.org,
	shikemeng@huaweicloud.com,  nphamcs@gmail.com
Subject: Re: [PATCH v5 mm-new 0/2] mm/swapfile.c: select swap devices of default priority round robin
Date: Wed, 29 Oct 2025 23:38:10 +0800	[thread overview]
Message-ID: <CAMgjq7CkMXwQuyXZWJuiqxHXQ=CWPoFN+aQtioN941Z6To1qFg@mail.gmail.com> (raw)
In-Reply-To: <20251028034308.929550-1-bhe@redhat.com>

On Wed, Oct 29, 2025 at 4:30 AM Baoquan He <bhe@redhat.com> wrote:
>
> Currently, on system with multiple swap devices, swap allocation will
> select one swap device according to priority. The swap device with the
> highest priority will be chosen to allocate firstly.
>
> People can specify a priority from 0 to 32767 when swapon a swap device,
> or the system will set it from -2 then downwards by default. Meanwhile,
> on NUMA system, the swap device with node_id will be considered first
> on that NUMA node of the node_id.
>
> In the current code, an array of plist, swap_avail_heads[nid], is used
> to organize swap devices on each NUMA node. For each NUMA node, there
> is a plist organizing all swap devices. The 'prio' value in the plist
> is the negated value of the device's priority due to plist being sorted
> from low to high. The swap device owning one node_id will be promoted to
> the front position on that NUMA node, then other swap devices are put in
> order of their default priority.
>
> E.g I got a system with 8 NUMA nodes, and I setup 4 zram partition as
> swap devices.
>
> Current behaviour:
> their priorities will be(note that -1 is skipped):
> NAME       TYPE      SIZE USED PRIO
> /dev/zram0 partition  16G   0B   -2
> /dev/zram1 partition  16G   0B   -3
> /dev/zram2 partition  16G   0B   -4
> /dev/zram3 partition  16G   0B   -5
>
> And their positions in the 8 swap_avail_lists[nid] will be:
> swap_avail_lists[0]: /* node 0's available swap device list */
> zram0   -> zram1   -> zram2   -> zram3
> prio:1     prio:3     prio:4     prio:5
> swap_avali_lists[1]: /* node 1's available swap device list */
> zram1   -> zram0   -> zram2   -> zram3
> prio:1     prio:2     prio:4     prio:5
> swap_avail_lists[2]: /* node 2's available swap device list */
> zram2   -> zram0   -> zram1   -> zram3
> prio:1     prio:2     prio:3     prio:5
> swap_avail_lists[3]: /* node 3's available swap device list */
> zram3   -> zram0   -> zram1   -> zram2
> prio:1     prio:2     prio:3     prio:4
> swap_avail_lists[4-7]: /* node 4,5,6,7's available swap device list */
> zram0   -> zram1   -> zram2   -> zram3
> prio:2     prio:3     prio:4     prio:5
>
> The adjustment for swap device with node_id intended to decrease the
> pressure of lock contention for one swap device by taking different
> swap device on different node. The adjustment was introduced in commit
> a2468cc9bfdf ("swap: choose swap device according to numa node").
> However, the adjustment is a little coarse-grained. On the node, the swap
> device sharing the node's id will always be selected firstly by node's CPUs
> until exhausted, then next one. And on other nodes where no swap device
> shares its node id, swap device with priority '-2' will be selected firstly
> until exhausted, then next with priority '-3'.
>
> This is the swapon output during the process high pressure vm-scability
> test is being taken. It's clearly showing zram0 is heavily exploited until
> exhausted.
>
> ===================================
> [root@hp-dl385g10-03 ~]# swapon
> NAME       TYPE      SIZE  USED PRIO
> /dev/zram0 partition  16G 15.7G   -2
> /dev/zram1 partition  16G  3.4G   -3
> /dev/zram2 partition  16G  3.4G   -4
> /dev/zram3 partition  16G  2.6G   -5
>
> The node based strategy on selecting swap device is much better then the
> old way one by one selecting swap device. However it is still unreasonable
> because swap devices are assumed to have similar accessing speed if no
> priority is specified when swapon. It's unfair and doesn't make sense just
> because one swap device is swapped on firstly, its priority will be higher
> than the one swapped on later.
>
> So in this patchset, change is made to select the swap device round robin
> if default priority. In code, the plist array swap_avail_heads[nid] is replaced
> with a plist swap_avail_head which reverts commit a2468cc9bfdf. Meanwhile,
> on top of the revert, further change is taken to make any device w/o
> specified priority get the same default priority '-1'. Surely, swap device
> with specified priority are always put foremost, this is not impacted. If
> you care about their different accessing speed, then use 'swapon -p xx' to
> deploy priority for your swap devices.
>
> New behaviour:
>
> swap_avail_list: /* one global available swap device list */
> zram0   -> zram1   -> zram2   -> zram3
> prio:1     prio:1     prio:1     prio:1
>
> This is the swapon output during the process high pressure vm-scability
> being taken, all is selected round robin:
> =======================================
> [root@hp-dl385g10-03 linux]# swapon
> NAME       TYPE      SIZE  USED PRIO
> /dev/zram0 partition  16G 12.6G   -1
> /dev/zram1 partition  16G 12.6G   -1
> /dev/zram2 partition  16G 12.6G   -1
> /dev/zram3 partition  16G 12.6G   -1
>
> With the change, we can see about 18% efficiency promotion as below:
>
> vm-scability test:
> ==================
> Test with:
> usemem --init-time -O -y -x -n 31 2G (4G memcg, zram as swap)
>                            Before:          After:
> System time:               637.92 s         526.74 s      (lower is better)
> Sum Throughput:            3546.56 MB/s     4207.56 MB/s  (higher is better)
> Single process Throughput: 114.40 MB/s      135.72 MB/s   (higher is better)
> free latency:              10138455.99 us   6810119.01 us (low is better)
>
> Changelog:
> ==========
> v4->v5:
> ------
> - Rebase on the latest mm-new;
> - Clean up the relics of swap_numa in Documentation/admin-guide/mm/index.rst.
>
> v3->v4:
> ------
> - Rebase on the latest mm-new;
> - Add Chris's Suggested-by and Acked-by.
>
> v2->v3:
> -------
> - Split the v2 patch into two parts, one is reverting commit
>   a2468cc9bfdf, the 2nd is making change to set default priority as -1
>   for all swap devices which makes swapping out select swap device round
>   robin. This eases patch reviewing which is suggested by Chris, thanks.
> - Fix a LKP reported issue I mistakenly added other debugging code into
>   v2 patch. clean that up.
>
> v1->v2:
> -------
> - Remove Documentation/admin-guide/mm/swap_numa.rst;
> - Add back mistakenly removed lockdep_assert_held() line;
> - Remove the unneeded code comment in _enable_swap_info().
>   Thanks a lot for careful reviewing from Chris, YoungJun and Kairui.
>
> Baoquan He (2):
>   mm/swap: do not choose swap device according to numa node
>   mm/swap: select swap device with default priority round robin
>
>  Documentation/admin-guide/mm/index.rst     |   1 -
>  Documentation/admin-guide/mm/swap_numa.rst |  78 ---------------
>  include/linux/swap.h                       |  11 +--
>  mm/swapfile.c                              | 106 ++++-----------------
>  4 files changed, 17 insertions(+), 179 deletions(-)
>  delete mode 100644 Documentation/admin-guide/mm/swap_numa.rst
>
> --
> 2.41.0
>
>

Glad to see the performance is better and the code is cleaner, thanks!

For the series:

Reviewed-by: Kairui Song <kasong@tencent.com>

     prev parent reply	other threads:[~2025-10-29 15:38 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-10-28  3:43 Baoquan He
2025-10-28  3:43 ` [PATCH v5 mm-new 1/2] mm/swap: do not choose swap device according to numa node Baoquan He
2025-10-28 19:54   ` Nhat Pham
2025-10-28  3:43 ` [PATCH v5 mm-new 2/2] mm/swap: select swap device with default priority round robin Baoquan He
2025-10-28 19:56   ` Nhat Pham
2025-10-29 15:38 ` Kairui Song [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAMgjq7CkMXwQuyXZWJuiqxHXQ=CWPoFN+aQtioN941Z6To1qFg@mail.gmail.com' \
    --to=ryncsn@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=baohua@kernel.org \
    --cc=bhe@redhat.com \
    --cc=chrisl@kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=nphamcs@gmail.com \
    --cc=shikemeng@huaweicloud.com \
    --cc=youngjun.park@lge.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox