linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Chris Li <chrisl@kernel.org>
To: Baoquan He <bhe@redhat.com>
Cc: linux-mm@kvack.org, akpm@linux-foundation.org,
	kasong@tencent.com,  baohua@kernel.org,
	shikemeng@huaweicloud.com, nphamcs@gmail.com,
	 YoungJun Park <youngjun.park@lge.com>
Subject: Re: [PATCH] mm/swapfile.c: select the swap device with default priority round robin
Date: Wed, 24 Sep 2025 08:54:53 -0700	[thread overview]
Message-ID: <CACePvbUhnNUgqbTUrBYoMK-VeCB1Lntkeijof8ehOz77M8H6Bg@mail.gmail.com> (raw)
In-Reply-To: <20250924091746.146461-1-bhe@redhat.com>

Hi Baoquan,

Very exciting numbers. I have always suspected the per node priority
is not doing much contribution in the new swap allocator world. I did
not expect it to have negative contributions.

On Wed, Sep 24, 2025 at 2:18 AM Baoquan He <bhe@redhat.com> wrote:
>
> Currently, on system with multiple swap devices, swap allocation will
> select one swap device according to priority. The swap device with the
> highest priority will be chosen to allocate firstly.
>
> People can specify a priority from 0 to 32767 when swapon a swap device,
> or the system will set it from -2 then downwards by default. Meanwhile,
> on NUMA system, the swap device with node_id will be considered first
> on that NUMA node of the node_id.

That behavior was introduced by: a2468cc9bfdf ("swap: choose swap
device according to numa node")
You are effectively reverting that patch and the following fix up
patches on top of that.
The commit message or maybe the title should reflect the reversion nature.

If you did more than the simple revert plus fix up, please document
what additional change you make in this patch.

>
> In the current code, an array of plist, swap_avail_heads[nid], is used
> to organize swap devices on each NUMA node. For each NUMA node, there
> is a plist organizing all swap devices. The 'prio' value in the plist
> is the negated value of the device's priority due to plist being sorted
> from low to high. The swap device owning one node_id will be promoted to
> the front position on that NUMA node, then other swap devices are put in
> order of their default priority.
>

The original patch that introduced it is using SSD as a benchmark.
Here you are using patched zram as a benchmark. You want to explain in
a little bit detail why you choose a different test method. e.g. You
don't have a machine with an SSD device as a raw partition to do the
original test. Compression ram based swap device, zswap or zram is
used a lot more in the data center server and android workload
environment, maybe even in some Linux workstation distro as well.

You can also invite others who do have the spare SSD driver to test
the SSD as a swap device. Maybe with some setup instructions how to
set up and repeat your test on their machine with multiple SSD drives.
How to compare the result with or without your reversion patch.

> E.g I got a system with 8 NUMA nodes, and I setup 4 zram partition as
> swap devices.

You want to make it clear up front that you are using a patched zram
to simulate the per node swap device behavior. Native zram does not
have that.

>
> Current behaviour:
> their priorities will be(note that -1 is skipped):
> NAME       TYPE      SIZE USED PRIO
> /dev/zram0 partition  16G   0B   -2
> /dev/zram1 partition  16G   0B   -3
> /dev/zram2 partition  16G   0B   -4
> /dev/zram3 partition  16G   0B   -5
>
> And their positions in the 8 swap_avail_lists[nid] will be:
> swap_avail_lists[0]: /* node 0's available swap device list */
> zram0   -> zram1   -> zram2   -> zram3
> prio:1     prio:3     prio:4     prio:5
> swap_avali_lists[1]: /* node 1's available swap device list */
> zram1   -> zram0   -> zram2   -> zram3
> prio:1     prio:2     prio:4     prio:5
> swap_avail_lists[2]: /* node 2's available swap device list */
> zram2   -> zram0   -> zram1   -> zram3
> prio:1     prio:2     prio:3     prio:5
> swap_avail_lists[3]: /* node 3's available swap device list */
> zram3   -> zram0   -> zram1   -> zram2
> prio:1     prio:2     prio:3     prio:4
> swap_avail_lists[4-7]: /* node 4,5,6,7's available swap device list */
> zram0   -> zram1   -> zram2   -> zram3
> prio:2     prio:3     prio:4     prio:5
>
> The adjustment for swap device with node_id intended to decrease the
> pressure of lock contention for one swap device by taking different
> swap device on different node. However, the adjustment is very
> coarse-grained. On the node, the swap device sharing the node's id will
> always be selected firstly by node's CPUs until exhausted, then next one.
> And on other nodes where no swap device shares its node id, swap device
> with priority '-2' will be selected firstly until exhausted, then next
> with priority '-3'.
>
> This is the swapon output during the process high pressure vm-scability
> test is being taken. It's clearly shown zram0 is heavily exploited until
> exhausted.

Any tips how others repeat your high pressure vm-scability test,
especially for someone who has multiple SSD drives as a test swap
device. Some test script setup would be nice. You can post the
instruction in the same email thread as separate email, it does not
have to be in the commit message.

> ===================================
> [root@hp-dl385g10-03 ~]# swapon
> NAME       TYPE      SIZE  USED PRIO
> /dev/zram0 partition  16G 15.7G   -2
> /dev/zram1 partition  16G  3.4G   -3
> /dev/zram2 partition  16G  3.4G   -4
> /dev/zram3 partition  16G  2.6G   -5
>
> This is unreasonable because swap devices are assumed to have similar
> accessing speed if no priority is specified when swapon. It's unfair and
> doesn't make sense just because one swap device is swapped on firstly,
> its priority will be higher than the one swapped on later.
>
> So here change is made to select the swap device round robin if default
> priority. In code, the plist array swap_avail_heads[nid] is replaced
> with a plist swap_avail_head. Any device w/o specified priority will get
> the same default priority '-1'. Surely, swap device with specified priority
> are always put foremost, this is not impacted. If you care about their
> different accessing speed, then use 'swapon -p xx' to deploy priority for
> your swap devices.
>
> New behaviour:
>
> swap_avail_list: /* one global available swap device list */
> zram0   -> zram1   -> zram2   -> zram3
> prio:1     prio:1     prio:1     prio:1
>
> This is the swapon output during the process high pressure vm-scability
> being taken, all is selected round robin:
> =======================================
> [root@hp-dl385g10-03 linux]# swapon
> NAME       TYPE      SIZE  USED PRIO
> /dev/zram0 partition  16G 12.6G   -1
> /dev/zram1 partition  16G 12.6G   -1
> /dev/zram2 partition  16G 12.6G   -1
> /dev/zram3 partition  16G 12.6G   -1
>
> With the change, we can see about 18% efficiency promotion as below:
>
> vm-scability test:
> ==================
> Test with:
> usemem --init-time -O -y -x -n 31 2G (4G memcg, zram as swap)
>                            Before:          After:
> System time:               637.92 s         526.74 s
You can clarify here lower is better.
> Sum Throughput:            3546.56 MB/s     4207.56 MB/s

Higher is better. Also a percentage number can be useful here. e.g.
that is +18.6% percent improvement since reverting to round robin. A
huge difference!

> Single process Throughput: 114.40 MB/s      135.72 MB/s
Higher is better.
> free latency:              10138455.99 us   6810119.01 us
>
> Suggested-by: Chris Li <chrisl@kernel.org>
> Signed-off-by: Baoquan He <bhe@redhat.com>
> ---
>  include/linux/swap.h | 11 +-----
>  mm/swapfile.c        | 94 +++++++-------------------------------------

Very nice patch stats! Fewer code and runs faster. What more can we ask for?


>  2 files changed, 16 insertions(+), 89 deletions(-)
>
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 3473e4247ca3..f72c8e5e0635 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -337,16 +337,7 @@ struct swap_info_struct {
>         struct work_struct discard_work; /* discard worker */
>         struct work_struct reclaim_work; /* reclaim worker */
>         struct list_head discard_clusters; /* discard clusters list */
> -       struct plist_node avail_lists[]; /*
> -                                          * entries in swap_avail_heads, one
> -                                          * entry per node.
> -                                          * Must be last as the number of the
> -                                          * array is nr_node_ids, which is not
> -                                          * a fixed value so have to allocate
> -                                          * dynamically.
> -                                          * And it has to be an array so that
> -                                          * plist_for_each_* can work.
> -                                          */
> +       struct plist_node avail_list;   /* entry in swap_avail_head */
>  };
>
>  static inline swp_entry_t page_swap_entry(struct page *page)
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index b4f3cc712580..d8a54e5af16d 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -73,7 +73,7 @@ atomic_long_t nr_swap_pages;
>  EXPORT_SYMBOL_GPL(nr_swap_pages);
>  /* protected with swap_lock. reading in vm_swap_full() doesn't need lock */
>  long total_swap_pages;
> -static int least_priority = -1;
> +#define DEF_SWAP_PRIO  -1
>  unsigned long swapfile_maximum_size;
>  #ifdef CONFIG_MIGRATION
>  bool swap_migration_ad_supported;
> @@ -102,7 +102,7 @@ static PLIST_HEAD(swap_active_head);
>   * is held and the locking order requires swap_lock to be taken
>   * before any swap_info_struct->lock.
>   */
> -static struct plist_head *swap_avail_heads;
> +static PLIST_HEAD(swap_avail_head);
>  static DEFINE_SPINLOCK(swap_avail_lock);
>
>  static struct swap_info_struct *swap_info[MAX_SWAPFILES];
> @@ -995,7 +995,6 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
>  /* SWAP_USAGE_OFFLIST_BIT can only be set by this helper. */
>  static void del_from_avail_list(struct swap_info_struct *si, bool swapoff)
>  {
> -       int nid;
>         unsigned long pages;
>
>         spin_lock(&swap_avail_lock);
> @@ -1007,7 +1006,7 @@ static void del_from_avail_list(struct swap_info_struct *si, bool swapoff)
>                  * swap_avail_lock, to ensure the result can be seen by
>                  * add_to_avail_list.
>                  */
> -               lockdep_assert_held(&si->lock);
> +               //lockdep_assert_held(&si->lock);

That seems like some debug stuff left over. If you need to remove it, remove it.

The rest of the patch looks fine to me. Thanks for working on it. That
is a very nice cleanup.

Agree with YoungJun Park on removing the numa swap document as well.

Looking forward to your refresh version. I should be able to Ack-by on
your next version.

Chris


>                 si->flags &= ~SWP_WRITEOK;
>                 atomic_long_or(SWAP_USAGE_OFFLIST_BIT, &si->inuse_pages);
>         } else {
> @@ -1024,8 +1023,7 @@ static void del_from_avail_list(struct swap_info_struct *si, bool swapoff)
>                         goto skip;
>         }
>
> -       for_each_node(nid)
> -               plist_del(&si->avail_lists[nid], &swap_avail_heads[nid]);
> +       plist_del(&si->avail_list, &swap_avail_head);
>
>  skip:
>         spin_unlock(&swap_avail_lock);
> @@ -1034,7 +1032,6 @@ static void del_from_avail_list(struct swap_info_struct *si, bool swapoff)
>  /* SWAP_USAGE_OFFLIST_BIT can only be cleared by this helper. */
>  static void add_to_avail_list(struct swap_info_struct *si, bool swapon)
>  {
> -       int nid;
>         long val;
>         unsigned long pages;
>
> @@ -1067,8 +1064,7 @@ static void add_to_avail_list(struct swap_info_struct *si, bool swapon)
>                         goto skip;
>         }
>
> -       for_each_node(nid)
> -               plist_add(&si->avail_lists[nid], &swap_avail_heads[nid]);
> +       plist_add(&si->avail_list, &swap_avail_head);
>
>  skip:
>         spin_unlock(&swap_avail_lock);
> @@ -1211,16 +1207,14 @@ static bool swap_alloc_fast(swp_entry_t *entry,
>  static bool swap_alloc_slow(swp_entry_t *entry,
>                             int order)
>  {
> -       int node;
>         unsigned long offset;
>         struct swap_info_struct *si, *next;
>
> -       node = numa_node_id();
>         spin_lock(&swap_avail_lock);
>  start_over:
> -       plist_for_each_entry_safe(si, next, &swap_avail_heads[node], avail_lists[node]) {
> +       plist_for_each_entry_safe(si, next, &swap_avail_head, avail_list) {
>                 /* Rotate the device and switch to a new cluster */
> -               plist_requeue(&si->avail_lists[node], &swap_avail_heads[node]);
> +               plist_requeue(&si->avail_list, &swap_avail_head);
>                 spin_unlock(&swap_avail_lock);
>                 if (get_swap_device_info(si)) {
>                         offset = cluster_alloc_swap_entry(si, order, SWAP_HAS_CACHE);
> @@ -1245,7 +1239,7 @@ static bool swap_alloc_slow(swp_entry_t *entry,
>                  * still in the swap_avail_head list then try it, otherwise
>                  * start over if we have not gotten any slots.
>                  */
> -               if (plist_node_empty(&next->avail_lists[node]))
> +               if (plist_node_empty(&si->avail_list))
>                         goto start_over;
>         }
>         spin_unlock(&swap_avail_lock);
> @@ -2535,44 +2529,18 @@ static int setup_swap_extents(struct swap_info_struct *sis, sector_t *span)
>         return generic_swapfile_activate(sis, swap_file, span);
>  }
>
> -static int swap_node(struct swap_info_struct *si)
> -{
> -       struct block_device *bdev;
> -
> -       if (si->bdev)
> -               bdev = si->bdev;
> -       else
> -               bdev = si->swap_file->f_inode->i_sb->s_bdev;
> -
> -       return bdev ? bdev->bd_disk->node_id : NUMA_NO_NODE;
> -}
> -
>  static void setup_swap_info(struct swap_info_struct *si, int prio,
>                             unsigned char *swap_map,
>                             struct swap_cluster_info *cluster_info,
>                             unsigned long *zeromap)
>  {
> -       int i;
> -
> -       if (prio >= 0)
> -               si->prio = prio;
> -       else
> -               si->prio = --least_priority;
> +       si->prio = prio;
>         /*
>          * the plist prio is negated because plist ordering is
>          * low-to-high, while swap ordering is high-to-low
>          */
>         si->list.prio = -si->prio;
> -       for_each_node(i) {
> -               if (si->prio >= 0)
> -                       si->avail_lists[i].prio = -si->prio;
> -               else {
> -                       if (swap_node(si) == i)
> -                               si->avail_lists[i].prio = 1;
> -                       else
> -                               si->avail_lists[i].prio = -si->prio;
> -               }
> -       }
> +       si->avail_list.prio = -si->prio;
>         si->swap_map = swap_map;
>         si->cluster_info = cluster_info;
>         si->zeromap = zeromap;
> @@ -2721,20 +2689,6 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
>         }
>         spin_lock(&p->lock);
>         del_from_avail_list(p, true);
> -       if (p->prio < 0) {
> -               struct swap_info_struct *si = p;
> -               int nid;
> -
> -               plist_for_each_entry_continue(si, &swap_active_head, list) {
> -                       si->prio++;
> -                       si->list.prio--;
> -                       for_each_node(nid) {
> -                               if (si->avail_lists[nid].prio != 1)
> -                                       si->avail_lists[nid].prio--;
> -                       }
> -               }
> -               least_priority++;
> -       }
>         plist_del(&p->list, &swap_active_head);
>         atomic_long_sub(p->pages, &nr_swap_pages);
>         total_swap_pages -= p->pages;
> @@ -2972,9 +2926,8 @@ static struct swap_info_struct *alloc_swap_info(void)
>         struct swap_info_struct *p;
>         struct swap_info_struct *defer = NULL;
>         unsigned int type;
> -       int i;
>
> -       p = kvzalloc(struct_size(p, avail_lists, nr_node_ids), GFP_KERNEL);
> +       p = kvzalloc(sizeof(struct swap_info_struct), GFP_KERNEL);
>         if (!p)
>                 return ERR_PTR(-ENOMEM);
>
> @@ -3013,8 +2966,7 @@ static struct swap_info_struct *alloc_swap_info(void)
>         }
>         p->swap_extent_root = RB_ROOT;
>         plist_node_init(&p->list, 0);
> -       for_each_node(i)
> -               plist_node_init(&p->avail_lists[i], 0);
> +       plist_node_init(&p->avail_list, 0);
>         p->flags = SWP_USED;
>         spin_unlock(&swap_lock);
>         if (defer) {
> @@ -3282,9 +3234,6 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
>         if (!capable(CAP_SYS_ADMIN))
>                 return -EPERM;
>
> -       if (!swap_avail_heads)
> -               return -ENOMEM;
> -
>         si = alloc_swap_info();
>         if (IS_ERR(si))
>                 return PTR_ERR(si);
> @@ -3465,7 +3414,7 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
>         }
>
>         mutex_lock(&swapon_mutex);
> -       prio = -1;
> +       prio = DEF_SWAP_PRIO;
>         if (swap_flags & SWAP_FLAG_PREFER)
>                 prio = swap_flags & SWAP_FLAG_PRIO_MASK;
>         enable_swap_info(si, prio, swap_map, cluster_info, zeromap);
> @@ -3904,7 +3853,6 @@ static bool __has_usable_swap(void)
>  void __folio_throttle_swaprate(struct folio *folio, gfp_t gfp)
>  {
>         struct swap_info_struct *si, *next;
> -       int nid = folio_nid(folio);
>
>         if (!(gfp & __GFP_IO))
>                 return;
> @@ -3923,8 +3871,8 @@ void __folio_throttle_swaprate(struct folio *folio, gfp_t gfp)
>                 return;
>
>         spin_lock(&swap_avail_lock);
> -       plist_for_each_entry_safe(si, next, &swap_avail_heads[nid],
> -                                 avail_lists[nid]) {
> +       plist_for_each_entry_safe(si, next, &swap_avail_head,
> +                                 avail_list) {
>                 if (si->bdev) {
>                         blkcg_schedule_throttle(si->bdev->bd_disk, true);
>                         break;
> @@ -3936,18 +3884,6 @@ void __folio_throttle_swaprate(struct folio *folio, gfp_t gfp)
>
>  static int __init swapfile_init(void)
>  {
> -       int nid;
> -
> -       swap_avail_heads = kmalloc_array(nr_node_ids, sizeof(struct plist_head),
> -                                        GFP_KERNEL);
> -       if (!swap_avail_heads) {
> -               pr_emerg("Not enough memory for swap heads, swap is disabled\n");
> -               return -ENOMEM;
> -       }
> -
> -       for_each_node(nid)
> -               plist_head_init(&swap_avail_heads[nid]);
> -
>         swapfile_maximum_size = arch_max_swapfile_size();
>
>  #ifdef CONFIG_MIGRATION
> --
> 2.41.0
>
>


  parent reply	other threads:[~2025-09-24 15:55 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-09-24  9:17 Baoquan He
2025-09-24 10:23 ` Baoquan He
2025-09-24 15:41 ` YoungJun Park
2025-09-25  2:24   ` Baoquan He
2025-09-24 15:52 ` YoungJun Park
2025-09-25  4:10   ` Baoquan He
2025-09-25  4:23     ` Baoquan He
2025-09-24 15:54 ` Chris Li [this message]
2025-09-24 16:06   ` Chris Li
2025-09-25  2:15     ` Baoquan He
2025-09-25 18:31       ` Chris Li
2025-09-25  1:55   ` Baoquan He
2025-09-25 18:25     ` Chris Li
2025-09-26 15:31       ` Baoquan He
2025-09-27  4:46         ` Chris Li
2025-09-28  2:14           ` Baoquan He
2025-09-24 16:34 ` YoungJun Park
2025-09-25  0:24   ` Baoquan He
2025-09-25  4:36 ` Kairui Song
2025-09-25  6:18   ` Baoquan He

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CACePvbUhnNUgqbTUrBYoMK-VeCB1Lntkeijof8ehOz77M8H6Bg@mail.gmail.com \
    --to=chrisl@kernel.org \
    --cc=akpm@linux-foundation.org \
    --cc=baohua@kernel.org \
    --cc=bhe@redhat.com \
    --cc=kasong@tencent.com \
    --cc=linux-mm@kvack.org \
    --cc=nphamcs@gmail.com \
    --cc=shikemeng@huaweicloud.com \
    --cc=youngjun.park@lge.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox