Re: [PATCH v3 2/2] mm/swap: select swap device with default priority round robin

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Baoquan He <bhe@redhat.com>
To: Chris Li <chrisl@kernel.org>
Cc: linux-mm@kvack.org, akpm@linux-foundation.org,
	kasong@tencent.com, youngjun.park@lge.com, aaron.lu@intel.com,
	baohua@kernel.org, shikemeng@huaweicloud.com, nphamcs@gmail.com
Subject: Re: [PATCH v3 2/2] mm/swap: select swap device with default priority round robin
Date: Thu, 2 Oct 2025 20:40:20 +0800	[thread overview]
Message-ID: <aN5ytDQPCx9EQx1t@localhost> (raw)
In-Reply-To: <CACePvbUozJDOtj2hcepfhp865vGou7-s48NbrJ5sBT_udOeCgA@mail.gmail.com>

On 10/01/25 at 08:03pm, Chris Li wrote:
> Thanks for removing the node id complexity. Those negative priorities
> have been very hard to follow and reason. Now with the extra 18%
> performance boost just cherry on top.
> 
> I am very happy with this outcome. Could not ask for better.
> 
> Acked-by: Chris Li <chrisl@kernel.org>

Thanks a lot for careful reviewing.

> 
> Chris
> 
> On Mon, Sep 29, 2025 at 11:33 PM Baoquan He <bhe@redhat.com> wrote:
> >
> > Swap devices are assumed to have similar accessing speed if no priority
> > is specified when swapon. It's unfair and doesn't make sense just because
> > one swap device is swapped on firstly, its priority will be higher than
> > the one swapped on later.
> >
> > Here, set all swap devicess to have priority '-1' by default. With this
> > change, swap device with default priority will be selected round robin
> > when swapping out. This can improve the swapping efficiency a lot among
> > multiple swap devices with default priority.
> >
> > Below are swapon output during processes high pressure vm-scability test
> > is being taken:
> >
> > 1) This is pre-commit a2468cc9bfdf, swap device is selectd one by one by
> >    priority from high to low when one swap device is exhausted:
> > ------------------------------------
> > [root@hp-dl385g10-03 ~]# swapon
> > NAME       TYPE      SIZE   USED PRIO
> > /dev/zram0 partition  16G    16G   -1
> > /dev/zram1 partition  16G 966.2M   -2
> > /dev/zram2 partition  16G     0B   -3
> > /dev/zram3 partition  16G     0B   -4
> >
> > 2) This is behaviour with commit a2468cc9bfdf, on node, swap device
> >    sharing the same node id is selected firstly until exhausted; while
> >    on node no swap device sharing the node id it selects the one with
> >    highest priority until exhaustd:
> > ------------------------------------
> > [root@hp-dl385g10-03 ~]# swapon
> > NAME       TYPE      SIZE  USED PRIO
> > /dev/zram0 partition  16G 15.7G   -2
> > /dev/zram1 partition  16G  3.4G   -3
> > /dev/zram2 partition  16G  3.4G   -4
> > /dev/zram3 partition  16G  2.6G   -5
> >
> > 3) After this patch applied, swap devices with default priority are selectd
> >    round robin:
> > ------------------------------------
> > [root@hp-dl385g10-03 block]# swapon
> > NAME       TYPE      SIZE USED PRIO
> > /dev/zram0 partition  16G 6.6G   -1
> > /dev/zram1 partition  16G 6.6G   -1
> > /dev/zram2 partition  16G 6.6G   -1
> > /dev/zram3 partition  16G 6.6G   -1
> >
> > With the change, we can see about 18% efficiency promotion relative to
> > node based way as below. (Surely, the pre-commit a2468cc9bfdf way is
> > the worst.)
> >
> > vm-scability test:
> > ==================
> > Test with:
> > usemem --init-time -O -y -x -n 31 2G (4G memcg, zram as swap)
> >                             one by one:      node based:      round robin:
> > System time:                1087.38 s        637.92 s         526.74 s     (lower is better)
> > Sum Throughput:             2036.55 MB/s     3546.56 MB/s     4207.56 MB/s (higher is better)
> > Single process Throughput:  65.69 MB/s       114.40 MB/s      135.72 MB/s  (high is better)
> > free latency:               15769409.48 us   10138455.99 us   6810119.01 us(lower is better)
> >
> > Signed-off-by: Baoquan He <bhe@redhat.com>
> > ---
> >  mm/swapfile.c | 31 ++++---------------------------
> >  1 file changed, 4 insertions(+), 27 deletions(-)
> >
> > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > index f9b3667fb08a..2bd8bd76ea28 100644
> > --- a/mm/swapfile.c
> > +++ b/mm/swapfile.c
> > @@ -73,7 +73,7 @@ atomic_long_t nr_swap_pages;
> >  EXPORT_SYMBOL_GPL(nr_swap_pages);
> >  /* protected with swap_lock. reading in vm_swap_full() doesn't need lock */
> >  long total_swap_pages;
> > -static int least_priority;
> > +#define DEF_SWAP_PRIO  -1
> >  unsigned long swapfile_maximum_size;
> >  #ifdef CONFIG_MIGRATION
> >  bool swap_migration_ad_supported;
> > @@ -2534,10 +2534,7 @@ static void setup_swap_info(struct swap_info_struct *si, int prio,
> >                             struct swap_cluster_info *cluster_info,
> >                             unsigned long *zeromap)
> >  {
> > -       if (prio >= 0)
> > -               si->prio = prio;
> > -       else
> > -               si->prio = --least_priority;
> > +       si->prio = prio;
> >         /*
> >          * the plist prio is negated because plist ordering is
> >          * low-to-high, while swap ordering is high-to-low
> > @@ -2555,16 +2552,7 @@ static void _enable_swap_info(struct swap_info_struct *si)
> >         total_swap_pages += si->pages;
> >
> >         assert_spin_locked(&swap_lock);
> > -       /*
> > -        * both lists are plists, and thus priority ordered.
> > -        * swap_active_head needs to be priority ordered for swapoff(),
> > -        * which on removal of any swap_info_struct with an auto-assigned
> > -        * (i.e. negative) priority increments the auto-assigned priority
> > -        * of any lower-priority swap_info_structs.
> > -        * swap_avail_head needs to be priority ordered for folio_alloc_swap(),
> > -        * which allocates swap pages from the highest available priority
> > -        * swap_info_struct.
> > -        */
> > +
> >         plist_add(&si->list, &swap_active_head);
> >
> >         /* Add back to available list */
> > @@ -2692,17 +2680,6 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
> >         }
> >         spin_lock(&p->lock);
> >         del_from_avail_list(p, true);
> > -       if (p->prio < 0) {
> > -               struct swap_info_struct *si = p;
> > -               int nid;
> > -
> > -               plist_for_each_entry_continue(si, &swap_active_head, list) {
> > -                       si->prio++;
> > -                       si->list.prio--;
> > -                       si->avail_list.prio--;
> > -               }
> > -               least_priority++;
> > -       }
> >         plist_del(&p->list, &swap_active_head);
> >         atomic_long_sub(p->pages, &nr_swap_pages);
> >         total_swap_pages -= p->pages;
> > @@ -3428,7 +3405,7 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
> >         }
> >
> >         mutex_lock(&swapon_mutex);
> > -       prio = -1;
> > +       prio = DEF_SWAP_PRIO;
> >         if (swap_flags & SWAP_FLAG_PREFER)
> >                 prio = swap_flags & SWAP_FLAG_PRIO_MASK;
> >         enable_swap_info(si, prio, swap_map, cluster_info, zeromap);
> > --
> > 2.41.0
> >
> >
>

next prev parent reply	other threads:[~2025-10-02 12:40 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-09-30  6:33 [PATCH v3 0/2] mm/swapfile.c: select the " Baoquan He
2025-09-30  6:33 ` [PATCH v3 1/2] mm/swap: do not choose swap device according to numa node Baoquan He
2025-10-02  2:57   ` Chris Li
2025-10-02 12:38   ` Baoquan He
2025-09-30  6:33 ` [PATCH v3 2/2] mm/swap: select swap device with default priority round robin Baoquan He
2025-10-02  3:03   ` Chris Li
2025-10-02 12:40     ` Baoquan He [this message]
2025-10-02 12:39   ` Baoquan He
2025-10-02  3:21 ` [PATCH v3 0/2] mm/swapfile.c: select the " Baoquan He
2025-10-10  2:38   ` Chris Li

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aN5ytDQPCx9EQx1t@localhost \
    --to=bhe@redhat.com \
    --cc=aaron.lu@intel.com \
    --cc=akpm@linux-foundation.org \
    --cc=baohua@kernel.org \
    --cc=chrisl@kernel.org \
    --cc=kasong@tencent.com \
    --cc=linux-mm@kvack.org \
    --cc=nphamcs@gmail.com \
    --cc=shikemeng@huaweicloud.com \
    --cc=youngjun.park@lge.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox