From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id AC386E77188
	for <linux-mm@archiver.kernel.org>; Thu,  9 Jan 2025 02:15:33 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id BB4E76B007B; Wed,  8 Jan 2025 21:15:32 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id B64586B0082; Wed,  8 Jan 2025 21:15:32 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id A2BD56B0083; Wed,  8 Jan 2025 21:15:32 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14])
	by kanga.kvack.org (Postfix) with ESMTP id 7EB2F6B007B
	for <linux-mm@kvack.org>; Wed,  8 Jan 2025 21:15:32 -0500 (EST)
Received: from smtpin06.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay02.hostedemail.com (Postfix) with ESMTP id D4D3E1216BD
	for <linux-mm@kvack.org>; Thu,  9 Jan 2025 02:15:31 +0000 (UTC)
X-FDA: 82986296862.06.82CB6E0
Received: from mail-lj1-f175.google.com (mail-lj1-f175.google.com [209.85.208.175])
	by imf22.hostedemail.com (Postfix) with ESMTP id D2151C0010
	for <linux-mm@kvack.org>; Thu,  9 Jan 2025 02:15:29 +0000 (UTC)
Authentication-Results: imf22.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b=VT8phy+F;
	spf=pass (imf22.hostedemail.com: domain of ryncsn@gmail.com designates 209.85.208.175 as permitted sender) smtp.mailfrom=ryncsn@gmail.com;
	dmarc=pass (policy=none) header.from=gmail.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1736388930;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=7sEL/TRCbe/SapVd89kbcGY2mfA+Htn81ZKVlP49A+0=;
	b=Xs62DYCJ68dY/+eWdt2gIHRdhS9PdSwWLZZBxvV+7Te64nWotOe7hP/NIl5SPqx4Kked+i
	FxWTNUpb3GJTlYISAUVanR3ZXYVOmtMl2CO9c6Z5/k+BRr7w7yXZpZeNqxb8myYoUI7xxU
	ed1cdAVHabirOFHerMvlepsx/mmmIgI=
ARC-Authentication-Results: i=1;
	imf22.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b=VT8phy+F;
	spf=pass (imf22.hostedemail.com: domain of ryncsn@gmail.com designates 209.85.208.175 as permitted sender) smtp.mailfrom=ryncsn@gmail.com;
	dmarc=pass (policy=none) header.from=gmail.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1736388930; a=rsa-sha256;
	cv=none;
	b=YNhUSzebMY2qs+wMwLvGtzcGkDs0GTTIUtH5j7L61HptogpBkZ5r8rjDQG5c+5RsM7SBCz
	F6bDJK3gABlhSr+zxTRAVWnUHOc61zpiptCYLg3wto+qlGlhec4of2S1H5x//++70tcrzL
	U1RkWZcm45jIzQCIK5lL6ENJOBSBZF0=
Received: by mail-lj1-f175.google.com with SMTP id 38308e7fff4ca-2ffd6b7d77aso4236981fa.0
        for <linux-mm@kvack.org>; Wed, 08 Jan 2025 18:15:29 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20230601; t=1736388928; x=1736993728; darn=kvack.org;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=7sEL/TRCbe/SapVd89kbcGY2mfA+Htn81ZKVlP49A+0=;
        b=VT8phy+F0V/Xgh1FMLSXnEtQqXplyv3qEmWWYsvbMRcs+HA67pNrkUd8UzNhZ5uf2B
         NWcAL/VuYtlOjwEQqsLzn/ianu+MLPlmsWzt7UUno1UpkQ0rRneJMJsXKV2QdTYp34+J
         bvXxceRlKz4XZiec1dpjuYbORY/EmjdPDAwQFq4uemegw9i7K+rz9EUAWF3k7Dp688MD
         ndWI3BC7fYSmDLNWBApcIBNi0z3BnsEJo7pVNywg7LAgOKqJwAHIV2b4mDez6eAFij55
         A45vVQqK1DeEFtWtjg1eBBXdadLEAonO89vnVnnDddQiGmZEbQ0hNJMJ+jPkJ6hfFIIz
         RUYA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1736388928; x=1736993728;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=7sEL/TRCbe/SapVd89kbcGY2mfA+Htn81ZKVlP49A+0=;
        b=ZoK4/Cqj4jnR63f9GVYBIsUIJFLyKIQONaUnNSJrrfHqzmEdBXE3dstMYMaQhJkqkn
         n8FvU3CYzB8u7EkI6CquP5GMAg2wwg646TUkTbwjYnsBtGdohuYJfmv4XWgUd+Rzfitj
         +Y9wloM5NX6n4LFP3c4sYYZiVZqHqHDMUhIPeQovE2Kgi2Bw9LyA5ov0AyJYAFQmo2vx
         lUfy43KTGSBkFz8Q5CJSxx8iYJDa9+sM8ONlEnRawwILoWIzB2De41HJ7TFI3PKliwn1
         VaGqx8vWowL1uuZhF02PECKNKh75u2j+vGEuAxrGuV2++tooCqajwn1vbof0rUZtSAT5
         AdWw==
X-Gm-Message-State: AOJu0YyoJXax3DT9Eq7dLRGRNeMCoZD7gHx7jT+Idueq9sQR0SJYVsAp
	j5gGRULT+kQckDtR46PmeLGkHmUbu/UfXDWfjCkXpXSirqMzwrnlzBy4hzSpo5ojH+Ce3DWSEQm
	CBdz/OumU6KVclFtCoK2qc7IA47k=
X-Gm-Gg: ASbGncv6HlOLq7xqxUlmNYpAnIjxGrMsGXxC8c6AYfFuZBvaCLP9XZGKfshwxZjmjTz
	tkInXGYzZX5ZwC3Ntlm1VvA+GtQ4DeLEACBDjzg==
X-Google-Smtp-Source: AGHT+IFcGj66Am6sySShNF2Z8tjtjdST7bjmVFLbNNnLSr9gqTSNbXOntPyq/ySU15a3mQEEkB3F+LJhcisFmiLVEIw=
X-Received: by 2002:a05:651c:2128:b0:2fb:4b0d:9092 with SMTP id
 38308e7fff4ca-305f4530089mr14220241fa.1.1736388927446; Wed, 08 Jan 2025
 18:15:27 -0800 (PST)
MIME-Version: 1.0
References: <20241230174621.61185-1-ryncsn@gmail.com> <20241230174621.61185-10-ryncsn@gmail.com>
 <Z35c8AhRWKsmsZqB@MiWiFi-R3L-srv>
In-Reply-To: <Z35c8AhRWKsmsZqB@MiWiFi-R3L-srv>
From: Kairui Song <ryncsn@gmail.com>
Date: Thu, 9 Jan 2025 10:15:11 +0800
X-Gm-Features: AbW1kvavSV2xCTy9bWGCaVlhzQs5Km1z0PtToC2-yUI2t5RvBkqIS7iCdn5_mWE
Message-ID: <CAMgjq7Ad3v+oyBH8598-0emcqnLNK9izS-azgn89J5q=a=6N1w@mail.gmail.com>
Subject: Re: [PATCH v3 09/13] mm, swap: reduce contention on device lock
To: Baoquan He <bhe@redhat.com>
Cc: linux-mm@kvack.org, Andrew Morton <akpm@linux-foundation.org>, 
	Chris Li <chrisl@kernel.org>, Barry Song <v-songbaohua@oppo.com>, 
	Ryan Roberts <ryan.roberts@arm.com>, Hugh Dickins <hughd@google.com>, 
	Yosry Ahmed <yosryahmed@google.com>, "Huang, Ying" <ying.huang@linux.alibaba.com>, 
	Nhat Pham <nphamcs@gmail.com>, Johannes Weiner <hannes@cmpxchg.org>, 
	Kalesh Singh <kaleshsingh@google.com>, linux-kernel@vger.kernel.org
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Rspamd-Server: rspam05
X-Stat-Signature: db48zcm48hohirej36w5qt8dp6hmzjko
X-Rspamd-Queue-Id: D2151C0010
X-Rspam-User: 
X-HE-Tag: 1736388929-432059
X-HE-Meta: U2FsdGVkX18rWdd+HHJ3euDMILqv4CeUWXEGMTj04r9El9h8avy6VgMr0Rm6NbnFqPEgHQ9k+LO5UUqcrtb2ICQENGU2Bbzbmrka5sfzOQ8BMT6s5naV9PRgc1st+y9U51jUWn++Nhmf8BqScc3BR+GjpWPe7vq5ZZSe40byTGXj8scOLLvL5QqWrHShkGQskqGr7pzmm73LD0AMYNxbYG9CkhdGgZ6Nu8bZlIG9B/ZucUF3T3pSHq1A4rWFPoWjIA2fa+mirLSpx6vgZE8KAyh/QmRbYyP6ljEStJStK40jULCwNUuhT/2vrtKwKjAex1wyflIuEYTfP5eEU56C40XgcM+noWgAV3T7gEFYCS+++V0/wQl0QhCTreQ+HcFzLNCmMhVv2uGELTpialTkQPeJKH6yV5Un5ps3g23+K9dpZSEm5mAKHTdEByQT/R+DmvI7dpzJD/sXAg+pUmY9/ZC/NIrMKXzwFMzt6iEDQq8AiqajCKp4Ns6VNhdb0SM7YfEURXOV7+k3rnKXrLeUaZY6dqzyb3CLvis9ERYJpQ6RWgtpNz/H46ocsC8Qi7dNG6Yz8jlP5hPWuhK25pX7MO48NyhTJ4UOaHeG+Yn67odOYW2FRaY15Jk6tKgUNlN0ikVRlrGKDIJH5G1HanKIEcRIVA2ToLZMUYAwUniYcOoVPbR5B3ZQeedl2WKBj9cuau80f7s+Ps0PDuXBErh1e7qQvnkX5exfE5SwURvDHewd1uOGw+RYFlG1DbFGDjFx40MywN3KcyMUWRZe7xnCZczrAnclvYfsEs/2do2s4/50YZwaF9kdqqx8q1HewZytd0wzJ58Dru8g9untnP5faj6SslpnFO2jsAImV/+ZVsbNh9Av4CXotz7Zco8XB0H/ioU0+hsCcvGCYdI7yI0CsgT/eIOkfjqabP3VnjKLNThgAJK7Simy+JdosjnqDE/q2W2NLCb1Ga3f5pT2cTT
 BiDwT+0F
 CS81Vixu6zFarapyuGu2Co++guL+4vuJVcER92oBTddpB3uFCIhbXSPLp5zNMNEblmr6D3OOAF+FUs3LlA23joOdOD0kWC8iUNlklMRl6N2DuLUjTmPxhrJB4qG5hvWO/TxKmOhhMm1avPW9P3IEAyKPEsiNY7tZ3mN3B65RAWdxbT+Tz76gXrOIeIpgvvU9lu2ys4Mmocb+RFoYXIVAxSincmiDebiiSGP6gIYc336NzBiX5C6dFXB3rhOp/NuelJ2yiG/2H7B3Qjgln3RYyK/cjZaRR+rNXVWMaeHCZXNB4aOaAr0TEwA3yinSbJWY+6ITlbJ8RSuLZkAfPdmd9nyeCkw==
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On Wed, Jan 8, 2025 at 7:10=E2=80=AFPM Baoquan He <bhe@redhat.com> wrote:
>

Thanks for the very detailed review!

> On 12/31/24 at 01:46am, Kairui Song wrote:
> ......snip.....
> > ---
> >  include/linux/swap.h |   3 +-
> >  mm/swapfile.c        | 435 ++++++++++++++++++++++++-------------------
> >  2 files changed, 246 insertions(+), 192 deletions(-)
> >
> > diff --git a/include/linux/swap.h b/include/linux/swap.h
> > index 339d7f0192ff..c4ff31cb6bde 100644
> > --- a/include/linux/swap.h
> > +++ b/include/linux/swap.h
> > @@ -291,6 +291,7 @@ enum swap_cluster_flags {
> >   * throughput.
> >   */
> >  struct percpu_cluster {
> > +     local_lock_t lock; /* Protect the percpu_cluster above */
> >       unsigned int next[SWAP_NR_ORDERS]; /* Likely next allocation offs=
et */
> >  };
> >
> > @@ -313,7 +314,7 @@ struct swap_info_struct {
> >                                       /* list of cluster that contains =
at least one free slot */
> >       struct list_head frag_clusters[SWAP_NR_ORDERS];
> >                                       /* list of cluster that are fragm=
ented or contented */
> > -     unsigned int frag_cluster_nr[SWAP_NR_ORDERS];
> > +     atomic_long_t frag_cluster_nr[SWAP_NR_ORDERS];
> >       unsigned int pages;             /* total of usable pages of swap =
*/
> >       atomic_long_t inuse_pages;      /* number of those currently in u=
se */
> >       struct percpu_cluster __percpu *percpu_cluster; /* per cpu's swap=
 location */
> > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > index 7795a3d27273..dadd4fead689 100644
> > --- a/mm/swapfile.c
> > +++ b/mm/swapfile.c
> > @@ -261,12 +261,10 @@ static int __try_to_reclaim_swap(struct swap_info=
_struct *si,
> >       folio_ref_sub(folio, nr_pages);
> >       folio_set_dirty(folio);
> >
> > -     spin_lock(&si->lock);
> >       /* Only sinple page folio can be backed by zswap */
> >       if (nr_pages =3D=3D 1)
> >               zswap_invalidate(entry);
> >       swap_entry_range_free(si, entry, nr_pages);
> > -     spin_unlock(&si->lock);
> >       ret =3D nr_pages;
> >  out_unlock:
> >       folio_unlock(folio);
> > @@ -403,7 +401,21 @@ static void discard_swap_cluster(struct swap_info_=
struct *si,
> >
> >  static inline bool cluster_is_free(struct swap_cluster_info *info)
> >  {
> > -     return info->flags =3D=3D CLUSTER_FLAG_FREE;
> > +     return info->count =3D=3D 0;
>
> This is a little confusing. Maybe we should add one and call it
> cluster_is_empty(). Because discarded clusters are also be able to pass
> the checking here.

Good idea, agree on this, this new name is better.

>
> > +}
> > +
> > +static inline bool cluster_is_discard(struct swap_cluster_info *info)
> > +{
> > +     return info->flags =3D=3D CLUSTER_FLAG_DISCARD;
> > +}
> > +
> > +static inline bool cluster_is_usable(struct swap_cluster_info *ci, int=
 order)
> > +{
> > +     if (unlikely(ci->flags > CLUSTER_FLAG_USABLE))
> > +             return false;
> > +     if (!order)
> > +             return true;
> > +     return cluster_is_free(ci) || order =3D=3D ci->order;
> >  }
> >
> >  static inline unsigned int cluster_index(struct swap_info_struct *si,
> > @@ -440,19 +452,20 @@ static void cluster_move(struct swap_info_struct =
*si,
> >  {
> >       VM_WARN_ON(ci->flags =3D=3D new_flags);
> >       BUILD_BUG_ON(1 << sizeof(ci->flags) * BITS_PER_BYTE < CLUSTER_FLA=
G_MAX);
> > +     lockdep_assert_held(&ci->lock);
> >
> > -     if (ci->flags =3D=3D CLUSTER_FLAG_NONE) {
> > +     spin_lock(&si->lock);
> > +     if (ci->flags =3D=3D CLUSTER_FLAG_NONE)
> >               list_add_tail(&ci->list, list);
> > -     } else {
> > -             if (ci->flags =3D=3D CLUSTER_FLAG_FRAG) {
> > -                     VM_WARN_ON(!si->frag_cluster_nr[ci->order]);
> > -                     si->frag_cluster_nr[ci->order]--;
> > -             }
> > +     else
> >               list_move_tail(&ci->list, list);
> > -     }
> > +     spin_unlock(&si->lock);
> > +
> > +     if (ci->flags =3D=3D CLUSTER_FLAG_FRAG)
> > +             atomic_long_dec(&si->frag_cluster_nr[ci->order]);
> > +     else if (new_flags =3D=3D CLUSTER_FLAG_FRAG)
> > +             atomic_long_inc(&si->frag_cluster_nr[ci->order]);
> >       ci->flags =3D new_flags;
> > -     if (new_flags =3D=3D CLUSTER_FLAG_FRAG)
> > -             si->frag_cluster_nr[ci->order]++;
> >  }
> >
> >  /* Add a cluster to discard list and schedule it to do discard */
> > @@ -475,39 +488,90 @@ static void swap_cluster_schedule_discard(struct =
swap_info_struct *si,
> >
> >  static void __free_cluster(struct swap_info_struct *si, struct swap_cl=
uster_info *ci)
> >  {
> > -     lockdep_assert_held(&si->lock);
> >       lockdep_assert_held(&ci->lock);
> >       cluster_move(si, ci, &si->free_clusters, CLUSTER_FLAG_FREE);
> >       ci->order =3D 0;
> >  }
> >
> > +/*
> > + * Isolate and lock the first cluster that is not contented on a list,
> > + * clean its flag before taken off-list. Cluster flag must be in sync
> > + * with list status, so cluster updaters can always know the cluster
> > + * list status without touching si lock.
> > + *
> > + * Note it's possible that all clusters on a list are contented so
> > + * this returns NULL for an non-empty list.
> > + */
> > +static struct swap_cluster_info *cluster_isolate_lock(
> > +             struct swap_info_struct *si, struct list_head *list)
> > +{
> > +     struct swap_cluster_info *ci, *ret =3D NULL;
> > +
> > +     spin_lock(&si->lock);
> > +
> > +     if (unlikely(!(si->flags & SWP_WRITEOK)))
> > +             goto out;
> > +
> > +     list_for_each_entry(ci, list, list) {
> > +             if (!spin_trylock(&ci->lock))
> > +                     continue;
> > +
> > +             /* We may only isolate and clear flags of following lists=
 */
> > +             VM_BUG_ON(!ci->flags);
> > +             VM_BUG_ON(ci->flags > CLUSTER_FLAG_USABLE &&
> > +                       ci->flags !=3D CLUSTER_FLAG_FULL);
> > +
> > +             list_del(&ci->list);
> > +             ci->flags =3D CLUSTER_FLAG_NONE;
> > +             ret =3D ci;
> > +             break;
> > +     }
> > +out:
> > +     spin_unlock(&si->lock);
> > +
> > +     return ret;
> > +}
> > +
> >  /*
> >   * Doing discard actually. After a cluster discard is finished, the cl=
uster
> > - * will be added to free cluster list. caller should hold si->lock.
> > -*/
> > -static void swap_do_scheduled_discard(struct swap_info_struct *si)
> > + * will be added to free cluster list. Discard cluster is a bit specia=
l as
> > + * they don't participate in allocation or reclaim, so clusters marked=
 as
> > + * CLUSTER_FLAG_DISCARD must remain off-list or on discard list.
> > + */
> > +static bool swap_do_scheduled_discard(struct swap_info_struct *si)
> >  {
> >       struct swap_cluster_info *ci;
> > +     bool ret =3D false;
> >       unsigned int idx;
> >
> > +     spin_lock(&si->lock);
> >       while (!list_empty(&si->discard_clusters)) {
> >               ci =3D list_first_entry(&si->discard_clusters, struct swa=
p_cluster_info, list);
> > +             /*
> > +              * Delete the cluster from list but don't clear its flags=
 until
> > +              * discard is done, so isolation and relocation will skip=
 it.
> > +              */
> >               list_del(&ci->list);
>
> I don't understand above comment. ci has been taken off list. While
> allocation need isolate from a usable list. Even though we clear
> ci->flags now, how come isolation and relocation will touch it. I may
> miss anything here.

There are many cases, one possible and common situation is that the
percpu cluster (si->percpu_cluster of another CPU) is still pointing
to it.

Also, this commit removed protection of si lock on allocation, and
allocation path may also drop ci lock to call reclaim, which means one
cluster could be used or freed by anyone before allocator reacquire
the ci lock again. In that case, the allocator could see a discard
cluster.

So we don't clear the discard flag, in case anyone misuse it.

I can add more inline comments on this, this is already some related
comments above the function relocate_cluster, could add some more
referencing that.

>
> > -             /* Must clear flag when taking a cluster off-list */
> > -             ci->flags =3D CLUSTER_FLAG_NONE;
> >               idx =3D cluster_index(si, ci);
> >               spin_unlock(&si->lock);
> > -
> >               discard_swap_cluster(si, idx * SWAPFILE_CLUSTER,
> >                               SWAPFILE_CLUSTER);
> >
> > -             spin_lock(&si->lock);
> >               spin_lock(&ci->lock);
> > -             __free_cluster(si, ci);
> > +             /*
> > +              * Discard is done, clear its flags as it's now off-list,
> > +              * then return the cluster to allocation list.
> > +              */
> > +             ci->flags =3D CLUSTER_FLAG_NONE;
> >               memset(si->swap_map + idx * SWAPFILE_CLUSTER,
> >                               0, SWAPFILE_CLUSTER);
> > +             __free_cluster(si, ci);
> >               spin_unlock(&ci->lock);
> > +             ret =3D true;
> > +             spin_lock(&si->lock);
> >       }
> > +     spin_unlock(&si->lock);
> > +     return ret;
> >  }
> >
> >  static void swap_discard_work(struct work_struct *work)
> ......snip....
> > @@ -791,29 +873,34 @@ static void swap_reclaim_work(struct work_struct =
*work)
> >  static unsigned long cluster_alloc_swap_entry(struct swap_info_struct =
*si, int order,
> >                                             unsigned char usage)
> >  {
> > -     struct percpu_cluster *cluster;
> >       struct swap_cluster_info *ci;
> >       unsigned int offset, found =3D 0;
> >
> > -new_cluster:
> > -     lockdep_assert_held(&si->lock);
> > -     cluster =3D this_cpu_ptr(si->percpu_cluster);
> > -     offset =3D cluster->next[order];
> > +     /* Fast path using per CPU cluster */
> > +     local_lock(&si->percpu_cluster->lock);
> > +     offset =3D __this_cpu_read(si->percpu_cluster->next[order]);
> >       if (offset) {
> > -             offset =3D alloc_swap_scan_cluster(si, offset, &found, or=
der, usage);
> > +             ci =3D lock_cluster(si, offset);
> > +             /* Cluster could have been used by another order */
> > +             if (cluster_is_usable(ci, order)) {
> > +                     if (cluster_is_free(ci))
> > +                             offset =3D cluster_offset(si, ci);
> > +                     offset =3D alloc_swap_scan_cluster(si, offset, &f=
ound,
> > +                                                      order, usage);
> > +             } else {
> > +                     unlock_cluster(ci);
> > +             }
> >               if (found)
> >                       goto done;
> >       }
> >
> > -     if (!list_empty(&si->free_clusters)) {
> > -             ci =3D list_first_entry(&si->free_clusters, struct swap_c=
luster_info, list);
> > -             offset =3D alloc_swap_scan_cluster(si, cluster_offset(si,=
 ci), &found, order, usage);
> > -             /*
> > -              * Either we didn't touch the cluster due to swapoff,
> > -              * or the allocation must success.
> > -              */
> > -             VM_BUG_ON((si->flags & SWP_WRITEOK) && !found);
> > -             goto done;
> > +new_cluster:
> > +     ci =3D cluster_isolate_lock(si, &si->free_clusters);
> > +     if (ci) {
> > +             offset =3D alloc_swap_scan_cluster(si, cluster_offset(si,=
 ci),
> > +                                              &found, order, usage);
> > +             if (found)
> > +                     goto done;
> >       }
> >
> >       /* Try reclaim from full clusters if free clusters list is draine=
d */
> > @@ -821,49 +908,45 @@ static unsigned long cluster_alloc_swap_entry(str=
uct swap_info_struct *si, int o
> >               swap_reclaim_full_clusters(si, false);
> >
> >       if (order < PMD_ORDER) {
> > -             unsigned int frags =3D 0;
> > +             unsigned int frags =3D 0, frags_existing;
> >
> > -             while (!list_empty(&si->nonfull_clusters[order])) {
> > -                     ci =3D list_first_entry(&si->nonfull_clusters[ord=
er],
> > -                                           struct swap_cluster_info, l=
ist);
> > -                     cluster_move(si, ci, &si->frag_clusters[order], C=
LUSTER_FLAG_FRAG);
> > +             while ((ci =3D cluster_isolate_lock(si, &si->nonfull_clus=
ters[order]))) {
> >                       offset =3D alloc_swap_scan_cluster(si, cluster_of=
fset(si, ci),
> >                                                        &found, order, u=
sage);
> > -                     frags++;
> > +                     /*
> > +                      * With `fragmenting` set to true, it will surely=
 take
>                                  ~~~~~~~~~~~
>                          wondering what 'fragmenting' means here.

This comment is a bit out of context indeed, it actually trying to say
the alloc_swap_scan_cluster call above should move the cluster to
tail. I'll update the comment.


>
> > +                      * the cluster off nonfull list
> > +                      */
> >                       if (found)
> >                               goto done;
> > +                     frags++;
> >               }
> >
> > -             /*
> > -              * Nonfull clusters are moved to frag tail if we reached
> > -              * here, count them too, don't over scan the frag list.
> > -              */
> > -             while (frags < si->frag_cluster_nr[order]) {
> > -                     ci =3D list_first_entry(&si->frag_clusters[order]=
,
> > -                                           struct swap_cluster_info, l=
ist);
> > +             frags_existing =3D atomic_long_read(&si->frag_cluster_nr[=
order]);
> > +             while (frags < frags_existing &&
> > +                    (ci =3D cluster_isolate_lock(si, &si->frag_cluster=
s[order]))) {
> > +                     atomic_long_dec(&si->frag_cluster_nr[order]);
> >                       /*
> > -                      * Rotate the frag list to iterate, they were all=
 failing
> > -                      * high order allocation or moved here due to per=
-CPU usage,
> > -                      * this help keeping usable cluster ahead.
> > +                      * Rotate the frag list to iterate, they were all
> > +                      * failing high order allocation or moved here du=
e to
> > +                      * per-CPU usage, but they could contain newly re=
leased
> > +                      * reclaimable (eg. lazy-freed swap cache) slots.
> >                        */
> > -                     list_move_tail(&ci->list, &si->frag_clusters[orde=
r]);
> >                       offset =3D alloc_swap_scan_cluster(si, cluster_of=
fset(si, ci),
> >                                                        &found, order, u=
sage);
> > -                     frags++;
> >                       if (found)
> >                               goto done;
> > +                     frags++;
> >               }
> >       }
> >
> > -     if (!list_empty(&si->discard_clusters)) {
> > -             /*
> > -              * we don't have free cluster but have some clusters in
> > -              * discarding, do discard now and reclaim them, then
> > -              * reread cluster_next_cpu since we dropped si->lock
> > -              */
> > -             swap_do_scheduled_discard(si);
> > +     /*
> > +      * We don't have free cluster but have some clusters in
> > +      * discarding, do discard now and reclaim them, then
> > +      * reread cluster_next_cpu since we dropped si->lock
> > +      */
> > +     if ((si->flags & SWP_PAGE_DISCARD) && swap_do_scheduled_discard(s=
i))
> >               goto new_cluster;
> > -     }
> >
> >       if (order)
> >               goto done;
> .....
>
>