From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 0284DE77188
	for <linux-mm@archiver.kernel.org>; Fri, 10 Jan 2025 11:24:23 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 578016B00C0; Fri, 10 Jan 2025 06:24:23 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 525866B00C1; Fri, 10 Jan 2025 06:24:23 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 3A0036B00C2; Fri, 10 Jan 2025 06:24:23 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16])
	by kanga.kvack.org (Postfix) with ESMTP id 0DCD96B00C0
	for <linux-mm@kvack.org>; Fri, 10 Jan 2025 06:24:23 -0500 (EST)
Received: from smtpin10.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay09.hostedemail.com (Postfix) with ESMTP id DEB6480BD4
	for <linux-mm@kvack.org>; Fri, 10 Jan 2025 11:24:08 +0000 (UTC)
X-FDA: 82991308218.10.4A1CFF9
Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124])
	by imf10.hostedemail.com (Postfix) with ESMTP id A9F84C0010
	for <linux-mm@kvack.org>; Fri, 10 Jan 2025 11:24:04 +0000 (UTC)
Authentication-Results: imf10.hostedemail.com;
	dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=DiCuKwvD;
	spf=pass (imf10.hostedemail.com: domain of bhe@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=bhe@redhat.com;
	dmarc=pass (policy=none) header.from=redhat.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1736508247;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=bIrZcKwdw3kdGVnELFYbhYYRPYhpNO4CmQuAXfqWp90=;
	b=Hj87okGSEIxFJ/gB/frTZof/yxCA57k/3aw4uNM+pptVJt8DJ7KgoUyDSkHNphGk3r4vYr
	GY5V7tHMkX4WWl029j9hjbqMr/RoXjxsJkJG7VC4qXfJXSmbkv0al+r7hPWM95IWwGl6BQ
	rKFic5kn3f9l0IIT2RVRe6s8yOC0rL0=
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1736508247; a=rsa-sha256;
	cv=none;
	b=GZI7sErzaJV//pdisn5TQRcVLXrI/zvpuhhhYY1ARdT8T1QizCOJk4Ord+DMk0NDJUKQec
	2OOK1NKaN41SIjUEED8eTRHnu4m5S+depGTFpZlQ3/GdqduCYrKmtk3Q/ltG2ip7AS8Xlt
	T0dadjIQo3Z607smKe1m/uYUeviIcUk=
ARC-Authentication-Results: i=1;
	imf10.hostedemail.com;
	dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=DiCuKwvD;
	spf=pass (imf10.hostedemail.com: domain of bhe@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=bhe@redhat.com;
	dmarc=pass (policy=none) header.from=redhat.com
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1736508243;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=bIrZcKwdw3kdGVnELFYbhYYRPYhpNO4CmQuAXfqWp90=;
	b=DiCuKwvD0DHxnHCZc25zM5K93mA75KpDaWknDu0+4YldJsV9SIIFqpcjYfwOnR1Bq6ZaVS
	WKTcSizy+TPdhYCQOZrOzOM9qgrYM0bvGr4LVFIxvojS+JVGjDUCLiXVFII9CBBugLIJTt
	PHHtxBOz7GOrkrGLNMlk65nHsoemlPk=
Received: from mx-prod-mc-02.mail-002.prod.us-west-2.aws.redhat.com
 (ec2-54-186-198-63.us-west-2.compute.amazonaws.com [54.186.198.63]) by
 relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3,
 cipher=TLS_AES_256_GCM_SHA384) id us-mta-256-sBQG_5G_Nm2wP7nEBqmgIg-1; Fri,
 10 Jan 2025 06:23:57 -0500
X-MC-Unique: sBQG_5G_Nm2wP7nEBqmgIg-1
X-Mimecast-MFC-AGG-ID: sBQG_5G_Nm2wP7nEBqmgIg
Received: from mx-prod-int-03.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-03.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.12])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256)
	(No client certificate requested)
	by mx-prod-mc-02.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id 951CA1955DC6;
	Fri, 10 Jan 2025 11:23:53 +0000 (UTC)
Received: from localhost (unknown [10.72.113.4])
	by mx-prod-int-03.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id ABC8119560AF;
	Fri, 10 Jan 2025 11:23:51 +0000 (UTC)
Date: Fri, 10 Jan 2025 19:23:47 +0800
From: Baoquan He <bhe@redhat.com>
To: Kairui Song <ryncsn@gmail.com>
Cc: linux-mm@kvack.org, Andrew Morton <akpm@linux-foundation.org>,
	Chris Li <chrisl@kernel.org>, Barry Song <v-songbaohua@oppo.com>,
	Ryan Roberts <ryan.roberts@arm.com>,
	Hugh Dickins <hughd@google.com>,
	Yosry Ahmed <yosryahmed@google.com>,
	"Huang, Ying" <ying.huang@linux.alibaba.com>,
	Nhat Pham <nphamcs@gmail.com>, Johannes Weiner <hannes@cmpxchg.org>,
	Kalesh Singh <kaleshsingh@google.com>, linux-kernel@vger.kernel.org
Subject: Re: [PATCH v3 09/13] mm, swap: reduce contention on device lock
Message-ID: <Z4EDQx2TwLokw1hJ@MiWiFi-R3L-srv>
References: <20241230174621.61185-1-ryncsn@gmail.com>
 <20241230174621.61185-10-ryncsn@gmail.com>
 <Z35c8AhRWKsmsZqB@MiWiFi-R3L-srv>
 <CAMgjq7Ad3v+oyBH8598-0emcqnLNK9izS-azgn89J5q=a=6N1w@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
In-Reply-To: <CAMgjq7Ad3v+oyBH8598-0emcqnLNK9izS-azgn89J5q=a=6N1w@mail.gmail.com>
X-Scanned-By: MIMEDefang 3.0 on 10.30.177.12
X-Rspamd-Server: rspam10
X-Rspamd-Queue-Id: A9F84C0010
X-Stat-Signature: rtya5de8j6hu16dhma4fh953rh969h7s
X-Rspam-User: 
X-HE-Tag: 1736508244-208737
X-HE-Meta: U2FsdGVkX1/FO7Q7RuiQH+YehLVOuKNFbxlieQiPLNVInrz0KKn1cMHqGxG0DtVbaYvXFYSJbMjX66FN/BntNIFvxOiQShBgaM7ucXQIqV//7rfF7Ahj13Px7+0jhMDfP2wI42BEnuIu3c9T+kxnWxkF3zaXLKK3h5RNhNjernfGNKxnmoENGYbFD3OTJmGokRZeuQQVRfohT+1wAemkVYbu+jkhkaiAwS3TRfozj27ExiynvoRd4V419BMpqAPFbKy0cVP//TaJOfsyj1nbSeWT01qi0/spR51NiW5GSF+fSg3XowYgkKZLLxRvoFTaUpaXEM1Hyzbs+P7iOicC0RTl6JjKNrjalotYB/tqCrSWRnLoHz8kiXCrP5hUKVK9bwST8+8Wg4Mts2nl+ScMvbXXakR5MsHt15SnXNUYXcB/fMMNH0tuOzvTcA0dCgNxglM/bMISAIDGpVoV7X92qqfOYS2jJBJ+UvOG4RQopeX6+JzAuXA345N+h4ggwTpODDz/E9nlpUkyYNLVyihScoLtCIwjUczRwqsgnAJCSLQyHoDjaIRrx6DrDG7nYTT5UyyE9bm0P5Rxb+7RVLeT9iZ5nVHk487CIeH04jX5ciR58HM91M/zg9OTgKm564MBA2IeeCgLk5iRAD9OyVcApGUfwmPUt34obWhbNXjyo71TvOy+VKZf10j6m9FYOA4TCpR5OVJL3Sw1mdLql3xwg0Hw5EeV+xreOq1gA3IUgfk12W2of9G5OPCbYTnFXHyiiSOTV8l/YN3rux8ylI4FvTs7ImATO2L2JTgvNz0R0fHo3yuTq91cOMAe0xBlAWSuedzaz27q68TobiUHDtf8XzXF+VyVdomkc2Z77aaBTn7ytqe90sB9i2jyEQf8nIz/WNID79ElB9eslum61zmqGApvDq3J0l7j577J29LXKkf0UCJVHRvOCmmRt0j1gcEB5swk2mAB+15UqFxJYBg
 t4wiyvJK
 zGE5PmcD2phtVGY2rZUzajxx99NlqgP5g6IqHa/mgeZ7FtMet4ZxIXji46K9idiydEALgKsIOEFljd47LVceP/aqtJJ6QPlVmvVmb4RLIEUHY8MuJbe71CHiZ4awitRCfHBBB3KP6g7AOqzqp91pBPWsHjeU7G0kCvchoXKt+JXBmizTes04etL2ThtsYF3BYwFhF6WvkDk8shegNotChubjuv0T38KVBVE1I4b3OO4GJrc4v34kXcGSXC2DBOu37/CCAje74Zez3aaJ3tTQBPGdGQ1FGX603X3BkCsaEyRG+xcWL8s1R4NyPn93DYiosoBEVAOIWlhe8XkESp68l7+wFKwyOmvmZvcSd4QMRgNy4EApqNebTc7JCF8tGqm2IctsQSjoJI2pfJrAOs9jYvIU0Ww==
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On 01/09/25 at 10:15am, Kairui Song wrote:
> On Wed, Jan 8, 2025 at 7:10 PM Baoquan He <bhe@redhat.com> wrote:
> >
> 
> Thanks for the very detailed review!
> 
> > On 12/31/24 at 01:46am, Kairui Song wrote:
> > ......snip.....
> > > ---
> > >  include/linux/swap.h |   3 +-
> > >  mm/swapfile.c        | 435 ++++++++++++++++++++++++-------------------
> > >  2 files changed, 246 insertions(+), 192 deletions(-)
> > >
> > > diff --git a/include/linux/swap.h b/include/linux/swap.h
> > > index 339d7f0192ff..c4ff31cb6bde 100644
> > > --- a/include/linux/swap.h
> > > +++ b/include/linux/swap.h
> > > @@ -291,6 +291,7 @@ enum swap_cluster_flags {
> > >   * throughput.
> > >   */
> > >  struct percpu_cluster {
> > > +     local_lock_t lock; /* Protect the percpu_cluster above */
> > >       unsigned int next[SWAP_NR_ORDERS]; /* Likely next allocation offset */
> > >  };
> > >
> > > @@ -313,7 +314,7 @@ struct swap_info_struct {
> > >                                       /* list of cluster that contains at least one free slot */
> > >       struct list_head frag_clusters[SWAP_NR_ORDERS];
> > >                                       /* list of cluster that are fragmented or contented */
> > > -     unsigned int frag_cluster_nr[SWAP_NR_ORDERS];
> > > +     atomic_long_t frag_cluster_nr[SWAP_NR_ORDERS];
> > >       unsigned int pages;             /* total of usable pages of swap */
> > >       atomic_long_t inuse_pages;      /* number of those currently in use */
> > >       struct percpu_cluster __percpu *percpu_cluster; /* per cpu's swap location */
> > > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > > index 7795a3d27273..dadd4fead689 100644
> > > --- a/mm/swapfile.c
> > > +++ b/mm/swapfile.c
...snip...
> > > @@ -475,39 +488,90 @@ static void swap_cluster_schedule_discard(struct swap_info_struct *si,
> > >
> > >  static void __free_cluster(struct swap_info_struct *si, struct swap_cluster_info *ci)
> > >  {
> > > -     lockdep_assert_held(&si->lock);
> > >       lockdep_assert_held(&ci->lock);
> > >       cluster_move(si, ci, &si->free_clusters, CLUSTER_FLAG_FREE);
> > >       ci->order = 0;
> > >  }
> > >
> > > +/*
> > > + * Isolate and lock the first cluster that is not contented on a list,
> > > + * clean its flag before taken off-list. Cluster flag must be in sync
> > > + * with list status, so cluster updaters can always know the cluster
> > > + * list status without touching si lock.
> > > + *
> > > + * Note it's possible that all clusters on a list are contented so
> > > + * this returns NULL for an non-empty list.
> > > + */
> > > +static struct swap_cluster_info *cluster_isolate_lock(
> > > +             struct swap_info_struct *si, struct list_head *list)
> > > +{
> > > +     struct swap_cluster_info *ci, *ret = NULL;
> > > +
> > > +     spin_lock(&si->lock);
> > > +
> > > +     if (unlikely(!(si->flags & SWP_WRITEOK)))
> > > +             goto out;
> > > +
> > > +     list_for_each_entry(ci, list, list) {
> > > +             if (!spin_trylock(&ci->lock))
> > > +                     continue;
> > > +
> > > +             /* We may only isolate and clear flags of following lists */
> > > +             VM_BUG_ON(!ci->flags);
> > > +             VM_BUG_ON(ci->flags > CLUSTER_FLAG_USABLE &&
> > > +                       ci->flags != CLUSTER_FLAG_FULL);
> > > +
> > > +             list_del(&ci->list);
> > > +             ci->flags = CLUSTER_FLAG_NONE;
> > > +             ret = ci;
> > > +             break;
> > > +     }
> > > +out:
> > > +     spin_unlock(&si->lock);
> > > +
> > > +     return ret;
> > > +}
> > > +
> > >  /*
> > >   * Doing discard actually. After a cluster discard is finished, the cluster
> > > - * will be added to free cluster list. caller should hold si->lock.
> > > -*/
> > > -static void swap_do_scheduled_discard(struct swap_info_struct *si)
> > > + * will be added to free cluster list. Discard cluster is a bit special as
> > > + * they don't participate in allocation or reclaim, so clusters marked as
> > > + * CLUSTER_FLAG_DISCARD must remain off-list or on discard list.
> > > + */
> > > +static bool swap_do_scheduled_discard(struct swap_info_struct *si)
> > >  {
> > >       struct swap_cluster_info *ci;
> > > +     bool ret = false;
> > >       unsigned int idx;
> > >
> > > +     spin_lock(&si->lock);
> > >       while (!list_empty(&si->discard_clusters)) {
> > >               ci = list_first_entry(&si->discard_clusters, struct swap_cluster_info, list);
> > > +             /*
> > > +              * Delete the cluster from list but don't clear its flags until
> > > +              * discard is done, so isolation and relocation will skip it.
> > > +              */
> > >               list_del(&ci->list);
> >
> > I don't understand above comment. ci has been taken off list. While
> > allocation need isolate from a usable list. Even though we clear
> > ci->flags now, how come isolation and relocation will touch it. I may
> > miss anything here.
> 
> There are many cases, one possible and common situation is that the
> percpu cluster (si->percpu_cluster of another CPU) is still pointing
> to it.
> 
> Also, this commit removed protection of si lock on allocation, and
> allocation path may also drop ci lock to call reclaim, which means one
> cluster could be used or freed by anyone before allocator reacquire
> the ci lock again. In that case, the allocator could see a discard
> cluster.
> 
> So we don't clear the discard flag, in case anyone misuse it.
> 
> I can add more inline comments on this, this is already some related
> comments above the function relocate_cluster, could add some more
> referencing that.

Thanks for your great explanation. I understand that si->percpu_cluster
could point to a discarded ci, and a ci could be got from non-full,
frag lists but later become discarded if that ci is freed on other cpu
during cluster_reclaim_range() invocation. I haven't got how isolation
could see a discarded ci in cluster_isolate_lock(). Could you help give
an exmaple on how that happen?

Surely, I understand keeping the discarded flag is very necessary so
that checking like cluster_is_usable() will return expected value.

And by the way, I haven't got when the ' if (!ci->count)' case could
happen in relocate_cluster() since we have filtered away discarded ci
with the 'if (cluster_is_discard(ci))' checking. I asked in another
thread, could you help explain it?

static void relocate_cluster(struct swap_info_struct *si,
                             struct swap_cluster_info *ci)
{               
        lockdep_assert_held(&ci->lock); 
                
        /* Discard cluster must remain off-list or on discard list */
        if (cluster_is_discard(ci))
                return;
                
        if (!ci->count) {
                free_cluster(si, ci);
...
}
> 
> >
> > > -             /* Must clear flag when taking a cluster off-list */
> > > -             ci->flags = CLUSTER_FLAG_NONE;
> > >               idx = cluster_index(si, ci);
> > >               spin_unlock(&si->lock);
> > > -
> > >               discard_swap_cluster(si, idx * SWAPFILE_CLUSTER,
> > >                               SWAPFILE_CLUSTER);
> > >
> > > -             spin_lock(&si->lock);
> > >               spin_lock(&ci->lock);
> > > -             __free_cluster(si, ci);
> > > +             /*
> > > +              * Discard is done, clear its flags as it's now off-list,
> > > +              * then return the cluster to allocation list.
> > > +              */
> > > +             ci->flags = CLUSTER_FLAG_NONE;
> > >               memset(si->swap_map + idx * SWAPFILE_CLUSTER,
> > >                               0, SWAPFILE_CLUSTER);
> > > +             __free_cluster(si, ci);
> > >               spin_unlock(&ci->lock);
> > > +             ret = true;
> > > +             spin_lock(&si->lock);
> > >       }
> > > +     spin_unlock(&si->lock);
> > > +     return ret;
> > >  }
> > >
> > >  static void swap_discard_work(struct work_struct *work)
> > ......snip....
> > > @@ -791,29 +873,34 @@ static void swap_reclaim_work(struct work_struct *work)
> > >  static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int order,
> > >                                             unsigned char usage)
> > >  {
> > > -     struct percpu_cluster *cluster;
> > >       struct swap_cluster_info *ci;
> > >       unsigned int offset, found = 0;
> > >
> > > -new_cluster:
> > > -     lockdep_assert_held(&si->lock);
> > > -     cluster = this_cpu_ptr(si->percpu_cluster);
> > > -     offset = cluster->next[order];
> > > +     /* Fast path using per CPU cluster */
> > > +     local_lock(&si->percpu_cluster->lock);
> > > +     offset = __this_cpu_read(si->percpu_cluster->next[order]);
> > >       if (offset) {
> > > -             offset = alloc_swap_scan_cluster(si, offset, &found, order, usage);
> > > +             ci = lock_cluster(si, offset);
> > > +             /* Cluster could have been used by another order */
> > > +             if (cluster_is_usable(ci, order)) {
> > > +                     if (cluster_is_free(ci))
> > > +                             offset = cluster_offset(si, ci);
> > > +                     offset = alloc_swap_scan_cluster(si, offset, &found,
> > > +                                                      order, usage);
> > > +             } else {
> > > +                     unlock_cluster(ci);
> > > +             }
> > >               if (found)
> > >                       goto done;
> > >       }
> > >
> > > -     if (!list_empty(&si->free_clusters)) {
> > > -             ci = list_first_entry(&si->free_clusters, struct swap_cluster_info, list);
> > > -             offset = alloc_swap_scan_cluster(si, cluster_offset(si, ci), &found, order, usage);
> > > -             /*
> > > -              * Either we didn't touch the cluster due to swapoff,
> > > -              * or the allocation must success.
> > > -              */
> > > -             VM_BUG_ON((si->flags & SWP_WRITEOK) && !found);
> > > -             goto done;
> > > +new_cluster:
> > > +     ci = cluster_isolate_lock(si, &si->free_clusters);
> > > +     if (ci) {
> > > +             offset = alloc_swap_scan_cluster(si, cluster_offset(si, ci),
> > > +                                              &found, order, usage);
> > > +             if (found)
> > > +                     goto done;
> > >       }
> > >
> > >       /* Try reclaim from full clusters if free clusters list is drained */
> > > @@ -821,49 +908,45 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
> > >               swap_reclaim_full_clusters(si, false);
> > >
> > >       if (order < PMD_ORDER) {
> > > -             unsigned int frags = 0;
> > > +             unsigned int frags = 0, frags_existing;
> > >
> > > -             while (!list_empty(&si->nonfull_clusters[order])) {
> > > -                     ci = list_first_entry(&si->nonfull_clusters[order],
> > > -                                           struct swap_cluster_info, list);
> > > -                     cluster_move(si, ci, &si->frag_clusters[order], CLUSTER_FLAG_FRAG);
> > > +             while ((ci = cluster_isolate_lock(si, &si->nonfull_clusters[order]))) {
> > >                       offset = alloc_swap_scan_cluster(si, cluster_offset(si, ci),
> > >                                                        &found, order, usage);
> > > -                     frags++;
> > > +                     /*
> > > +                      * With `fragmenting` set to true, it will surely take
> >                                  ~~~~~~~~~~~
> >                          wondering what 'fragmenting' means here.
> 
> This comment is a bit out of context indeed, it actually trying to say
> the alloc_swap_scan_cluster call above should move the cluster to
> tail. I'll update the comment.
> 
> 
> 
> >
> > > +                      * the cluster off nonfull list
> > > +                      */
> > >                       if (found)
> > >                               goto done;
> > > +                     frags++;
> > >               }
> > >
> > > -             /*
> > > -              * Nonfull clusters are moved to frag tail if we reached
> > > -              * here, count them too, don't over scan the frag list.
> > > -              */
> > > -             while (frags < si->frag_cluster_nr[order]) {
> > > -                     ci = list_first_entry(&si->frag_clusters[order],
> > > -                                           struct swap_cluster_info, list);
> > > +             frags_existing = atomic_long_read(&si->frag_cluster_nr[order]);
> > > +             while (frags < frags_existing &&
> > > +                    (ci = cluster_isolate_lock(si, &si->frag_clusters[order]))) {
> > > +                     atomic_long_dec(&si->frag_cluster_nr[order]);
> > >                       /*
> > > -                      * Rotate the frag list to iterate, they were all failing
> > > -                      * high order allocation or moved here due to per-CPU usage,
> > > -                      * this help keeping usable cluster ahead.
> > > +                      * Rotate the frag list to iterate, they were all
> > > +                      * failing high order allocation or moved here due to
> > > +                      * per-CPU usage, but they could contain newly released
> > > +                      * reclaimable (eg. lazy-freed swap cache) slots.
> > >                        */
> > > -                     list_move_tail(&ci->list, &si->frag_clusters[order]);
> > >                       offset = alloc_swap_scan_cluster(si, cluster_offset(si, ci),
> > >                                                        &found, order, usage);
> > > -                     frags++;
> > >                       if (found)
> > >                               goto done;
> > > +                     frags++;
> > >               }
> > >       }
> > >
> > > -     if (!list_empty(&si->discard_clusters)) {
> > > -             /*
> > > -              * we don't have free cluster but have some clusters in
> > > -              * discarding, do discard now and reclaim them, then
> > > -              * reread cluster_next_cpu since we dropped si->lock
> > > -              */
> > > -             swap_do_scheduled_discard(si);
> > > +     /*
> > > +      * We don't have free cluster but have some clusters in
> > > +      * discarding, do discard now and reclaim them, then
> > > +      * reread cluster_next_cpu since we dropped si->lock
> > > +      */
> > > +     if ((si->flags & SWP_PAGE_DISCARD) && swap_do_scheduled_discard(si))
> > >               goto new_cluster;
> > > -     }
> > >
> > >       if (order)
> > >               goto done;
> > .....
> >
> >
>