From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 9A48AC3DA60
	for <linux-mm@archiver.kernel.org>; Thu, 18 Jul 2024 07:57:30 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 2D23A6B009A; Thu, 18 Jul 2024 03:57:30 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 281D16B009B; Thu, 18 Jul 2024 03:57:30 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 1231A6B009C; Thu, 18 Jul 2024 03:57:30 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10])
	by kanga.kvack.org (Postfix) with ESMTP id E3D986B009A
	for <linux-mm@kvack.org>; Thu, 18 Jul 2024 03:57:29 -0400 (EDT)
Received: from smtpin26.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay05.hostedemail.com (Postfix) with ESMTP id 9609540D71
	for <linux-mm@kvack.org>; Thu, 18 Jul 2024 07:57:29 +0000 (UTC)
X-FDA: 82352118618.26.893DDAE
Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.10])
	by imf10.hostedemail.com (Postfix) with ESMTP id EA25DC001D
	for <linux-mm@kvack.org>; Thu, 18 Jul 2024 07:57:26 +0000 (UTC)
Authentication-Results: imf10.hostedemail.com;
	dkim=pass header.d=intel.com header.s=Intel header.b=KA8R603G;
	spf=pass (imf10.hostedemail.com: domain of ying.huang@intel.com designates 192.198.163.10 as permitted sender) smtp.mailfrom=ying.huang@intel.com;
	dmarc=pass (policy=none) header.from=intel.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1721289407;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=thyIkhRh3u+3oNJuW7SgfW2n66AKfEBlVl62ve8e2pY=;
	b=t1z06a8MpdULzOoWSHkhFYYeGUz+aL/aKx3pdfUMVflPnAxw9/UkHxoGjj0B2SfU9J5Q0m
	3fl0IfqW72IRYfCNG0sFHIL7GZxrHiy+G8dzuyI6zBvxaqZkXh7xpI8XRxf0WW1t8M6Agd
	kRuPu3k56Q/7rWU1m4Ruia+CExYRtlw=
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1721289407; a=rsa-sha256;
	cv=none;
	b=RfiGbobPb+oMhwDKEYdY/l8j3WnozjVjnXMWrXPj/DlccZD8eH8Lp0dAtAAnXES4so/5UQ
	pRclMeJecK3SEFC9QlkJe++eTnNyRdE/zYDGIjke93wQJlrF7QXhNkg6JElM20kL7k8T/6
	laCj1uzbpqNEiO+cRkpCp1BNoF2b02g=
ARC-Authentication-Results: i=1;
	imf10.hostedemail.com;
	dkim=pass header.d=intel.com header.s=Intel header.b=KA8R603G;
	spf=pass (imf10.hostedemail.com: domain of ying.huang@intel.com designates 192.198.163.10 as permitted sender) smtp.mailfrom=ying.huang@intel.com;
	dmarc=pass (policy=none) header.from=intel.com
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1721289447; x=1752825447;
  h=from:to:cc:subject:in-reply-to:references:date:
   message-id:mime-version:content-transfer-encoding;
  bh=aX34zE463HVf8i7YY78wd6CQ6DhDOuWiCsimDyCDk7A=;
  b=KA8R603G74v35UlnHfBqYV8SR57qfyGgJRnGPbfeaMWsn0+uwWV/D5oB
   Z16hD8XU0lSOFVnygDyModbfz6atgvFrkFW4Kx8y4WJK57RZjxZaSEFvZ
   jcCci8QZ054fswd72bF+uIOzI8j9Z5oxbcTUO7aI/tOfrfB+YkLlVKuoI
   lv3AGg21qnm3JYqHRTqjBr8WCkG9FqVyTLZ8taoMdUr4lcAHLO5dyw+z6
   DT0uyuEXY/gC3qUAMbXjdyy/DhWUAL9XLcxKXmaz3HMEZjCGOwrAOb9yr
   qkN0pbJK0o/01So9cVDQj1R3ASgYtFivLyQO7BZ6DKMYRiOXl5q4obst3
   Q==;
X-CSE-ConnectionGUID: IIOPs/3fSea6I5xqD4fZRw==
X-CSE-MsgGUID: bgTdKbMWT6iIKj5VWv0HKA==
X-IronPort-AV: E=McAfee;i="6700,10204,11136"; a="30249326"
X-IronPort-AV: E=Sophos;i="6.09,217,1716274800"; 
   d="scan'208";a="30249326"
Received: from orviesa008.jf.intel.com ([10.64.159.148])
  by fmvoesa104.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 18 Jul 2024 00:57:25 -0700
X-CSE-ConnectionGUID: yMcBT5HjST+hdw2/kTnaWQ==
X-CSE-MsgGUID: 8Q/j9dqfS/mJgfaVO13owg==
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="6.09,217,1716274800"; 
   d="scan'208";a="51395255"
Received: from yhuang6-desk2.sh.intel.com (HELO yhuang6-desk2.ccr.corp.intel.com) ([10.238.208.55])
  by orviesa008-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 18 Jul 2024 00:57:23 -0700
From: "Huang, Ying" <ying.huang@intel.com>
To: Chris Li <chrisl@kernel.org>
Cc: Ryan Roberts <ryan.roberts@arm.com>,  Andrew Morton
 <akpm@linux-foundation.org>,  Kairui Song <kasong@tencent.com>,  Hugh
 Dickins <hughd@google.com>,  Kalesh Singh <kaleshsingh@google.com>,
  linux-kernel@vger.kernel.org,  linux-mm@kvack.org,  Barry Song
 <baohua@kernel.org>
Subject: Re: [PATCH v4 2/3] mm: swap: mTHP allocate swap entries from
 nonfull list
In-Reply-To: <CACePvbXfeyt5cSX3zQhbZQ4Z5suW6iXw4Kb8BDH96SeMi54o8Q@mail.gmail.com>
	(Chris Li's message of "Wed, 17 Jul 2024 08:41:07 -0700")
References: <20240711-swap-allocator-v4-0-0295a4d4c7aa@kernel.org>
	<20240711-swap-allocator-v4-2-0295a4d4c7aa@kernel.org>
	<ea720b4a-da70-4ee3-8f74-2c7344480170@arm.com>
	<CACePvbW_g4T10mqcG-FnJ11nP0obRG8ZgtdAN_EMCosnk9EQpA@mail.gmail.com>
	<b4b31314-1125-40ee-b784-20abc78bd468@arm.com>
	<CACePvbXfeyt5cSX3zQhbZQ4Z5suW6iXw4Kb8BDH96SeMi54o8Q@mail.gmail.com>
Date: Thu, 18 Jul 2024 15:53:49 +0800
Message-ID: <874j8nxhiq.fsf@yhuang6-desk2.ccr.corp.intel.com>
User-Agent: Gnus/5.13 (Gnus v5.13)
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
X-Rspam-User: 
X-Rspamd-Server: rspam04
X-Rspamd-Queue-Id: EA25DC001D
X-Stat-Signature: 7998nr9uuw44tjp8ppfwkhmon3g87epz
X-HE-Tag: 1721289446-242747
X-HE-Meta: U2FsdGVkX18Qvc6wD6Nn4fTVEGhLY6mA/HscLNwByDD4GPW/F5omL3lPmrIO4kSVm8zdnrA0/IPGr4BrH3IX7a1wm3rTNFi5iAqTkJi1a608JE5ahXj+YDJNAbcyNMJH2XIgnvU4rieQx6Xx56ximTeOzjnG6el5f+goOmiTbas1rPLUjtyO8anSSzm7tXIL0XrmuUU1z0Cl8Kw7gqaYBOv0hLNUUdiie4thnmoJdqdicZJtd8ugrl+xii4gQ9NN5/po24zAroBR6EW9EDR4XG6O+OdVGzrw5r1BKRxg+I8ejvNkw6bay2dKnCgPPOoxXw23FGj0LvcBs+6/K8wpbQL0EKdPy3eFCD8JdmVx1PfprViVYqVIO+Vx/pukRYGp8neUpRA1ahrT1SGt0Q8KRNLxtkzXWIkGA+qZx+z30ZUKZSYZlkt5U/gh1ZJItPFuHPyVDNCxwxS4A10nthMU0k5FeowDpgN7GJS/3u2aMrHdoU2SIbJ3FYpbQgJ8Gu12zYR+mPzf8bDh8HIP5lDVD1IH+RKIynpEK2oHdxenVngB0c8CkWrQPbXB8Z1W3t9Fl3fQP1prGiHfbLpV1rY50a7cZuiTLlIuAMwzezG9HfWfaGKHzrimznyrAe+3ZuG2S3lfwK0lbZSNi3wsU1PGmAh3M40H4IoiTzh5tWbvuLPSU+11VHYDn0k6LUaMn3vIE4L6BIaNIWLmMuWM4TB7RaefRDFpJ7+sx87hhFbufK4kTO23BgFoGKS4ia4qJ/HUnGXZABojXB45nJ5B0j/XntzBLO++405I/1TVKtbUB6nJdLoJFO5OqFuWCGhwVldGfpWEl2KIVxRM7nxITZ6DVLEN+/8eYK8V6xfGz0lTO+qRSzmXvga0ZOkRD63dPjo6RbShVn4TMT3AXVHWtjBIy+x/txBQ0e3YxiVVIWCZd7LmifLTq5DgczqjxAH3Ae7O3TDblN8jx33sxsX2JZC
 j+0X68Sb
 DEz97plHGTYM0QwbFNv62shYnrH0CDbG5RXk8m4f33R4sGNIsDn6HHz/SloD9/b7tZGrm+Hi4TLDzAI1x1tFkq6VTZiYLmLF/m+k4nzOwLxoQBmTBmnfyfh/ashE22qWApiUBubbc2A+P+o4uYcxAMUQFM7M5qTCk9hUzTG6msD3czgNbmwrnHVSA63rDQ5xBEduKc9DuHdhYUhNe/jPxiqxKzha7qWGIRSX0e2Dz4y8hoJdKZLwQ4AGjaCWN64dl+zer6v3Lh9jTowg=
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

Chris Li <chrisl@kernel.org> writes:

> On Wed, Jul 17, 2024 at 3:14=E2=80=AFAM Ryan Roberts <ryan.roberts@arm.co=
m> wrote:
>>
>> On 16/07/2024 23:46, Chris Li wrote:
>> > On Mon, Jul 15, 2024 at 8:40=E2=80=AFAM Ryan Roberts <ryan.roberts@arm=
.com> wrote:
>> >>
>> >> On 11/07/2024 08:29, Chris Li wrote:
>> >>> Track the nonfull cluster as well as the empty cluster
>> >>> on lists. Each order has one nonfull cluster list.
>> >>>
>> >>> The cluster will remember which order it was used during
>> >>> new cluster allocation.
>> >>>
>> >>> When the cluster has free entry, add to the nonfull[order]
>> >>> list.  When the free cluster list is empty, also allocate
>> >>> from the nonempty list of that order.
>> >>>
>> >>> This improves the mTHP swap allocation success rate.
>> >>>
>> >>> There are limitations if the distribution of numbers of
>> >>> different orders of mTHP changes a lot. e.g. there are a lot
>> >>> of nonfull cluster assign to order A while later time there
>> >>> are a lot of order B allocation while very little allocation
>> >>> in order A. Currently the cluster used by order A will not
>> >>> reused by order B unless the cluster is 100% empty.
>> >>>
>> >>> Signed-off-by: Chris Li <chrisl@kernel.org>
>> >>> ---
>> >>>  include/linux/swap.h |  4 ++++
>> >>>  mm/swapfile.c        | 34 +++++++++++++++++++++++++++++++---
>> >>>  2 files changed, 35 insertions(+), 3 deletions(-)
>> >>>
>> >>> diff --git a/include/linux/swap.h b/include/linux/swap.h
>> >>> index e9be95468fc7..db8d6000c116 100644
>> >>> --- a/include/linux/swap.h
>> >>> +++ b/include/linux/swap.h
>> >>> @@ -254,9 +254,11 @@ struct swap_cluster_info {
>> >>>                                */
>> >>>       u16 count;
>> >>>       u8 flags;
>> >>> +     u8 order;
>> >>>       struct list_head list;
>> >>>  };
>> >>>  #define CLUSTER_FLAG_FREE 1 /* This cluster is free */
>> >>> +#define CLUSTER_FLAG_NONFULL 2 /* This cluster is on nonfull list */
>> >>>
>> >>>
>> >>>  /*
>> >>> @@ -296,6 +298,8 @@ struct swap_info_struct {
>> >>>       unsigned long *zeromap;         /* vmalloc'ed bitmap to track =
zero pages */
>> >>>       struct swap_cluster_info *cluster_info; /* cluster info. Only =
for SSD */
>> >>>       struct list_head free_clusters; /* free clusters list */
>> >>> +     struct list_head nonfull_clusters[SWAP_NR_ORDERS];
>> >>> +                                     /* list of cluster that contai=
ns at least one free slot */
>> >>>       unsigned int lowest_bit;        /* index of first free in swap=
_map */
>> >>>       unsigned int highest_bit;       /* index of last free in swap_=
map */
>> >>>       unsigned int pages;             /* total of usable pages of sw=
ap */
>> >>> diff --git a/mm/swapfile.c b/mm/swapfile.c
>> >>> index f70d25005d2c..e13a33664cfa 100644
>> >>> --- a/mm/swapfile.c
>> >>> +++ b/mm/swapfile.c
>> >>> @@ -361,14 +361,21 @@ static void swap_cluster_schedule_discard(stru=
ct swap_info_struct *si,
>> >>>       memset(si->swap_map + idx * SWAPFILE_CLUSTER,
>> >>>                       SWAP_MAP_BAD, SWAPFILE_CLUSTER);
>> >>>
>> >>> -     list_add_tail(&ci->list, &si->discard_clusters);
>> >>> +     if (ci->flags)
>> >>
>> >> I'm not sure this is future proof; what happens if a flag is added in=
 future
>> >> that does not indicate that the cluster is on a list. Perhaps explici=
tly check
>> >> CLUSTER_FLAG_NONFULL? Or `if (!list_empty(&ci->list))`.
>> >
>> > Currently flags are only used to track which list it is on.
>>
>> Yes, I get that it works correctly at the moment. I just don't think it'=
s wise
>> for the code to assume that any flag being set means its on a list; that=
 feels
>> fragile for future.
>
> ACK.
>
>>
>> > BTW, this
>> > line has changed to check for explicite which list in patch 3 the big
>> > rewrite. I can move that line change to patch 2 if you want.
>>
>> That would get my vote; let's make every patch as good as it can be.
>
> Done.
>
>>
>> >
>> >>
>> >>> +             list_move_tail(&ci->list, &si->discard_clusters);
>> >>> +     else
>> >>> +             list_add_tail(&ci->list, &si->discard_clusters);
>> >>> +     ci->flags =3D 0;
>> >>
>> >> Bug: (I think?) the cluster ends up on the discard_clusters list and
>> >> swap_do_scheduled_discard() calls __free_cluster() which will then ca=
ll
>> >
>> > swap_do_scheduled_discard() delete the entry from discard list.
>>
>> Ahh yes, my bad!
>>
>> > The flag does not track the discard list state.
>> >
>> >> list_add_tail() to put it on the free_clusters list. But since it is =
on the
>> >> discard_list at that point, shouldn't it call list_move_tail()?
>> >
>> > See above. Call list_move_tail() would be a mistake.
>> >
>> >>
>> >>>       schedule_work(&si->discard_work);
>> >>>  }
>> >>>
>> >>>  static void __free_cluster(struct swap_info_struct *si, struct swap=
_cluster_info *ci)
>> >>>  {
>> >>> +     if (ci->flags & CLUSTER_FLAG_NONFULL)
>> >>> +             list_move_tail(&ci->list, &si->free_clusters);
>> >>> +     else
>> >>> +             list_add_tail(&ci->list, &si->free_clusters);
>> >>>       ci->flags =3D CLUSTER_FLAG_FREE;
>> >>> -     list_add_tail(&ci->list, &si->free_clusters);
>> >>>  }
>> >>>
>> >>>  /*
>> >>> @@ -491,7 +498,12 @@ static void dec_cluster_info_page(struct swap_i=
nfo_struct *p, struct swap_cluste
>> >>>       ci->count--;
>> >>>
>> >>>       if (!ci->count)
>> >>> -             free_cluster(p, ci);
>> >>> +             return free_cluster(p, ci);
>> >>
>> >> nit: I'm not sure what the kernel style guide says about this, but I'=
m not a
>> >> huge fan of returning void. I'd find it clearer if you just turn the =
below `if`
>> >> into an `else if`.
>> >
>> > I try to avoid 'else if' if possible.
>> > Changed to
>> > if (!ci->count) {
>> >               free_cluster(p, ci);
>> >               return;
>> > }
>>
>> ok
>>
>> >
>> >>
>> >>> +
>> >>> +     if (!(ci->flags & CLUSTER_FLAG_NONFULL)) {
>> >>> +             list_add_tail(&ci->list, &p->nonfull_clusters[ci->orde=
r]);
>> >>
>> >> I find the transitions when you add and remove a cluster from the
>> >> nonfull_clusters list a bit strange (if I've understood correctly): I=
t is added
>> >> to the list whenever there is at least one free swap entry if not alr=
eady on the
>> >> list. But you take it off the list when assigning it as the current c=
luster for
>> >> a cpu in scan_swap_map_try_ssd_cluster().
>> >>
>> >> So you could have this situation:
>> >>
>> >>   - cpuA allocs cluster from free list (exclusive to that cpu)
>> >>   - cpuA allocs 1 swap entry from current cluster
>> >>   - swap entry is freed; cluster added to nonfull_clusters
>> >>   - cpuB "allocs" cluster from nonfull_clusters
>> >>
>> >> At this point both cpuA and cpuB share the same cluster as their curr=
ent
>> >> cluster. So why not just put the cluster on the nonfull_clusters list=
 at
>> >> allocation time (when removed from free_list) and only remove it from=
 the
>> >
>> > The big rewrite on patch 3 does that, taking it off the free list and
>> > moving it into nonfull.
>>
>> Oh, from the title, "RFC: mm: swap: seperate SSD allocation from
>> scan_swap_map_slots()" I assumed that was just a refactoring of the code=
 to
>> separate the SSD and HDD code paths. Personally I'd prefer to see the
>> refactoring separated from behavioural changes.
>
> It is not a refactoring. It is a big rewrite of the swap allocator
> using the cluster. Behavior change is expected. The goal is completely
> removing the brute force scanning of swap_map[] array for cluster swap
> allocation.
>
>>
>> Since the patch was titled RFC and I thought it was just refactoring, I =
was
>> deferring review. But sounds like it is actually required to realize the=
 test
>> results quoted on the cover letter?
>
> Yes, required because it handles the previous fall out case try_ssd()
> failed. This big rewrite has gone through a lot of testing and bug
> fix. It is pretty stable now. The only reason I keep it as RFC is
> because it is not feature complete. Currently it does not do swap
> cache reclaim. The next version will have swap cache reclaim and
> remove the RFC.
>
>>
>> > I am only making the minimal change in this step so the big rewrite ca=
n land.
>> >
>> >> nonfull_clusters list when it is completely full (or at least definit=
ely doesn't
>> >> have room for an `order` allocation)? Then you allow "stealing" alway=
s instead
>> >> of just sometimes. You would likely want to move the cluster to the e=
nd of the
>> >> nonfull list when selecting it in scan_swap_map_try_ssd_cluster() to =
reduce the
>> >> chances of multiple CPUs using the same cluster.
>> >
>> > For nonfull clusters it is less important to avoid multiple CPU
>> > sharing the cluster. Because the cluster already has previous swap
>> > entries allocated from the previous CPU.
>>
>> But if 2 CPUs have the same cluster, isn't there a pathalogical case whe=
re cpuA
>> could be slightly ahead of cpuB so that cpuA allocates all the free page=
s and
>
> That happens to exist per cpu next pointer already. When the other CPU
> advances to the next cluster pointer, it can cross with the other
> CPU's next cluster pointer.

No.  si->percpu_cluster[cpu].next will keep in the current per cpu
cluster only.  If it doesn't do that, we should fix it.

I agree with Ryan that we should make per cpu cluster correct.  A
cluster in per cpu cluster shouldn't be put in nonfull list.  When we
scan to the end of a per cpu cluster, we can put the cluster in nonfull
list if necessary.  And, we should make it correct in this patch instead
of later in series.  I understand that you want to make the patch itself
simple, but it's important to make code simple to be understood too.
Consistent design choice will do that.

>> cpuB just ends up scanning and finding nothing to allocate. I think do w=
ant to
>> share the cluster when you really need to, but try to avoid it if there =
are
>> other options, and I think moving the cluster to the end of the list mig=
ht be a
>> way to help that?
>
> Simply moving to the end of the list can create a possible deadloop
> when all clusters have been scanned and not available swap range
> found.

This is another reason that we should put the cluster in
nonfull_clusters[order--] if there are no free swap entry with "order"
in the cluster.  It makes design complex to keep it in
nonfull_clusters[order].

> We have tried many different approaches including moving to the end of
> the list. It can cause more fragmentation because each CPU allocates
> their swap slot cache (64 entries) from a different cluster.
>
>> > Those behaviors will be fine
>> > tuned after the patch 3 big rewrite. Try to make this patch simple.
>
> Again, I want to keep it simple here so patch 3 can land.
>
>> >> Another potential optimization (which was in my hacked version IIRC) =
is to only
>> >> add/remove from nonfull list when `total - count` crosses the (1 << o=
rder)
>> >> boundary rather than when becoming completely full. You definitely wo=
n't be able
>> >> to allocate order-2 if there are only 3 pages available, for example.
>> >
>> > That is in patch 3 as well. This patch is just doing the bare minimum
>> > to introduce the nonfull list.
>> >
>> >>
>> >>> +             ci->flags |=3D CLUSTER_FLAG_NONFULL;
>> >>> +     }
>> >>>  }
>> >>>
>> >>>  /*
>> >>> @@ -550,6 +562,18 @@ static bool scan_swap_map_try_ssd_cluster(struc=
t swap_info_struct *si,
>> >>>       if (tmp =3D=3D SWAP_NEXT_INVALID) {
>> >>>               if (!list_empty(&si->free_clusters)) {
>> >>>                       ci =3D list_first_entry(&si->free_clusters, st=
ruct swap_cluster_info, list);
>> >>> +                     list_del(&ci->list);
>> >>> +                     spin_lock(&ci->lock);
>> >>> +                     ci->order =3D order;
>> >>> +                     ci->flags =3D 0;
>> >>> +                     spin_unlock(&ci->lock);
>> >>> +                     tmp =3D cluster_index(si, ci) * SWAPFILE_CLUST=
ER;
>> >>> +             } else if (!list_empty(&si->nonfull_clusters[order])) {
>> >>> +                     ci =3D list_first_entry(&si->nonfull_clusters[=
order], struct swap_cluster_info, list);
>> >>> +                     list_del(&ci->list);
>> >>> +                     spin_lock(&ci->lock);
>> >>> +                     ci->flags =3D 0;
>> >>> +                     spin_unlock(&ci->lock);
>> >>>                       tmp =3D cluster_index(si, ci) * SWAPFILE_CLUST=
ER;
>> >>>               } else if (!list_empty(&si->discard_clusters)) {
>> >>>                       /*
>> >>> @@ -964,6 +988,7 @@ static void swap_free_cluster(struct swap_info_s=
truct *si, unsigned long idx)
>> >>>       ci =3D lock_cluster(si, offset);
>> >>>       memset(si->swap_map + offset, 0, SWAPFILE_CLUSTER);
>> >>>       ci->count =3D 0;
>> >>> +     ci->order =3D 0;
>> >>>       ci->flags =3D 0;
>> >>
>> >> Wonder if it would be better to put this in __free_cluster()?
>> >
>> > Both flags and order were moved to __free_cluster() in patch 3 of this
>> > series. The order is best assigned together with flags when the
>> > cluster changes the list.
>> >
>> > Thanks for the review. The patch 3 big rewrite is the heavy lifting.
>>
>> OK, but sounds like patch 3 isn't really RFC after all, but a crucial pa=
rt of
>> the series? I'll try to take a look at it today.
>
> Yes, it is the cluster swap allocator big rewrite.
>
> Thank you for taking a look.

--
Best Regards,
Huang, Ying