From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 14B2CE77188
	for <linux-mm@archiver.kernel.org>; Wed,  8 Jan 2025 11:10:28 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 64DF86B0095; Wed,  8 Jan 2025 06:10:28 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 5FE656B0098; Wed,  8 Jan 2025 06:10:28 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 49DD96B0099; Wed,  8 Jan 2025 06:10:28 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15])
	by kanga.kvack.org (Postfix) with ESMTP id 2CA116B0095
	for <linux-mm@kvack.org>; Wed,  8 Jan 2025 06:10:28 -0500 (EST)
Received: from smtpin22.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay10.hostedemail.com (Postfix) with ESMTP id BE340C0615
	for <linux-mm@kvack.org>; Wed,  8 Jan 2025 11:10:27 +0000 (UTC)
X-FDA: 82984016094.22.1FFB331
Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124])
	by imf15.hostedemail.com (Postfix) with ESMTP id B9BF5A0005
	for <linux-mm@kvack.org>; Wed,  8 Jan 2025 11:10:25 +0000 (UTC)
Authentication-Results: imf15.hostedemail.com;
	dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=FATldJWd;
	spf=pass (imf15.hostedemail.com: domain of bhe@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=bhe@redhat.com;
	dmarc=pass (policy=none) header.from=redhat.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1736334626;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=ENhAIAO9hlT7ELhgPK/ETT3AMZoalen27fO32TdZ9uw=;
	b=nwtr5oVh5uSawM0wDiAdhBIzvKk1TsERCOf3kUhvFpBuQKstX0nk8DPE1vv3u5HSlri9N9
	I765GKCmhmQd7BVWlW0nVM35L19vpORek9WGxHpl2+PzdUnalS1wTWlFABhJenQRnR6Yh2
	RhcmQxsnXq8zLx1We7HBpMyq3aEt3C4=
ARC-Authentication-Results: i=1;
	imf15.hostedemail.com;
	dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=FATldJWd;
	spf=pass (imf15.hostedemail.com: domain of bhe@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=bhe@redhat.com;
	dmarc=pass (policy=none) header.from=redhat.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1736334626; a=rsa-sha256;
	cv=none;
	b=x91ol1hV0ScDRNfpRdCsbVkjCVmmfpbPSDGIlKm9IpIc7QNQg4nPMLvMS88N9a2isi2xwt
	SZihZe7hoK56HOZXoMRbzLDLu7mjwy2oPVoIV89QEp+pnIEqsp/NO5DZnWgE5HjgREbqHU
	b2lF1Mu5NLG+qqcS16DwtM3aJUufj9g=
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1736334625;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:content-type:content-type:
	 in-reply-to:in-reply-to:references:references;
	bh=ENhAIAO9hlT7ELhgPK/ETT3AMZoalen27fO32TdZ9uw=;
	b=FATldJWdkZ7pEMsof8AccD93F2EbUG1TwOp9VAOO916UeoYQkHCFEsWumunT4gdy+bNjVa
	jgMcau2/FphAaNwSA4JqKqWL8CKXkyyeqN754ty8jiilE79Vz1W8G02SMBTgfW7hfd49Pj
	yemSO647ugRGZFMxemPVdhvNtJlE9qE=
Received: from mx-prod-mc-01.mail-002.prod.us-west-2.aws.redhat.com
 (ec2-54-186-198-63.us-west-2.compute.amazonaws.com [54.186.198.63]) by
 relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3,
 cipher=TLS_AES_256_GCM_SHA384) id us-mta-692-Z2CQ-a7SOgWe0bpRiYQMVg-1; Wed,
 08 Jan 2025 06:10:21 -0500
X-MC-Unique: Z2CQ-a7SOgWe0bpRiYQMVg-1
X-Mimecast-MFC-AGG-ID: Z2CQ-a7SOgWe0bpRiYQMVg
Received: from mx-prod-int-04.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-04.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.40])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256)
	(No client certificate requested)
	by mx-prod-mc-01.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id 2545B19560BD;
	Wed,  8 Jan 2025 11:10:18 +0000 (UTC)
Received: from localhost (unknown [10.72.112.99])
	by mx-prod-int-04.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id 61FA01956053;
	Wed,  8 Jan 2025 11:09:59 +0000 (UTC)
Date: Wed, 8 Jan 2025 19:09:36 +0800
From: Baoquan He <bhe@redhat.com>
To: Kairui Song <kasong@tencent.com>
Cc: linux-mm@kvack.org, Andrew Morton <akpm@linux-foundation.org>,
	Chris Li <chrisl@kernel.org>, Barry Song <v-songbaohua@oppo.com>,
	Ryan Roberts <ryan.roberts@arm.com>,
	Hugh Dickins <hughd@google.com>,
	Yosry Ahmed <yosryahmed@google.com>,
	"Huang, Ying" <ying.huang@linux.alibaba.com>,
	Nhat Pham <nphamcs@gmail.com>, Johannes Weiner <hannes@cmpxchg.org>,
	Kalesh Singh <kaleshsingh@google.com>, linux-kernel@vger.kernel.org
Subject: Re: [PATCH v3 09/13] mm, swap: reduce contention on device lock
Message-ID: <Z35c8AhRWKsmsZqB@MiWiFi-R3L-srv>
References: <20241230174621.61185-1-ryncsn@gmail.com>
 <20241230174621.61185-10-ryncsn@gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20241230174621.61185-10-ryncsn@gmail.com>
X-Scanned-By: MIMEDefang 3.0 on 10.30.177.40
X-Rspamd-Queue-Id: B9BF5A0005
X-Stat-Signature: rdrsosexzqjbqwtaoyfe367f8f1zpwue
X-Rspam-User: 
X-Rspamd-Server: rspam11
X-HE-Tag: 1736334625-260026
X-HE-Meta: U2FsdGVkX19+pB0F4AEztOeC+xgVMXcyDnlr7usdu7cWwlns8qqvY+tMFCnLpYSBiLvzYwb3QsEGzwNjt2+5WCpu2PORfxAyPH6aInWyaGMYvOH7C+LtynY+AxsHwBsydkCeKdGl1dmh1TmliHy3RW1QuQ/NcGd2sOne2kgFl2Axydbsgi19hnqUsYSXEPkFAxcyc0/A3KWkDbsCAP9OvQtRkPitXTlAS5zw0rYRpCA+tp8LJ8fdjqZTYiBcInKe228Ag9Cch29fVwUbqgwiBAyuM9C2iO1o2O9ZB8ct6T1gpDHMDbC/Z6ILG5z8hO////RLbQBzrIHauaAgOn4qUfx8jP18BFlhBlslnbWLYu8TBqk1hEGETN5QRUX+oqZxpf8qryMClIvCBs5khzNPe0iatyHtrR2Ht9L6l8cUmSQZrLpF0PZYJ2cFDl01VuifzcOqRE/T0Iw87so1zxjOEPBLjdi3vKtLUwTW4qqqYDvqwuLvlEmq08XOPTrtQiejmXUEB/ryg18Y4wktDLkq4E8jryWGNoOzFSycC6wlxBRxBPncyVsf2Yd5RQObY4+5Dd5bJtz1KrEi2Xkp/eHl3G8qEMbUxIp23ECj8umSheD0MekaA/25EqAjVaztgCZYQEjVjhRZFx2E4TRcHpq+joLYDutZL6T/ljlkivC7AzTWFsnrEFfEV3CuCMhhPIbJJF0crUWzl1ddZvX7nyewfM6+u5lM4nPLnNERuQQWHU24TIZJOhuYHfQczvfnhkN6x+VE/KA+OmX18zGNwtgiOWUuYI4H0L5XOB8hcYgA5htEx9BeiJsa5X0/6okJYcOsSxWxswWaNdZB/3itYBs8YtEVn7+lNJBcwnQjB9qtUfUMeyytThukCdnam/UOifuLfF5xrU7jpoq2CUJ7cn5H6xhQf3NShJ8xYVRYsQwCuMuyzx0dnYz5pfZL1buFdu3fzqlRfMztWLaANIA/nvC
 P+hex7uK
 r6T0MJvTs23iQ7DVbIgZ484q4HRPHQyzr45QDIJs3qVhvi5ctj+YOYBpGWJ5azMP6qXfvflslEB6YiVsLZr1XX+UBduKMGJQnrSH2gsdndcOFi9T+lyQI+iaPx2RA6BMVnxqQidXNoUV23pmRLHzhsq9oWo3yJ84mo41BIXfJb3uZ6Dn/WLwuN7zUZ3Ep1EdnQabxzfHT3ePEBb0jHV6mU1D4fqO5T3aRmL7ffQeKnOc7MBkHvsQUifbFiaH7RuwtRjfJ4TNTHA64TvEta9s8AlZNasKhU9Vh1WkqC/lbd3DttLkkfVyaC3dlS2sC01vzEI3LUroop9lEXmKIfqPf7m/OIXFYMXTbH+Wr
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On 12/31/24 at 01:46am, Kairui Song wrote:
......snip.....
> ---
>  include/linux/swap.h |   3 +-
>  mm/swapfile.c        | 435 ++++++++++++++++++++++++-------------------
>  2 files changed, 246 insertions(+), 192 deletions(-)
> 
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 339d7f0192ff..c4ff31cb6bde 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -291,6 +291,7 @@ enum swap_cluster_flags {
>   * throughput.
>   */
>  struct percpu_cluster {
> +	local_lock_t lock; /* Protect the percpu_cluster above */
>  	unsigned int next[SWAP_NR_ORDERS]; /* Likely next allocation offset */
>  };
>  
> @@ -313,7 +314,7 @@ struct swap_info_struct {
>  					/* list of cluster that contains at least one free slot */
>  	struct list_head frag_clusters[SWAP_NR_ORDERS];
>  					/* list of cluster that are fragmented or contented */
> -	unsigned int frag_cluster_nr[SWAP_NR_ORDERS];
> +	atomic_long_t frag_cluster_nr[SWAP_NR_ORDERS];
>  	unsigned int pages;		/* total of usable pages of swap */
>  	atomic_long_t inuse_pages;	/* number of those currently in use */
>  	struct percpu_cluster __percpu *percpu_cluster; /* per cpu's swap location */
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 7795a3d27273..dadd4fead689 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -261,12 +261,10 @@ static int __try_to_reclaim_swap(struct swap_info_struct *si,
>  	folio_ref_sub(folio, nr_pages);
>  	folio_set_dirty(folio);
>  
> -	spin_lock(&si->lock);
>  	/* Only sinple page folio can be backed by zswap */
>  	if (nr_pages == 1)
>  		zswap_invalidate(entry);
>  	swap_entry_range_free(si, entry, nr_pages);
> -	spin_unlock(&si->lock);
>  	ret = nr_pages;
>  out_unlock:
>  	folio_unlock(folio);
> @@ -403,7 +401,21 @@ static void discard_swap_cluster(struct swap_info_struct *si,
>  
>  static inline bool cluster_is_free(struct swap_cluster_info *info)
>  {
> -	return info->flags == CLUSTER_FLAG_FREE;
> +	return info->count == 0;

This is a little confusing. Maybe we should add one and call it
cluster_is_empty(). Because discarded clusters are also be able to pass
the checking here.

> +}
> +
> +static inline bool cluster_is_discard(struct swap_cluster_info *info)
> +{
> +	return info->flags == CLUSTER_FLAG_DISCARD;
> +}
> +
> +static inline bool cluster_is_usable(struct swap_cluster_info *ci, int order)
> +{
> +	if (unlikely(ci->flags > CLUSTER_FLAG_USABLE))
> +		return false;
> +	if (!order)
> +		return true;
> +	return cluster_is_free(ci) || order == ci->order;
>  }
>  
>  static inline unsigned int cluster_index(struct swap_info_struct *si,
> @@ -440,19 +452,20 @@ static void cluster_move(struct swap_info_struct *si,
>  {
>  	VM_WARN_ON(ci->flags == new_flags);
>  	BUILD_BUG_ON(1 << sizeof(ci->flags) * BITS_PER_BYTE < CLUSTER_FLAG_MAX);
> +	lockdep_assert_held(&ci->lock);
>  
> -	if (ci->flags == CLUSTER_FLAG_NONE) {
> +	spin_lock(&si->lock);
> +	if (ci->flags == CLUSTER_FLAG_NONE)
>  		list_add_tail(&ci->list, list);
> -	} else {
> -		if (ci->flags == CLUSTER_FLAG_FRAG) {
> -			VM_WARN_ON(!si->frag_cluster_nr[ci->order]);
> -			si->frag_cluster_nr[ci->order]--;
> -		}
> +	else
>  		list_move_tail(&ci->list, list);
> -	}
> +	spin_unlock(&si->lock);
> +
> +	if (ci->flags == CLUSTER_FLAG_FRAG)
> +		atomic_long_dec(&si->frag_cluster_nr[ci->order]);
> +	else if (new_flags == CLUSTER_FLAG_FRAG)
> +		atomic_long_inc(&si->frag_cluster_nr[ci->order]);
>  	ci->flags = new_flags;
> -	if (new_flags == CLUSTER_FLAG_FRAG)
> -		si->frag_cluster_nr[ci->order]++;
>  }
>  
>  /* Add a cluster to discard list and schedule it to do discard */
> @@ -475,39 +488,90 @@ static void swap_cluster_schedule_discard(struct swap_info_struct *si,
>  
>  static void __free_cluster(struct swap_info_struct *si, struct swap_cluster_info *ci)
>  {
> -	lockdep_assert_held(&si->lock);
>  	lockdep_assert_held(&ci->lock);
>  	cluster_move(si, ci, &si->free_clusters, CLUSTER_FLAG_FREE);
>  	ci->order = 0;
>  }
>  
> +/*
> + * Isolate and lock the first cluster that is not contented on a list,
> + * clean its flag before taken off-list. Cluster flag must be in sync
> + * with list status, so cluster updaters can always know the cluster
> + * list status without touching si lock.
> + *
> + * Note it's possible that all clusters on a list are contented so
> + * this returns NULL for an non-empty list.
> + */
> +static struct swap_cluster_info *cluster_isolate_lock(
> +		struct swap_info_struct *si, struct list_head *list)
> +{
> +	struct swap_cluster_info *ci, *ret = NULL;
> +
> +	spin_lock(&si->lock);
> +
> +	if (unlikely(!(si->flags & SWP_WRITEOK)))
> +		goto out;
> +
> +	list_for_each_entry(ci, list, list) {
> +		if (!spin_trylock(&ci->lock))
> +			continue;
> +
> +		/* We may only isolate and clear flags of following lists */
> +		VM_BUG_ON(!ci->flags);
> +		VM_BUG_ON(ci->flags > CLUSTER_FLAG_USABLE &&
> +			  ci->flags != CLUSTER_FLAG_FULL);
> +
> +		list_del(&ci->list);
> +		ci->flags = CLUSTER_FLAG_NONE;
> +		ret = ci;
> +		break;
> +	}
> +out:
> +	spin_unlock(&si->lock);
> +
> +	return ret;
> +}
> +
>  /*
>   * Doing discard actually. After a cluster discard is finished, the cluster
> - * will be added to free cluster list. caller should hold si->lock.
> -*/
> -static void swap_do_scheduled_discard(struct swap_info_struct *si)
> + * will be added to free cluster list. Discard cluster is a bit special as
> + * they don't participate in allocation or reclaim, so clusters marked as
> + * CLUSTER_FLAG_DISCARD must remain off-list or on discard list.
> + */
> +static bool swap_do_scheduled_discard(struct swap_info_struct *si)
>  {
>  	struct swap_cluster_info *ci;
> +	bool ret = false;
>  	unsigned int idx;
>  
> +	spin_lock(&si->lock);
>  	while (!list_empty(&si->discard_clusters)) {
>  		ci = list_first_entry(&si->discard_clusters, struct swap_cluster_info, list);
> +		/*
> +		 * Delete the cluster from list but don't clear its flags until
> +		 * discard is done, so isolation and relocation will skip it.
> +		 */
>  		list_del(&ci->list);

I don't understand above comment. ci has been taken off list. While
allocation need isolate from a usable list. Even though we clear
ci->flags now, how come isolation and relocation will touch it. I may
miss anything here.

> -		/* Must clear flag when taking a cluster off-list */
> -		ci->flags = CLUSTER_FLAG_NONE;
>  		idx = cluster_index(si, ci);
>  		spin_unlock(&si->lock);
> -
>  		discard_swap_cluster(si, idx * SWAPFILE_CLUSTER,
>  				SWAPFILE_CLUSTER);
>  
> -		spin_lock(&si->lock);
>  		spin_lock(&ci->lock);
> -		__free_cluster(si, ci);
> +		/*
> +		 * Discard is done, clear its flags as it's now off-list,
> +		 * then return the cluster to allocation list.
> +		 */
> +		ci->flags = CLUSTER_FLAG_NONE;
>  		memset(si->swap_map + idx * SWAPFILE_CLUSTER,
>  				0, SWAPFILE_CLUSTER);
> +		__free_cluster(si, ci);
>  		spin_unlock(&ci->lock);
> +		ret = true;
> +		spin_lock(&si->lock);
>  	}
> +	spin_unlock(&si->lock);
> +	return ret;
>  }
>  
>  static void swap_discard_work(struct work_struct *work)
......snip....
> @@ -791,29 +873,34 @@ static void swap_reclaim_work(struct work_struct *work)
>  static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int order,
>  					      unsigned char usage)
>  {
> -	struct percpu_cluster *cluster;
>  	struct swap_cluster_info *ci;
>  	unsigned int offset, found = 0;
>  
> -new_cluster:
> -	lockdep_assert_held(&si->lock);
> -	cluster = this_cpu_ptr(si->percpu_cluster);
> -	offset = cluster->next[order];
> +	/* Fast path using per CPU cluster */
> +	local_lock(&si->percpu_cluster->lock);
> +	offset = __this_cpu_read(si->percpu_cluster->next[order]);
>  	if (offset) {
> -		offset = alloc_swap_scan_cluster(si, offset, &found, order, usage);
> +		ci = lock_cluster(si, offset);
> +		/* Cluster could have been used by another order */
> +		if (cluster_is_usable(ci, order)) {
> +			if (cluster_is_free(ci))
> +				offset = cluster_offset(si, ci);
> +			offset = alloc_swap_scan_cluster(si, offset, &found,
> +							 order, usage);
> +		} else {
> +			unlock_cluster(ci);
> +		}
>  		if (found)
>  			goto done;
>  	}
>  
> -	if (!list_empty(&si->free_clusters)) {
> -		ci = list_first_entry(&si->free_clusters, struct swap_cluster_info, list);
> -		offset = alloc_swap_scan_cluster(si, cluster_offset(si, ci), &found, order, usage);
> -		/*
> -		 * Either we didn't touch the cluster due to swapoff,
> -		 * or the allocation must success.
> -		 */
> -		VM_BUG_ON((si->flags & SWP_WRITEOK) && !found);
> -		goto done;
> +new_cluster:
> +	ci = cluster_isolate_lock(si, &si->free_clusters);
> +	if (ci) {
> +		offset = alloc_swap_scan_cluster(si, cluster_offset(si, ci),
> +						 &found, order, usage);
> +		if (found)
> +			goto done;
>  	}
>  
>  	/* Try reclaim from full clusters if free clusters list is drained */
> @@ -821,49 +908,45 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
>  		swap_reclaim_full_clusters(si, false);
>  
>  	if (order < PMD_ORDER) {
> -		unsigned int frags = 0;
> +		unsigned int frags = 0, frags_existing;
>  
> -		while (!list_empty(&si->nonfull_clusters[order])) {
> -			ci = list_first_entry(&si->nonfull_clusters[order],
> -					      struct swap_cluster_info, list);
> -			cluster_move(si, ci, &si->frag_clusters[order], CLUSTER_FLAG_FRAG);
> +		while ((ci = cluster_isolate_lock(si, &si->nonfull_clusters[order]))) {
>  			offset = alloc_swap_scan_cluster(si, cluster_offset(si, ci),
>  							 &found, order, usage);
> -			frags++;
> +			/*
> +			 * With `fragmenting` set to true, it will surely take
                                 ~~~~~~~~~~~
                         wondering what 'fragmenting' means here.


> +			 * the cluster off nonfull list
> +			 */
>  			if (found)
>  				goto done;
> +			frags++;
>  		}
>  
> -		/*
> -		 * Nonfull clusters are moved to frag tail if we reached
> -		 * here, count them too, don't over scan the frag list.
> -		 */
> -		while (frags < si->frag_cluster_nr[order]) {
> -			ci = list_first_entry(&si->frag_clusters[order],
> -					      struct swap_cluster_info, list);
> +		frags_existing = atomic_long_read(&si->frag_cluster_nr[order]);
> +		while (frags < frags_existing &&
> +		       (ci = cluster_isolate_lock(si, &si->frag_clusters[order]))) {
> +			atomic_long_dec(&si->frag_cluster_nr[order]);
>  			/*
> -			 * Rotate the frag list to iterate, they were all failing
> -			 * high order allocation or moved here due to per-CPU usage,
> -			 * this help keeping usable cluster ahead.
> +			 * Rotate the frag list to iterate, they were all
> +			 * failing high order allocation or moved here due to
> +			 * per-CPU usage, but they could contain newly released
> +			 * reclaimable (eg. lazy-freed swap cache) slots.
>  			 */
> -			list_move_tail(&ci->list, &si->frag_clusters[order]);
>  			offset = alloc_swap_scan_cluster(si, cluster_offset(si, ci),
>  							 &found, order, usage);
> -			frags++;
>  			if (found)
>  				goto done;
> +			frags++;
>  		}
>  	}
>  
> -	if (!list_empty(&si->discard_clusters)) {
> -		/*
> -		 * we don't have free cluster but have some clusters in
> -		 * discarding, do discard now and reclaim them, then
> -		 * reread cluster_next_cpu since we dropped si->lock
> -		 */
> -		swap_do_scheduled_discard(si);
> +	/*
> +	 * We don't have free cluster but have some clusters in
> +	 * discarding, do discard now and reclaim them, then
> +	 * reread cluster_next_cpu since we dropped si->lock
> +	 */
> +	if ((si->flags & SWP_PAGE_DISCARD) && swap_do_scheduled_discard(si))
>  		goto new_cluster;
> -	}
>  
>  	if (order)
>  		goto done;
.....