From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 73BE8C25B78 for ; Tue, 28 May 2024 22:27:32 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id D7CB76B00A0; Tue, 28 May 2024 18:27:31 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id D2BF86B00B9; Tue, 28 May 2024 18:27:31 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id BCCD46B00BA; Tue, 28 May 2024 18:27:31 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 96C466B00A0 for ; Tue, 28 May 2024 18:27:31 -0400 (EDT) Received: from smtpin03.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id 3FE331203EF for ; Tue, 28 May 2024 22:27:31 +0000 (UTC) X-FDA: 82169242302.03.1E8BC6A Received: from mail-ej1-f46.google.com (mail-ej1-f46.google.com [209.85.218.46]) by imf04.hostedemail.com (Postfix) with ESMTP id 3EC3740011 for ; Tue, 28 May 2024 22:27:29 +0000 (UTC) Authentication-Results: imf04.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b="ooxW7/ur"; spf=pass (imf04.hostedemail.com: domain of chriscli@google.com designates 209.85.218.46 as permitted sender) smtp.mailfrom=chriscli@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1716935249; a=rsa-sha256; cv=none; b=EovTaLA7fhJ1t5L8Fl0GhMt5mLRpqToMAy7MxAX0gRDCBFpAPimF9WsYWTlZAuR05ZCxJV wrJEuY4D9XQqV11ziWmhq3rV21hOQBHR/7qmsa0vHh8+0BJLj/DPkYtNq+Id1rVS4EtYW6 kPPcleTuFepQCgiuesKdkiQQ8DDjNf4= ARC-Authentication-Results: i=1; imf04.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b="ooxW7/ur"; spf=pass (imf04.hostedemail.com: domain of chriscli@google.com designates 209.85.218.46 as permitted sender) smtp.mailfrom=chriscli@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1716935249; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=qoIiECWD3UbiYvPaYPgJKahAc6q2/BimSU4Z82AWX6s=; b=Qn6p+o2FYpzfckTLUmOvoKBNJUcAylrjVUe/8MUfyhOJuDkfhOtq9RbvtAIp8kJbBzOkJC w6526SMqnsidAQSPUKArl5hb3gE9AWlmf8AEXHWXnBRL2JigdSImFhJt3SKVGRPR46Cj+z Xi1Zz/4lXARWA0b0Ldh8QRtH44SifFI= Received: by mail-ej1-f46.google.com with SMTP id a640c23a62f3a-a634e03339dso145602166b.3 for ; Tue, 28 May 2024 15:27:28 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1716935248; x=1717540048; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=qoIiECWD3UbiYvPaYPgJKahAc6q2/BimSU4Z82AWX6s=; b=ooxW7/urBMqUe/53+hV/VGNjkalnbI0TEARNFvNwyDPA3/LjQADLvpcI155nSYUEqe 5twWki9ojjjHa1IqsGMjEobmKBh5uvTwwsktdtDOcHTF5+Q7AyBCHpDHho58fxnScsrj aXXM1NWi5dLtTNpsGhNmzi7Yy3Rime63VEr8kj+Gm0MX/Xn/5V9sGx2eq6km2Wy1qLfo 9ayFO0HkV0pHndrtQ7FzPzwF7E6gSnmsBylHxGBc0LKPt18huEpqXkRGKWLlAVmFIsBX Z8YY6NSTbIBWODDhgIu2JMeDhyMu9TDBIduGWrP+ROroW/Y8k+qGd3IDZFypnI7b97SV Azjg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1716935248; x=1717540048; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=qoIiECWD3UbiYvPaYPgJKahAc6q2/BimSU4Z82AWX6s=; b=iynJ67LPmcTb7CZq2kqeFgPlgdCNXSihzHr3BWVok8L2k5OWvI91S1O7U2FkWpwhYQ wXsYKk+J/4facPrsgEiw/T32qN0qcuQ/TGvn/++gnKDT0a7tlv7Mflb8TzoYPwoNBMHz MlU4AvUiPEDt7BGdMsysxn+PRCmuvfkYSIVN6HOggodiLe2m9l4BYYwa4x863B33b05y 2Cck/F/NeC/pDVWf54XJ/oZAM3YXFCLLzTmPlV/zuXaIqK1fFrrtpfNGR0ov8p+CDbDy N+Gt/Nx/pyWm23HVEkZU+rHSadoFUh5rO8pkqtMn/l3Z2IepZFvnaT8PcqpzusK0t0iM BZ6Q== X-Forwarded-Encrypted: i=1; AJvYcCVjN5qT263Zuiw2SKiaQg9s/WZ9LrSoJ8PO6HKCDA2HXKUjpzw5mYT59RnyU9IdWJLA+jaJ4eS0NIBYx/sLVUn8eSI= X-Gm-Message-State: AOJu0Yw5N7fBoZ0zOWCoLne1xnGUxfyERKO7MyOJQtNcG3Egi82NHUnI 3YcAV55W5LqD66NGfpSKhk8zEV9S48LmaJweiAp0vGmjPhX+Fy+f08srXrAKswA2NzHn5wtGRp5 ww0m1OcV90J+zVKO8KXofLt0401ARoagDQBPa X-Google-Smtp-Source: AGHT+IHPLgQQZbXgRChZoHcS9wY1hIg6jjIRxXtzqkMz0XcZdc6aD1My/C9BURDs9UW+u8Ohx9B3Ji+B//8XDUHTF38= X-Received: by 2002:a17:906:e913:b0:a59:c2fb:e33a with SMTP id a640c23a62f3a-a62641a3149mr873407266b.4.1716935247295; Tue, 28 May 2024 15:27:27 -0700 (PDT) MIME-Version: 1.0 References: <20240524-swap-allocator-v1-0-47861b423b26@kernel.org> <20240524-swap-allocator-v1-1-47861b423b26@kernel.org> In-Reply-To: From: Chris Li Date: Tue, 28 May 2024 15:27:13 -0700 Message-ID: Subject: Re: [PATCH 1/2] mm: swap: swap cluster switch to double link list To: Kairui Song Cc: Andrew Morton , Ryan Roberts , "Huang, Ying" , linux-kernel@vger.kernel.org, linux-mm@kvack.org, Barry Song Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: 3EC3740011 X-Rspam-User: X-Rspamd-Server: rspam12 X-Stat-Signature: ysx9ibaggu3pcy9h59qgggta8juonxwt X-HE-Tag: 1716935249-366197 X-HE-Meta: U2FsdGVkX18NQkuMD+Z6AIFwGebefVm0Zoy6N9WRV0+Ksdx8t5lpvtFX7x0fPEPBFCL4IIrJ04XLavnzHVhoBbgpOFQ7mwvVSHbRkW/bH9uU+SIXcfBtACjNHxXVh0jUVEWdcJmvzsROBq+NMyTryIO2JDvmrSgk+FYchyjW5KrEtbR+VpaCtO785rBG8m/Uzm5sjaiUNRIQnJyCCgeI/24dbY2I+SeywkEe25cz89INhjsSchTxmnc2+1UvyG8WDemg0hJBNReFfn9q2iB4/MCAgN+tf88102SuhW9Xjs32fF2wK6R2ftwhiptQDzMPUAHakvuk9AYYEW1w8e3jFvII2IfULGjxj/UW9Kn3iS0pFzwAQmirnBuW6oONyUQ9wrxGSIgNDKwuvbJ1/4PFVUuCdB6GSIfgpl3jGksjycUuXT0DRz2wM7B9/8AMLW564MkyI7GafZ011U6IJXBpS/MWuXzkworLVm4lxouvlBWMRoPPdrJl+lfb0mv9TI64jRMAYi81h7CqD3kBTCz8DdZP8l0lSO70j4yD8scrHfEusIQwcfCKVRqE4h7rH7HgFEA6spJGj+5A0silv4FVv5r5MmQEMEgPEi7q6MGir11ndaCidlVXN9yGewDgNG/FXUvv7t/IpggwMUHLS7LyQAx1JiXsUTfI6L+IFofA7MypceckdSikmXJamCI8cxAMvxRsTseZfvXHePX/yAj201HPYzEMXBP8j/Ju2S+CL2RonM5Z2z803JZ2qM9tGlTfFVOEXUVEGHmFIpjSOWiRAesXnS28ecVeqWyOPztz5lPx7Yc49dGV2Z1nS4WsghiU4uXRSFEP9q7wOE192o0mjaQsbWJCuby945z+n4rBXTHthf8/j1c3F3X9wXW38VR4hWBr8Ere89a9TxeGlkFMXsVkpaxsF8bdCqvAFGsD+ts4r/+0bh7njHZUAIXmVvpioKgC5fhZBjiHkVA20t8 CjuPGZiy GZEqXsfN9/w3k8cDp2mlRqUiBLoSW+s74yLkobY8+F39fZ9SQURiRMluL5Vqvv6goHoo/u580QHbk4gaTso3BLwfYx6k3t9FNW3we3n7mQ0vyCZGKrYN35wCDH5CDBNHpXYYzujq9bQol7hrwkqg1FocFwA5K2a0guxvxCbyBMG2Yo50qpMjiQu7dveK6ZsBdYJK8b7AnY+PlN8oJagTWOHNnSzOmkQ06jczY1wXrmEm4u2pif4TjIz5UzFPxjYIfvj1ZT+DsuPZfxeoZRru2pBLBgjePapJ1rqxZdA0DZVzM0ip55XBp9SAChA== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Hi Kairui, On Tue, May 28, 2024 at 9:24=E2=80=AFAM Kairui Song wrot= e: > > On Sat, May 25, 2024 at 1:17=E2=80=AFAM Chris Li wrot= e: > > > > Previously, the swap cluster used a cluster index as a pointer > > to construct a custom single link list type "swap_cluster_list". > > The next cluster pointer is shared with the cluster->count. > > The assumption is that only the free cluster needs to be put > > on the list. > > > > That assumption is not true for mTHP allocators any more. Need > > to track the non full cluster on the list as well. Move the > > current cluster single link list into standard double link list. > > > > Remove the cluster getter/setter for accessing the cluster > > struct member. Move the cluster locking in the caller function > > rather than the getter/setter function. That way the locking can > > protect more than one member, e.g. cluster->flag. > > > > Change cluster code to use "struct swap_cluster_info *" to > > reference the cluster rather than by using index. That is more > > consistent with the list manipulation. It avoids the repeat > > adding index to the cluser_info. The code is easier to understand. > > > > Remove the cluster next pointer is NULL flag, the double link > > list can handle the empty list pretty well. > > > > The "swap_cluster_info" struct is two pointer bigger, because > > 512 swap entries share one swap struct, it has very little impact > > on the average memory usage per swap entry. Other than the list > > conversion, there is no real function change in this patch. > > --- > > include/linux/swap.h | 14 ++-- > > mm/swapfile.c | 231 ++++++++++++++-----------------------------= -------- > > 2 files changed, 68 insertions(+), 177 deletions(-) > > > > Hi Chris, > > Thanks for this very nice clean up, the code is much easier to read. Thanks for the review. See my comments below. I am working on a V2 to address the two issues identified so far. BTW, I am pretty happy the patch stats have much more deltes than insert. > > > diff --git a/include/linux/swap.h b/include/linux/swap.h > > index 11c53692f65f..0d3906eff3c9 100644 > > --- a/include/linux/swap.hm > > +++ b/include/linux/swap.h > > @@ -254,11 +254,12 @@ struct swap_cluster_info { > > * elements correspond to the swap > > * cluster > > */ > > - unsigned int data:24; > > + unsigned int count:16; > > unsigned int flags:8; > > + struct list_head next; > > }; > > #define CLUSTER_FLAG_FREE 1 /* This cluster is free */ > > -#define CLUSTER_FLAG_NEXT_NULL 2 /* This cluster has no next cluster *= / > > + > > > > /* > > * The first page in the swap file is the swap header, which is always= marked > > @@ -283,11 +284,6 @@ struct percpu_cluster { > > unsigned int next[SWAP_NR_ORDERS]; /* Likely next allocation of= fset */ > > }; > > > > -struct swap_cluster_list { > > - struct swap_cluster_info head; > > - struct swap_cluster_info tail; > > -}; > > - > > /* > > * The in-memory structure used to track swap areas. > > */ > > @@ -300,7 +296,7 @@ struct swap_info_struct { > > unsigned int max; /* extent of the swap_map */ > > unsigned char *swap_map; /* vmalloc'ed array of usage co= unts */ > > struct swap_cluster_info *cluster_info; /* cluster info. Only f= or SSD */ > > - struct swap_cluster_list free_clusters; /* free clusters list *= / > > + struct list_head free_clusters; /* free clusters list */ > > unsigned int lowest_bit; /* index of first free in swap_= map */ > > unsigned int highest_bit; /* index of last free in swap_m= ap */ > > unsigned int pages; /* total of usable pages of swa= p */ > > @@ -333,7 +329,7 @@ struct swap_info_struct { > > * list. > > */ > > struct work_struct discard_work; /* discard worker */ > > - struct swap_cluster_list discard_clusters; /* discard clusters = list */ > > + struct list_head discard_clusters; /* discard clusters list */ > > struct plist_node avail_lists[]; /* > > * entries in swap_avail_head= s, one > > * entry per node. > > diff --git a/mm/swapfile.c b/mm/swapfile.c > > index 4f0e8b2ac8aa..205a60c5f9cb 100644 > > --- a/mm/swapfile.c > > +++ b/mm/swapfile.c > > @@ -290,64 +290,11 @@ static void discard_swap_cluster(struct swap_info= _struct *si, > > #endif > > #define LATENCY_LIMIT 256 > > > > -static inline void cluster_set_flag(struct swap_cluster_info *info, > > - unsigned int flag) > > -{ > > - info->flags =3D flag; > > -} > > - > > -static inline unsigned int cluster_count(struct swap_cluster_info *inf= o) > > -{ > > - return info->data; > > -} > > - > > -static inline void cluster_set_count(struct swap_cluster_info *info, > > - unsigned int c) > > -{ > > - info->data =3D c; > > -} > > - > > -static inline void cluster_set_count_flag(struct swap_cluster_info *in= fo, > > - unsigned int c, unsigned int f= ) > > -{ > > - info->flags =3D f; > > - info->data =3D c; > > -} > > - > > -static inline unsigned int cluster_next(struct swap_cluster_info *info= ) > > -{ > > - return info->data; > > -} > > - > > -static inline void cluster_set_next(struct swap_cluster_info *info, > > - unsigned int n) > > -{ > > - info->data =3D n; > > -} > > - > > -static inline void cluster_set_next_flag(struct swap_cluster_info *inf= o, > > - unsigned int n, unsigned int f= ) > > -{ > > - info->flags =3D f; > > - info->data =3D n; > > -} > > - > > static inline bool cluster_is_free(struct swap_cluster_info *info) > > { > > return info->flags & CLUSTER_FLAG_FREE; > > } > > > > -static inline bool cluster_is_null(struct swap_cluster_info *info) > > -{ > > - return info->flags & CLUSTER_FLAG_NEXT_NULL; > > -} > > - > > -static inline void cluster_set_null(struct swap_cluster_info *info) > > -{ > > - info->flags =3D CLUSTER_FLAG_NEXT_NULL; > > - info->data =3D 0; > > -} > > - > > static inline struct swap_cluster_info *lock_cluster(struct swap_info_= struct *si, > > unsigned long offs= et) > > { > > @@ -394,65 +341,11 @@ static inline void unlock_cluster_or_swap_info(st= ruct swap_info_struct *si, > > spin_unlock(&si->lock); > > } > > > > -static inline bool cluster_list_empty(struct swap_cluster_list *list) > > -{ > > - return cluster_is_null(&list->head); > > -} > > - > > -static inline unsigned int cluster_list_first(struct swap_cluster_list= *list) > > -{ > > - return cluster_next(&list->head); > > -} > > - > > -static void cluster_list_init(struct swap_cluster_list *list) > > -{ > > - cluster_set_null(&list->head); > > - cluster_set_null(&list->tail); > > -} > > - > > -static void cluster_list_add_tail(struct swap_cluster_list *list, > > - struct swap_cluster_info *ci, > > - unsigned int idx) > > -{ > > - if (cluster_list_empty(list)) { > > - cluster_set_next_flag(&list->head, idx, 0); > > - cluster_set_next_flag(&list->tail, idx, 0); > > - } else { > > - struct swap_cluster_info *ci_tail; > > - unsigned int tail =3D cluster_next(&list->tail); > > - > > - /* > > - * Nested cluster lock, but both cluster locks are > > - * only acquired when we held swap_info_struct->lock > > - */ > > - ci_tail =3D ci + tail; > > - spin_lock_nested(&ci_tail->lock, SINGLE_DEPTH_NESTING); > > - cluster_set_next(ci_tail, idx); > > - spin_unlock(&ci_tail->lock); > > - cluster_set_next_flag(&list->tail, idx, 0); > > - } > > -} > > - > > -static unsigned int cluster_list_del_first(struct swap_cluster_list *l= ist, > > - struct swap_cluster_info *ci= ) > > -{ > > - unsigned int idx; > > - > > - idx =3D cluster_next(&list->head); > > - if (cluster_next(&list->tail) =3D=3D idx) { > > - cluster_set_null(&list->head); > > - cluster_set_null(&list->tail); > > - } else > > - cluster_set_next_flag(&list->head, > > - cluster_next(&ci[idx]), 0); > > - > > - return idx; > > -} > > - > > /* Add a cluster to discard list and schedule it to do discard */ > > static void swap_cluster_schedule_discard(struct swap_info_struct *si, > > - unsigned int idx) > > + struct swap_cluster_info *ci) > > { > > + unsigned int idx =3D ci - si->cluster_info; > > /* > > * If scan_swap_map_slots() can't find a free cluster, it will = check > > * si->swap_map directly. To make sure the discarding cluster i= sn't > > @@ -462,17 +355,16 @@ static void swap_cluster_schedule_discard(struct = swap_info_struct *si, > > memset(si->swap_map + idx * SWAPFILE_CLUSTER, > > SWAP_MAP_BAD, SWAPFILE_CLUSTER); > > > > - cluster_list_add_tail(&si->discard_clusters, si->cluster_info, = idx); > > - > > + spin_lock_nested(&ci->lock, SINGLE_DEPTH_NESTING); > > + list_add_tail(&ci->next, &si->discard_clusters); > > + spin_unlock(&ci->lock); > > schedule_work(&si->discard_work); > > } > > > > -static void __free_cluster(struct swap_info_struct *si, unsigned long = idx) > > +static void __free_cluster(struct swap_info_struct *si, struct swap_cl= uster_info *ci) > > { > > - struct swap_cluster_info *ci =3D si->cluster_info; > > - > > - cluster_set_flag(ci + idx, CLUSTER_FLAG_FREE); > > - cluster_list_add_tail(&si->free_clusters, ci, idx); > > + ci->flags =3D CLUSTER_FLAG_FREE; > > + list_add_tail(&ci->next, &si->free_clusters); > > } > > > > /* > > @@ -481,21 +373,21 @@ static void __free_cluster(struct swap_info_struc= t *si, unsigned long idx) > > */ > > static void swap_do_scheduled_discard(struct swap_info_struct *si) > > { > > - struct swap_cluster_info *info, *ci; > > + struct swap_cluster_info *ci; > > unsigned int idx; > > > > - info =3D si->cluster_info; > > - > > - while (!cluster_list_empty(&si->discard_clusters)) { > > - idx =3D cluster_list_del_first(&si->discard_clusters, i= nfo); > > + while (!list_empty(&si->discard_clusters)) { > > + ci =3D list_first_entry(&si->discard_clusters, struct s= wap_cluster_info, next); > > + idx =3D ci - si->cluster_info; > > spin_unlock(&si->lock); > > > > discard_swap_cluster(si, idx * SWAPFILE_CLUSTER, > > SWAPFILE_CLUSTER); > > > > spin_lock(&si->lock); > > - ci =3D lock_cluster(si, idx * SWAPFILE_CLUSTER); > > - __free_cluster(si, idx); > > + > > + spin_lock(&ci->lock); > > + __free_cluster(si, ci); > > memset(si->swap_map + idx * SWAPFILE_CLUSTER, > > 0, SWAPFILE_CLUSTER); > > unlock_cluster(ci); > > @@ -521,20 +413,20 @@ static void swap_users_ref_free(struct percpu_ref= *ref) > > complete(&si->comp); > > } > > > > -static void alloc_cluster(struct swap_info_struct *si, unsigned long i= dx) > > +static struct swap_cluster_info *alloc_cluster(struct swap_info_struct= *si, unsigned long idx) > > { > > - struct swap_cluster_info *ci =3D si->cluster_info; > > + struct swap_cluster_info *ci =3D list_first_entry(&si->free_clu= sters, struct swap_cluster_info, next); > > > > - VM_BUG_ON(cluster_list_first(&si->free_clusters) !=3D idx); > > - cluster_list_del_first(&si->free_clusters, ci); > > - cluster_set_count_flag(ci + idx, 0, 0); > > + VM_BUG_ON(ci - si->cluster_info !=3D idx); > > + list_del(&ci->next); > > + ci->count =3D 0; > > + ci->flags =3D 0; > > + return ci; > > } > > > > -static void free_cluster(struct swap_info_struct *si, unsigned long id= x) > > +static void free_cluster(struct swap_info_struct *si, struct swap_clus= ter_info *ci) > > { > > - struct swap_cluster_info *ci =3D si->cluster_info + idx; > > - > > - VM_BUG_ON(cluster_count(ci) !=3D 0); > > + VM_BUG_ON(ci->count !=3D 0); > > /* > > * If the swap is discardable, prepare discard the cluster > > * instead of free it immediately. The cluster will be freed > > @@ -542,11 +434,11 @@ static void free_cluster(struct swap_info_struct = *si, unsigned long idx) > > */ > > if ((si->flags & (SWP_WRITEOK | SWP_PAGE_DISCARD)) =3D=3D > > (SWP_WRITEOK | SWP_PAGE_DISCARD)) { > > - swap_cluster_schedule_discard(si, idx); > > + swap_cluster_schedule_discard(si, ci); > > return; > > } > > > > - __free_cluster(si, idx); > > + __free_cluster(si, ci); > > } > > > > /* > > @@ -559,15 +451,15 @@ static void add_cluster_info_page(struct swap_inf= o_struct *p, > > unsigned long count) > > { > > unsigned long idx =3D page_nr / SWAPFILE_CLUSTER; > > + struct swap_cluster_info *ci =3D cluster_info + idx; > > > > if (!cluster_info) > > return; > > - if (cluster_is_free(&cluster_info[idx])) > > + if (cluster_is_free(ci)) > > alloc_cluster(p, idx); > > > > - VM_BUG_ON(cluster_count(&cluster_info[idx]) + count > SWAPFILE_= CLUSTER); > > - cluster_set_count(&cluster_info[idx], > > - cluster_count(&cluster_info[idx]) + count); > > + VM_BUG_ON(ci->count + count > SWAPFILE_CLUSTER); > > + ci->count +=3D count; > > } > > > > /* > > @@ -581,24 +473,20 @@ static void inc_cluster_info_page(struct swap_inf= o_struct *p, > > } > > > > /* > > - * The cluster corresponding to page_nr decreases one usage. If the us= age > > - * counter becomes 0, which means no page in the cluster is in using, = we can > > - * optionally discard the cluster and add it to free cluster list. > > + * The cluster ci decreases one usage. If the usage counter becomes 0, > > + * which means no page in the cluster is in using, we can optionally d= iscard > > + * the cluster and add it to free cluster list. > > */ > > -static void dec_cluster_info_page(struct swap_info_struct *p, > > - struct swap_cluster_info *cluster_info, unsigned long page_nr) > > +static void dec_cluster_info_page(struct swap_info_struct *p, struct s= wap_cluster_info *ci) > > { > > - unsigned long idx =3D page_nr / SWAPFILE_CLUSTER; > > - > > - if (!cluster_info) > > + if (!p->cluster_info) > > return; > > > > - VM_BUG_ON(cluster_count(&cluster_info[idx]) =3D=3D 0); > > - cluster_set_count(&cluster_info[idx], > > - cluster_count(&cluster_info[idx]) - 1); > > + VM_BUG_ON(ci->count =3D=3D 0); > > + ci->count--; > > > > - if (cluster_count(&cluster_info[idx]) =3D=3D 0) > > - free_cluster(p, idx); > > + if (!ci->count) > > + free_cluster(p, ci); > > } > > > > /* > > @@ -611,10 +499,10 @@ scan_swap_map_ssd_cluster_conflict(struct swap_in= fo_struct *si, > > { > > This whole scan_swap_map_ssd_cluster_conflict function seems not > needed now. free_clusters is a double linked list, so using a cluster > in the middle won't corrupt the list. The comments are still for the > old list design. I was debating removing the cluster_conflict() as well and found out it can't be removed until we change the order 0 allocations also use clusters. There can still be conflict because the order 0 allocations just do the bruce force scan of swap_map[] when try_ssd fails. This causes other problems as well. As far as I can tell, the conflict can still happen. > > > struct percpu_cluster *percpu_cluster; > > bool conflict; > > - > > + struct swap_cluster_info *first =3D list_first_entry(&si->free_= clusters, struct swap_cluster_info, next); > > offset /=3D SWAPFILE_CLUSTER; > > - conflict =3D !cluster_list_empty(&si->free_clusters) && > > - offset !=3D cluster_list_first(&si->free_clusters) && > > + conflict =3D !list_empty(&si->free_clusters) && > > + offset !=3D first - si->cluster_info && > > cluster_is_free(&si->cluster_info[offset]); > > > > if (!conflict) > > @@ -655,10 +543,14 @@ static bool scan_swap_map_try_ssd_cluster(struct = swap_info_struct *si, > > cluster =3D this_cpu_ptr(si->percpu_cluster); > > tmp =3D cluster->next[order]; > > if (tmp =3D=3D SWAP_NEXT_INVALID) { > > - if (!cluster_list_empty(&si->free_clusters)) { > > - tmp =3D cluster_next(&si->free_clusters.head) * > > - SWAPFILE_CLUSTER; > > - } else if (!cluster_list_empty(&si->discard_clusters)) = { > > + if (!list_empty(&si->free_clusters)) { > > + ci =3D list_first_entry(&si->free_clusters, str= uct swap_cluster_info, next); > > + list_del(&ci->next); > > + spin_lock(&ci->lock); > > Shouldn't this list_del also be protected by ci->lock? It was > protected in alloc_cluster before, keeping the flag synced with > cluster status so cluster_is_free won't return false positive. The list add and list del are protected by Si->lock not by cluster lock. Previously I wanted to use cluster->lock to protect it and realized that adding/deleting the cluster to/from the list will change three clusters. (current, prev, next). We need to get three cluster locks. We might change to a per list spinlock. e.g. one lock for one list to reduce the contention on Si->lock. However, per cluster lock is not enough if we only take one cluster lock. > > > + ci->flags =3D 0; > > + spin_unlock(&ci->lock); > > + tmp =3D (ci - si->cluster_info) * SWAPFILE_CLUS= TER; > > + } else if (!list_empty(&si->discard_clusters)) { > > /* > > * we don't have free cluster but have some clu= sters in > > * discarding, do discard now and reclaim them,= then > > @@ -670,7 +562,8 @@ static bool scan_swap_map_try_ssd_cluster(struct sw= ap_info_struct *si, > > goto new_cluster; > > } else > > return false; > > - } > > + } else > > + ci =3D si->cluster_info + tmp; > > This "else ci =3D ..." seems wrong, tmp is not an array index, and not > needed either. Yes, there is a bug there, pointed out by OPPO as well. It should be ci =3D si->cluster_info + (tmp/ SWAPFILE_CLUSTER); "tmp" is needed because "tmp" or " cluster->next[order]" keep track of the current cluster allocation offset, in the per cpu cluster struct. BTW, In my V2 I have changed "tmp" to "offset" and previous "offset" to "retoffset" to make it more obvious. "tmp" does not give much information about what it really does. Chris > > > > > /* > > * Other CPUs can use our cluster if they can't find a free clu= ster, > > @@ -1062,8 +955,9 @@ static void swap_free_cluster(struct swap_info_str= uct *si, unsigned long idx) > > > > ci =3D lock_cluster(si, offset); > > memset(si->swap_map + offset, 0, SWAPFILE_CLUSTER); > > - cluster_set_count_flag(ci, 0, 0); > > - free_cluster(si, idx); > > + ci->count =3D 0; > > + ci->flags =3D 0; > > + free_cluster(si, ci); > > unlock_cluster(ci); > > swap_range_free(si, offset, SWAPFILE_CLUSTER); > > } > > @@ -1336,7 +1230,7 @@ static void swap_entry_free(struct swap_info_stru= ct *p, swp_entry_t entry) > > count =3D p->swap_map[offset]; > > VM_BUG_ON(count !=3D SWAP_HAS_CACHE); > > p->swap_map[offset] =3D 0; > > - dec_cluster_info_page(p, p->cluster_info, offset); > > + dec_cluster_info_page(p, ci); > > unlock_cluster(ci); > > > > mem_cgroup_uncharge_swap(entry, 1); > > @@ -2985,8 +2879,8 @@ static int setup_swap_map_and_extents(struct swap= _info_struct *p, > > > > nr_good_pages =3D maxpages - 1; /* omit header page */ > > > > - cluster_list_init(&p->free_clusters); > > - cluster_list_init(&p->discard_clusters); > > + INIT_LIST_HEAD(&p->free_clusters); > > + INIT_LIST_HEAD(&p->discard_clusters); > > > > for (i =3D 0; i < swap_header->info.nr_badpages; i++) { > > unsigned int page_nr =3D swap_header->info.badpages[i]; > > @@ -3037,14 +2931,15 @@ static int setup_swap_map_and_extents(struct sw= ap_info_struct *p, > > for (k =3D 0; k < SWAP_CLUSTER_COLS; k++) { > > j =3D (k + col) % SWAP_CLUSTER_COLS; > > for (i =3D 0; i < DIV_ROUND_UP(nr_clusters, SWAP_CLUSTE= R_COLS); i++) { > > + struct swap_cluster_info *ci; > > idx =3D i * SWAP_CLUSTER_COLS + j; > > + ci =3D cluster_info + idx; > > if (idx >=3D nr_clusters) > > continue; > > - if (cluster_count(&cluster_info[idx])) > > + if (ci->count) > > continue; > > - cluster_set_flag(&cluster_info[idx], CLUSTER_FL= AG_FREE); > > - cluster_list_add_tail(&p->free_clusters, cluste= r_info, > > - idx); > > + ci->flags =3D CLUSTER_FLAG_FREE; > > + list_add_tail(&ci->next, &p->free_clusters); > > } > > } > > return nr_extents; > > > > -- > > 2.45.1.288.g0e0cd299f1-goog > > > > >