From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 0284DE77188 for ; Fri, 10 Jan 2025 11:24:23 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 578016B00C0; Fri, 10 Jan 2025 06:24:23 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 525866B00C1; Fri, 10 Jan 2025 06:24:23 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 3A0036B00C2; Fri, 10 Jan 2025 06:24:23 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 0DCD96B00C0 for ; Fri, 10 Jan 2025 06:24:23 -0500 (EST) Received: from smtpin10.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id DEB6480BD4 for ; Fri, 10 Jan 2025 11:24:08 +0000 (UTC) X-FDA: 82991308218.10.4A1CFF9 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by imf10.hostedemail.com (Postfix) with ESMTP id A9F84C0010 for ; Fri, 10 Jan 2025 11:24:04 +0000 (UTC) Authentication-Results: imf10.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=DiCuKwvD; spf=pass (imf10.hostedemail.com: domain of bhe@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=bhe@redhat.com; dmarc=pass (policy=none) header.from=redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1736508247; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=bIrZcKwdw3kdGVnELFYbhYYRPYhpNO4CmQuAXfqWp90=; b=Hj87okGSEIxFJ/gB/frTZof/yxCA57k/3aw4uNM+pptVJt8DJ7KgoUyDSkHNphGk3r4vYr GY5V7tHMkX4WWl029j9hjbqMr/RoXjxsJkJG7VC4qXfJXSmbkv0al+r7hPWM95IWwGl6BQ rKFic5kn3f9l0IIT2RVRe6s8yOC0rL0= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1736508247; a=rsa-sha256; cv=none; b=GZI7sErzaJV//pdisn5TQRcVLXrI/zvpuhhhYY1ARdT8T1QizCOJk4Ord+DMk0NDJUKQec 2OOK1NKaN41SIjUEED8eTRHnu4m5S+depGTFpZlQ3/GdqduCYrKmtk3Q/ltG2ip7AS8Xlt T0dadjIQo3Z607smKe1m/uYUeviIcUk= ARC-Authentication-Results: i=1; imf10.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=DiCuKwvD; spf=pass (imf10.hostedemail.com: domain of bhe@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=bhe@redhat.com; dmarc=pass (policy=none) header.from=redhat.com DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1736508243; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=bIrZcKwdw3kdGVnELFYbhYYRPYhpNO4CmQuAXfqWp90=; b=DiCuKwvD0DHxnHCZc25zM5K93mA75KpDaWknDu0+4YldJsV9SIIFqpcjYfwOnR1Bq6ZaVS WKTcSizy+TPdhYCQOZrOzOM9qgrYM0bvGr4LVFIxvojS+JVGjDUCLiXVFII9CBBugLIJTt PHHtxBOz7GOrkrGLNMlk65nHsoemlPk= Received: from mx-prod-mc-02.mail-002.prod.us-west-2.aws.redhat.com (ec2-54-186-198-63.us-west-2.compute.amazonaws.com [54.186.198.63]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-256-sBQG_5G_Nm2wP7nEBqmgIg-1; Fri, 10 Jan 2025 06:23:57 -0500 X-MC-Unique: sBQG_5G_Nm2wP7nEBqmgIg-1 X-Mimecast-MFC-AGG-ID: sBQG_5G_Nm2wP7nEBqmgIg Received: from mx-prod-int-03.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-03.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.12]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx-prod-mc-02.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id 951CA1955DC6; Fri, 10 Jan 2025 11:23:53 +0000 (UTC) Received: from localhost (unknown [10.72.113.4]) by mx-prod-int-03.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id ABC8119560AF; Fri, 10 Jan 2025 11:23:51 +0000 (UTC) Date: Fri, 10 Jan 2025 19:23:47 +0800 From: Baoquan He To: Kairui Song Cc: linux-mm@kvack.org, Andrew Morton , Chris Li , Barry Song , Ryan Roberts , Hugh Dickins , Yosry Ahmed , "Huang, Ying" , Nhat Pham , Johannes Weiner , Kalesh Singh , linux-kernel@vger.kernel.org Subject: Re: [PATCH v3 09/13] mm, swap: reduce contention on device lock Message-ID: References: <20241230174621.61185-1-ryncsn@gmail.com> <20241230174621.61185-10-ryncsn@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: X-Scanned-By: MIMEDefang 3.0 on 10.30.177.12 X-Rspamd-Server: rspam10 X-Rspamd-Queue-Id: A9F84C0010 X-Stat-Signature: rtya5de8j6hu16dhma4fh953rh969h7s X-Rspam-User: X-HE-Tag: 1736508244-208737 X-HE-Meta: U2FsdGVkX1/FO7Q7RuiQH+YehLVOuKNFbxlieQiPLNVInrz0KKn1cMHqGxG0DtVbaYvXFYSJbMjX66FN/BntNIFvxOiQShBgaM7ucXQIqV//7rfF7Ahj13Px7+0jhMDfP2wI42BEnuIu3c9T+kxnWxkF3zaXLKK3h5RNhNjernfGNKxnmoENGYbFD3OTJmGokRZeuQQVRfohT+1wAemkVYbu+jkhkaiAwS3TRfozj27ExiynvoRd4V419BMpqAPFbKy0cVP//TaJOfsyj1nbSeWT01qi0/spR51NiW5GSF+fSg3XowYgkKZLLxRvoFTaUpaXEM1Hyzbs+P7iOicC0RTl6JjKNrjalotYB/tqCrSWRnLoHz8kiXCrP5hUKVK9bwST8+8Wg4Mts2nl+ScMvbXXakR5MsHt15SnXNUYXcB/fMMNH0tuOzvTcA0dCgNxglM/bMISAIDGpVoV7X92qqfOYS2jJBJ+UvOG4RQopeX6+JzAuXA345N+h4ggwTpODDz/E9nlpUkyYNLVyihScoLtCIwjUczRwqsgnAJCSLQyHoDjaIRrx6DrDG7nYTT5UyyE9bm0P5Rxb+7RVLeT9iZ5nVHk487CIeH04jX5ciR58HM91M/zg9OTgKm564MBA2IeeCgLk5iRAD9OyVcApGUfwmPUt34obWhbNXjyo71TvOy+VKZf10j6m9FYOA4TCpR5OVJL3Sw1mdLql3xwg0Hw5EeV+xreOq1gA3IUgfk12W2of9G5OPCbYTnFXHyiiSOTV8l/YN3rux8ylI4FvTs7ImATO2L2JTgvNz0R0fHo3yuTq91cOMAe0xBlAWSuedzaz27q68TobiUHDtf8XzXF+VyVdomkc2Z77aaBTn7ytqe90sB9i2jyEQf8nIz/WNID79ElB9eslum61zmqGApvDq3J0l7j577J29LXKkf0UCJVHRvOCmmRt0j1gcEB5swk2mAB+15UqFxJYBg t4wiyvJK zGE5PmcD2phtVGY2rZUzajxx99NlqgP5g6IqHa/mgeZ7FtMet4ZxIXji46K9idiydEALgKsIOEFljd47LVceP/aqtJJ6QPlVmvVmb4RLIEUHY8MuJbe71CHiZ4awitRCfHBBB3KP6g7AOqzqp91pBPWsHjeU7G0kCvchoXKt+JXBmizTes04etL2ThtsYF3BYwFhF6WvkDk8shegNotChubjuv0T38KVBVE1I4b3OO4GJrc4v34kXcGSXC2DBOu37/CCAje74Zez3aaJ3tTQBPGdGQ1FGX603X3BkCsaEyRG+xcWL8s1R4NyPn93DYiosoBEVAOIWlhe8XkESp68l7+wFKwyOmvmZvcSd4QMRgNy4EApqNebTc7JCF8tGqm2IctsQSjoJI2pfJrAOs9jYvIU0Ww== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 01/09/25 at 10:15am, Kairui Song wrote: > On Wed, Jan 8, 2025 at 7:10 PM Baoquan He wrote: > > > > Thanks for the very detailed review! > > > On 12/31/24 at 01:46am, Kairui Song wrote: > > ......snip..... > > > --- > > > include/linux/swap.h | 3 +- > > > mm/swapfile.c | 435 ++++++++++++++++++++++++------------------- > > > 2 files changed, 246 insertions(+), 192 deletions(-) > > > > > > diff --git a/include/linux/swap.h b/include/linux/swap.h > > > index 339d7f0192ff..c4ff31cb6bde 100644 > > > --- a/include/linux/swap.h > > > +++ b/include/linux/swap.h > > > @@ -291,6 +291,7 @@ enum swap_cluster_flags { > > > * throughput. > > > */ > > > struct percpu_cluster { > > > + local_lock_t lock; /* Protect the percpu_cluster above */ > > > unsigned int next[SWAP_NR_ORDERS]; /* Likely next allocation offset */ > > > }; > > > > > > @@ -313,7 +314,7 @@ struct swap_info_struct { > > > /* list of cluster that contains at least one free slot */ > > > struct list_head frag_clusters[SWAP_NR_ORDERS]; > > > /* list of cluster that are fragmented or contented */ > > > - unsigned int frag_cluster_nr[SWAP_NR_ORDERS]; > > > + atomic_long_t frag_cluster_nr[SWAP_NR_ORDERS]; > > > unsigned int pages; /* total of usable pages of swap */ > > > atomic_long_t inuse_pages; /* number of those currently in use */ > > > struct percpu_cluster __percpu *percpu_cluster; /* per cpu's swap location */ > > > diff --git a/mm/swapfile.c b/mm/swapfile.c > > > index 7795a3d27273..dadd4fead689 100644 > > > --- a/mm/swapfile.c > > > +++ b/mm/swapfile.c ...snip... > > > @@ -475,39 +488,90 @@ static void swap_cluster_schedule_discard(struct swap_info_struct *si, > > > > > > static void __free_cluster(struct swap_info_struct *si, struct swap_cluster_info *ci) > > > { > > > - lockdep_assert_held(&si->lock); > > > lockdep_assert_held(&ci->lock); > > > cluster_move(si, ci, &si->free_clusters, CLUSTER_FLAG_FREE); > > > ci->order = 0; > > > } > > > > > > +/* > > > + * Isolate and lock the first cluster that is not contented on a list, > > > + * clean its flag before taken off-list. Cluster flag must be in sync > > > + * with list status, so cluster updaters can always know the cluster > > > + * list status without touching si lock. > > > + * > > > + * Note it's possible that all clusters on a list are contented so > > > + * this returns NULL for an non-empty list. > > > + */ > > > +static struct swap_cluster_info *cluster_isolate_lock( > > > + struct swap_info_struct *si, struct list_head *list) > > > +{ > > > + struct swap_cluster_info *ci, *ret = NULL; > > > + > > > + spin_lock(&si->lock); > > > + > > > + if (unlikely(!(si->flags & SWP_WRITEOK))) > > > + goto out; > > > + > > > + list_for_each_entry(ci, list, list) { > > > + if (!spin_trylock(&ci->lock)) > > > + continue; > > > + > > > + /* We may only isolate and clear flags of following lists */ > > > + VM_BUG_ON(!ci->flags); > > > + VM_BUG_ON(ci->flags > CLUSTER_FLAG_USABLE && > > > + ci->flags != CLUSTER_FLAG_FULL); > > > + > > > + list_del(&ci->list); > > > + ci->flags = CLUSTER_FLAG_NONE; > > > + ret = ci; > > > + break; > > > + } > > > +out: > > > + spin_unlock(&si->lock); > > > + > > > + return ret; > > > +} > > > + > > > /* > > > * Doing discard actually. After a cluster discard is finished, the cluster > > > - * will be added to free cluster list. caller should hold si->lock. > > > -*/ > > > -static void swap_do_scheduled_discard(struct swap_info_struct *si) > > > + * will be added to free cluster list. Discard cluster is a bit special as > > > + * they don't participate in allocation or reclaim, so clusters marked as > > > + * CLUSTER_FLAG_DISCARD must remain off-list or on discard list. > > > + */ > > > +static bool swap_do_scheduled_discard(struct swap_info_struct *si) > > > { > > > struct swap_cluster_info *ci; > > > + bool ret = false; > > > unsigned int idx; > > > > > > + spin_lock(&si->lock); > > > while (!list_empty(&si->discard_clusters)) { > > > ci = list_first_entry(&si->discard_clusters, struct swap_cluster_info, list); > > > + /* > > > + * Delete the cluster from list but don't clear its flags until > > > + * discard is done, so isolation and relocation will skip it. > > > + */ > > > list_del(&ci->list); > > > > I don't understand above comment. ci has been taken off list. While > > allocation need isolate from a usable list. Even though we clear > > ci->flags now, how come isolation and relocation will touch it. I may > > miss anything here. > > There are many cases, one possible and common situation is that the > percpu cluster (si->percpu_cluster of another CPU) is still pointing > to it. > > Also, this commit removed protection of si lock on allocation, and > allocation path may also drop ci lock to call reclaim, which means one > cluster could be used or freed by anyone before allocator reacquire > the ci lock again. In that case, the allocator could see a discard > cluster. > > So we don't clear the discard flag, in case anyone misuse it. > > I can add more inline comments on this, this is already some related > comments above the function relocate_cluster, could add some more > referencing that. Thanks for your great explanation. I understand that si->percpu_cluster could point to a discarded ci, and a ci could be got from non-full, frag lists but later become discarded if that ci is freed on other cpu during cluster_reclaim_range() invocation. I haven't got how isolation could see a discarded ci in cluster_isolate_lock(). Could you help give an exmaple on how that happen? Surely, I understand keeping the discarded flag is very necessary so that checking like cluster_is_usable() will return expected value. And by the way, I haven't got when the ' if (!ci->count)' case could happen in relocate_cluster() since we have filtered away discarded ci with the 'if (cluster_is_discard(ci))' checking. I asked in another thread, could you help explain it? static void relocate_cluster(struct swap_info_struct *si, struct swap_cluster_info *ci) { lockdep_assert_held(&ci->lock); /* Discard cluster must remain off-list or on discard list */ if (cluster_is_discard(ci)) return; if (!ci->count) { free_cluster(si, ci); ... } > > > > > > - /* Must clear flag when taking a cluster off-list */ > > > - ci->flags = CLUSTER_FLAG_NONE; > > > idx = cluster_index(si, ci); > > > spin_unlock(&si->lock); > > > - > > > discard_swap_cluster(si, idx * SWAPFILE_CLUSTER, > > > SWAPFILE_CLUSTER); > > > > > > - spin_lock(&si->lock); > > > spin_lock(&ci->lock); > > > - __free_cluster(si, ci); > > > + /* > > > + * Discard is done, clear its flags as it's now off-list, > > > + * then return the cluster to allocation list. > > > + */ > > > + ci->flags = CLUSTER_FLAG_NONE; > > > memset(si->swap_map + idx * SWAPFILE_CLUSTER, > > > 0, SWAPFILE_CLUSTER); > > > + __free_cluster(si, ci); > > > spin_unlock(&ci->lock); > > > + ret = true; > > > + spin_lock(&si->lock); > > > } > > > + spin_unlock(&si->lock); > > > + return ret; > > > } > > > > > > static void swap_discard_work(struct work_struct *work) > > ......snip.... > > > @@ -791,29 +873,34 @@ static void swap_reclaim_work(struct work_struct *work) > > > static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int order, > > > unsigned char usage) > > > { > > > - struct percpu_cluster *cluster; > > > struct swap_cluster_info *ci; > > > unsigned int offset, found = 0; > > > > > > -new_cluster: > > > - lockdep_assert_held(&si->lock); > > > - cluster = this_cpu_ptr(si->percpu_cluster); > > > - offset = cluster->next[order]; > > > + /* Fast path using per CPU cluster */ > > > + local_lock(&si->percpu_cluster->lock); > > > + offset = __this_cpu_read(si->percpu_cluster->next[order]); > > > if (offset) { > > > - offset = alloc_swap_scan_cluster(si, offset, &found, order, usage); > > > + ci = lock_cluster(si, offset); > > > + /* Cluster could have been used by another order */ > > > + if (cluster_is_usable(ci, order)) { > > > + if (cluster_is_free(ci)) > > > + offset = cluster_offset(si, ci); > > > + offset = alloc_swap_scan_cluster(si, offset, &found, > > > + order, usage); > > > + } else { > > > + unlock_cluster(ci); > > > + } > > > if (found) > > > goto done; > > > } > > > > > > - if (!list_empty(&si->free_clusters)) { > > > - ci = list_first_entry(&si->free_clusters, struct swap_cluster_info, list); > > > - offset = alloc_swap_scan_cluster(si, cluster_offset(si, ci), &found, order, usage); > > > - /* > > > - * Either we didn't touch the cluster due to swapoff, > > > - * or the allocation must success. > > > - */ > > > - VM_BUG_ON((si->flags & SWP_WRITEOK) && !found); > > > - goto done; > > > +new_cluster: > > > + ci = cluster_isolate_lock(si, &si->free_clusters); > > > + if (ci) { > > > + offset = alloc_swap_scan_cluster(si, cluster_offset(si, ci), > > > + &found, order, usage); > > > + if (found) > > > + goto done; > > > } > > > > > > /* Try reclaim from full clusters if free clusters list is drained */ > > > @@ -821,49 +908,45 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o > > > swap_reclaim_full_clusters(si, false); > > > > > > if (order < PMD_ORDER) { > > > - unsigned int frags = 0; > > > + unsigned int frags = 0, frags_existing; > > > > > > - while (!list_empty(&si->nonfull_clusters[order])) { > > > - ci = list_first_entry(&si->nonfull_clusters[order], > > > - struct swap_cluster_info, list); > > > - cluster_move(si, ci, &si->frag_clusters[order], CLUSTER_FLAG_FRAG); > > > + while ((ci = cluster_isolate_lock(si, &si->nonfull_clusters[order]))) { > > > offset = alloc_swap_scan_cluster(si, cluster_offset(si, ci), > > > &found, order, usage); > > > - frags++; > > > + /* > > > + * With `fragmenting` set to true, it will surely take > > ~~~~~~~~~~~ > > wondering what 'fragmenting' means here. > > This comment is a bit out of context indeed, it actually trying to say > the alloc_swap_scan_cluster call above should move the cluster to > tail. I'll update the comment. > > > > > > > > + * the cluster off nonfull list > > > + */ > > > if (found) > > > goto done; > > > + frags++; > > > } > > > > > > - /* > > > - * Nonfull clusters are moved to frag tail if we reached > > > - * here, count them too, don't over scan the frag list. > > > - */ > > > - while (frags < si->frag_cluster_nr[order]) { > > > - ci = list_first_entry(&si->frag_clusters[order], > > > - struct swap_cluster_info, list); > > > + frags_existing = atomic_long_read(&si->frag_cluster_nr[order]); > > > + while (frags < frags_existing && > > > + (ci = cluster_isolate_lock(si, &si->frag_clusters[order]))) { > > > + atomic_long_dec(&si->frag_cluster_nr[order]); > > > /* > > > - * Rotate the frag list to iterate, they were all failing > > > - * high order allocation or moved here due to per-CPU usage, > > > - * this help keeping usable cluster ahead. > > > + * Rotate the frag list to iterate, they were all > > > + * failing high order allocation or moved here due to > > > + * per-CPU usage, but they could contain newly released > > > + * reclaimable (eg. lazy-freed swap cache) slots. > > > */ > > > - list_move_tail(&ci->list, &si->frag_clusters[order]); > > > offset = alloc_swap_scan_cluster(si, cluster_offset(si, ci), > > > &found, order, usage); > > > - frags++; > > > if (found) > > > goto done; > > > + frags++; > > > } > > > } > > > > > > - if (!list_empty(&si->discard_clusters)) { > > > - /* > > > - * we don't have free cluster but have some clusters in > > > - * discarding, do discard now and reclaim them, then > > > - * reread cluster_next_cpu since we dropped si->lock > > > - */ > > > - swap_do_scheduled_discard(si); > > > + /* > > > + * We don't have free cluster but have some clusters in > > > + * discarding, do discard now and reclaim them, then > > > + * reread cluster_next_cpu since we dropped si->lock > > > + */ > > > + if ((si->flags & SWP_PAGE_DISCARD) && swap_do_scheduled_discard(si)) > > > goto new_cluster; > > > - } > > > > > > if (order) > > > goto done; > > ..... > > > > >