From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id B9C6AE7719E for ; Mon, 13 Jan 2025 06:33:34 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 1BB336B0085; Mon, 13 Jan 2025 01:33:34 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 16A676B0088; Mon, 13 Jan 2025 01:33:34 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 031EE6B0089; Mon, 13 Jan 2025 01:33:33 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id D90D26B0085 for ; Mon, 13 Jan 2025 01:33:33 -0500 (EST) Received: from smtpin21.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id 53F70A1B5E for ; Mon, 13 Jan 2025 06:33:33 +0000 (UTC) X-FDA: 83001462306.21.F734D50 Received: from mail-lj1-f171.google.com (mail-lj1-f171.google.com [209.85.208.171]) by imf22.hostedemail.com (Postfix) with ESMTP id 6150EC0004 for ; Mon, 13 Jan 2025 06:33:31 +0000 (UTC) Authentication-Results: imf22.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b="Kbgs2s9/"; spf=pass (imf22.hostedemail.com: domain of ryncsn@gmail.com designates 209.85.208.171 as permitted sender) smtp.mailfrom=ryncsn@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1736750011; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=vdReEcOqAHmb/EmeEKtVMPpuGKB51tbr/xx7VPSGr8o=; b=8h+Q+xFB4oloVpeJJVEW8F1G46mZY9SjWmLIXs4Fun1PhJMZA+7KcVpJ1A7ES4XyN3KGvR xJT1TyQGNHTAptucPRfO1aAzq4AbwgQt6FCUVA+Vx/MbV9iJJvbvxXhTfvr4LtFR29x8f6 1dC5O+w8uP5JvTPxwKwW8nVAQqH3dKA= ARC-Authentication-Results: i=1; imf22.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b="Kbgs2s9/"; spf=pass (imf22.hostedemail.com: domain of ryncsn@gmail.com designates 209.85.208.171 as permitted sender) smtp.mailfrom=ryncsn@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1736750011; a=rsa-sha256; cv=none; b=KR2vsn2Sz8d8YzDsbpUp/AZ0AVVsEyZ2aa0N6hjOO231ustIXvSvFNlDU7pzrkmUbHOXEq Br2IQ5cSzahoLG+cfEXQcON+WkadwvTYVKh82fN6qg3N3eaR70GgF3FHKomKU5+J9gwTAq YFWFwYjCRmuUiF3aSne8M86KC0wXzMc= Received: by mail-lj1-f171.google.com with SMTP id 38308e7fff4ca-303489e8775so33316441fa.3 for ; Sun, 12 Jan 2025 22:33:30 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1736750009; x=1737354809; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=vdReEcOqAHmb/EmeEKtVMPpuGKB51tbr/xx7VPSGr8o=; b=Kbgs2s9/gYJfM+nvnvYLywM/UoSKvLVHra82RuS6hEr2RPs18kkYFW6UeA0ki04hCC AQAQmSV2UAW54NIS3WzZ9I0TfYhMnmdjTVQxj5HBLjXLZP+rrMTFYmxkVQwplHhKYDRy xlvEBUaH7FS///mmMU5DaItJCyUYsQBPcx4e86Zjg8VPrQL1atUR2Gbbg+M5k/ESqkN3 dU9eJLDFW3XieCfdYV0fu+MUJEmqX8BNOj+6N4My04yeAKZK5iGsmYMZw1zphzckqO8o oFx4Ao195KPYHPicwHt74IJLIyrXkR7rLN6KlF+j7QFgqfrVBynbRme8ixilQnwfOWja iHKQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1736750009; x=1737354809; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=vdReEcOqAHmb/EmeEKtVMPpuGKB51tbr/xx7VPSGr8o=; b=npgiOP9uPod7vAAGQzD7DnUltOIEL4+uOXocuLRNkCGo085Ig2uo2wKVDr/69NIoDe CKOzVLqjHG9lmU5Ap363N2RYNQ6N0f/76vz4AVoHyOhL0W8zE4DTp63aiSmPtUikx9ur FiXnEaTwCnj3hFOrwl6WHOIqq9w91+OHI4wOP3gEEOcBHKzlEYZvRlmQRLA1jw3YMtWH QZEUYM6aksq8EBvbqypeVwYrWuoPt1Nhc6Aa3tS5HucmsoKjOIcbE9lOpy7lQ01APL7u CzmrKWXMMkXlMc6sIqw9PlTZYkRjYTm02A3eTeYUspoJgzUZEVC3BTP7NSp06AJcI3VQ wZFQ== X-Gm-Message-State: AOJu0Yy09UqsnDR65FVCz63Y381U03xh3py0YH6iVJBCdpAxLEMA1BYm s0jVXGncVK8nqPCcPYhqD4YWdijDlJJCPg+TTIlFEg1m3vUpWxLGRvQY7D+uvMRlZ9NmCo4+MF4 X1oRypsTBKPMrvvSo1yRYXoab/JA= X-Gm-Gg: ASbGncuSXfYuR7Ev5gjMyit+IemUJ2gSyWISB8VunLkzn9ez3W8ob08Wz7COT6Tr5M9 C6pObia/Pk3ZxIEaY9WCK0M6YfPReaIeewl1YJA== X-Google-Smtp-Source: AGHT+IHmZC3VHSpsw7R9H1S3kh20GaI7c+g5ghiBNVDmVt6PMqeJBoCA+05IXuKrjUpnOCtHDkmpeXARnaJ3fHf5lgQ= X-Received: by 2002:a2e:bc89:0:b0:306:1500:3efb with SMTP id 38308e7fff4ca-30615004243mr22525771fa.12.1736750009210; Sun, 12 Jan 2025 22:33:29 -0800 (PST) MIME-Version: 1.0 References: <20241230174621.61185-1-ryncsn@gmail.com> <20241230174621.61185-10-ryncsn@gmail.com> In-Reply-To: From: Kairui Song Date: Mon, 13 Jan 2025 14:33:12 +0800 X-Gm-Features: AbW1kvb83RBx824B1lieaoLqeCG7yqa_nytMUGWjPiDxkBmCUXUj0H6_ffVpZwc Message-ID: Subject: Re: [PATCH v3 09/13] mm, swap: reduce contention on device lock To: Baoquan He Cc: linux-mm@kvack.org, Andrew Morton , Chris Li , Barry Song , Ryan Roberts , Hugh Dickins , Yosry Ahmed , "Huang, Ying" , Nhat Pham , Johannes Weiner , Kalesh Singh , linux-kernel@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: 6150EC0004 X-Stat-Signature: axxgou7o6dmrhd5cf5cranebcf9bnm5t X-Rspam-User: X-Rspamd-Server: rspam11 X-HE-Tag: 1736750011-726988 X-HE-Meta: U2FsdGVkX194N+YIqCLJyJy5PsNs8AIj9vcGt606RyA/plCqGLabSpxtJ70vrNSJ9RhM0zd1gfJW15QCgZQPpGwZbS2hj6YFvo36MT0K8cjjqn7R1ykoR+AXFn9gq0TPw9B0eRw40Vt+oLvuRfdRZbajqr5RfHDq81ZhYEtePb1j08i4t0tUheky8J1+u+Bg13RJRXj2dvBiReOGz6xv6/cWSr7ykx3thxXfx1AGN+49XfVyNXVwdUssHntZxLcKWXHWxshX5QELLyGrBn+IOxHOHrMQJpi2UguZofGTn1HAm6OueaOESYGfut6uRjfqzRVW6yKWnqzVfbzi/zgIhmNCOYb2I4Vak2VuDQAnHQknspfFOyOFD88HQIqP4BcHqKt1yhls2ITwMRNsNnyioiY9CFMo6TE89sF/qp3RLT7rDCzrXO/X8igVA1YnzzMGpcxFe7PLbYU7edBOcW60/MtWnMpOdFcMonID/yzRtI72OCOz/1qvRs2BmxCguWp3xQO5d7v8De5kIB3Q+bpKIQSTyGuaIwEJMwPMTDlUS3Hwg6yrcbVpTXqbI/ikZF07cZxbdZIhxbBTYJ5BfOZu8nh7ScUOo/haesjtBGwj6iEkkUpRoDJ6uTcZnJlSbhaAHjYJ6u5OcedRqhbXY/CH26bNqGxbwo/ep7MS+zP7ydE/Rlg20grQSCHw1TG+BtG+u07xdw3jZwltlxpTGRMqwzpbk2OuBxeMPoTOD5znVElm/MJitrw29nTKj4X9R3lNLCxfIoSXOUPz+BGLNiP5uL8Nsx6f6dFuD/fqlT3UUYEQq0ZL1KKHrRq/mgDoMgMXXsF2Bq9j1tuXan4M4WzvVVDdo9PFAR4osm1E75a/cP4+GXriQpOBl6pRO5iDV93UuD3KFbLm9eq4GZA6hN4YAgu11+MGhCrszc1p3DR0ppoDz7OVrP420fpRKSvCsYAPDLeplcrsIfPPGzQg2LW SCYEhGVO exBjqegIY/spnS/aed39Ht9LohGKpDi4HukYkY84BKVNTY3qAP5B3nE/FCPrPDELhAWzN1+ugyxikdiOISDasD26mIAV8T4MgZmB+agReHAFAVCqoaGyjYbh5S9sMH12w7fYV2rWQxLLhK7WDwhTY92TMwqjSeK8DOptE6TahsurP0x9Hl3k1QvsV5XTOXm8wixjukd2Kw4hTwJwvIdFlwxMeuuqPLK2JzkZiI1k8LJbRWxKSNwp0+htphRhS5qShPYl/vbvnjN1o4hhwhpxN+xe/WjHfxNjQ2IKFdp0F0L/9YO8iaDAkLzZC8vB6YP3rZsXqEB0M4HPHqpz4QxQUt2HLCg== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Fri, Jan 10, 2025 at 7:24=E2=80=AFPM Baoquan He wrote: > > On 01/09/25 at 10:15am, Kairui Song wrote: > > On Wed, Jan 8, 2025 at 7:10=E2=80=AFPM Baoquan He wrot= e: > > > > > > > Thanks for the very detailed review! > > > > > On 12/31/24 at 01:46am, Kairui Song wrote: > > > ......snip..... > > > > --- > > > > include/linux/swap.h | 3 +- > > > > mm/swapfile.c | 435 ++++++++++++++++++++++++---------------= ---- > > > > 2 files changed, 246 insertions(+), 192 deletions(-) > > > > > > > > diff --git a/include/linux/swap.h b/include/linux/swap.h > > > > index 339d7f0192ff..c4ff31cb6bde 100644 > > > > --- a/include/linux/swap.h > > > > +++ b/include/linux/swap.h > > > > @@ -291,6 +291,7 @@ enum swap_cluster_flags { > > > > * throughput. > > > > */ > > > > struct percpu_cluster { > > > > + local_lock_t lock; /* Protect the percpu_cluster above */ > > > > unsigned int next[SWAP_NR_ORDERS]; /* Likely next allocation = offset */ > > > > }; > > > > > > > > @@ -313,7 +314,7 @@ struct swap_info_struct { > > > > /* list of cluster that conta= ins at least one free slot */ > > > > struct list_head frag_clusters[SWAP_NR_ORDERS]; > > > > /* list of cluster that are f= ragmented or contented */ > > > > - unsigned int frag_cluster_nr[SWAP_NR_ORDERS]; > > > > + atomic_long_t frag_cluster_nr[SWAP_NR_ORDERS]; > > > > unsigned int pages; /* total of usable pages of s= wap */ > > > > atomic_long_t inuse_pages; /* number of those currently = in use */ > > > > struct percpu_cluster __percpu *percpu_cluster; /* per cpu's = swap location */ > > > > diff --git a/mm/swapfile.c b/mm/swapfile.c > > > > index 7795a3d27273..dadd4fead689 100644 > > > > --- a/mm/swapfile.c > > > > +++ b/mm/swapfile.c > ...snip... > > > > @@ -475,39 +488,90 @@ static void swap_cluster_schedule_discard(str= uct swap_info_struct *si, > > > > > > > > static void __free_cluster(struct swap_info_struct *si, struct swa= p_cluster_info *ci) > > > > { > > > > - lockdep_assert_held(&si->lock); > > > > lockdep_assert_held(&ci->lock); > > > > cluster_move(si, ci, &si->free_clusters, CLUSTER_FLAG_FREE); > > > > ci->order =3D 0; > > > > } > > > > > > > > +/* > > > > + * Isolate and lock the first cluster that is not contented on a l= ist, > > > > + * clean its flag before taken off-list. Cluster flag must be in s= ync > > > > + * with list status, so cluster updaters can always know the clust= er > > > > + * list status without touching si lock. > > > > + * > > > > + * Note it's possible that all clusters on a list are contented so > > > > + * this returns NULL for an non-empty list. > > > > + */ > > > > +static struct swap_cluster_info *cluster_isolate_lock( > > > > + struct swap_info_struct *si, struct list_head *list) > > > > +{ > > > > + struct swap_cluster_info *ci, *ret =3D NULL; > > > > + > > > > + spin_lock(&si->lock); > > > > + > > > > + if (unlikely(!(si->flags & SWP_WRITEOK))) > > > > + goto out; > > > > + > > > > + list_for_each_entry(ci, list, list) { > > > > + if (!spin_trylock(&ci->lock)) > > > > + continue; > > > > + > > > > + /* We may only isolate and clear flags of following l= ists */ > > > > + VM_BUG_ON(!ci->flags); > > > > + VM_BUG_ON(ci->flags > CLUSTER_FLAG_USABLE && > > > > + ci->flags !=3D CLUSTER_FLAG_FULL); > > > > + > > > > + list_del(&ci->list); > > > > + ci->flags =3D CLUSTER_FLAG_NONE; > > > > + ret =3D ci; > > > > + break; > > > > + } > > > > +out: > > > > + spin_unlock(&si->lock); > > > > + > > > > + return ret; > > > > +} > > > > + > > > > /* > > > > * Doing discard actually. After a cluster discard is finished, th= e cluster > > > > - * will be added to free cluster list. caller should hold si->lock= . > > > > -*/ > > > > -static void swap_do_scheduled_discard(struct swap_info_struct *si) > > > > + * will be added to free cluster list. Discard cluster is a bit sp= ecial as > > > > + * they don't participate in allocation or reclaim, so clusters ma= rked as > > > > + * CLUSTER_FLAG_DISCARD must remain off-list or on discard list. > > > > + */ > > > > +static bool swap_do_scheduled_discard(struct swap_info_struct *si) > > > > { > > > > struct swap_cluster_info *ci; > > > > + bool ret =3D false; > > > > unsigned int idx; > > > > > > > > + spin_lock(&si->lock); > > > > while (!list_empty(&si->discard_clusters)) { > > > > ci =3D list_first_entry(&si->discard_clusters, struct= swap_cluster_info, list); > > > > + /* > > > > + * Delete the cluster from list but don't clear its f= lags until > > > > + * discard is done, so isolation and relocation will = skip it. > > > > + */ > > > > list_del(&ci->list); > > > > > > I don't understand above comment. ci has been taken off list. While > > > allocation need isolate from a usable list. Even though we clear > > > ci->flags now, how come isolation and relocation will touch it. I may > > > miss anything here. > > > > There are many cases, one possible and common situation is that the > > percpu cluster (si->percpu_cluster of another CPU) is still pointing > > to it. > > > > Also, this commit removed protection of si lock on allocation, and > > allocation path may also drop ci lock to call reclaim, which means one > > cluster could be used or freed by anyone before allocator reacquire > > the ci lock again. In that case, the allocator could see a discard > > cluster. > > > > So we don't clear the discard flag, in case anyone misuse it. > > > > I can add more inline comments on this, this is already some related > > comments above the function relocate_cluster, could add some more > > referencing that. Hi Baoquan, > > Thanks for your great explanation. I understand that si->percpu_cluster > could point to a discarded ci, and a ci could be got from non-full, > frag lists but later become discarded if that ci is freed on other cpu > during cluster_reclaim_range() invocation. I haven't got how isolation > could see a discarded ci in cluster_isolate_lock(). Could you help give > an exmaple on how that happen? cluster_isolate_lock shouldn't see a discard cluster, and there is a VM_BUG_ON for that. > > Surely, I understand keeping the discarded flag is very necessary so > that checking like cluster_is_usable() will return expected value. > > And by the way, I haven't got when the ' if (!ci->count)' case could > happen in relocate_cluster() since we have filtered away discarded ci > with the 'if (cluster_is_discard(ci))' checking. I asked in another > thread, could you help explain it? Many swap devices doesn't need discard so the cluster could be freed directly. And actually the ci->count check in relocate_cluster is not necessarily related to that. The caller of relocate_cluster, may fail an allocation (eg. race with swapoff) and that could end up calling relocate_cluster with a empty cluster, such cluster should go to the free list (swapoff might fail too). The swapoff case is extremely rare but let's just be more robust here, covering free cluster have almost no overhead but save a lot of efforts. I can add some comments on this. > > static void relocate_cluster(struct swap_info_struct *si, > struct swap_cluster_info *ci) > { > lockdep_assert_held(&ci->lock); > > /* Discard cluster must remain off-list or on discard list */ > if (cluster_is_discard(ci)) > return; > > if (!ci->count) { > free_cluster(si, ci); > ... > }