From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 51BD5E7719E for ; Mon, 13 Jan 2025 08:08:00 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id D73FD6B0089; Mon, 13 Jan 2025 03:07:59 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id D21A36B008A; Mon, 13 Jan 2025 03:07:59 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id BEAAD6B008C; Mon, 13 Jan 2025 03:07:59 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 9B2F36B0089 for ; Mon, 13 Jan 2025 03:07:59 -0500 (EST) Received: from smtpin16.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id 4F95E1A0F43 for ; Mon, 13 Jan 2025 08:07:59 +0000 (UTC) X-FDA: 83001700278.16.65E6CAC Received: from mail-lj1-f169.google.com (mail-lj1-f169.google.com [209.85.208.169]) by imf02.hostedemail.com (Postfix) with ESMTP id 625DD80002 for ; Mon, 13 Jan 2025 08:07:57 +0000 (UTC) Authentication-Results: imf02.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b="O1wH/JJ5"; spf=pass (imf02.hostedemail.com: domain of ryncsn@gmail.com designates 209.85.208.169 as permitted sender) smtp.mailfrom=ryncsn@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1736755677; a=rsa-sha256; cv=none; b=rxlWdXlKUUrxL6hqH6qGrBBl1b66w9e4G73ZcPm3WEUB8TXxxa6lHBtkyd4mn/hm2Q7USz Ux4Uv7HYmIUoFaY3EPJgoUaLBbzLjCauoN5CbZGWVKrQXPa37mDBXMQzTWxP9ZWUWBaxE9 LlCCzLJOdXkg3ExV2aR8nFOzpF/rqic= ARC-Authentication-Results: i=1; imf02.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b="O1wH/JJ5"; spf=pass (imf02.hostedemail.com: domain of ryncsn@gmail.com designates 209.85.208.169 as permitted sender) smtp.mailfrom=ryncsn@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1736755677; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=yUDYUkHC9PE+cWoWv8xCLvgaCPZjWFVwk2Vqe0HERuA=; b=0d9i61qkLmypDqCrcSszdhsd6UrulznxBJqfARyPL1AOasuwuEVpBKLBn5L8lU6AXAn43W y4LDlXV0kQtcdW/UO5USkrftcFekoarwvB9wKQPinln6eEVUo/6hTi9ula5mfhCclS0myt SARSFDBKAbLG1sG2MLui47i4RnCxPW4= Received: by mail-lj1-f169.google.com with SMTP id 38308e7fff4ca-30036310158so31169861fa.0 for ; Mon, 13 Jan 2025 00:07:57 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1736755675; x=1737360475; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=yUDYUkHC9PE+cWoWv8xCLvgaCPZjWFVwk2Vqe0HERuA=; b=O1wH/JJ55+AiHr6eXgJLuX1iaK9NY6nB+6IyC3Ne8WrOEyQ8TPc1rd4Eh463YqXUBf QfMQSuLB40DYRXmUgmT9UPIarB7wZScIv892CqT+y3BF99ybn2pLP2QJanKWoR3H8RbP HSv3FCkCcvCbhXjUOSBmEomr96nHR0BP0+H0h9+DKvKhkyYMEOz3zQo65HoY9KGtIAbg 3l3R9iRAwBGj32g87jvmsi4NZmG2XQ8tndfni6Y/lh357Dy/Oe/Bf9miZv6QWEx88S6u 3hyZjadgQMm+PwoRzwyKXGn026h6+oSsNn4xeFvuT8r3rp2//1Ys3FzBDO41VOsZBXoV 0YRg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1736755675; x=1737360475; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=yUDYUkHC9PE+cWoWv8xCLvgaCPZjWFVwk2Vqe0HERuA=; b=CdTjWi/WhRCjgOYhnA0Jzrl8TTUficELWCbDdyy0SXGjx/UwN07cxvzCcu9hfqILcR 1mOa04VCGoxcSdmNTMtvW7EXKhJsECU3bx+9iPcBrBmLGBkB3RApwXst2qVT7nzmrWFB Z+FYLqgCqYZFzISsTMRCAFh429futbFC3yr6n+fWspR+AFu4VWXE33PoBH8zWs/TQLUc nf5rlhzrodV3HvffiTeq7zO/PDnJbBUsjgT1N5iLBHIUdnadRvG5bED2twpYvYRB0OsC UU5GinhCc245jah8hw1qjIWo9VWUHH1ImS03TpI3EOh8PfOsttM4/8FSD+21o0boWHOx gYOA== X-Gm-Message-State: AOJu0Yy+EC2DnhFYaIg3FRaaMZMFHIen8sQBZnzXlhixRI/Yceko2TKi L+iGyKM1nFOS2WPl8p/G6gzIl/GT8tTihfAOoIILGpnGQxYrbpaxbiiLBqEnyiewDjcUzsFBXVD zYXLcl8/46sRRWgOJcA8OViIXIUQ= X-Gm-Gg: ASbGncuC3XOsCfgFcYt/5hyLUtsvEBV7JLFqtiMQCMF4jjSFY1kycG3MPz8jsRKxes1 ukFOAfmof2a33o/9cObZ/r/f6QQhhaJxpOb442g== X-Google-Smtp-Source: AGHT+IENvyAteF2O3M6/BkkJaPEaRLrEnMT2C2Ah7n7GsvZ5aHBN1xRGpyDgHoaxzryUGyzAExHX6Nq/SlefhjgiWqw= X-Received: by 2002:a2e:a98e:0:b0:302:1b18:2c06 with SMTP id 38308e7fff4ca-305f458cd31mr64067471fa.17.1736755675108; Mon, 13 Jan 2025 00:07:55 -0800 (PST) MIME-Version: 1.0 References: <20241230174621.61185-1-ryncsn@gmail.com> <20241230174621.61185-10-ryncsn@gmail.com> In-Reply-To: From: Kairui Song Date: Mon, 13 Jan 2025 16:07:38 +0800 X-Gm-Features: AbW1kvZ9IhEYTvF45oZH_mTuB7TK-iIOaxYmKVrORD3gOHoZipTaJfCU5f1KSD8 Message-ID: Subject: Re: [PATCH v3 09/13] mm, swap: reduce contention on device lock To: Baoquan He Cc: linux-mm@kvack.org, Andrew Morton , Chris Li , Barry Song , Ryan Roberts , Hugh Dickins , Yosry Ahmed , "Huang, Ying" , Nhat Pham , Johannes Weiner , Kalesh Singh , linux-kernel@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: 625DD80002 X-Stat-Signature: ihrkr3xk613ipsh7i57s5c8gdsmo7gws X-Rspam-User: X-Rspamd-Server: rspam09 X-HE-Tag: 1736755677-532711 X-HE-Meta: U2FsdGVkX1/AEpBVm7fo9K5R/pywQV3OlF8nw7hVofDXYIeyGyWf1YUpj3wFF2rzddffrL7oFtuSxuf07SgV3gd4v2nVlmkk8qbalVKq6RUvEVeOJcwd+xOOveFL2ztPux6zbLGSPNj4+olUsRWIp1QDjZ/HvcO55ZozZaXG3Jl3fA0r3Iu5HLDetr93TSWM78jYm1vn8uv7v85yfu0lxXLjVWJxfz+Bu2+KmV5aeH6/khNR1K/aor6hVUKNCCIuhiSYTZTtDZeNH9iRyBn4/vcAPdrBUPgOmRXltJqoIVASgI7RRuS3WjUL3AGrY2oTXXjY3eR2q5qUzNcVHLrkW/YInj0Cjd3nYWkWr07iTBExd57YRt2xx7LV/LZy8kNcb3HoQygSqSiOQ1M6zBtugvHc9g1+23Um/kWczu5q9tUoecimdAmSbVi/QMsdbPCXRo6XRyYYkcx90mV0iNjveIDoVLKAHOBbddSWm4R5Sve3q4Onsp5pZAOZNh9fDlx5JxBzffHmXcWwIAQOfNstYG1BSjt7DPaR88eq/SsXkCBlJfBhVhqeeBquib5ReuzpCJSWZR33qAnRWad7PryF0r7Z8at5MguH0WMijtSV12IPa7rWKiLiJphXQya+GQXqtHhNr9qPTU6gBaXC/B4/ISeo1KRdmGgIFi6SgjooJalJoj3CdKKOxM2MORJ4wzcyb84NA8uhzcIQNTE2rLCjgDjjpT9xVil+ITFyp76imMRwcbPCQBlQYJfgZvaVlkK2as+TO3OLkbLYG9z6TSYYB8hMiJhJm4JroDEpFPqA/1tToBBYM/12mJo/FhzVPyRHYsoBE+m4+37qbxfYcgjtS2YrkNWE8WxcOYmr/baXyaAka5yhtp26eYtAy1SbjKZzqPAhJJT1Vhyx1/LRsB+lRhMPPxZocFFaOmVg7qMkYC+Qz2Vq5xcuVtnByhxvXu2V5O6Z27Kfu1zunlu1vFU 5kqjm79l Xf9dHuOHXtVejtssAM1vOduOOOhV91I5VR7Wo7wV5ohHKLiWeA8lvNM3crWOeld4l9IhzmWA9IaASgUPFIX1lpQ+HflnXK4BrMOzLdrAcehq6MYgkPoncegEb7vl0GNv0vJrE6sIaWGPPze1W12HM1yrtzRPg9R9Qw8Uo+SNziqu+RRs49x4aXOyK1bcxWHtXuyrh7I0AqXCK5Rrvrq/hxBMnCAS4vOXo4/YcwnJ7YHlYSUTKt+CWvsR+jhy3tQMaQXXHgO3B16DxG+kUVc2WFemkeQ+aVmTLdFSszHFiiBMDEYy4YQc+OFEP3I53frToRv42xnzRD2S6HNkiJSoj3ZtpoAqySpl7GXcsg4PIj5M7/34F1i6netOsmg== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Mon, Jan 13, 2025 at 2:33=E2=80=AFPM Kairui Song wrot= e: > > On Fri, Jan 10, 2025 at 7:24=E2=80=AFPM Baoquan He wrote= : > > > > On 01/09/25 at 10:15am, Kairui Song wrote: > > > On Wed, Jan 8, 2025 at 7:10=E2=80=AFPM Baoquan He wr= ote: > > > > > > > > > > Thanks for the very detailed review! > > > > > > > On 12/31/24 at 01:46am, Kairui Song wrote: > > > > ......snip..... > > > > > --- > > > > > include/linux/swap.h | 3 +- > > > > > mm/swapfile.c | 435 ++++++++++++++++++++++++-------------= ------ > > > > > 2 files changed, 246 insertions(+), 192 deletions(-) > > > > > > > > > > diff --git a/include/linux/swap.h b/include/linux/swap.h > > > > > index 339d7f0192ff..c4ff31cb6bde 100644 > > > > > --- a/include/linux/swap.h > > > > > +++ b/include/linux/swap.h > > > > > @@ -291,6 +291,7 @@ enum swap_cluster_flags { > > > > > * throughput. > > > > > */ > > > > > struct percpu_cluster { > > > > > + local_lock_t lock; /* Protect the percpu_cluster above */ > > > > > unsigned int next[SWAP_NR_ORDERS]; /* Likely next allocatio= n offset */ > > > > > }; > > > > > > > > > > @@ -313,7 +314,7 @@ struct swap_info_struct { > > > > > /* list of cluster that con= tains at least one free slot */ > > > > > struct list_head frag_clusters[SWAP_NR_ORDERS]; > > > > > /* list of cluster that are= fragmented or contented */ > > > > > - unsigned int frag_cluster_nr[SWAP_NR_ORDERS]; > > > > > + atomic_long_t frag_cluster_nr[SWAP_NR_ORDERS]; > > > > > unsigned int pages; /* total of usable pages of= swap */ > > > > > atomic_long_t inuse_pages; /* number of those currentl= y in use */ > > > > > struct percpu_cluster __percpu *percpu_cluster; /* per cpu'= s swap location */ > > > > > diff --git a/mm/swapfile.c b/mm/swapfile.c > > > > > index 7795a3d27273..dadd4fead689 100644 > > > > > --- a/mm/swapfile.c > > > > > +++ b/mm/swapfile.c > > ...snip... > > > > > @@ -475,39 +488,90 @@ static void swap_cluster_schedule_discard(s= truct swap_info_struct *si, > > > > > > > > > > static void __free_cluster(struct swap_info_struct *si, struct s= wap_cluster_info *ci) > > > > > { > > > > > - lockdep_assert_held(&si->lock); > > > > > lockdep_assert_held(&ci->lock); > > > > > cluster_move(si, ci, &si->free_clusters, CLUSTER_FLAG_FREE)= ; > > > > > ci->order =3D 0; > > > > > } > > > > > > > > > > +/* > > > > > + * Isolate and lock the first cluster that is not contented on a= list, > > > > > + * clean its flag before taken off-list. Cluster flag must be in= sync > > > > > + * with list status, so cluster updaters can always know the clu= ster > > > > > + * list status without touching si lock. > > > > > + * > > > > > + * Note it's possible that all clusters on a list are contented = so > > > > > + * this returns NULL for an non-empty list. > > > > > + */ > > > > > +static struct swap_cluster_info *cluster_isolate_lock( > > > > > + struct swap_info_struct *si, struct list_head *list= ) > > > > > +{ > > > > > + struct swap_cluster_info *ci, *ret =3D NULL; > > > > > + > > > > > + spin_lock(&si->lock); > > > > > + > > > > > + if (unlikely(!(si->flags & SWP_WRITEOK))) > > > > > + goto out; > > > > > + > > > > > + list_for_each_entry(ci, list, list) { > > > > > + if (!spin_trylock(&ci->lock)) > > > > > + continue; > > > > > + > > > > > + /* We may only isolate and clear flags of following= lists */ > > > > > + VM_BUG_ON(!ci->flags); > > > > > + VM_BUG_ON(ci->flags > CLUSTER_FLAG_USABLE && > > > > > + ci->flags !=3D CLUSTER_FLAG_FULL); > > > > > + > > > > > + list_del(&ci->list); > > > > > + ci->flags =3D CLUSTER_FLAG_NONE; > > > > > + ret =3D ci; > > > > > + break; > > > > > + } > > > > > +out: > > > > > + spin_unlock(&si->lock); > > > > > + > > > > > + return ret; > > > > > +} > > > > > + > > > > > /* > > > > > * Doing discard actually. After a cluster discard is finished, = the cluster > > > > > - * will be added to free cluster list. caller should hold si->lo= ck. > > > > > -*/ > > > > > -static void swap_do_scheduled_discard(struct swap_info_struct *s= i) > > > > > + * will be added to free cluster list. Discard cluster is a bit = special as > > > > > + * they don't participate in allocation or reclaim, so clusters = marked as > > > > > + * CLUSTER_FLAG_DISCARD must remain off-list or on discard list. > > > > > + */ > > > > > +static bool swap_do_scheduled_discard(struct swap_info_struct *s= i) > > > > > { > > > > > struct swap_cluster_info *ci; > > > > > + bool ret =3D false; > > > > > unsigned int idx; > > > > > > > > > > + spin_lock(&si->lock); > > > > > while (!list_empty(&si->discard_clusters)) { > > > > > ci =3D list_first_entry(&si->discard_clusters, stru= ct swap_cluster_info, list); > > > > > + /* > > > > > + * Delete the cluster from list but don't clear its= flags until > > > > > + * discard is done, so isolation and relocation wil= l skip it. > > > > > + */ > > > > > list_del(&ci->list); > > > > > > > > I don't understand above comment. ci has been taken off list. While > > > > allocation need isolate from a usable list. Even though we clear > > > > ci->flags now, how come isolation and relocation will touch it. I m= ay > > > > miss anything here. > > > > > > There are many cases, one possible and common situation is that the > > > percpu cluster (si->percpu_cluster of another CPU) is still pointing > > > to it. > > > > > > Also, this commit removed protection of si lock on allocation, and > > > allocation path may also drop ci lock to call reclaim, which means on= e > > > cluster could be used or freed by anyone before allocator reacquire > > > the ci lock again. In that case, the allocator could see a discard > > > cluster. > > > > > > So we don't clear the discard flag, in case anyone misuse it. > > > > > > I can add more inline comments on this, this is already some related > > > comments above the function relocate_cluster, could add some more > > > referencing that. > > Hi Baoquan, > > > > > Thanks for your great explanation. I understand that si->percpu_cluster > > could point to a discarded ci, and a ci could be got from non-full, > > frag lists but later become discarded if that ci is freed on other cpu > > during cluster_reclaim_range() invocation. I haven't got how isolation > > could see a discarded ci in cluster_isolate_lock(). Could you help give > > an exmaple on how that happen? > > cluster_isolate_lock shouldn't see a discard cluster, and there is a > VM_BUG_ON for that. Oh, now I realize what you mean, the comment in swap_do_scheduled_discard mentioned cluster_isolate_lock may see a discard cluster, that is not true, it was added in an early version of this series and forgot to update the comment. I'll just drop that. > > > > > Surely, I understand keeping the discarded flag is very necessary so > > that checking like cluster_is_usable() will return expected value. > > > > And by the way, I haven't got when the ' if (!ci->count)' case could > > happen in relocate_cluster() since we have filtered away discarded ci > > with the 'if (cluster_is_discard(ci))' checking. I asked in another > > thread, could you help explain it? > > Many swap devices doesn't need discard so the cluster could be freed > directly. And actually the ci->count check in relocate_cluster is not > necessarily related to that. > > The caller of relocate_cluster, may fail an allocation (eg. race with > swapoff) and that could end up calling relocate_cluster with a empty > cluster, such cluster should go to the free list (swapoff might fail > too). > > The swapoff case is extremely rare but let's just be more robust here, > covering free cluster have almost no overhead but save a lot of > efforts. I can add some comments on this. > > > > > static void relocate_cluster(struct swap_info_struct *si, > > struct swap_cluster_info *ci) > > { > > lockdep_assert_held(&ci->lock); > > > > /* Discard cluster must remain off-list or on discard list */ > > if (cluster_is_discard(ci)) > > return; > > > > if (!ci->count) { > > free_cluster(si, ci); > > ... > > }