From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 439DFC3DA4A for ; Sat, 3 Aug 2024 10:57:20 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id B68C26B007B; Sat, 3 Aug 2024 06:57:19 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id B184F6B0083; Sat, 3 Aug 2024 06:57:19 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 9E0246B0085; Sat, 3 Aug 2024 06:57:19 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 7F7B86B007B for ; Sat, 3 Aug 2024 06:57:19 -0400 (EDT) Received: from smtpin07.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id 2EFCF1C488C for ; Sat, 3 Aug 2024 10:57:19 +0000 (UTC) X-FDA: 82410632598.07.04BDD24 Received: from mail-vs1-f48.google.com (mail-vs1-f48.google.com [209.85.217.48]) by imf13.hostedemail.com (Postfix) with ESMTP id 6DFD120018 for ; Sat, 3 Aug 2024 10:57:17 +0000 (UTC) Authentication-Results: imf13.hostedemail.com; dkim=none; dmarc=fail reason="SPF not aligned (relaxed), No valid DKIM" header.from=kernel.org (policy=none); spf=pass (imf13.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.217.48 as permitted sender) smtp.mailfrom=21cnbao@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1722682579; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=l3gycogCNf/Ydh3QW7k4kTbyy8kxDZNintYR40hSSM8=; b=8nLDctPIUpyTeA3duejDe/Oc1rsGZIZu0KbpKiiDqsC5/bsj1XVt+gFzp+IXNp+NlwJaBG WlgdFVOtoc/W49BVk8HT8opfOrn5d5R3GcqW0OlVHa6zqxYwOZWaX84UnxEc5+fM48I5N9 bmYfoHgobeL0HZcohHlzmRs1DAOERAs= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1722682579; a=rsa-sha256; cv=none; b=w+ZE+3UPBOM6qZTOtEcF9JLyWjug01tIhswVKm0Ex45xfhuM0mVWVEWyGCVPq1cHLIgtN6 LrxQ7VsH4OnkB6T7hAb5SVvsVsXU7SnV9/S3SsNC+mRpBXGBjuHCXHUrTceHxpuQFkWXk2 PGw106F7a/lzJUDlBd4UNdtL6lhraW8= ARC-Authentication-Results: i=1; imf13.hostedemail.com; dkim=none; dmarc=fail reason="SPF not aligned (relaxed), No valid DKIM" header.from=kernel.org (policy=none); spf=pass (imf13.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.217.48 as permitted sender) smtp.mailfrom=21cnbao@gmail.com Received: by mail-vs1-f48.google.com with SMTP id ada2fe7eead31-4928fa870a9so2230961137.2 for ; Sat, 03 Aug 2024 03:57:17 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1722682636; x=1723287436; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=l3gycogCNf/Ydh3QW7k4kTbyy8kxDZNintYR40hSSM8=; b=DetFYRTPQttN1aC4hYT5y8HRZAaN75rphYk4hU3P0mTPTJewn4Qjhh3JdKKMozyNKl EMd2dXa2fsOws7h1iQA0r7HuTECjTVLfJrkv/qdP8uPLTyywxXHU5XQCLfYw1wGcJwl8 hgDayxEdCaucySFTNHpODQ3s0Zqag1djBcSJLlL+NnlyIDu/V2VAMcosOjSxZaRaA29o dzMiuwGxZ1q1KRfxk/7ry4PyTZvmBvUPAGFe58GookFWLMRpIrQRgVEaBpQI+IZokTIU ggPI1pUB+FHbkd/KLfhpIJE0gLw1Ca2CnVRksPif0NRMb4tR6aWXj9r/cfgENwsIs0DS 2tmQ== X-Forwarded-Encrypted: i=1; AJvYcCXA7i5nXhBMi+ecT/ZD5LWcCHYoXVHNlR209xrdNXes6x/gqjnhRf2zvzKn86QRjnrpAyd0Hl30R6+qkzwc8pun6Zg= X-Gm-Message-State: AOJu0YyYlwuEuxLRgZojf2SbnfiyyP+l3igECiImpJtnW5UrVntOjkru 8vIF/iTDysxBR/Dl/zqKLhlGC65+6U15viRwu0nWxLnGLskBzP7BoUReGzPCzHWJ1hjadaGWdCl xZuSpixIcoM+O/F4CWucQv0bmpLM= X-Google-Smtp-Source: AGHT+IGI8H27B7+GjJSoyZnw1NjMlWP14PChzaAB7wTYhC5mW0hqZf3wzGJ5TTP6frQT4/WiYz4fweWWtgb319Hfm9w= X-Received: by 2002:a05:6102:f10:b0:48f:40c1:3cd0 with SMTP id ada2fe7eead31-4945be0bbbdmr6461553137.12.1722682636380; Sat, 03 Aug 2024 03:57:16 -0700 (PDT) MIME-Version: 1.0 References: <20240730-swap-allocator-v5-5-cb9c148b9297@kernel.org> <20240803091118.84274-1-21cnbao@gmail.com> In-Reply-To: <20240803091118.84274-1-21cnbao@gmail.com> From: Barry Song Date: Sat, 3 Aug 2024 22:57:05 +1200 Message-ID: Subject: Re: [PATCH v5 5/9] mm: swap: skip slot cache on freeing for mTHP To: chrisl@kernel.org Cc: akpm@linux-foundation.org, hughd@google.com, kaleshsingh@google.com, kasong@tencent.com, linux-kernel@vger.kernel.org, linux-mm@kvack.org, ryan.roberts@arm.com, ying.huang@intel.com, Barry Song Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: 6DFD120018 X-Stat-Signature: m5gyabzz6aus43nc8j5mm3jg4hb4esqs X-Rspamd-Server: rspam09 X-Rspam-User: X-HE-Tag: 1722682637-584865 X-HE-Meta: U2FsdGVkX181T3hTjs9cSrzzAx+/Amofx0xFOtw6lmAn8ngDDs1Rf4ZMUpUQ3HWTGrxzTO67mtOlYwx0kzg1qX47/zdFDVUl5RU7ptsU24c68OimCPoYX8gp2OM4coXrbNpvJutEGyGuy4Mj4wfoH1NkEZuMx08o87nIHkd3hJIpXXq0DzqV5xogIQfWfoFp/Nh5xK65ADc+xTQTq5GvjAzvhrp429W5LodKuLwg2+WFZMTktYvQcBvLH5DaMnn/5ZuzgMZl3z5VLjbKQq8fePdZXUgzUlU8nZt6FSXIY0qHJQWpcfU8pHRZoM6J/cWwEHjnHgY9tBoYqizoy8VROUcnU7jo0kQ0mKZcXBm7DyGyrOCdKoRx/aD/3iYlV7TDIjVDLY1pU3sPDwYquv3cMr3bLl2C7DNvsaEV02IVN+GosTL+CppbyFkUg6Gkuzb7i3x2wCHrO/208S47p6SNaiUCAD6yeVU5XDGxaakw5gBU+kSL5P+d0cHbdxb3mwn41opiwXD8v58ycFSpPTwLCcVAvSuEa9+dS7MQMqj5Uyhq0POid5YPjhWXYqE9rahabz6BmKU1Qc2jr4t9wRJqPXQrleIZ1kViAuGgAINigiaYKPoDZXglTkBrYiivGjf5kTn9WybX5lYxEXdWL5abEIKkmUhKQT9YE6HdCwB9wOqkYeqCTk9rMPTb9yZS821bHoexUXClzqlp5ynyyX8k7ZHhP/9mn3HOvPBWPf+VJ4SsMWfFrH8eiXx+yrhYOUSpLn+4OYtFx3iCL1DZy3NSegMipE7EKZ/QqH7y1+q6js0WWTNYR1sqiru0PA0/cBaDC6b39bC7asm09GjIUY+7kg1+uOGN6afr4vb3ibqPrEtKbKkqUViwTGHS49i/uTul6y1AaYG362tmgdxbEoQOQ3w+o31LvkJq3GSoXguASP9TiwuJZEaK2cz7YbG3bCh0FpR2NFUwRnzZn9tOf3v 9w+HnSnE hTxZLsscsc4YY3MZRARrXwHBK9/QWxEWAt50N6DmfR35kkChx46bR0DSY8+chJL/8P2JB+haVAtawSEt9O0g2qq+NVgfOrK6qF1Ew3WOGxUEhkICNAUw2OmQlbPuLFQ/O1+XXl/LbtSVeNI8/vy7lQapjAyD6kOI6wej2dNUlVx5xAPK5OkDHawBqB8XBTpxRb0o7g/T/9sLyGaMETQr0+f0VDJScBsd30kViIfJ58hCimNvXNKFzVJFa0IZFs3qVfsQFkr4xHOGUdHotxSOp1ZMDtCw7B6fM/UnnkHe4IQdU/w/KyOWE+k0m3asYVp51J7JFB8KHNPLmbzH6Dzjm5xrTqIQYK6IktIpB2VgIOdYM8IJsJPjzR/o0qk3yWxTtTxqScIjIu78lFdoROSXypVa6IqIqLKxBGshfbEwD5P7KZEQ3fCZoAltrR2NTVJ1Qewxe1i8yrBglvUOA0PyTGSoplczzPDi3cq4i74HG1KtdklY11p7edEyXUfbbaoWfPFX4aksfaVxrB0kUO/icRz95mbQ0eFZzIM8A X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Sat, Aug 3, 2024 at 9:11=E2=80=AFPM Barry Song <21cnbao@gmail.com> wrote= : > > On Wed, Jul 31, 2024 at 6:49=E2=80=AFPM wrote: > > > > From: Kairui Song > > > > Currently when we are freeing mTHP folios from swap cache, we free > > then one by one and put each entry into swap slot cache. Slot > > cache is designed to reduce the overhead by batching the freeing, > > but mTHP swap entries are already continuous so they can be batch > > freed without it already, it saves litle overhead, or even increase > > overhead for larger mTHP. > > > > What's more, mTHP entries could stay in swap cache for a while. > > Contiguous swap entry is an rather rare resource so releasing them > > directly can help improve mTHP allocation success rate when under > > pressure. > > > > Signed-off-by: Kairui Song > > Acked-by: Barry Song > > I believe this is the right direction to take. Currently, entries are rel= eased > one by one, even when they are contiguous in the swap file(those nr_pages > entries are definitely in the same cluster and same si), leading to numer= ous > lock and unlock operations. > This approach provides batched support. > > free_swap_and_cache_nr() has the same issue, so I drafted a patch based o= n > your code. I wonder if you can also help test and review before I send it > officially: > > From 4bed5c08bc0f7769ee2849812acdad70c4e32ead Mon Sep 17 00:00:00 2001 > From: Barry Song > Date: Sat, 3 Aug 2024 20:21:14 +1200 > Subject: [PATCH RFC] mm: attempt to batch free swap entries for zap_pte_r= ange() > > Zhiguo reported that swap release could be a serious bottleneck > during process exits[1]. With mTHP, we have the opportunity to > batch free swaps. > Thanks to the work of Chris and Kairui[2], I was able to achieve > this optimization with minimal code changes by building on their > efforts. > > [1] https://lore.kernel.org/linux-mm/20240731133318.527-1-justinjiang@viv= o.com/ > [2] https://lore.kernel.org/linux-mm/20240730-swap-allocator-v5-0-cb9c148= b9297@kernel.org/ > > Signed-off-by: Barry Song > --- > mm/swapfile.c | 43 +++++++++++++++++++++++++++++++++++++++++++ > 1 file changed, 43 insertions(+) > > diff --git a/mm/swapfile.c b/mm/swapfile.c > index ea023fc25d08..9def6dba8d26 100644 > --- a/mm/swapfile.c > +++ b/mm/swapfile.c > @@ -156,6 +156,25 @@ static bool swap_is_has_cache(struct swap_info_struc= t *si, > return true; > } > > +static bool swap_is_last_map(struct swap_info_struct *si, > + unsigned long offset, int nr_pages, > + bool *any_only_cache) > +{ > + unsigned char *map =3D si->swap_map + offset; > + unsigned char *map_end =3D map + nr_pages; > + bool cached =3D false; > + > + do { > + if ((*map & ~SWAP_HAS_CACHE) !=3D 1) > + return false; > + if (*map & SWAP_HAS_CACHE) > + cached =3D true; > + } while (++map < map_end); > + > + *any_only_cache =3D cached; > + return true; > +} > + > /* > * returns number of pages in the folio that backs the swap entry. If po= sitive, > * the folio was reclaimed. If negative, the folio was not reclaimed. If= 0, no > @@ -1808,6 +1827,29 @@ void free_swap_and_cache_nr(swp_entry_t entry, int= nr) > if (WARN_ON(end_offset > si->max)) > goto out; > > + if (nr > 1) { > + struct swap_cluster_info *ci; > + bool batched_free; > + int i; > + > + ci =3D lock_cluster_or_swap_info(si, start_offset); > + if ((batched_free =3D swap_is_last_map(si, start_offset, = nr, &any_only_cache))) { > + for (i =3D 0; i < nr; i++) > + WRITE_ONCE(si->swap_map[start_offset + i]= , SWAP_HAS_CACHE); > + } > + unlock_cluster_or_swap_info(si, ci); > + > + if (batched_free) { > + spin_lock(&si->lock); > + pr_err("%s offset:%lx nr:%lx\n", __func__,start_o= ffset, nr); > + swap_entry_range_free(si, entry, nr); > + spin_unlock(&si->lock); > + if (any_only_cache) > + goto reclaim; > + goto out; > + } Sorry, what I actually meant was that the two gotos are reversed. if (batched_free) { if (any_only_cache) goto reclaim; spin_lock(&si->lock); swap_entry_range_free(si, entry, nr); spin_unlock(&si->lock); goto out; } > + } > + > /* > * First free all entries in the range. > */ > @@ -1828,6 +1870,7 @@ void free_swap_and_cache_nr(swp_entry_t entry, int = nr) > if (!any_only_cache) > goto out; > > +reclaim: > /* > * Now go back over the range trying to reclaim the swap cache. T= his is > * more efficient for large folios because we will only try to re= claim > -- > 2.34.1 > > > > > --- > > mm/swapfile.c | 59 ++++++++++++++++++++++++++-------------------------= -------- > > 1 file changed, 26 insertions(+), 33 deletions(-) > > > > diff --git a/mm/swapfile.c b/mm/swapfile.c > > index 34e6ea13e8e4..9b63b2262cc2 100644 > > --- a/mm/swapfile.c > > +++ b/mm/swapfile.c > > @@ -479,20 +479,21 @@ static void inc_cluster_info_page(struct swap_inf= o_struct *p, > > } > > > > /* > > - * The cluster ci decreases one usage. If the usage counter becomes 0, > > + * The cluster ci decreases @nr_pages usage. If the usage counter beco= mes 0, > > * which means no page in the cluster is in use, we can optionally dis= card > > * the cluster and add it to free cluster list. > > */ > > -static void dec_cluster_info_page(struct swap_info_struct *p, struct s= wap_cluster_info *ci) > > +static void dec_cluster_info_page(struct swap_info_struct *p, > > + struct swap_cluster_info *ci, int nr_= pages) > > { > > if (!p->cluster_info) > > return; > > > > - VM_BUG_ON(ci->count =3D=3D 0); > > + VM_BUG_ON(ci->count < nr_pages); > > VM_BUG_ON(cluster_is_free(ci)); > > lockdep_assert_held(&p->lock); > > lockdep_assert_held(&ci->lock); > > - ci->count--; > > + ci->count -=3D nr_pages; > > > > if (!ci->count) { > > free_cluster(p, ci); > > @@ -998,19 +999,6 @@ static int scan_swap_map_slots(struct swap_info_st= ruct *si, > > return n_ret; > > } > > > > -static void swap_free_cluster(struct swap_info_struct *si, unsigned lo= ng idx) > > -{ > > - unsigned long offset =3D idx * SWAPFILE_CLUSTER; > > - struct swap_cluster_info *ci; > > - > > - ci =3D lock_cluster(si, offset); > > - memset(si->swap_map + offset, 0, SWAPFILE_CLUSTER); > > - ci->count =3D 0; > > - free_cluster(si, ci); > > - unlock_cluster(ci); > > - swap_range_free(si, offset, SWAPFILE_CLUSTER); > > -} > > - > > int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_or= der) > > { > > int order =3D swap_entry_order(entry_order); > > @@ -1269,21 +1257,28 @@ static unsigned char __swap_entry_free(struct s= wap_info_struct *p, > > return usage; > > } > > > > -static void swap_entry_free(struct swap_info_struct *p, swp_entry_t en= try) > > +/* > > + * Drop the last HAS_CACHE flag of swap entries, caller have to > > + * ensure all entries belong to the same cgroup. > > + */ > > +static void swap_entry_range_free(struct swap_info_struct *p, swp_entr= y_t entry, > > + unsigned int nr_pages) > > { > > - struct swap_cluster_info *ci; > > unsigned long offset =3D swp_offset(entry); > > - unsigned char count; > > + unsigned char *map =3D p->swap_map + offset; > > + unsigned char *map_end =3D map + nr_pages; > > + struct swap_cluster_info *ci; > > > > ci =3D lock_cluster(p, offset); > > - count =3D p->swap_map[offset]; > > - VM_BUG_ON(count !=3D SWAP_HAS_CACHE); > > - p->swap_map[offset] =3D 0; > > - dec_cluster_info_page(p, ci); > > + do { > > + VM_BUG_ON(*map !=3D SWAP_HAS_CACHE); > > + *map =3D 0; > > + } while (++map < map_end); > > + dec_cluster_info_page(p, ci, nr_pages); > > unlock_cluster(ci); > > > > - mem_cgroup_uncharge_swap(entry, 1); > > - swap_range_free(p, offset, 1); > > + mem_cgroup_uncharge_swap(entry, nr_pages); > > + swap_range_free(p, offset, nr_pages); > > } > > > > static void cluster_swap_free_nr(struct swap_info_struct *sis, > > @@ -1343,7 +1338,6 @@ void swap_free_nr(swp_entry_t entry, int nr_pages= ) > > void put_swap_folio(struct folio *folio, swp_entry_t entry) > > { > > unsigned long offset =3D swp_offset(entry); > > - unsigned long idx =3D offset / SWAPFILE_CLUSTER; > > struct swap_cluster_info *ci; > > struct swap_info_struct *si; > > unsigned char *map; > > @@ -1356,19 +1350,18 @@ void put_swap_folio(struct folio *folio, swp_en= try_t entry) > > return; > > > > ci =3D lock_cluster_or_swap_info(si, offset); > > - if (size =3D=3D SWAPFILE_CLUSTER) { > > + if (size > 1) { > > map =3D si->swap_map + offset; > > - for (i =3D 0; i < SWAPFILE_CLUSTER; i++) { > > + for (i =3D 0; i < size; i++) { > > val =3D map[i]; > > VM_BUG_ON(!(val & SWAP_HAS_CACHE)); > > if (val =3D=3D SWAP_HAS_CACHE) > > free_entries++; > > } > > - if (free_entries =3D=3D SWAPFILE_CLUSTER) { > > + if (free_entries =3D=3D size) { > > unlock_cluster_or_swap_info(si, ci); > > spin_lock(&si->lock); > > - mem_cgroup_uncharge_swap(entry, SWAPFILE_CLUSTE= R); > > - swap_free_cluster(si, idx); > > + swap_entry_range_free(si, entry, size); > > spin_unlock(&si->lock); > > return; > > } > > @@ -1413,7 +1406,7 @@ void swapcache_free_entries(swp_entry_t *entries,= int n) > > for (i =3D 0; i < n; ++i) { > > p =3D swap_info_get_cont(entries[i], prev); > > if (p) > > - swap_entry_free(p, entries[i]); > > + swap_entry_range_free(p, entries[i], 1); > > prev =3D p; > > } > > if (p) > > > > -- > > 2.46.0.rc1.232.g9752f9e123-goog > > > > Thanks > Barry >