From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 72196C3DA4A for ; Sat, 3 Aug 2024 09:11:41 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id A55876B007B; Sat, 3 Aug 2024 05:11:40 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id A059E6B0083; Sat, 3 Aug 2024 05:11:40 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 8CD206B0085; Sat, 3 Aug 2024 05:11:40 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 6F25A6B007B for ; Sat, 3 Aug 2024 05:11:40 -0400 (EDT) Received: from smtpin22.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id 7AE5F4155A for ; Sat, 3 Aug 2024 09:11:39 +0000 (UTC) X-FDA: 82410366318.22.C90B70B Received: from mail-pj1-f51.google.com (mail-pj1-f51.google.com [209.85.216.51]) by imf08.hostedemail.com (Postfix) with ESMTP id 7869F160003 for ; Sat, 3 Aug 2024 09:11:36 +0000 (UTC) Authentication-Results: imf08.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=L5OQGjwT; spf=pass (imf08.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.216.51 as permitted sender) smtp.mailfrom=21cnbao@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1722676218; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=WyHzns6fv4vzn1BHlOyIeJLzB9UhGPzYZabxEtUQGWk=; b=eQtGxhL1gcDh+vSFnH+uBFRZ287Zm22zisPovLyeMEGfcgqJ8kEPSr23slbRlmT0jkKoal rJvQKrP5fJ9z4ZBffqU93wTc+FSqxLmBug0SSPnekZ7tntK5S5KqSufoJTsXJWy1l3QMyz nPABdV2oS7pXlDEojv44BcVpgH0xoDo= ARC-Authentication-Results: i=1; imf08.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=L5OQGjwT; spf=pass (imf08.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.216.51 as permitted sender) smtp.mailfrom=21cnbao@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1722676218; a=rsa-sha256; cv=none; b=pwCR+/diNxxkz1rJRIytXJYtZ1Ia3JUa3b9mN51qR1lDi8aBesWorPYzDFXylfbDlcT9oD ioEnamcs29NfkXaylG6mYGT3i4l0g5ig10pzOYbqFwOZ6ktxfN7jH+DDdcT14B530J5kRY NSjEivptWoh7lBNIKrDz/qUd833latA= Received: by mail-pj1-f51.google.com with SMTP id 98e67ed59e1d1-2cb4c4de4cbso6012947a91.1 for ; Sat, 03 Aug 2024 02:11:36 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1722676295; x=1723281095; darn=kvack.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=WyHzns6fv4vzn1BHlOyIeJLzB9UhGPzYZabxEtUQGWk=; b=L5OQGjwTeN+fKAcP2+ks9z4SjsJxO0N4ao/haXmNE2d2NLozfrLBencmr5KPnewCpC t9jXI2p8RjDAS/eEDTyh1UcaDpy/zI04jFaB0U8NZZrGUPLvwfs7v/LcgvE/Igx3u8SI 8kcnfqVOn/dLCwbN1hFNCtTLGSBUEFCVyStx8m9MLTtIk9pBX8tLQzJU4ZCqNTerA53G YnUY7/IzzzdUbACw4cZlrJNf47AxfErF6VXGsURH5dvbU8pOThqAMmm0gL6atkx/46wE 3W+3JBSG27EgASGEIB8+oo5gDw1IGAMRdhvftm01yiYcnpVPf9EAO8n9YmakfnO2KDyj kbHA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1722676295; x=1723281095; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=WyHzns6fv4vzn1BHlOyIeJLzB9UhGPzYZabxEtUQGWk=; b=kfTuzjdCw90/M4ycSWj1o0nfpCVpjLam6k3cv8rjjo2rn/ejp0q6JAcQO9sETk+DH3 aZUWzwqk6bWHVaLAGXxOJZXl0XKuAllXIMlLjFuS7tkc/Nbt8HTWshj9fsga4lngKHIj MPv/MaBwf6JeA0IRQhKHH/V2ldCKpEykxZf221zq+jDSs42MgbV7mbudIceozMVNHmNm ac0sOTqz01m8LXoqRX1Ll6hsyjVnmGo6SRYKPfYVby63WE1PfzRBbO21MXErINcwsarX cNumjGow4Oapz0SVAYafkepRV60WBs5zwQS1QnjGhrGpxy7AASji026WbAyTNol2Efmw nr3Q== X-Forwarded-Encrypted: i=1; AJvYcCXmjOAwpW0/lx7T2LHghtILnLc05jjuLa1t4dSaI9sk3j4yLIXIBMcLC0BuPzS/Npt1YvOifxJJC6NgavKEiZCBn78= X-Gm-Message-State: AOJu0YwVb7WX5qUBqIY6SQgwLdvvKeLKZkfIeE+ngbh5ZUq+9CaLWRU/ /umyz7ohhAiCzDxFK5D8fwaDs/8alEmK0cydA3uZwGEsqhgbcgBS X-Google-Smtp-Source: AGHT+IGZF7adYEYRApBk2zmy2UwHmUFCE5l2pC56AzNHsLlv/ks1x0q67elburQSuZTqHcd+4mH7Rw== X-Received: by 2002:a17:90a:a10d:b0:2c9:6aae:ac08 with SMTP id 98e67ed59e1d1-2cff943e05amr6944091a91.17.1722676294944; Sat, 03 Aug 2024 02:11:34 -0700 (PDT) Received: from localhost.localdomain ([2407:7000:8942:5500:aaa1:59ff:fe57:eb97]) by smtp.gmail.com with ESMTPSA id 98e67ed59e1d1-2cfdc4cf181sm6454237a91.37.2024.08.03.02.11.30 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sat, 03 Aug 2024 02:11:34 -0700 (PDT) From: Barry Song <21cnbao@gmail.com> To: chrisl@kernel.org Cc: akpm@linux-foundation.org, baohua@kernel.org, hughd@google.com, kaleshsingh@google.com, kasong@tencent.com, linux-kernel@vger.kernel.org, linux-mm@kvack.org, ryan.roberts@arm.com, ying.huang@intel.com, Barry Song Subject: Re: [PATCH v5 5/9] mm: swap: skip slot cache on freeing for mTHP Date: Sat, 3 Aug 2024 21:11:18 +1200 Message-Id: <20240803091118.84274-1-21cnbao@gmail.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20240730-swap-allocator-v5-5-cb9c148b9297@kernel.org> References: <20240730-swap-allocator-v5-5-cb9c148b9297@kernel.org> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Rspamd-Server: rspam06 X-Rspamd-Queue-Id: 7869F160003 X-Stat-Signature: ibez4tzqkfiakz4ik9o447c14wsniazu X-Rspam-User: X-HE-Tag: 1722676296-453164 X-HE-Meta: U2FsdGVkX19mg01s5bBTwb4VKwLJtOVzz5qsrdQaLvRtNBVxmfEBUe/ZO/S0nlOEAR8kw4m4acr8bmB0lDHZN+UsOaHeEATav0Tt79bROg2QAwlJm4glWOtzFkHNuIqijRDnrTaLeSK1+xNwWZ5uOy7zmxBCDRWRRiads+UFRtn/NltejsXJOpn9vxHX7MUYp54GrEYMQ9JPnv4JQpuJBwSUHCqwD3zdzkDa4DIbuzeLEcP0aO7v0whMORhEPJZdmk3kfxRn/HABls0xdMDwA5e4J+nz43d9LED6s//Eb6WFH2BxPAQzcfJkne2jxDIHi5+eq7VknN2SKd5//d81utR6nnR0tu3Ki0FrRBrmQybKHRuGRYCbNeGtr9bex76YEJof+8UC2c82Qv5MC3HW6Ghu8BVCOpNNH6uVB8OoP3UhFyLJ2tY4kY78DuQd/tEOMop5rZ8AYvt+dDxAdtiHzOw8UlgwdrpWt4Ky5apzdglVX+WP3gJvqzopknzN8DPTvoA7WqUBtMq7vfytR1O+X5WNO7vLTxcaxoUNZtxxtwxSgRLYUZnZIl6GfudDNY7Bh+Z/nBRvDQjSXroB7U3m33Dkzzu3RANrfzpvnk4eCrFohi2V47B2kkcUim/LTQiua4t1U9sya+IbjkVoPo8oUSRyW59giJLFdhAydOncoiipNK46MMX8WQOZu4r7QCORDDZt3q29f6QyuAicCKyEU3l9G+U8Ju1VW4LjXyJ5E9DDN5B/TxF/U6yCLKyrlW4hX76ja7qHQjQD0nKtw1nu22ZaL0L0txYFfr9IXBHlNnGbZyuJoaeZVE5p0VFwlWkLUQGLZpUnrn+1HB2PmCmD3AhwaF+azZibZeAKzisK6GVsUhqCmpgCE+DJAw/pyWwLx9R3tGKTdY2mS9L1AiBO718avKqxX/LjtRSUnK0cLeYZuM/fivD242CzSQW93U2Iix5+uIZJ4ZOZmZeQpI0 v6YHbkG2 Wg/SD5T9hdJWULjqQ3jL66CriuCczZAU1nxODBVuL5smedfJCm8zP0x0/u9jO7BSCIfn6S4RVviF3rYROiYbZ0iLFqTu8wA+rmHJNt2b+mjMKbwvbuTw9p1qUnzcBinRiM6X133hwVEufSPv0bVD6JeL+djo3/NKCNjYM5YZ7p7sVVoBieMkBGxbX2p5bCAIRQ88RiJHpu2k35L6/oxnAmWrAIc66KzeSW0FVhxs+EFu2ZtlfX5iM3i3myE7DhkKF7+v63kNxUAxuaMdRoHFfJ4kpPCSrN9CTEHSp4QHdvQ5ep5Q0nhWduhfDzCZRV7roznUcA7x4cY6lUYQE3MrXYjNi7TkJ2A2ccrTNZm8+rd1aaC3P/ddDNPrAdeFhGH3Joz0h73vQlAAY6Bb5WRmAfVrCgu71+fjeJaAe43xv3ZC3H3yHqZwMyxkoUS+IWZaDpU+jL8J2CBcMsRA2NmKWma6SjIN05dbTmVBi8plyl+ib3tI3g7rdoG7RxwYAO9gfI89xlIZfXdPiJp75QN5c6bN66rDpIOeDOwEJRQtym0bRVbzgBdo9XoMa2dxX1QxfvjvX6YIRx3xsPl2XcLBpA580ahtO9eX9bEb/dvnCRLxzNvrJ7PVgN89Yw3mV7R49Bzqou/khKu29B6A= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Wed, Jul 31, 2024 at 6:49 PM wrote: > > From: Kairui Song > > Currently when we are freeing mTHP folios from swap cache, we free > then one by one and put each entry into swap slot cache. Slot > cache is designed to reduce the overhead by batching the freeing, > but mTHP swap entries are already continuous so they can be batch > freed without it already, it saves litle overhead, or even increase > overhead for larger mTHP. > > What's more, mTHP entries could stay in swap cache for a while. > Contiguous swap entry is an rather rare resource so releasing them > directly can help improve mTHP allocation success rate when under > pressure. > > Signed-off-by: Kairui Song Acked-by: Barry Song I believe this is the right direction to take. Currently, entries are released one by one, even when they are contiguous in the swap file(those nr_pages entries are definitely in the same cluster and same si), leading to numerous lock and unlock operations. This approach provides batched support. free_swap_and_cache_nr() has the same issue, so I drafted a patch based on your code. I wonder if you can also help test and review before I send it officially: >From 4bed5c08bc0f7769ee2849812acdad70c4e32ead Mon Sep 17 00:00:00 2001 From: Barry Song Date: Sat, 3 Aug 2024 20:21:14 +1200 Subject: [PATCH RFC] mm: attempt to batch free swap entries for zap_pte_range() Zhiguo reported that swap release could be a serious bottleneck during process exits[1]. With mTHP, we have the opportunity to batch free swaps. Thanks to the work of Chris and Kairui[2], I was able to achieve this optimization with minimal code changes by building on their efforts. [1] https://lore.kernel.org/linux-mm/20240731133318.527-1-justinjiang@vivo.com/ [2] https://lore.kernel.org/linux-mm/20240730-swap-allocator-v5-0-cb9c148b9297@kernel.org/ Signed-off-by: Barry Song --- mm/swapfile.c | 43 +++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 43 insertions(+) diff --git a/mm/swapfile.c b/mm/swapfile.c index ea023fc25d08..9def6dba8d26 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -156,6 +156,25 @@ static bool swap_is_has_cache(struct swap_info_struct *si, return true; } +static bool swap_is_last_map(struct swap_info_struct *si, + unsigned long offset, int nr_pages, + bool *any_only_cache) +{ + unsigned char *map = si->swap_map + offset; + unsigned char *map_end = map + nr_pages; + bool cached = false; + + do { + if ((*map & ~SWAP_HAS_CACHE) != 1) + return false; + if (*map & SWAP_HAS_CACHE) + cached = true; + } while (++map < map_end); + + *any_only_cache = cached; + return true; +} + /* * returns number of pages in the folio that backs the swap entry. If positive, * the folio was reclaimed. If negative, the folio was not reclaimed. If 0, no @@ -1808,6 +1827,29 @@ void free_swap_and_cache_nr(swp_entry_t entry, int nr) if (WARN_ON(end_offset > si->max)) goto out; + if (nr > 1) { + struct swap_cluster_info *ci; + bool batched_free; + int i; + + ci = lock_cluster_or_swap_info(si, start_offset); + if ((batched_free = swap_is_last_map(si, start_offset, nr, &any_only_cache))) { + for (i = 0; i < nr; i++) + WRITE_ONCE(si->swap_map[start_offset + i], SWAP_HAS_CACHE); + } + unlock_cluster_or_swap_info(si, ci); + + if (batched_free) { + spin_lock(&si->lock); + pr_err("%s offset:%lx nr:%lx\n", __func__,start_offset, nr); + swap_entry_range_free(si, entry, nr); + spin_unlock(&si->lock); + if (any_only_cache) + goto reclaim; + goto out; + } + } + /* * First free all entries in the range. */ @@ -1828,6 +1870,7 @@ void free_swap_and_cache_nr(swp_entry_t entry, int nr) if (!any_only_cache) goto out; +reclaim: /* * Now go back over the range trying to reclaim the swap cache. This is * more efficient for large folios because we will only try to reclaim -- 2.34.1 > --- >  mm/swapfile.c | 59 ++++++++++++++++++++++++++--------------------------------- >  1 file changed, 26 insertions(+), 33 deletions(-) > > diff --git a/mm/swapfile.c b/mm/swapfile.c > index 34e6ea13e8e4..9b63b2262cc2 100644 > --- a/mm/swapfile.c > +++ b/mm/swapfile.c > @@ -479,20 +479,21 @@ static void inc_cluster_info_page(struct swap_info_struct *p, >  } > >  /* > - * The cluster ci decreases one usage. If the usage counter becomes 0, > + * The cluster ci decreases @nr_pages usage. If the usage counter becomes 0, >   * which means no page in the cluster is in use, we can optionally discard >   * the cluster and add it to free cluster list. >   */ > -static void dec_cluster_info_page(struct swap_info_struct *p, struct swap_cluster_info *ci) > +static void dec_cluster_info_page(struct swap_info_struct *p, > +                                 struct swap_cluster_info *ci, int nr_pages) >  { >         if (!p->cluster_info) >                 return; > > -       VM_BUG_ON(ci->count == 0); > +       VM_BUG_ON(ci->count < nr_pages); >         VM_BUG_ON(cluster_is_free(ci)); >         lockdep_assert_held(&p->lock); >         lockdep_assert_held(&ci->lock); > -       ci->count--; > +       ci->count -= nr_pages; > >         if (!ci->count) { >                 free_cluster(p, ci); > @@ -998,19 +999,6 @@ static int scan_swap_map_slots(struct swap_info_struct *si, >         return n_ret; >  } > > -static void swap_free_cluster(struct swap_info_struct *si, unsigned long idx) > -{ > -       unsigned long offset = idx * SWAPFILE_CLUSTER; > -       struct swap_cluster_info *ci; > - > -       ci = lock_cluster(si, offset); > -       memset(si->swap_map + offset, 0, SWAPFILE_CLUSTER); > -       ci->count = 0; > -       free_cluster(si, ci); > -       unlock_cluster(ci); > -       swap_range_free(si, offset, SWAPFILE_CLUSTER); > -} > - >  int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_order) >  { >         int order = swap_entry_order(entry_order); > @@ -1269,21 +1257,28 @@ static unsigned char __swap_entry_free(struct swap_info_struct *p, >         return usage; >  } > > -static void swap_entry_free(struct swap_info_struct *p, swp_entry_t entry) > +/* > + * Drop the last HAS_CACHE flag of swap entries, caller have to > + * ensure all entries belong to the same cgroup. > + */ > +static void swap_entry_range_free(struct swap_info_struct *p, swp_entry_t entry, > +                                 unsigned int nr_pages) >  { > -       struct swap_cluster_info *ci; >         unsigned long offset = swp_offset(entry); > -       unsigned char count; > +       unsigned char *map = p->swap_map + offset; > +       unsigned char *map_end = map + nr_pages; > +       struct swap_cluster_info *ci; > >         ci = lock_cluster(p, offset); > -       count = p->swap_map[offset]; > -       VM_BUG_ON(count != SWAP_HAS_CACHE); > -       p->swap_map[offset] = 0; > -       dec_cluster_info_page(p, ci); > +       do { > +               VM_BUG_ON(*map != SWAP_HAS_CACHE); > +               *map = 0; > +       } while (++map < map_end); > +       dec_cluster_info_page(p, ci, nr_pages); >         unlock_cluster(ci); > > -       mem_cgroup_uncharge_swap(entry, 1); > -       swap_range_free(p, offset, 1); > +       mem_cgroup_uncharge_swap(entry, nr_pages); > +       swap_range_free(p, offset, nr_pages); >  } > >  static void cluster_swap_free_nr(struct swap_info_struct *sis, > @@ -1343,7 +1338,6 @@ void swap_free_nr(swp_entry_t entry, int nr_pages) >  void put_swap_folio(struct folio *folio, swp_entry_t entry) >  { >         unsigned long offset = swp_offset(entry); > -       unsigned long idx = offset / SWAPFILE_CLUSTER; >         struct swap_cluster_info *ci; >         struct swap_info_struct *si; >         unsigned char *map; > @@ -1356,19 +1350,18 @@ void put_swap_folio(struct folio *folio, swp_entry_t entry) >                 return; > >         ci = lock_cluster_or_swap_info(si, offset); > -       if (size == SWAPFILE_CLUSTER) { > +       if (size > 1) { >                 map = si->swap_map + offset; > -               for (i = 0; i < SWAPFILE_CLUSTER; i++) { > +               for (i = 0; i < size; i++) { >                         val = map[i]; >                         VM_BUG_ON(!(val & SWAP_HAS_CACHE)); >                         if (val == SWAP_HAS_CACHE) >                                 free_entries++; >                 } > -               if (free_entries == SWAPFILE_CLUSTER) { > +               if (free_entries == size) { >                         unlock_cluster_or_swap_info(si, ci); >                         spin_lock(&si->lock); > -                       mem_cgroup_uncharge_swap(entry, SWAPFILE_CLUSTER); > -                       swap_free_cluster(si, idx); > +                       swap_entry_range_free(si, entry, size); >                         spin_unlock(&si->lock); >                         return; >                 } > @@ -1413,7 +1406,7 @@ void swapcache_free_entries(swp_entry_t *entries, int n) >         for (i = 0; i < n; ++i) { >                 p = swap_info_get_cont(entries[i], prev); >                 if (p) > -                       swap_entry_free(p, entries[i]); > +                       swap_entry_range_free(p, entries[i], 1); >                 prev = p; >         } >         if (p) > > -- > 2.46.0.rc1.232.g9752f9e123-goog > Thanks Barry