From: Chris Li <chrisl@kernel.org>
To: Kairui Song <kasong@tencent.com>
Cc: linux-mm@kvack.org, Andrew Morton <akpm@linux-foundation.org>,
Kemeng Shi <shikemeng@huaweicloud.com>,
Nhat Pham <nphamcs@gmail.com>, Baoquan He <bhe@redhat.com>,
Barry Song <baohua@kernel.org>,
"Huang, Ying" <ying.huang@linux.alibaba.com>,
linux-kernel@vger.kernel.org
Subject: Re: [PATCH 1/2] mm, swap: don't scan every fragment cluster
Date: Tue, 5 Aug 2025 16:30:02 -0700 [thread overview]
Message-ID: <CAF8kJuPY20cybaFqBXk34sEgZ8ydNOk7AoOtmNGLtdO3huzE-Q@mail.gmail.com> (raw)
In-Reply-To: <20250804172439.2331-2-ryncsn@gmail.com>
Looks good to me with minor nit picks on commit messages and comments.
Let me know if you will refresh a version or not.
Nit: I suggest the patch title use positive terms, something along the lines:
"Only scan one cluster in fragment list"
"Don't scan" seems to describe what the patch does not do rather than
what the patch does.
On Mon, Aug 4, 2025 at 10:24 AM Kairui Song <ryncsn@gmail.com> wrote:
>
> From: Kairui Song <kasong@tencent.com>
>
> Fragment clusters were mostly failing high order allocation already.
> The reason we scan it now is that a swap slot may get freed without
> releasing the swap cache, so a swap map entry will end up in HAS_CACHE
> only status, and the cluster won't be moved back to non-full or free
> cluster list.
>
> Usually this only happens for !SWP_SYNCHRONOUS_IO devices when the swap
Nit: Please clarify what "this" here means. I assume scanning fragment lists.
From the context it can almost mean "map entry will end up in HAS_CACHE".
> device usage is low (!vm_swap_full()) since swap will try to lazy free
> the swap cache.
>
> It's unlikely to cause any real issue. Fragmentation is only an issue
> when the device is getting full, and by that time, swap will already
> be releasing the swap cache aggressively. And swap cache reclaim happens
> when the allocator scans a cluster too. Scanning one fragment cluster
> should be good enough to reclaim these pinned slots.
>
> And besides, only high order allocation requires iterating over a
> cluster list, order 0 allocation will succeed on the first attempt.
> And high order allocation failure isn't a serious problem.
>
> So the iteration of fragment clusters is trivial, but it will slow down
> mTHP allocation by a lot when the fragment cluster list is long.
> So it's better to drop this fragment cluster iteration design. Only
> scanning one fragment cluster is good enough in case any cluster is
> stuck in the fragment list; this ensures order 0 allocation never
> falls, and large allocations still have an acceptable success rate.
>
> Test on a 48c96t system, build linux kernel using 10G ZRAM, make -j48,
> defconfig with 768M cgroup memory limit, on top of tmpfs, 4K folio
> only:
>
> Before: sys time: 4407.28s
> After: sys time: 4425.22s
>
> Change to make -j96, 2G memory limit, 64kB mTHP enabled, and 10G ZRAM:
>
> Before: sys time: 10230.22s 64kB/swpout: 1793044 64kB/swpout_fallback: 17653
> After: sys time: 5527.90s 64kB/swpout: 1789358 64kB/swpout_fallback: 17813
>
> Change to 8G ZRAM:
>
> Before: sys time: 21929.17s 64kB/swpout: 1634681 64kB/swpout_fallback: 173056
> After: sys time: 6121.01s 64kB/swpout: 1638155 64kB/swpout_fallback: 189562
>
> Change to use 10G brd device with SWP_SYNCHRONOUS_IO flag removed:
>
> Before: sys time: 7368.41s 64kB/swpout:1787599 swpout_fallback: 0
> After: sys time: 7338.27s 64kB/swpout:1783106 swpout_fallback: 0
>
> Change to use 8G brd device with SWP_SYNCHRONOUS_IO flag removed:
>
> Before: sys time: 28139.60s 64kB/swpout:1645421 swpout_fallback: 148408
> After: sys time: 8941.90s 64kB/swpout:1592973 swpout_fallback: 265010
>
> The performance is a lot better and large order allocation failure rate
> is only very slightly higher or unchanged.
>
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---
> include/linux/swap.h | 1 -
> mm/swapfile.c | 30 ++++++++----------------------
> 2 files changed, 8 insertions(+), 23 deletions(-)
>
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 2fe6ed2cc3fd..a060d102e0d1 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -310,7 +310,6 @@ struct swap_info_struct {
> /* list of cluster that contains at least one free slot */
> struct list_head frag_clusters[SWAP_NR_ORDERS];
> /* list of cluster that are fragmented or contented */
> - atomic_long_t frag_cluster_nr[SWAP_NR_ORDERS];
Nit: please have some comment in the commit log that why remove the
frag_cluster_nr counter.
I feel this change can be split out from the main change of this
patch. The main performance improvement is from only scanning one
fragment cluster rather than the full list right? Delete the counter
helps, but in a much smaller number.
Chris
next prev parent reply other threads:[~2025-08-05 23:30 UTC|newest]
Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-08-04 17:24 [PATCH 0/2] mm, swap: improve cluster scan strategy Kairui Song
2025-08-04 17:24 ` [PATCH 1/2] mm, swap: don't scan every fragment cluster Kairui Song
2025-08-05 23:30 ` Chris Li [this message]
2025-08-06 3:02 ` Kairui Song
2025-08-04 17:24 ` [PATCH 2/2] mm, swap: prefer nonfull over free clusters Kairui Song
2025-08-05 23:35 ` Chris Li
2025-08-06 0:03 ` Nhat Pham
2025-08-06 0:30 ` Chris Li
2025-08-06 3:38 ` Kairui Song
2025-08-05 23:26 ` [PATCH 0/2] mm, swap: improve cluster scan strategy Chris Li
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=CAF8kJuPY20cybaFqBXk34sEgZ8ydNOk7AoOtmNGLtdO3huzE-Q@mail.gmail.com \
--to=chrisl@kernel.org \
--cc=akpm@linux-foundation.org \
--cc=baohua@kernel.org \
--cc=bhe@redhat.com \
--cc=kasong@tencent.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=nphamcs@gmail.com \
--cc=shikemeng@huaweicloud.com \
--cc=ying.huang@linux.alibaba.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox