From: Yosry Ahmed <yosryahmed@google.com>
To: Kairui Song <kasong@tencent.com>
Cc: linux-mm@kvack.org, Andrew Morton <akpm@linux-foundation.org>,
Chris Li <chrisl@kernel.org>, Barry Song <v-songbaohua@oppo.com>,
Ryan Roberts <ryan.roberts@arm.com>,
Hugh Dickins <hughd@google.com>,
"Huang, Ying" <ying.huang@intel.com>,
Tim Chen <tim.c.chen@linux.intel.com>,
Nhat Pham <nphamcs@gmail.com>,
linux-kernel@vger.kernel.org
Subject: Re: [PATCH 00/13] mm, swap: rework of swap allocator locks
Date: Wed, 23 Oct 2024 10:59:42 -0700 [thread overview]
Message-ID: <CAJD7tkYzcxNbZLRL4DezWC27RqdEcfqr0Eafr2h0bUbacuRSjw@mail.gmail.com> (raw)
In-Reply-To: <20241022192451.38138-1-ryncsn@gmail.com>
On Tue, Oct 22, 2024 at 12:29 PM Kairui Song <ryncsn@gmail.com> wrote:
>
> From: Kairui Song <kasong@tencent.com>
>
> This series improved the swap allocator performance greatly by reworking
> the locking design and simplify a lot of code path.
>
> This is follow up of previous swap cluster allocator series:
> https://lore.kernel.org/linux-mm/20240730-swap-allocator-v5-0-cb9c148b9297@kernel.org/
>
> And this series is based on an follow up fix of the swap cluster
> allocator:
> https://lore.kernel.org/linux-mm/20241022175512.10398-1-ryncsn@gmail.com/
>
> This is part of the new swap allocator work item discussed in
> Chris's "Swap Abstraction" discussion at LSF/MM 2024, and
> "mTHP and swap allocator" discussion at LPC 2024.
>
> Previous series introduced a fully cluster based allocation algorithm,
> this series completely get rid of the old allocation path and makes the
> allocator avoid grabbing the si->lock unless needed. This bring huge
> performance gain and get rid of slot cache on freeing path.
>
> Currently, swap locking is mainly composed of two locks, cluster lock
> (ci->lock) and device lock (si->lock). The device lock is widely used
> to protect many things, causing it to be the main bottleneck for SWAP.
>
> Cluster lock is much more fine-grained, so it will be best to use
> ci->lock instead of si->lock as much as possible.
>
> `perf lock` indicates this issue clearly. Doing linux kernel build
> using tmpfs and ZRAM with limited memory (make -j64 with 1G memcg and 4k
> pages), result of "perf lock contention -ab sleep 3":
>
> contended total wait max wait avg wait type caller
>
> 34948 53.63 s 7.11 ms 1.53 ms spinlock free_swap_and_cache_nr+0x350
> 16569 40.05 s 6.45 ms 2.42 ms spinlock get_swap_pages+0x231
> 11191 28.41 s 7.03 ms 2.54 ms spinlock swapcache_free_entries+0x59
> 4147 22.78 s 122.66 ms 5.49 ms spinlock page_vma_mapped_walk+0x6f3
> 4595 7.17 s 6.79 ms 1.56 ms spinlock swapcache_free_entries+0x59
> 406027 2.74 s 2.59 ms 6.74 us spinlock list_lru_add+0x39
> ...snip...
>
> The top 5 caller are all users of si->lock, total wait time up sums to
> several minutes in the 3 seconds time window.
>
> Following the new allocator design, many operation doesn't need to touch
> si->lock at all. We only need to take si->lock when doing operations
> across multiple clusters (eg. changing the cluster list), other
> operations only need to take ci->lock. So ideally allocator should
> always take ci->lock first, then, if needed, take si->lock. But due
> to historical reasons, ci->lock is used inside si->lock by design,
> causing lock inversion if we simply try to acquire si->lock after
> acquiring ci->lock.
>
> This series audited all si->lock usage, simplify legacy codes, eliminate
> usage of si->lock as much as possible by introducing new designs based
> on the new cluster allocator.
>
> Old HDD allocation codes are removed, cluster allocator is adapted
> with small changes for HDD usage, test is looking OK.
>
> And this also removed slot cache for freeing path. The performance is
> better without it, and this enables other clean up and optimizations
> as discussed before:
> https://lore.kernel.org/all/CAMgjq7ACohT_uerSz8E_994ZZCv709Zor+43hdmesW_59W1BWw@mail.gmail.com/
>
> After this series, lock contention on si->lock is nearly unobservable
> with `perf lock` with the same test above :
>
> contended total wait max wait avg wait type caller
> ... snip ...
> 91 204.62 us 4.51 us 2.25 us spinlock cluster_move+0x2e
> ... snip ...
> 47 125.62 us 4.47 us 2.67 us spinlock cluster_move+0x2e
> ... snip ...
> 23 63.15 us 3.95 us 2.74 us spinlock cluster_move+0x2e
> ... snip ...
> 17 41.26 us 4.58 us 2.43 us spinlock cluster_isolate_lock+0x1d
> ... snip ...
>
> cluster_move and cluster_isolate_lock are basically the only users
> of si->lock now, performance gain is huge with reduced LOC.
>
> Tests
> ===
>
> Build kernel with defconfig on tmpfs with ZRAM as swap:
> ---
>
> Running a test matrix which is scaled up progressive for a intuitive result.
> The test are ran on top of tmpfs, using memory cgroup for memory limitation,
> on a 48c96t system.
>
> 12 test run for each case, it can be seen clearly that as concurrent job
> number goes higher the performance gain is higher, the performance is
> higher even with low concurrency.
>
> make -j<NR> | System Time (seconds) | Total Time (seconds)
> (NR / Mem / ZRAM) | (Before / After / Delta) | (Before / After / Delta)
> With 4k pages only:
> 6 / 192M / 3G | 5258 / 5235 / -0.3% | 1420 / 1414 / -0.3%
> 12 / 256M / 4G | 5518 / 5337 / -3.3% | 758 / 742 / -2.1%
> 24 / 384M / 5G | 7091 / 5766 / -18.7% | 476 / 422 / -11.3%
> 48 / 768M / 7G | 11139 / 5831 / -47.7% | 330 / 221 / -33.0%
> 96 / 1.5G / 10G | 21303 / 11353 / -46.7% | 283 / 180 / -36.4%
> With 64k mTHP:
> 24 / 512M / 5G | 5104 / 4641 / -18.7% | 376 / 358 / -4.8%
> 48 / 1G / 7G | 8693 / 4662 / -18.7% | 257 / 176 / -31.5%
> 96 / 2G / 10G | 17056 / 10263 / -39.8% | 234 / 169 / -27.8%
>
> With more aggressive setup, it shows clearly both the performance and
> fragmentation are better:
>
> tiem make -j96 / 768M memcg, 4K pages, 10G ZRAM, on Intel 8255C * 2:
> (avg of 4 test run)
> Before:
> Sys time: 73578.30, Real time: 864.05
> tiem make -j96 / 1G memcg, 4K pages, 10G ZRAM:
> After: (-54.7% sys time, -49.3% real time)
> Sys time: 33314.76, Real time: 437.67
>
> time make -j96 / 1152M memcg, 64K mTHP, 10G ZRAM, on Intel 8255C * 2:
> (avg of 4 test run)
> Before:
> Sys time: 74044.85, Real time: 846.51
> hugepages-64kB/stats/swpout: 1735216
> hugepages-64kB/stats/swpout_fallback: 430333
> After: (-51.4% sys time, -47.7% real time, -63.2% mTHP failure)
> Sys time: 35958.87, Real time: 442.69
> hugepages-64kB/stats/swpout: 1866267
> hugepages-64kB/stats/swpout_fallback: 158330
>
> There is a up to 54.7% improvement for build kernel test, and lower
> fragmentation rate. Performance improvement should be even larger for
> micro benchmarks
>
> Build kernel with tinyconfig on tmpfs with HDD as swap:
> ---
>
> This test is similar to above, but HDD test is very noisy and slow, the
> deviation is huge, so just use tinyconfig instead and take the median test
> result of 3 test run, which looks OK:
>
> Before this series:
> 114.44user 29.11system 39:42.90elapsed 6%CPU
> 2901232inputs+0outputs (238877major+4227640minor)pagefaults
>
> After this commit:
> 113.90user 23.81system 38:11.77elapsed 6%CPU
> 2548728inputs+0outputs (235471major+4238110minor)pagefaults
>
> Single thread SWAP:
> ---
>
> Sequential SWAP should also be slightly faster as we removed a lot of
> unnecessary parts. Test using micro benchmark for swapout/in 4G
> zero memory using ZRAM, 10 test runs:
>
> Swapout Before (avg. 3359304):
> 3353796 3358551 3371305 3356043 3367524 3355303 3355924 3354513 3360776
>
> Swapin Before (avg. 1928698):
> 1920283 1927183 1934105 1921373 1926562 1938261 1927726 1928636 1934155
>
> Swapout After (avg. 3347511, -0.4%):
> 3337863 3347948 3355235 3339081 3333134 3353006 3354917 3346055 3360359
>
> Swapin After (avg. 1922290, -0.3%):
> 1919101 1925743 1916810 1917007 1923930 1935152 1917403 1923549 1921913
Unfortunately I don't have the time to go through this series, but I
just wanted to say that this awesome work, Kairui.
Selfishly, I especially like cleaning up the swap slot freeing path,
and having a centralized freeing path with a single call to
zswap_invalidate().
Thanks for doing this :)
prev parent reply other threads:[~2024-10-23 18:00 UTC|newest]
Thread overview: 21+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-10-22 19:24 Kairui Song
2024-10-22 19:24 ` [PATCH 01/13] mm, swap: minor clean up for swap entry allocation Kairui Song
2024-10-22 19:24 ` [PATCH 02/13] mm, swap: fold swap_info_get_cont in the only caller Kairui Song
2024-10-22 19:24 ` [PATCH 03/13] mm, swap: remove old allocation path for HDD Kairui Song
2024-10-22 19:24 ` [PATCH 04/13] mm, swap: use cluster lock " Kairui Song
2024-10-22 19:24 ` [PATCH 05/13] mm, swap: clean up device availability check Kairui Song
2024-10-22 19:24 ` [PATCH 06/13] mm, swap: clean up plist removal and adding Kairui Song
2024-10-22 19:24 ` [PATCH 07/13] mm, swap: hold a reference of si during scan and clean up flags Kairui Song
2024-10-22 19:24 ` [PATCH 08/13] mm, swap: use an enum to define all cluster flags and wrap flags changes Kairui Song
2024-10-22 19:24 ` [PATCH 09/13] mm, swap: reduce contention on device lock Kairui Song
2024-10-22 19:24 ` [PATCH 10/13] mm, swap: simplify percpu cluster updating Kairui Song
2024-10-22 19:24 ` [PATCH 11/13] mm, swap: introduce a helper for retrieving cluster from offset Kairui Song
2024-10-22 19:24 ` [PATCH 12/13] mm, swap: use a global swap cluster for non-rotation device Kairui Song
2024-10-22 19:37 ` [PATCH 13/13] mm, swap_slots: remove slot cache for freeing path Kairui Song
2024-10-23 2:24 ` [PATCH 00/13] mm, swap: rework of swap allocator locks Huang, Ying
2024-10-23 18:01 ` Kairui Song
2024-10-24 3:04 ` Huang, Ying
2024-10-24 3:51 ` Kairui Song
2024-10-23 10:27 ` Andrew Morton
2024-10-23 17:56 ` Kairui Song
2024-10-23 17:59 ` Yosry Ahmed [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=CAJD7tkYzcxNbZLRL4DezWC27RqdEcfqr0Eafr2h0bUbacuRSjw@mail.gmail.com \
--to=yosryahmed@google.com \
--cc=akpm@linux-foundation.org \
--cc=chrisl@kernel.org \
--cc=hughd@google.com \
--cc=kasong@tencent.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=nphamcs@gmail.com \
--cc=ryan.roberts@arm.com \
--cc=tim.c.chen@linux.intel.com \
--cc=v-songbaohua@oppo.com \
--cc=ying.huang@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox