Re: [PATCH 00/13] mm, swap: rework of swap allocator locks

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Yosry Ahmed <yosryahmed@google.com>
To: Kairui Song <kasong@tencent.com>
Cc: linux-mm@kvack.org, Andrew Morton <akpm@linux-foundation.org>,
	 Chris Li <chrisl@kernel.org>, Barry Song <v-songbaohua@oppo.com>,
	 Ryan Roberts <ryan.roberts@arm.com>,
	Hugh Dickins <hughd@google.com>,
	 "Huang, Ying" <ying.huang@intel.com>,
	Tim Chen <tim.c.chen@linux.intel.com>,
	 Nhat Pham <nphamcs@gmail.com>,
	linux-kernel@vger.kernel.org
Subject: Re: [PATCH 00/13] mm, swap: rework of swap allocator locks
Date: Wed, 23 Oct 2024 10:59:42 -0700	[thread overview]
Message-ID: <CAJD7tkYzcxNbZLRL4DezWC27RqdEcfqr0Eafr2h0bUbacuRSjw@mail.gmail.com> (raw)
In-Reply-To: <20241022192451.38138-1-ryncsn@gmail.com>

On Tue, Oct 22, 2024 at 12:29 PM Kairui Song <ryncsn@gmail.com> wrote:
>
> From: Kairui Song <kasong@tencent.com>
>
> This series improved the swap allocator performance greatly by reworking
> the locking design and simplify a lot of code path.
>
> This is follow up of previous swap cluster allocator series:
> https://lore.kernel.org/linux-mm/20240730-swap-allocator-v5-0-cb9c148b9297@kernel.org/
>
> And this series is based on an follow up fix of the swap cluster
> allocator:
> https://lore.kernel.org/linux-mm/20241022175512.10398-1-ryncsn@gmail.com/
>
> This is part of the new swap allocator work item discussed in
> Chris's "Swap Abstraction" discussion at LSF/MM 2024, and
> "mTHP and swap allocator" discussion at LPC 2024.
>
> Previous series introduced a fully cluster based allocation algorithm,
> this series completely get rid of the old allocation path and makes the
> allocator avoid grabbing the si->lock unless needed. This bring huge
> performance gain and get rid of slot cache on freeing path.
>
> Currently, swap locking is mainly composed of two locks, cluster lock
> (ci->lock) and device lock (si->lock). The device lock is widely used
> to protect many things, causing it to be the main bottleneck for SWAP.
>
> Cluster lock is much more fine-grained, so it will be best to use
> ci->lock instead of si->lock as much as possible.
>
> `perf lock` indicates this issue clearly. Doing linux kernel build
> using tmpfs and ZRAM with limited memory (make -j64 with 1G memcg and 4k
> pages), result of "perf lock contention -ab sleep 3":
>
>   contended   total wait     max wait     avg wait         type   caller
>
>      34948     53.63 s       7.11 ms      1.53 ms     spinlock   free_swap_and_cache_nr+0x350
>      16569     40.05 s       6.45 ms      2.42 ms     spinlock   get_swap_pages+0x231
>      11191     28.41 s       7.03 ms      2.54 ms     spinlock   swapcache_free_entries+0x59
>       4147     22.78 s     122.66 ms      5.49 ms     spinlock   page_vma_mapped_walk+0x6f3
>       4595      7.17 s       6.79 ms      1.56 ms     spinlock   swapcache_free_entries+0x59
>     406027      2.74 s       2.59 ms      6.74 us     spinlock   list_lru_add+0x39
>   ...snip...
>
> The top 5 caller are all users of si->lock, total wait time up sums to
> several minutes in the 3 seconds time window.
>
> Following the new allocator design, many operation doesn't need to touch
> si->lock at all. We only need to take si->lock when doing operations
> across multiple clusters (eg. changing the cluster list), other
> operations only need to take ci->lock. So ideally allocator should
> always take ci->lock first, then, if needed, take si->lock. But due
> to historical reasons, ci->lock is used inside si->lock by design,
> causing lock inversion if we simply try to acquire si->lock after
> acquiring ci->lock.
>
> This series audited all si->lock usage, simplify legacy codes, eliminate
> usage of si->lock as much as possible by introducing new designs based
> on the new cluster allocator.
>
> Old HDD allocation codes are removed, cluster allocator is adapted
> with small changes for HDD usage, test is looking OK.
>
> And this also removed slot cache for freeing path. The performance is
> better without it, and this enables other clean up and optimizations
> as discussed before:
> https://lore.kernel.org/all/CAMgjq7ACohT_uerSz8E_994ZZCv709Zor+43hdmesW_59W1BWw@mail.gmail.com/
>
> After this series, lock contention on si->lock is nearly unobservable
> with `perf lock` with the same test above :
>
>   contended   total wait     max wait     avg wait         type   caller
>   ... snip ...
>          91    204.62 us      4.51 us      2.25 us     spinlock   cluster_move+0x2e
>   ... snip ...
>          47    125.62 us      4.47 us      2.67 us     spinlock   cluster_move+0x2e
>   ... snip ...
>          23     63.15 us      3.95 us      2.74 us     spinlock   cluster_move+0x2e
>   ... snip ...
>          17     41.26 us      4.58 us      2.43 us     spinlock   cluster_isolate_lock+0x1d
>   ... snip ...
>
> cluster_move and cluster_isolate_lock are basically the only users
> of si->lock now, performance gain is huge with reduced LOC.
>
> Tests
> ===
>
> Build kernel with defconfig on tmpfs with ZRAM as swap:
> ---
>
> Running a test matrix which is scaled up progressive for a intuitive result.
> The test are ran on top of tmpfs, using memory cgroup for memory limitation,
> on a 48c96t system.
>
> 12 test run for each case, it can be seen clearly that as concurrent job
> number goes higher the performance gain is higher, the performance is
> higher even with low concurrency.
>
>    make -j<NR>     |   System Time (seconds)  |   Total Time (seconds)
>  (NR / Mem / ZRAM) | (Before / After / Delta) | (Before / After / Delta)
>  With 4k pages only:
>   6 / 192M / 3G    |    5258 /  5235 / -0.3%  |    1420 /  1414 / -0.3%
>  12 / 256M / 4G    |    5518 /  5337 / -3.3%  |     758 /   742 / -2.1%
>  24 / 384M / 5G    |    7091 /  5766 / -18.7% |     476 /   422 / -11.3%
>  48 / 768M / 7G    |   11139 /  5831 / -47.7% |     330 /   221 / -33.0%
>  96 / 1.5G / 10G   |   21303 / 11353 / -46.7% |     283 /   180 / -36.4%
>  With 64k mTHP:
>  24 / 512M / 5G    |    5104 /  4641 / -18.7% |     376 /   358 / -4.8%
>  48 /   1G / 7G    |    8693 /  4662 / -18.7% |     257 /   176 / -31.5%
>  96 /   2G / 10G   |   17056 / 10263 / -39.8% |     234 /   169 / -27.8%
>
> With more aggressive setup, it shows clearly both the performance and
> fragmentation are better:
>
> tiem make -j96 / 768M memcg, 4K pages, 10G ZRAM, on Intel 8255C * 2:
> (avg of 4 test run)
> Before:
> Sys time: 73578.30, Real time: 864.05
> tiem make -j96 / 1G memcg, 4K pages, 10G ZRAM:
> After: (-54.7% sys time, -49.3% real time)
> Sys time: 33314.76, Real time: 437.67
>
> time make -j96 / 1152M memcg, 64K mTHP, 10G ZRAM, on Intel 8255C * 2:
> (avg of 4 test run)
> Before:
> Sys time: 74044.85, Real time: 846.51
> hugepages-64kB/stats/swpout: 1735216
> hugepages-64kB/stats/swpout_fallback: 430333
> After: (-51.4% sys time, -47.7% real time, -63.2% mTHP failure)
> Sys time: 35958.87, Real time: 442.69
> hugepages-64kB/stats/swpout: 1866267
> hugepages-64kB/stats/swpout_fallback: 158330
>
> There is a up to 54.7% improvement for build kernel test, and lower
> fragmentation rate. Performance improvement should be even larger for
> micro benchmarks
>
> Build kernel with tinyconfig on tmpfs with HDD as swap:
> ---
>
> This test is similar to above, but HDD test is very noisy and slow, the
> deviation is huge, so just use tinyconfig instead and take the median test
> result of 3 test run, which looks OK:
>
> Before this series:
> 114.44user 29.11system 39:42.90elapsed 6%CPU
> 2901232inputs+0outputs (238877major+4227640minor)pagefaults
>
> After this commit:
> 113.90user 23.81system 38:11.77elapsed 6%CPU
> 2548728inputs+0outputs (235471major+4238110minor)pagefaults
>
> Single thread SWAP:
> ---
>
> Sequential SWAP should also be slightly faster as we removed a lot of
> unnecessary parts. Test using micro benchmark for swapout/in 4G
> zero memory using ZRAM, 10 test runs:
>
> Swapout Before (avg. 3359304):
> 3353796 3358551 3371305 3356043 3367524 3355303 3355924 3354513 3360776
>
> Swapin Before (avg. 1928698):
> 1920283 1927183 1934105 1921373 1926562 1938261 1927726 1928636 1934155
>
> Swapout After (avg. 3347511, -0.4%):
> 3337863 3347948 3355235 3339081 3333134 3353006 3354917 3346055 3360359
>
> Swapin After (avg. 1922290, -0.3%):
> 1919101 1925743 1916810 1917007 1923930 1935152 1917403 1923549 1921913

Unfortunately I don't have the time to go through this series, but I
just wanted to say that this awesome work, Kairui.

Selfishly, I especially like cleaning up the swap slot freeing path,
and having a centralized freeing path with a single call to
zswap_invalidate().

Thanks for doing this :)

     prev parent reply	other threads:[~2024-10-23 18:00 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-10-22 19:24 Kairui Song
2024-10-22 19:24 ` [PATCH 01/13] mm, swap: minor clean up for swap entry allocation Kairui Song
2024-10-22 19:24 ` [PATCH 02/13] mm, swap: fold swap_info_get_cont in the only caller Kairui Song
2024-10-22 19:24 ` [PATCH 03/13] mm, swap: remove old allocation path for HDD Kairui Song
2024-10-22 19:24 ` [PATCH 04/13] mm, swap: use cluster lock " Kairui Song
2024-10-22 19:24 ` [PATCH 05/13] mm, swap: clean up device availability check Kairui Song
2024-10-22 19:24 ` [PATCH 06/13] mm, swap: clean up plist removal and adding Kairui Song
2024-10-22 19:24 ` [PATCH 07/13] mm, swap: hold a reference of si during scan and clean up flags Kairui Song
2024-10-22 19:24 ` [PATCH 08/13] mm, swap: use an enum to define all cluster flags and wrap flags changes Kairui Song
2024-10-22 19:24 ` [PATCH 09/13] mm, swap: reduce contention on device lock Kairui Song
2024-10-22 19:24 ` [PATCH 10/13] mm, swap: simplify percpu cluster updating Kairui Song
2024-10-22 19:24 ` [PATCH 11/13] mm, swap: introduce a helper for retrieving cluster from offset Kairui Song
2024-10-22 19:24 ` [PATCH 12/13] mm, swap: use a global swap cluster for non-rotation device Kairui Song
2024-10-22 19:37 ` [PATCH 13/13] mm, swap_slots: remove slot cache for freeing path Kairui Song
2024-10-23  2:24 ` [PATCH 00/13] mm, swap: rework of swap allocator locks Huang, Ying
2024-10-23 18:01   ` Kairui Song
2024-10-24  3:04     ` Huang, Ying
2024-10-24  3:51       ` Kairui Song
2024-10-23 10:27 ` Andrew Morton
2024-10-23 17:56   ` Kairui Song
2024-10-23 17:59 ` Yosry Ahmed [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAJD7tkYzcxNbZLRL4DezWC27RqdEcfqr0Eafr2h0bUbacuRSjw@mail.gmail.com \
    --to=yosryahmed@google.com \
    --cc=akpm@linux-foundation.org \
    --cc=chrisl@kernel.org \
    --cc=hughd@google.com \
    --cc=kasong@tencent.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=nphamcs@gmail.com \
    --cc=ryan.roberts@arm.com \
    --cc=tim.c.chen@linux.intel.com \
    --cc=v-songbaohua@oppo.com \
    --cc=ying.huang@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox