From: Chris Li <chrisl@kernel.org>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: Kairui Song <ryncsn@gmail.com>,
linux-mm@kvack.org, Baoquan He <bhe@redhat.com>,
Barry Song <baohua@kernel.org>, Nhat Pham <nphamcs@gmail.com>,
Yosry Ahmed <yosry.ahmed@linux.dev>,
David Hildenbrand <david@kernel.org>,
Johannes Weiner <hannes@cmpxchg.org>,
Youngjun Park <youngjun.park@lge.com>,
Hugh Dickins <hughd@google.com>,
Baolin Wang <baolin.wang@linux.alibaba.com>,
Ying Huang <ying.huang@linux.alibaba.com>,
Kemeng Shi <shikemeng@huaweicloud.com>,
Lorenzo Stoakes <lorenzo.stoakes@oracle.com>,
"Matthew Wilcox (Oracle)" <willy@infradead.org>,
linux-kernel@vger.kernel.org, Kairui Song <kasong@tencent.com>,
linux-pm@vger.kernel.org
Subject: Re: [PATCH v3 00/19] mm, swap: swap table phase II: unify swapin use swap cache and cleanup flags
Date: Sat, 29 Nov 2025 21:07:40 +0400 [thread overview]
Message-ID: <CACePvbUK6uSDtz0QkBq-eqzQ_Hi9+t1gthGGt+ok7xdZtO1V8Q@mail.gmail.com> (raw)
In-Reply-To: <20251125-swap-table-p2-v3-0-33f54f707a5c@tencent.com>
Hi Andrew,
Can you add this swap table phase II series to mm-unstable for more
exposure? The patch series has gone through V3, overall looks OK, but
I have not finished reviewing them all yet. I will keep you posted
when the series is fully reviewed.
Thanks
Chris
On Mon, Nov 24, 2025 at 11:15 PM Kairui Song <ryncsn@gmail.com> wrote:
>
> This series removes the SWP_SYNCHRONOUS_IO swap cache bypass swapin code and
> special swap flag bits including SWAP_HAS_CACHE, along with many historical
> issues. The performance is about ~20% better for some workloads, like
> Redis with persistence. This also cleans up the code to prepare for
> later phases, some patches are from a previously posted series.
>
> Swap cache bypassing and swap synchronization in general had many
> issues. Some are solved as workarounds, and some are still there [1]. To
> resolve them in a clean way, one good solution is to always use swap
> cache as the synchronization layer [2]. So we have to remove the swap
> cache bypass swap-in path first. It wasn't very doable due to
> performance issues, but now combined with the swap table, removing
> the swap cache bypass path will instead improve the performance,
> there is no reason to keep it.
>
> Now we can rework the swap entry and cache synchronization following
> the new design. Swap cache synchronization was heavily relying on
> SWAP_HAS_CACHE, which is the cause of many issues. By dropping the usage
> of special swap map bits and related workarounds, we get a cleaner code
> base and prepare for merging the swap count into the swap table in the
> next step.
>
> And swap_map is now only used for swap count, so in the next phase,
> swap_map can be merged into the swap table, which will clean up more
> things and start to reduce the static memory usage. Removal of
> swap_cgroup_ctrl is also doable, but needs to be done after we also
> simplify the allocation of swapin folios: always use the new
> swap_cache_alloc_folio helper so the accounting will also be managed by
> the swap layer by then.
>
> Test results:
>
> Redis / Valkey bench:
> =====================
>
> Testing on a ARM64 VM 1.5G memory:
> Server: valkey-server --maxmemory 2560M
> Client: redis-benchmark -r 3000000 -n 3000000 -d 1024 -c 12 -P 32 -t get
>
> no persistence with BGSAVE
> Before: 460475.84 RPS 311591.19 RPS
> After: 451943.34 RPS (-1.9%) 371379.06 RPS (+19.2%)
>
> Testing on a x86_64 VM with 4G memory (system components takes about 2G):
> Server:
> Client: redis-benchmark -r 3000000 -n 3000000 -d 1024 -c 12 -P 32 -t get
>
> no persistence with BGSAVE
> Before: 306044.38 RPS 102745.88 RPS
> After: 309645.44 RPS (+1.2%) 125313.28 RPS (+22.0%)
>
> The performance is a lot better when persistence is applied. This should
> apply to many other workloads that involve sharing memory and COW. A
> slight performance drop was observed for the ARM64 Redis test: We are
> still using swap_map to track the swap count, which is causing redundant
> cache and CPU overhead and is not very performance-friendly for some
> arches. This will be improved once we merge the swap map into the swap
> table (as already demonstrated previously [3]).
>
> vm-scabiity
> ===========
> usemem --init-time -O -y -x -n 32 1536M (16G memory, global pressure,
> simulated PMEM as swap), average result of 6 test run:
>
> Before: After:
> System time: 282.22s 283.47s
> Sum Throughput: 5677.35 MB/s 5688.78 MB/s
> Single process Throughput: 176.41 MB/s 176.23 MB/s
> Free latency: 518477.96 us 521488.06 us
>
> Which is almost identical.
>
> Build kernel test:
> ==================
> Test using ZRAM as SWAP, make -j48, defconfig, on a x86_64 VM
> with 4G RAM, under global pressure, avg of 32 test run:
>
> Before After:
> System time: 1379.91s 1364.22s (-0.11%)
>
> Test using ZSWAP with NVME SWAP, make -j48, defconfig, on a x86_64 VM
> with 4G RAM, under global pressure, avg of 32 test run:
>
> Before After:
> System time: 1822.52s 1803.33s (-0.11%)
>
> Which is almost identical.
>
> MySQL:
> ======
> sysbench /usr/share/sysbench/oltp_read_only.lua --tables=16
> --table-size=1000000 --threads=96 --time=600 (using ZRAM as SWAP, in a
> 512M memory cgroup, buffer pool set to 3G, 3 test run and 180s warm up).
>
> Before: 318162.18 qps
> After: 318512.01 qps (+0.01%)
>
> In conclusion, the result is looking better or identical for most cases,
> and it's especially better for workloads with swap count > 1 on SYNC_IO
> devices, about ~20% gain in above test. Next phases will start to merge
> swap count into swap table and reduce memory usage.
>
> One more gain here is that we now have better support for THP swapin.
> Previously, the THP swapin was bound with swap cache bypassing, which
> only works for single-mapped folios. Removing the bypassing path also
> enabled THP swapin for all folios. The THP swapin is still limited to
> SYNC_IO devices, the limitation can be removed later.
>
> This may cause more serious THP thrashing for certain workloads, but that's
> not an issue caused by this series, it's a common THP issue we should resolve
> separately.
>
> Link: https://lore.kernel.org/linux-mm/CAMgjq7D5qoFEK9Omvd5_Zqs6M+TEoG03+2i_mhuP5CQPSOPrmQ@mail.gmail.com/ [1]
> Link: https://lore.kernel.org/linux-mm/20240326185032.72159-1-ryncsn@gmail.com/ [2]
> Link: https://lore.kernel.org/linux-mm/20250514201729.48420-1-ryncsn@gmail.com/ [3]
>
> Suggested-by: Chris Li <chrisl@kernel.org>
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---
> Still basically same with V2, mostly comment update and build fix, and
> rebase to resolve conflicts and for easier review and testing. Stress test and
> performance test is looking good and basically same as before.
>
> Changes in v3:
> - Imporve and update comments [ Barry Song, YoungJun Park, Chris Li ]
> - Simplify the changes of cluster_reclaim_range a bit, as YoungJun points
> out the change looked confusing.
> - Fix a few typos I found during self review.
> - Fix a few build error and warns.
> - Link to v2: https://lore.kernel.org/r/20251117-swap-table-p2-v2-0-37730e6ea6d5@tencent.com
>
> Changes in v2:
> - Rebased on latest mm-new to resolve conflicts, also appliable to
> mm-unstable.
> - Imporve comment, and commit messages in multiple commits, many thanks to
> [Barry Song, YoungJun Park, Yosry Ahmed ]
> - Fix cluster usable check in allocator [ YoungJun Park]
> - Improve cover letter [ Chris Li ]
> - Collect Reviewed-by [ Yosry Ahmed ]
> - Fix a few build warning and issues from build bot.
> - Link to v1: https://lore.kernel.org/r/20251029-swap-table-p2-v1-0-3d43f3b6ec32@tencent.com
>
> ---
> Kairui Song (18):
> mm, swap: rename __read_swap_cache_async to swap_cache_alloc_folio
> mm, swap: split swap cache preparation loop into a standalone helper
> mm, swap: never bypass the swap cache even for SWP_SYNCHRONOUS_IO
> mm, swap: always try to free swap cache for SWP_SYNCHRONOUS_IO devices
> mm, swap: simplify the code and reduce indention
> mm, swap: free the swap cache after folio is mapped
> mm/shmem: never bypass the swap cache for SWP_SYNCHRONOUS_IO
> mm, swap: swap entry of a bad slot should not be considered as swapped out
> mm, swap: consolidate cluster reclaim and usability check
> mm, swap: split locked entry duplicating into a standalone helper
> mm, swap: use swap cache as the swap in synchronize layer
> mm, swap: remove workaround for unsynchronized swap map cache state
> mm, swap: cleanup swap entry management workflow
> mm, swap: add folio to swap cache directly on allocation
> mm, swap: check swap table directly for checking cache
> mm, swap: clean up and improve swap entries freeing
> mm, swap: drop the SWAP_HAS_CACHE flag
> mm, swap: remove no longer needed _swap_info_get
>
> Nhat Pham (1):
> mm/shmem, swap: remove SWAP_MAP_SHMEM
>
> arch/s390/mm/gmap_helpers.c | 2 +-
> arch/s390/mm/pgtable.c | 2 +-
> include/linux/swap.h | 77 ++--
> kernel/power/swap.c | 10 +-
> mm/madvise.c | 2 +-
> mm/memory.c | 276 +++++++-------
> mm/rmap.c | 7 +-
> mm/shmem.c | 75 ++--
> mm/swap.h | 70 +++-
> mm/swap_state.c | 338 +++++++++++------
> mm/swapfile.c | 856 +++++++++++++++++++-------------------------
> mm/userfaultfd.c | 10 +-
> mm/vmscan.c | 1 -
> mm/zswap.c | 4 +-
> 14 files changed, 854 insertions(+), 876 deletions(-)
> ---
> base-commit: 1fa8c5771a65fc5a56f6e39825561cdc8fa91e14
> change-id: 20251007-swap-table-p2-7d3086e5c38a
>
> Best regards,
> --
> Kairui Song <kasong@tencent.com>
>
next prev parent reply other threads:[~2025-11-29 17:07 UTC|newest]
Thread overview: 28+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-11-24 19:13 Kairui Song
2025-11-24 19:13 ` [PATCH v3 01/19] mm, swap: rename __read_swap_cache_async to swap_cache_alloc_folio Kairui Song
2025-11-24 19:13 ` [PATCH v3 02/19] mm, swap: split swap cache preparation loop into a standalone helper Kairui Song
2025-11-24 19:13 ` [PATCH v3 03/19] mm, swap: never bypass the swap cache even for SWP_SYNCHRONOUS_IO Kairui Song
2025-11-24 19:13 ` [PATCH v3 04/19] mm, swap: always try to free swap cache for SWP_SYNCHRONOUS_IO devices Kairui Song
2025-11-24 19:13 ` [PATCH v3 05/19] mm, swap: simplify the code and reduce indention Kairui Song
2025-11-24 19:13 ` [PATCH v3 06/19] mm, swap: free the swap cache after folio is mapped Kairui Song
2025-11-24 19:13 ` [PATCH v3 07/19] mm/shmem: never bypass the swap cache for SWP_SYNCHRONOUS_IO Kairui Song
2025-12-02 7:34 ` Baolin Wang
2025-12-03 5:33 ` Kairui Song
2025-12-04 12:30 ` Baolin Wang
2025-11-24 19:13 ` [PATCH v3 08/19] mm/shmem, swap: remove SWAP_MAP_SHMEM Kairui Song
2025-12-02 7:04 ` Baolin Wang
2025-11-24 19:13 ` [PATCH v3 09/19] mm, swap: swap entry of a bad slot should not be considered as swapped out Kairui Song
2025-11-24 19:13 ` [PATCH v3 10/19] mm, swap: consolidate cluster reclaim and usability check Kairui Song
2025-11-24 19:13 ` [PATCH v3 11/19] mm, swap: split locked entry duplicating into a standalone helper Kairui Song
2025-11-24 19:13 ` [PATCH v3 12/19] mm, swap: use swap cache as the swap in synchronize layer Kairui Song
2025-11-24 19:13 ` [PATCH v3 13/19] mm, swap: remove workaround for unsynchronized swap map cache state Kairui Song
2025-11-24 19:13 ` [PATCH v3 14/19] mm, swap: cleanup swap entry management workflow Kairui Song
2025-11-25 18:11 ` Rafael J. Wysocki
2025-11-24 19:13 ` [PATCH v3 15/19] mm, swap: add folio to swap cache directly on allocation Kairui Song
2025-11-24 19:13 ` [PATCH v3 16/19] mm, swap: check swap table directly for checking cache Kairui Song
2025-11-24 19:14 ` [PATCH v3 17/19] mm, swap: clean up and improve swap entries freeing Kairui Song
2025-11-24 19:14 ` [PATCH v3 18/19] mm, swap: drop the SWAP_HAS_CACHE flag Kairui Song
2025-11-24 19:14 ` [PATCH v3 19/19] mm, swap: remove no longer needed _swap_info_get Kairui Song
2025-11-29 17:07 ` Chris Li [this message]
2025-11-29 18:18 ` [PATCH v3 00/19] mm, swap: swap table phase II: unify swapin use swap cache and cleanup flags Andrew Morton
2025-11-30 20:44 ` Chris Li
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=CACePvbUK6uSDtz0QkBq-eqzQ_Hi9+t1gthGGt+ok7xdZtO1V8Q@mail.gmail.com \
--to=chrisl@kernel.org \
--cc=akpm@linux-foundation.org \
--cc=baohua@kernel.org \
--cc=baolin.wang@linux.alibaba.com \
--cc=bhe@redhat.com \
--cc=david@kernel.org \
--cc=hannes@cmpxchg.org \
--cc=hughd@google.com \
--cc=kasong@tencent.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=linux-pm@vger.kernel.org \
--cc=lorenzo.stoakes@oracle.com \
--cc=nphamcs@gmail.com \
--cc=ryncsn@gmail.com \
--cc=shikemeng@huaweicloud.com \
--cc=willy@infradead.org \
--cc=ying.huang@linux.alibaba.com \
--cc=yosry.ahmed@linux.dev \
--cc=youngjun.park@lge.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox