From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 7ED00D2A520 for ; Thu, 4 Dec 2025 19:29:44 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id DA0846B008A; Thu, 4 Dec 2025 14:29:43 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id D783B6B00B3; Thu, 4 Dec 2025 14:29:43 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id C8ED06B00B4; Thu, 4 Dec 2025 14:29:43 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id B46DF6B008A for ; Thu, 4 Dec 2025 14:29:43 -0500 (EST) Received: from smtpin10.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id 4E65A897EF for ; Thu, 4 Dec 2025 19:29:43 +0000 (UTC) X-FDA: 84182778246.10.8F2A3FC Received: from mail-pf1-f174.google.com (mail-pf1-f174.google.com [209.85.210.174]) by imf05.hostedemail.com (Postfix) with ESMTP id 42EE910000D for ; Thu, 4 Dec 2025 19:29:41 +0000 (UTC) Authentication-Results: imf05.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b="Iom/+Yts"; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf05.hostedemail.com: domain of ryncsn@gmail.com designates 209.85.210.174 as permitted sender) smtp.mailfrom=ryncsn@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1764876581; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding:in-reply-to: references:dkim-signature; bh=/50DzzGGuZCmEfhhyIsQODb6cUpEszW9hVXDiA/0gbE=; b=IJHo4oBDwBkrdClmO1uB6LoPmg9jH3kPTAPMxDssrnuqtTX8ssyOTLjz9dM3Il6quGSSJY xgAyvZxLPcr4Gzk8w3lshzswoAIj36/P3mF/aBZxsrOcspKcKlxoxGqN183hzlHvqT1dB5 u85MwwemR9VjIWI4Z3uYWrBGWxVFXOU= ARC-Authentication-Results: i=1; imf05.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b="Iom/+Yts"; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf05.hostedemail.com: domain of ryncsn@gmail.com designates 209.85.210.174 as permitted sender) smtp.mailfrom=ryncsn@gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1764876581; a=rsa-sha256; cv=none; b=XY9QM9BCj5whTGF68kpx2FT+FL7VpAooAfTMekFUY2faQRHWCBS6IVmY4A/t+JV2yAQF6U aa5tayjxR2BujwJNhFq5cJ3PIX5tSYHF6yGT7NIESWJSMFimPdscb7450+iB07UMvxp0I1 rjeN/Co7HjnWqQdetwkXHcd0LF3kpos= Received: by mail-pf1-f174.google.com with SMTP id d2e1a72fcca58-7c66822dd6dso1548561b3a.0 for ; Thu, 04 Dec 2025 11:29:40 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1764876580; x=1765481380; darn=kvack.org; h=cc:to:content-transfer-encoding:mime-version:message-id:date :subject:from:from:to:cc:subject:date:message-id:reply-to; bh=/50DzzGGuZCmEfhhyIsQODb6cUpEszW9hVXDiA/0gbE=; b=Iom/+Ytsn0bKXy7JYEh78ml38x+hJsUuJYU1RH2ENpdW9OxTjTVgX25ypCrnonVIKs ZRBqibQ8AxPyhaNr0nuJPtKyXnuQp5QHmEUpOM7ZdTiufzPOb7UHkdXyJAawmkEdMTia grUxZrIt0xW+lgmjdgf4d0N5kFYEH5/SnA/QEqjOZhm3+piTiExLLjuVj7bMniGrWffd wEwrF/WGDJqgB9eGDWjlGAJ2Ra/XBgZ2V7ONNOQpCy6wGWZE6xrP5IDIk0eVD2oXLu9c 1nDBhPVcWHpuxR9o98hXmzM69PhSCP6+d1y9hlOxotzXnyPi2hRCirk2CqN9gj7/jlFi ShNg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1764876580; x=1765481380; h=cc:to:content-transfer-encoding:mime-version:message-id:date :subject:from:x-gm-gg:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=/50DzzGGuZCmEfhhyIsQODb6cUpEszW9hVXDiA/0gbE=; b=OnTF4Vfm1YwxKzJfRF54EFW8D9unh5hhzzw+GWbcNdrJXeNt94RVfl6V0WbBkDWOXH d8QL80pnkREn5VpZN7+NDI3R8UP2nL2pV4OnxFYJCrEmVMUcz5Aby4fOIjUyksDJELEz 93rm5VKsoYVjP/mSMuV57UfLd/2wEGUYlm88/L5j0lWg79IqKhEAPpFXK3f2QzyJS4u7 Y8o+sKi/rwMFwTK2cjOhGP8B3a5JAj9Dt3Pes2re8rXmDPkTKmRqlk1U/qTkyqCRwfs9 7Cnlciv3DW5BvWOWHER5JOyLOP+G4HfKi3iPhnv9slcJnv9dpNi65guLhoujgM6ghsAb JMUg== X-Gm-Message-State: AOJu0YzdrmA3sDzjvRD00SpnbZGac0YDXPQw4fX9JGqjrpqXZkdehN9Q wIbeBlTYwyqA6zBMNjQBM8Q2ek3Y7q+jMZoZdlWZFkuieGkSrfwwhd1w X-Gm-Gg: ASbGncsgpsOw8h/+URgZsJwyJnMdOr2wKvD3a3feZllF7NxqFw+funTo84ZPhsDaf3v PfsqNGxfxryGXOEkaomT9rYMsp2WS0ZNp6ppZIlUqOtlXx21IcYjWNR02EcY9e46ZeQkW7W4gOs TmFnCGxj8lVW7w5DvaOodh1XRDr1+sfUklUGSRlLPj39IdsempuQfgzLheLYoJFlIyFOtpcgJq7 6rYjBKeDJcsX58YFFvKAbjOy3f7ruSuiYqJXyY/8v33eGaA+6FFmcMVoU2tFjRNp6e24eaIdSRS YonC0DJzbVHVyptI+IyxVZx2qIMMPPauITu4RXQBLSJGQZa6imWuLQPPvHWRwcZj0o3hOOk02mc K1Ycpu90+0SQb7wnrtjPhWmgtyVrooB3cj/nik7ILcXws+0CHGSB5Y+kuPmFTdpKSLVEZXhHSqj 3NgnXs74GzFmwVTEZpihvU58m0hgbWggQuTXdK8g/te1ETOYyW X-Google-Smtp-Source: AGHT+IF+2CfReFeq8hvspaAgpVnWK4HktTpnrfb9z1IuPTDozQBneIgBBx+omJly5DQDoOr9RFkKyA== X-Received: by 2002:a05:6a20:734d:b0:334:7e78:7030 with SMTP id adf61e73a8af0-364032d799cmr4983387637.8.1764876579753; Thu, 04 Dec 2025 11:29:39 -0800 (PST) Received: from [127.0.0.1] ([101.32.222.185]) by smtp.gmail.com with ESMTPSA id 41be03b00d2f7-bf686b3b5a9sm2552926a12.9.2025.12.04.11.29.34 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 04 Dec 2025 11:29:39 -0800 (PST) From: Kairui Song Subject: [PATCH v4 00/19] mm, swap: swap table phase II: unify swapin use swap cache and cleanup flags Date: Fri, 05 Dec 2025 03:29:08 +0800 Message-Id: <20251205-swap-table-p2-v4-0-cb7e28a26a40@tencent.com> MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit X-B4-Tracking: v=1; b=H4sIAAXhMWkC/2XO0QrCIBTG8VcJrzPUM3V11XtEF07PmlBuzGHF2 Ls3R1Cty+/A788ZScTeYySHzUh6TD76Nsyj2G6IbUy4IPVu3kQwITljmsa76ehgqivSTlDtgJU KpYXSkNl0Pdb+sfRO53k3Pg5t/1zyiefruyT2q1LilFFwBdRQKbQgjgMGi2HY2fZGciuJj+d8/ UkS2WsNDBUa5eS/hy8v5NpD9lDLotZMG2l//TRNLxykUUEsAQAA X-Change-ID: 20251007-swap-table-p2-7d3086e5c38a To: linux-mm@kvack.org Cc: Andrew Morton , Baoquan He , Barry Song , Chris Li , Nhat Pham , Yosry Ahmed , David Hildenbrand , Johannes Weiner , Youngjun Park , Hugh Dickins , Baolin Wang , Ying Huang , Kemeng Shi , Lorenzo Stoakes , "Matthew Wilcox (Oracle)" , linux-kernel@vger.kernel.org, Kairui Song , linux-pm@vger.kernel.org, "Rafael J. Wysocki (Intel)" X-Mailer: b4 0.14.3 X-Developer-Signature: v=1; a=ed25519-sha256; t=1764876574; l=8814; i=kasong@tencent.com; s=kasong-sign-tencent; h=from:subject:message-id; bh=i5RlmtxXCdWtwZB3hf9A8JL1M1/AKu+ssjIASvWZXBQ=; b=UJFtVp18Ijq5QeH5ZU3bL5PKApkxQYMRCCQifrUtxMCdnJ6K/xkLqRsVG/0SOgg4Yd5FjqVbT nCRZxdRnv1XANkY9Ea8KY5Apizf29tmEz2ByoebeLZoT3RerziEvVzw X-Developer-Key: i=kasong@tencent.com; a=ed25519; pk=kCdoBuwrYph+KrkJnrr7Sm1pwwhGDdZKcKrqiK8Y1mI= X-Rspam-User: X-Rspamd-Queue-Id: 42EE910000D X-Rspamd-Server: rspam11 X-Stat-Signature: fmu6si63i6r35qcjry8m931csqqrzc74 X-HE-Tag: 1764876581-625667 X-HE-Meta: U2FsdGVkX1/KSqCLObViRPkRKLOrlxwE4iQGSsncrafha0Cety0Q92NcNS4Pd5qCIbOSzZyuuXAx5aKJdtb/yY64ejbRRnpLo92i7fCkLWpQeXMUuJk/mgUJoYG8jKsJILcCp6tpX9dbPXRZwMfCs859KufYUJ029E0q02UabgIq4NAaCRydpW7pMn8ySwG2yyt2DxOlpWYWvAjc0uHlRK86YIw+UVpsBal7R5Sqpr55XmoRiqcMwaAvPFI9slQBxJNz+cVl3vhQzzjS2vg4OW+vnneZK0fYPVaf/2ugnh3hCQqncrq+hRYwOtC0bkBdtQ5exiTSp/6mZ7hTsFs90JBqvXfvw2Eed1WBB15qXrZqtQglT6bCbDEl0Zas4kxJFUpkCo/wkaZnCR7S1q/JZADoeW9abAbIUgtCzj8C6XYaPB9AHq3gUifw2mfdl8HZKfxNHeFzQNjxFq+Ztb7wfzuyAlmgBrFTpy1mIffzYbBcElhP0JmYZPQVcGuChFYXwdd17QlZj5T/TMU2qFK70ZJCghzQUusjaAIdsozN+6q/682nSKDua2v/7kfsg+Sb34+3jjDf/Zfw5p3X+sRfD0+25C63+Er+fHlCpQeyRqnOpIAswaQEW9yET/474uJMdR1BtcgnrNMfA8tb7rDGylsPeB8rQ5DMS+zIXfgTym3usGXfplYzKLiyr0VD9JQQrCEqrtItZJIKZbaD5Ygqs+YbiPK0MqeM+yl/JC+MLcjB2Z9VH5cD3T/xczKlr4sBLh5LEOfDAKRURrfcxxIFahwPuN+5EMRKOBxD/E3iEiHwjhsnH5LiXEn6pEmoGy41fwTHVK3l01Wc9RE2xxFT4x3P4K0POI0WBpWHStD1o6NDVvrKcMpOOWnzFK8M7mLt6zXfqoPHhcgzyO67rSydDDKAH6y4J3TgpgJyAshEft8tYtgzQm6r9FNJXWaYLUTBPXyzKmTGnGeEdZ0CyIg ulKpZkdR iCq09RMiqeOanfiRRr/ZhNm2iRMU9FISBnWPvA+Ju8qO6rhl9DCloBi1RiVO6J9GRSQX2Gx1+d9noZTz7o6moQVCpoyam/cRVC5imYvgcQTvMUXPv7M709qZVKqIPzqxf4ZLf7k6U5p8DM8Vl3Fsltj69OfoCAnh7nJOBtcJCK+VI/3IInovTz2IYFRbkD+zpQA+M4w+U9qty8pR237FuU4DCYg6g82VPju0mywUMrAtpxoUPPKjiXTgeU1z0dmj0cz8/DXh/kzz+CcmCr444OZvB8NaXeWmU2tQxAUBltUsbXnN+n7VWS3Z1BVCSw7SoHSSaRQzOS8+J5El+Wp8cKiaGSXgFuQq/tJAhSwNwE5NjSuk6YAionTMRQ2Xkz7L24jpZSzLMzTuDSrJCteyJOldqQLJ7kulmdKZIJHyrB19IY4v7VYMVTocu72572d7ykExbEQkEIWgCb99cud0BjBuguzr4oV+9q7qq X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: This series removes the SWP_SYNCHRONOUS_IO swap cache bypass swapin code and special swap flag bits including SWAP_HAS_CACHE, along with many historical issues. The performance is about ~20% better for some workloads, like Redis with persistence. This also cleans up the code to prepare for later phases, some patches are from a previously posted series. Swap cache bypassing and swap synchronization in general had many issues. Some are solved as workarounds, and some are still there [1]. To resolve them in a clean way, one good solution is to always use swap cache as the synchronization layer [2]. So we have to remove the swap cache bypass swap-in path first. It wasn't very doable due to performance issues, but now combined with the swap table, removing the swap cache bypass path will instead improve the performance, there is no reason to keep it. Now we can rework the swap entry and cache synchronization following the new design. Swap cache synchronization was heavily relying on SWAP_HAS_CACHE, which is the cause of many issues. By dropping the usage of special swap map bits and related workarounds, we get a cleaner code base and prepare for merging the swap count into the swap table in the next step. And swap_map is now only used for swap count, so in the next phase, swap_map can be merged into the swap table, which will clean up more things and start to reduce the static memory usage. Removal of swap_cgroup_ctrl is also doable, but needs to be done after we also simplify the allocation of swapin folios: always use the new swap_cache_alloc_folio helper so the accounting will also be managed by the swap layer by then. Test results: Redis / Valkey bench: ===================== Testing on a ARM64 VM 1.5G memory: Server: valkey-server --maxmemory 2560M Client: redis-benchmark -r 3000000 -n 3000000 -d 1024 -c 12 -P 32 -t get no persistence with BGSAVE Before: 460475.84 RPS 311591.19 RPS After: 451943.34 RPS (-1.9%) 371379.06 RPS (+19.2%) Testing on a x86_64 VM with 4G memory (system components takes about 2G): Server: Client: redis-benchmark -r 3000000 -n 3000000 -d 1024 -c 12 -P 32 -t get no persistence with BGSAVE Before: 306044.38 RPS 102745.88 RPS After: 309645.44 RPS (+1.2%) 125313.28 RPS (+22.0%) The performance is a lot better when persistence is applied. This should apply to many other workloads that involve sharing memory and COW. A slight performance drop was observed for the ARM64 Redis test: We are still using swap_map to track the swap count, which is causing redundant cache and CPU overhead and is not very performance-friendly for some arches. This will be improved once we merge the swap map into the swap table (as already demonstrated previously [3]). vm-scabiity =========== usemem --init-time -O -y -x -n 32 1536M (16G memory, global pressure, simulated PMEM as swap), average result of 6 test run: Before: After: System time: 282.22s 283.47s Sum Throughput: 5677.35 MB/s 5688.78 MB/s Single process Throughput: 176.41 MB/s 176.23 MB/s Free latency: 518477.96 us 521488.06 us Which is almost identical. Build kernel test: ================== Test using ZRAM as SWAP, make -j48, defconfig, on a x86_64 VM with 4G RAM, under global pressure, avg of 32 test run: Before After: System time: 1379.91s 1364.22s (-0.11%) Test using ZSWAP with NVME SWAP, make -j48, defconfig, on a x86_64 VM with 4G RAM, under global pressure, avg of 32 test run: Before After: System time: 1822.52s 1803.33s (-0.11%) Which is almost identical. MySQL: ====== sysbench /usr/share/sysbench/oltp_read_only.lua --tables=16 --table-size=1000000 --threads=96 --time=600 (using ZRAM as SWAP, in a 512M memory cgroup, buffer pool set to 3G, 3 test run and 180s warm up). Before: 318162.18 qps After: 318512.01 qps (+0.01%) In conclusion, the result is looking better or identical for most cases, and it's especially better for workloads with swap count > 1 on SYNC_IO devices, about ~20% gain in above test. Next phases will start to merge swap count into swap table and reduce memory usage. One more gain here is that we now have better support for THP swapin. Previously, the THP swapin was bound with swap cache bypassing, which only works for single-mapped folios. Removing the bypassing path also enabled THP swapin for all folios. The THP swapin is still limited to SYNC_IO devices, the limitation can be removed later. This may cause more serious THP thrashing for certain workloads, but that's not an issue caused by this series, it's a common THP issue we should resolve separately. Link: https://lore.kernel.org/linux-mm/CAMgjq7D5qoFEK9Omvd5_Zqs6M+TEoG03+2i_mhuP5CQPSOPrmQ@mail.gmail.com/ [1] Link: https://lore.kernel.org/linux-mm/20240326185032.72159-1-ryncsn@gmail.com/ [2] Link: https://lore.kernel.org/linux-mm/20250514201729.48420-1-ryncsn@gmail.com/ [3] Suggested-by: Chris Li Signed-off-by: Kairui Song --- Changes in v4: Based on mm-unstalbe, basically same with V3, mostly comment update and more sanity checks. Stress test and performance test is looking good and basically same as before. - Rebase on latest mm-unstable, should be also mergeable with mm-new. - Update the shmem update commit message as suggested by, and reviewed by [ Baolin Wang ]. - Add a WARN_ON to catch more potential issue and update a few comments. - Link to v3: https://lore.kernel.org/r/20251125-swap-table-p2-v3-0-33f54f707a5c@tencent.com Changes in v3: - Imporve and update comments [ Barry Song, YoungJun Park, Chris Li ] - Simplify the changes of cluster_reclaim_range a bit, as YoungJun points out the change looked confusing. - Fix a few typos I found during self review. - Fix a few build error and warns. - Link to v2: https://lore.kernel.org/r/20251117-swap-table-p2-v2-0-37730e6ea6d5@tencent.com Changes in v2: - Rebased on latest mm-new to resolve conflicts, also appliable to mm-unstable. - Imporve comment, and commit messages in multiple commits, many thanks to [Barry Song, YoungJun Park, Yosry Ahmed ] - Fix cluster usable check in allocator [ YoungJun Park] - Improve cover letter [ Chris Li ] - Collect Reviewed-by [ Yosry Ahmed ] - Fix a few build warning and issues from build bot. - Link to v1: https://lore.kernel.org/r/20251029-swap-table-p2-v1-0-3d43f3b6ec32@tencent.com --- Kairui Song (18): mm, swap: rename __read_swap_cache_async to swap_cache_alloc_folio mm, swap: split swap cache preparation loop into a standalone helper mm, swap: never bypass the swap cache even for SWP_SYNCHRONOUS_IO mm, swap: always try to free swap cache for SWP_SYNCHRONOUS_IO devices mm, swap: simplify the code and reduce indention mm, swap: free the swap cache after folio is mapped mm/shmem: never bypass the swap cache for SWP_SYNCHRONOUS_IO mm, swap: swap entry of a bad slot should not be considered as swapped out mm, swap: consolidate cluster reclaim and usability check mm, swap: split locked entry duplicating into a standalone helper mm, swap: use swap cache as the swap in synchronize layer mm, swap: remove workaround for unsynchronized swap map cache state mm, swap: cleanup swap entry management workflow mm, swap: add folio to swap cache directly on allocation mm, swap: check swap table directly for checking cache mm, swap: clean up and improve swap entries freeing mm, swap: drop the SWAP_HAS_CACHE flag mm, swap: remove no longer needed _swap_info_get Nhat Pham (1): mm/shmem, swap: remove SWAP_MAP_SHMEM arch/s390/mm/gmap_helpers.c | 2 +- arch/s390/mm/pgtable.c | 2 +- include/linux/swap.h | 77 ++-- kernel/power/swap.c | 10 +- mm/madvise.c | 2 +- mm/memory.c | 276 +++++++------- mm/rmap.c | 7 +- mm/shmem.c | 75 ++-- mm/swap.h | 70 +++- mm/swap_state.c | 338 +++++++++++------ mm/swapfile.c | 864 ++++++++++++++++++++------------------------ mm/userfaultfd.c | 10 +- mm/vmscan.c | 1 - mm/zswap.c | 4 +- 14 files changed, 862 insertions(+), 876 deletions(-) --- base-commit: 92440888882ad21791a07ff8809807ef1d2c2a42 change-id: 20251007-swap-table-p2-7d3086e5c38a Best regards, -- Kairui Song