* [PATCH v3 00/20] Virtual Swap Space
@ 2026-02-08 21:58 Nhat Pham
2026-02-08 21:58 ` [PATCH v3 01/20] mm/swap: decouple swap cache from physical swap infrastructure Nhat Pham
` (21 more replies)
0 siblings, 22 replies; 52+ messages in thread
From: Nhat Pham @ 2026-02-08 21:58 UTC (permalink / raw)
To: linux-mm
Cc: akpm, hannes, hughd, yosry.ahmed, mhocko, roman.gushchin,
shakeel.butt, muchun.song, len.brown, chengming.zhou, kasong,
chrisl, huang.ying.caritas, ryan.roberts, shikemeng, viro,
baohua, bhe, osalvador, lorenzo.stoakes, christophe.leroy, pavel,
kernel-team, linux-kernel, cgroups, linux-pm, peterx, riel,
joshua.hahnjy, npache, gourry, axelrasmussen, yuanchu, weixugc,
rafael, jannh, pfalcato, zhengqi.arch
Changelog:
* RFC v2 -> v3:
* Implement a cluster-based allocation algorithm for virtual swap
slots, inspired by Kairui Song and Chris Li's implementation, as
well as Johannes Weiner's suggestions. This eliminates the lock
contention issues on the virtual swap layer.
* Re-use swap table for the reverse mapping.
* Remove CONFIG_VIRTUAL_SWAP.
* Reducing the size of the swap descriptor from 48 bytes to 24
bytes, i.e another 50% reduction in memory overhead from v2.
* Remove swap cache and zswap tree and use the swap descriptor
for this.
* Remove zeromap, and replace the swap_map bytemap with 2 bitmaps
(one for allocated slots, and one for bad slots).
* Rebase on top of 6.19 (7d0a66e4bb9081d75c82ec4957c50034cb0ea449)
* Update cover letter to include new benchmark results and discussion
on overhead in various cases.
* RFC v1 -> RFC v2:
* Use a single atomic type (swap_refs) for reference counting
purpose. This brings the size of the swap descriptor from 64 B
down to 48 B (25% reduction). Suggested by Yosry Ahmed.
* Zeromap bitmap is removed in the virtual swap implementation.
This saves one bit per phyiscal swapfile slot.
* Rearrange the patches and the code change to make things more
reviewable. Suggested by Johannes Weiner.
* Update the cover letter a bit.
This patch series implements the virtual swap space idea, based on Yosry's
proposals at LSFMMBPF 2023 (see [1], [2], [3]), as well as valuable
inputs from Johannes Weiner. The same idea (with different
implementation details) has been floated by Rik van Riel since at least
2011 (see [8]).
This patch series is based on 6.19. There are a couple more
swap-related changes in the mm-stable branch that I would need to
coordinate with, but I would like to send this out as an update, to show
that the lock contention issues that plagued earlier versions have been
resolved and performance on the kernel build benchmark is now on-par with
baseline. Furthermore, memory overhead has been substantially reduced
compared to the last RFC version.
I. Motivation
Currently, when an anon page is swapped out, a slot in a backing swap
device is allocated and stored in the page table entries that refer to
the original page. This slot is also used as the "key" to find the
swapped out content, as well as the index to swap data structures, such
as the swap cache, or the swap cgroup mapping. Tying a swap entry to its
backing slot in this way is performant and efficient when swap is purely
just disk space, and swapoff is rare.
However, the advent of many swap optimizations has exposed major
drawbacks of this design. The first problem is that we occupy a physical
slot in the swap space, even for pages that are NEVER expected to hit
the disk: pages compressed and stored in the zswap pool, zero-filled
pages, or pages rejected by both of these optimizations when zswap
writeback is disabled. This is the arguably central shortcoming of
zswap:
* In deployments when no disk space can be afforded for swap (such as
mobile and embedded devices), users cannot adopt zswap, and are forced
to use zram. This is confusing for users, and creates extra burdens
for developers, having to develop and maintain similar features for
two separate swap backends (writeback, cgroup charging, THP support,
etc.). For instance, see the discussion in [4].
* Resource-wise, it is hugely wasteful in terms of disk usage. At Meta,
we have swapfile in the order of tens to hundreds of GBs, which are
mostly unused and only exist to enable zswap usage and zero-filled
pages swap optimizations.
* Tying zswap (and more generally, other in-memory swap backends) to
the current physical swapfile infrastructure makes zswap implicitly
statically sized. This does not make sense, as unlike disk swap, in
which we consume a limited resource (disk space or swapfile space) to
save another resource (memory), zswap consume the same resource it is
saving (memory). The more we zswap, the more memory we have available,
not less. We are not rationing a limited resource when we limit
the size of he zswap pool, but rather we are capping the resource
(memory) saving potential of zswap. Under memory pressure, using
more zswap is almost always better than the alternative (disk IOs, or
even worse, OOMs), and dynamically sizing the zswap pool on demand
allows the system to flexibly respond to these precarious scenarios.
* Operationally, static provisioning the swapfile for zswap pose
significant challenges, because the sysadmin has to prescribe how
much swap is needed a priori, for each combination of
(memory size x disk space x workload usage). It is even more
complicated when we take into account the variance of memory
compression, which changes the reclaim dynamics (and as a result,
swap space size requirement). The problem is further exarcebated for
users who rely on swap utilization (and exhaustion) as an OOM signal.
All of these factors make it very difficult to configure the swapfile
for zswap: too small of a swapfile and we risk preventable OOMs and
limit the memory saving potentials of zswap; too big of a swapfile
and we waste disk space and memory due to swap metadata overhead.
This dilemma becomes more drastic in high memory systems, which can
have up to TBs worth of memory.
Past attempts to decouple disk and compressed swap backends, namely the
ghost swapfile approach (see [13]), as well as the alternative
compressed swap backend zram, have mainly focused on eliminating the
disk space usage of compressed backends. We want a solution that not
only tackles that same problem, but also achieve the dyamicization of
swap space to maximize the memory saving potentials while reducing
operational and static memory overhead.
Finally, any swap redesign should support efficient backend transfer,
i.e without having to perform the expensive page table walk to
update all the PTEs that refer to the swap entry:
* The main motivation for this requirement is zswap writeback. To quote
Johannes (from [14]): "Combining compression with disk swap is
extremely powerful, because it dramatically reduces the worst aspects
of both: it reduces the memory footprint of compression by shedding
the coldest data to disk; it reduces the IO latencies and flash wear
of disk swap through the writeback cache. In practice, this reduces
*average event rates of the entire reclaim/paging/IO stack*."
* Another motivation is to simplify swapoff, which is both complicated
and expensive in the current design, precisely because we are storing
an encoding of the backend positional information in the page table,
and thus requires a full page table walk to remove these references.
II. High Level Design Overview
To fix the aforementioned issues, we need an abstraction that separates
a swap entry from its physical backing storage. IOW, we need to
“virtualize” the swap space: swap clients will work with a dynamically
allocated virtual swap slot, storing it in page table entries, and
using it to index into various swap-related data structures. The
backing storage is decoupled from the virtual swap slot, and the newly
introduced layer will “resolve” the virtual swap slot to the actual
storage. This layer also manages other metadata of the swap entry, such
as its lifetime information (swap count), via a dynamically allocated,
per-swap-entry descriptor:
struct swp_desc {
union {
swp_slot_t slot; /* 0 8 */
struct zswap_entry * zswap_entry; /* 0 8 */
}; /* 0 8 */
union {
struct folio * swap_cache; /* 8 8 */
void * shadow; /* 8 8 */
}; /* 8 8 */
unsigned int swap_count; /* 16 4 */
unsigned short memcgid:16; /* 20: 0 2 */
bool in_swapcache:1; /* 22: 0 1 */
/* Bitfield combined with previous fields */
enum swap_type type:2; /* 20:17 4 */
/* size: 24, cachelines: 1, members: 6 */
/* bit_padding: 13 bits */
/* last cacheline: 24 bytes */
};
(output from pahole).
This design allows us to:
* Decouple zswap (and zeromapped swap entry) from backing swapfile:
simply associate the virtual swap slot with one of the supported
backends: a zswap entry, a zero-filled swap page, a slot on the
swapfile, or an in-memory page.
* Simplify and optimize swapoff: we only have to fault the page in and
have the virtual swap slot points to the page instead of the on-disk
physical swap slot. No need to perform any page table walking.
The size of the virtual swap descriptor is 24 bytes. Note that this is
not all "new" overhead, as the swap descriptor will replace:
* the swap_cgroup arrays (one per swap type) in the old design, which
is a massive source of static memory overhead. With the new design,
it is only allocated for used clusters.
* the swap tables, which holds the swap cache and workingset shadows.
* the zeromap bitmap, which is a bitmap of physical swap slots to
indicate whether the swapped out page is zero-filled or not.
* huge chunk of the swap_map. The swap_map is now replaced by 2 bitmaps,
one for allocated slots, and one for bad slots, representing 3 possible
states of a slot on the swapfile: allocated, free, and bad.
* the zswap tree.
So, in terms of additional memory overhead:
* For zswap entries, the added memory overhead is rather minimal. The
new indirection pointer neatly replaces the existing zswap tree.
We really only incur less than one word of overhead for swap count
blow up (since we no longer use swap continuation) and the swap type.
* For physical swap entries, the new design will impose fewer than 3 words
memory overhead. However, as noted above this overhead is only for
actively used swap entries, whereas in the current design the overhead is
static (including the swap cgroup array for example).
The primary victim of this overhead will be zram users. However, as
zswap now no longer takes up disk space, zram users can consider
switching to zswap (which, as a bonus, has a lot of useful features
out of the box, such as cgroup tracking, dynamic zswap pool sizing,
LRU-ordering writeback, etc.).
For a more concrete example, suppose we have a 32 GB swapfile (i.e.
8,388,608 swap entries), and we use zswap.
0% usage, or 0 entries: 0.00 MB
* Old design total overhead: 25.00 MB
* Vswap total overhead: 0.00 MB
25% usage, or 2,097,152 entries:
* Old design total overhead: 57.00 MB
* Vswap total overhead: 48.25 MB
50% usage, or 4,194,304 entries:
* Old design total overhead: 89.00 MB
* Vswap total overhead: 96.50 MB
75% usage, or 6,291,456 entries:
* Old design total overhead: 121.00 MB
* Vswap total overhead: 144.75 MB
100% usage, or 8,388,608 entries:
* Old design total overhead: 153.00 MB
* Vswap total overhead: 193.00 MB
So even in the worst case scenario for virtual swap, i.e when we
somehow have an oracle to correctly size the swapfile for zswap
pool to 32 GB, the added overhead is only 40 MB, which is a mere
0.12% of the total swapfile :)
In practice, the overhead will be closer to the 50-75% usage case, as
systems tend to leave swap headroom for pathological events or sudden
spikes in memory requirements. The added overhead in these cases are
practically neglible. And in deployments where swapfiles for zswap
are previously sparsely used, switching over to virtual swap will
actually reduce memory overhead.
Doing the same math for the disk swap, which is the worst case for
virtual swap in terms of swap backends:
0% usage, or 0 entries: 0.00 MB
* Old design total overhead: 25.00 MB
* Vswap total overhead: 2.00 MB
25% usage, or 2,097,152 entries:
* Old design total overhead: 41.00 MB
* Vswap total overhead: 66.25 MB
50% usage, or 4,194,304 entries:
* Old design total overhead: 57.00 MB
* Vswap total overhead: 130.50 MB
75% usage, or 6,291,456 entries:
* Old design total overhead: 73.00 MB
* Vswap total overhead: 194.75 MB
100% usage, or 8,388,608 entries:
* Old design total overhead: 89.00 MB
* Vswap total overhead: 259.00 MB
The added overhead is 170MB, which is 0.5% of the total swapfile size,
again in the worst case when we have a sizing oracle.
Please see the attached patches for more implementation details.
III. Usage and Benchmarking
This patch series introduce no new syscalls or userspace API. Existing
userspace setups will work as-is, except we no longer have to create a
swapfile or set memory.swap.max if we want to use zswap, as zswap is no
longer tied to physical swap. The zswap pool will be automatically and
dynamically sized based on memory usage and reclaim dynamics.
To measure the performance of the new implementation, I have run the
following benchmarks:
1. Kernel building: 52 workers (one per processor), memory.max = 3G.
Using zswap as the backend:
Baseline:
real: mean: 185.2s, stdev: 0.93s
sys: mean: 683.7s, stdev: 33.77s
Vswap:
real: mean: 184.88s, stdev: 0.57s
sys: mean: 675.14s, stdev: 32.8s
We actually see a slight improvement in systime (by 1.5%) :) This is
likely because we no longer have to perform swap charging for zswap
entries, and virtual swap allocator is simpler than that of physical
swap.
Using SSD swap as the backend:
Baseline:
real: mean: 200.3s, stdev: 2.33s
sys: mean: 489.88s, stdev: 9.62s
Vswap:
real: mean: 201.47s, stdev: 2.98s
sys: mean: 487.36s, stdev: 5.53s
The performance is neck-to-neck.
IV. Future Use Cases
While the patch series focus on two applications (decoupling swap
backends and swapoff optimization/simplification), this new,
future-proof design also allows us to implement new swap features more
easily and efficiently:
* Multi-tier swapping (as mentioned in [5]), with transparent
transferring (promotion/demotion) of pages across tiers (see [8] and
[9]). Similar to swapoff, with the old design we would need to
perform the expensive page table walk.
* Swapfile compaction to alleviate fragmentation (as proposed by Ying
Huang in [6]).
* Mixed backing THP swapin (see [7]): Once you have pinned down the
backing store of THPs, then you can dispatch each range of subpages
to appropriate backend swapin handler.
* Swapping a folio out with discontiguous physical swap slots
(see [10]).
* Zswap writeback optimization: The current architecture pre-reserves
physical swap space for pages when they enter the zswap pool, giving
the kernel no flexibility at writeback time. With the virtual swap
implementation, the backends are decoupled, and physical swap space
is allocated on-demand at writeback time, at which point we can make
much smarter decisions: we can batch multiple zswap writeback
operations into a single IO request, allocating contiguous physical
swap slots for that request. We can even perform compressed writeback
(i.e writing these pages without decompressing them) (see [12]).
V. References
[1]: https://lore.kernel.org/all/CAJD7tkbCnXJ95Qow_aOjNX6NOMU5ovMSHRC+95U4wtW6cM+puw@mail.gmail.com/
[2]: https://lwn.net/Articles/932077/
[3]: https://www.youtube.com/watch?v=Hwqw_TBGEhg
[4]: https://lore.kernel.org/all/Zqe_Nab-Df1CN7iW@infradead.org/
[5]: https://lore.kernel.org/lkml/CAF8kJuN-4UE0skVHvjUzpGefavkLULMonjgkXUZSBVJrcGFXCA@mail.gmail.com/
[6]: https://lore.kernel.org/linux-mm/87o78mzp24.fsf@yhuang6-desk2.ccr.corp.intel.com/
[7]: https://lore.kernel.org/all/CAGsJ_4ysCN6f7qt=6gvee1x3ttbOnifGneqcRm9Hoeun=uFQ2w@mail.gmail.com/
[8]: https://lore.kernel.org/linux-mm/4DA25039.3020700@redhat.com/
[9]: https://lore.kernel.org/all/CA+ZsKJ7DCE8PMOSaVmsmYZL9poxK6rn0gvVXbjpqxMwxS2C9TQ@mail.gmail.com/
[10]: https://lore.kernel.org/all/CACePvbUkMYMencuKfpDqtG1Ej7LiUS87VRAXb8sBn1yANikEmQ@mail.gmail.com/
[11]: https://lore.kernel.org/all/CAMgjq7BvQ0ZXvyLGp2YP96+i+6COCBBJCYmjXHGBnfisCAb8VA@mail.gmail.com/
[12]: https://lore.kernel.org/linux-mm/ZeZSDLWwDed0CgT3@casper.infradead.org/
[13]: https://lore.kernel.org/all/20251121-ghost-v1-1-cfc0efcf3855@kernel.org/
[14]: https://lore.kernel.org/linux-mm/20251202170222.GD430226@cmpxchg.org/
Nhat Pham (20):
mm/swap: decouple swap cache from physical swap infrastructure
swap: rearrange the swap header file
mm: swap: add an abstract API for locking out swapoff
zswap: add new helpers for zswap entry operations
mm/swap: add a new function to check if a swap entry is in swap
cached.
mm: swap: add a separate type for physical swap slots
mm: create scaffolds for the new virtual swap implementation
zswap: prepare zswap for swap virtualization
mm: swap: allocate a virtual swap slot for each swapped out page
swap: move swap cache to virtual swap descriptor
zswap: move zswap entry management to the virtual swap descriptor
swap: implement the swap_cgroup API using virtual swap
swap: manage swap entry lifecycle at the virtual swap layer
mm: swap: decouple virtual swap slot from backing store
zswap: do not start zswap shrinker if there is no physical swap slots
swap: do not unnecesarily pin readahead swap entries
swapfile: remove zeromap bitmap
memcg: swap: only charge physical swap slots
swap: simplify swapoff using virtual swap
swapfile: replace the swap map with bitmaps
Documentation/mm/swap-table.rst | 69 --
MAINTAINERS | 2 +
include/linux/cpuhotplug.h | 1 +
include/linux/mm_types.h | 16 +
include/linux/shmem_fs.h | 7 +-
include/linux/swap.h | 135 ++-
include/linux/swap_cgroup.h | 13 -
include/linux/swapops.h | 25 +
include/linux/zswap.h | 17 +-
kernel/power/swap.c | 6 +-
mm/Makefile | 5 +-
mm/huge_memory.c | 11 +-
mm/internal.h | 12 +-
mm/memcontrol-v1.c | 6 +
mm/memcontrol.c | 142 ++-
mm/memory.c | 101 +-
mm/migrate.c | 13 +-
mm/mincore.c | 15 +-
mm/page_io.c | 83 +-
mm/shmem.c | 215 +---
mm/swap.h | 157 +--
mm/swap_cgroup.c | 172 ---
mm/swap_state.c | 306 +----
mm/swap_table.h | 78 +-
mm/swapfile.c | 1518 ++++-------------------
mm/userfaultfd.c | 18 +-
mm/vmscan.c | 28 +-
mm/vswap.c | 2025 +++++++++++++++++++++++++++++++
mm/zswap.c | 142 +--
29 files changed, 2853 insertions(+), 2485 deletions(-)
delete mode 100644 Documentation/mm/swap-table.rst
delete mode 100644 mm/swap_cgroup.c
create mode 100644 mm/vswap.c
base-commit: 05f7e89ab9731565d8a62e3b5d1ec206485eeb0b
--
2.47.3
^ permalink raw reply [flat|nested] 52+ messages in thread
* [PATCH v3 01/20] mm/swap: decouple swap cache from physical swap infrastructure
2026-02-08 21:58 [PATCH v3 00/20] Virtual Swap Space Nhat Pham
@ 2026-02-08 21:58 ` Nhat Pham
2026-02-08 22:26 ` [PATCH v3 00/20] Virtual Swap Space Nhat Pham
` (3 more replies)
2026-02-08 21:58 ` [PATCH v3 02/20] swap: rearrange the swap header file Nhat Pham
` (20 subsequent siblings)
21 siblings, 4 replies; 52+ messages in thread
From: Nhat Pham @ 2026-02-08 21:58 UTC (permalink / raw)
To: linux-mm
Cc: akpm, hannes, hughd, yosry.ahmed, mhocko, roman.gushchin,
shakeel.butt, muchun.song, len.brown, chengming.zhou, kasong,
chrisl, huang.ying.caritas, ryan.roberts, shikemeng, viro,
baohua, bhe, osalvador, lorenzo.stoakes, christophe.leroy, pavel,
kernel-team, linux-kernel, cgroups, linux-pm, peterx, riel,
joshua.hahnjy, npache, gourry, axelrasmussen, yuanchu, weixugc,
rafael, jannh, pfalcato, zhengqi.arch
When we virtualize the swap space, we will manage swap cache at the
virtual swap layer. To prepare for this, decouple swap cache from
physical swap infrastructure.
We will also remove all the swap cache related helpers of swap table. We
will keep the rest of the swap table infrastructure, which will be
repurposed to serve as the rmap (physical -> virtual swap mapping)
later.
Note that with this patch, we will move to a single global lock to
synchronize swap cache accesses. This is temporarily, as the swap cache
will be re-partitioned in to (virtual) swap clusters once we move the
swap cache to the soon-to-be-introduced virtual swap layer.
Signed-off-by: Nhat Pham <nphamcs@gmail.com>
---
Documentation/mm/swap-table.rst | 69 -----------
mm/huge_memory.c | 11 +-
mm/migrate.c | 13 +-
mm/shmem.c | 7 +-
mm/swap.h | 26 ++--
mm/swap_state.c | 205 +++++++++++++++++---------------
mm/swap_table.h | 78 +-----------
mm/swapfile.c | 43 ++-----
mm/vmscan.c | 9 +-
9 files changed, 158 insertions(+), 303 deletions(-)
delete mode 100644 Documentation/mm/swap-table.rst
diff --git a/Documentation/mm/swap-table.rst b/Documentation/mm/swap-table.rst
deleted file mode 100644
index da10bb7a0dc37..0000000000000
--- a/Documentation/mm/swap-table.rst
+++ /dev/null
@@ -1,69 +0,0 @@
-.. SPDX-License-Identifier: GPL-2.0
-
-:Author: Chris Li <chrisl@kernel.org>, Kairui Song <kasong@tencent.com>
-
-==========
-Swap Table
-==========
-
-Swap table implements swap cache as a per-cluster swap cache value array.
-
-Swap Entry
-----------
-
-A swap entry contains the information required to serve the anonymous page
-fault.
-
-Swap entry is encoded as two parts: swap type and swap offset.
-
-The swap type indicates which swap device to use.
-The swap offset is the offset of the swap file to read the page data from.
-
-Swap Cache
-----------
-
-Swap cache is a map to look up folios using swap entry as the key. The result
-value can have three possible types depending on which stage of this swap entry
-was in.
-
-1. NULL: This swap entry is not used.
-
-2. folio: A folio has been allocated and bound to this swap entry. This is
- the transient state of swap out or swap in. The folio data can be in
- the folio or swap file, or both.
-
-3. shadow: The shadow contains the working set information of the swapped
- out folio. This is the normal state for a swapped out page.
-
-Swap Table Internals
---------------------
-
-The previous swap cache is implemented by XArray. The XArray is a tree
-structure. Each lookup will go through multiple nodes. Can we do better?
-
-Notice that most of the time when we look up the swap cache, we are either
-in a swap in or swap out path. We should already have the swap cluster,
-which contains the swap entry.
-
-If we have a per-cluster array to store swap cache value in the cluster.
-Swap cache lookup within the cluster can be a very simple array lookup.
-
-We give such a per-cluster swap cache value array a name: the swap table.
-
-A swap table is an array of pointers. Each pointer is the same size as a
-PTE. The size of a swap table for one swap cluster typically matches a PTE
-page table, which is one page on modern 64-bit systems.
-
-With swap table, swap cache lookup can achieve great locality, simpler,
-and faster.
-
-Locking
--------
-
-Swap table modification requires taking the cluster lock. If a folio
-is being added to or removed from the swap table, the folio must be
-locked prior to the cluster lock. After adding or removing is done, the
-folio shall be unlocked.
-
-Swap table lookup is protected by RCU and atomic read. If the lookup
-returns a folio, the user must lock the folio before use.
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 40cf59301c21a..21215ac870144 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3783,7 +3783,6 @@ static int __folio_freeze_and_split_unmapped(struct folio *folio, unsigned int n
/* Prevent deferred_split_scan() touching ->_refcount */
ds_queue = folio_split_queue_lock(folio);
if (folio_ref_freeze(folio, folio_cache_ref_count(folio) + 1)) {
- struct swap_cluster_info *ci = NULL;
struct lruvec *lruvec;
if (old_order > 1) {
@@ -3826,7 +3825,7 @@ static int __folio_freeze_and_split_unmapped(struct folio *folio, unsigned int n
return -EINVAL;
}
- ci = swap_cluster_get_and_lock(folio);
+ swap_cache_lock();
}
/* lock lru list/PageCompound, ref frozen by page_ref_freeze */
@@ -3862,8 +3861,8 @@ static int __folio_freeze_and_split_unmapped(struct folio *folio, unsigned int n
* Anonymous folio with swap cache.
* NOTE: shmem in swap cache is not supported yet.
*/
- if (ci) {
- __swap_cache_replace_folio(ci, folio, new_folio);
+ if (folio_test_swapcache(folio)) {
+ __swap_cache_replace_folio(folio, new_folio);
continue;
}
@@ -3901,8 +3900,8 @@ static int __folio_freeze_and_split_unmapped(struct folio *folio, unsigned int n
if (do_lru)
unlock_page_lruvec(lruvec);
- if (ci)
- swap_cluster_unlock(ci);
+ if (folio_test_swapcache(folio))
+ swap_cache_unlock();
} else {
split_queue_unlock(ds_queue);
return -EAGAIN;
diff --git a/mm/migrate.c b/mm/migrate.c
index 4688b9e38cd2f..11d9b43dff5d8 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -571,7 +571,6 @@ static int __folio_migrate_mapping(struct address_space *mapping,
struct folio *newfolio, struct folio *folio, int expected_count)
{
XA_STATE(xas, &mapping->i_pages, folio->index);
- struct swap_cluster_info *ci = NULL;
struct zone *oldzone, *newzone;
int dirty;
long nr = folio_nr_pages(folio);
@@ -601,13 +600,13 @@ static int __folio_migrate_mapping(struct address_space *mapping,
newzone = folio_zone(newfolio);
if (folio_test_swapcache(folio))
- ci = swap_cluster_get_and_lock_irq(folio);
+ swap_cache_lock_irq();
else
xas_lock_irq(&xas);
if (!folio_ref_freeze(folio, expected_count)) {
- if (ci)
- swap_cluster_unlock_irq(ci);
+ if (folio_test_swapcache(folio))
+ swap_cache_unlock_irq();
else
xas_unlock_irq(&xas);
return -EAGAIN;
@@ -640,7 +639,7 @@ static int __folio_migrate_mapping(struct address_space *mapping,
}
if (folio_test_swapcache(folio))
- __swap_cache_replace_folio(ci, folio, newfolio);
+ __swap_cache_replace_folio(folio, newfolio);
else
xas_store(&xas, newfolio);
@@ -652,8 +651,8 @@ static int __folio_migrate_mapping(struct address_space *mapping,
folio_ref_unfreeze(folio, expected_count - nr);
/* Leave irq disabled to prevent preemption while updating stats */
- if (ci)
- swap_cluster_unlock(ci);
+ if (folio_test_swapcache(folio))
+ swap_cache_unlock();
else
xas_unlock(&xas);
diff --git a/mm/shmem.c b/mm/shmem.c
index 79af5f9f8b908..1db97ef2d14eb 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2133,7 +2133,6 @@ static int shmem_replace_folio(struct folio **foliop, gfp_t gfp,
struct shmem_inode_info *info, pgoff_t index,
struct vm_area_struct *vma)
{
- struct swap_cluster_info *ci;
struct folio *new, *old = *foliop;
swp_entry_t entry = old->swap;
int nr_pages = folio_nr_pages(old);
@@ -2166,12 +2165,12 @@ static int shmem_replace_folio(struct folio **foliop, gfp_t gfp,
new->swap = entry;
folio_set_swapcache(new);
- ci = swap_cluster_get_and_lock_irq(old);
- __swap_cache_replace_folio(ci, old, new);
+ swap_cache_lock_irq();
+ __swap_cache_replace_folio(old, new);
mem_cgroup_replace_folio(old, new);
shmem_update_stats(new, nr_pages);
shmem_update_stats(old, -nr_pages);
- swap_cluster_unlock_irq(ci);
+ swap_cache_unlock_irq();
folio_add_lru(new);
*foliop = new;
diff --git a/mm/swap.h b/mm/swap.h
index 1bd466da30393..8726b587a5b5d 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -199,6 +199,11 @@ void __swap_writepage(struct folio *folio, struct swap_iocb **swap_plug);
/* linux/mm/swap_state.c */
extern struct address_space swap_space __read_mostly;
+void swap_cache_lock_irq(void);
+void swap_cache_unlock_irq(void);
+void swap_cache_lock(void);
+void swap_cache_unlock(void);
+
static inline struct address_space *swap_address_space(swp_entry_t entry)
{
return &swap_space;
@@ -247,14 +252,12 @@ static inline bool folio_matches_swap_entry(const struct folio *folio,
*/
struct folio *swap_cache_get_folio(swp_entry_t entry);
void *swap_cache_get_shadow(swp_entry_t entry);
-void swap_cache_add_folio(struct folio *folio, swp_entry_t entry, void **shadow);
+int swap_cache_add_folio(struct folio *folio, swp_entry_t entry, gfp_t gfp, void **shadow);
void swap_cache_del_folio(struct folio *folio);
-/* Below helpers require the caller to lock and pass in the swap cluster. */
-void __swap_cache_del_folio(struct swap_cluster_info *ci,
- struct folio *folio, swp_entry_t entry, void *shadow);
-void __swap_cache_replace_folio(struct swap_cluster_info *ci,
- struct folio *old, struct folio *new);
-void __swap_cache_clear_shadow(swp_entry_t entry, int nr_ents);
+/* Below helpers require the caller to lock the swap cache. */
+void __swap_cache_del_folio(struct folio *folio, swp_entry_t entry, void *shadow);
+void __swap_cache_replace_folio(struct folio *old, struct folio *new);
+void swap_cache_clear_shadow(swp_entry_t entry, int nr_ents);
void show_swap_cache_info(void);
void swapcache_clear(struct swap_info_struct *si, swp_entry_t entry, int nr);
@@ -411,21 +414,20 @@ static inline void *swap_cache_get_shadow(swp_entry_t entry)
return NULL;
}
-static inline void swap_cache_add_folio(struct folio *folio, swp_entry_t entry, void **shadow)
+static inline int swap_cache_add_folio(struct folio *folio, swp_entry_t entry, gfp_t gfp, void **shadow)
{
+ return 0;
}
static inline void swap_cache_del_folio(struct folio *folio)
{
}
-static inline void __swap_cache_del_folio(struct swap_cluster_info *ci,
- struct folio *folio, swp_entry_t entry, void *shadow)
+static inline void __swap_cache_del_folio(struct folio *folio, swp_entry_t entry, void *shadow)
{
}
-static inline void __swap_cache_replace_folio(struct swap_cluster_info *ci,
- struct folio *old, struct folio *new)
+static inline void __swap_cache_replace_folio(struct folio *old, struct folio *new)
{
}
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 44d228982521e..34c9d9b243a74 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -22,8 +22,8 @@
#include <linux/vmalloc.h>
#include <linux/huge_mm.h>
#include <linux/shmem_fs.h>
+#include <linux/xarray.h>
#include "internal.h"
-#include "swap_table.h"
#include "swap.h"
/*
@@ -41,6 +41,28 @@ struct address_space swap_space __read_mostly = {
.a_ops = &swap_aops,
};
+static DEFINE_XARRAY(swap_cache);
+
+void swap_cache_lock_irq(void)
+{
+ xa_lock_irq(&swap_cache);
+}
+
+void swap_cache_unlock_irq(void)
+{
+ xa_unlock_irq(&swap_cache);
+}
+
+void swap_cache_lock(void)
+{
+ xa_lock(&swap_cache);
+}
+
+void swap_cache_unlock(void)
+{
+ xa_unlock(&swap_cache);
+}
+
static bool enable_vma_readahead __read_mostly = true;
#define SWAP_RA_ORDER_CEILING 5
@@ -86,17 +108,22 @@ void show_swap_cache_info(void)
*/
struct folio *swap_cache_get_folio(swp_entry_t entry)
{
- unsigned long swp_tb;
+ void *entry_val;
struct folio *folio;
for (;;) {
- swp_tb = swap_table_get(__swap_entry_to_cluster(entry),
- swp_cluster_offset(entry));
- if (!swp_tb_is_folio(swp_tb))
+ rcu_read_lock();
+ entry_val = xa_load(&swap_cache, entry.val);
+ if (!entry_val || xa_is_value(entry_val)) {
+ rcu_read_unlock();
return NULL;
- folio = swp_tb_to_folio(swp_tb);
- if (likely(folio_try_get(folio)))
+ }
+ folio = entry_val;
+ if (likely(folio_try_get(folio))) {
+ rcu_read_unlock();
return folio;
+ }
+ rcu_read_unlock();
}
return NULL;
@@ -112,12 +139,14 @@ struct folio *swap_cache_get_folio(swp_entry_t entry)
*/
void *swap_cache_get_shadow(swp_entry_t entry)
{
- unsigned long swp_tb;
+ void *entry_val;
+
+ rcu_read_lock();
+ entry_val = xa_load(&swap_cache, entry.val);
+ rcu_read_unlock();
- swp_tb = swap_table_get(__swap_entry_to_cluster(entry),
- swp_cluster_offset(entry));
- if (swp_tb_is_shadow(swp_tb))
- return swp_tb_to_shadow(swp_tb);
+ if (xa_is_value(entry_val))
+ return entry_val;
return NULL;
}
@@ -132,46 +161,58 @@ void *swap_cache_get_shadow(swp_entry_t entry)
* with reference count or locks.
* The caller also needs to update the corresponding swap_map slots with
* SWAP_HAS_CACHE bit to avoid race or conflict.
+ *
+ * Return: 0 on success, negative error code on failure.
*/
-void swap_cache_add_folio(struct folio *folio, swp_entry_t entry, void **shadowp)
+int swap_cache_add_folio(struct folio *folio, swp_entry_t entry, gfp_t gfp, void **shadowp)
{
- void *shadow = NULL;
- unsigned long old_tb, new_tb;
- struct swap_cluster_info *ci;
- unsigned int ci_start, ci_off, ci_end;
+ XA_STATE_ORDER(xas, &swap_cache, entry.val, folio_order(folio));
unsigned long nr_pages = folio_nr_pages(folio);
+ unsigned long i;
+ void *old;
VM_WARN_ON_ONCE_FOLIO(!folio_test_locked(folio), folio);
VM_WARN_ON_ONCE_FOLIO(folio_test_swapcache(folio), folio);
VM_WARN_ON_ONCE_FOLIO(!folio_test_swapbacked(folio), folio);
- new_tb = folio_to_swp_tb(folio);
- ci_start = swp_cluster_offset(entry);
- ci_end = ci_start + nr_pages;
- ci_off = ci_start;
- ci = swap_cluster_lock(__swap_entry_to_info(entry), swp_offset(entry));
- do {
- old_tb = __swap_table_xchg(ci, ci_off, new_tb);
- WARN_ON_ONCE(swp_tb_is_folio(old_tb));
- if (swp_tb_is_shadow(old_tb))
- shadow = swp_tb_to_shadow(old_tb);
- } while (++ci_off < ci_end);
-
folio_ref_add(folio, nr_pages);
folio_set_swapcache(folio);
folio->swap = entry;
- swap_cluster_unlock(ci);
- node_stat_mod_folio(folio, NR_FILE_PAGES, nr_pages);
- lruvec_stat_mod_folio(folio, NR_SWAPCACHE, nr_pages);
+ do {
+ xas_lock_irq(&xas);
+ xas_create_range(&xas);
+ if (xas_error(&xas))
+ goto unlock;
+ for (i = 0; i < nr_pages; i++) {
+ VM_BUG_ON_FOLIO(xas.xa_index != entry.val + i, folio);
+ old = xas_load(&xas);
+ if (old && !xa_is_value(old)) {
+ VM_WARN_ON_ONCE_FOLIO(1, folio);
+ xas_set_err(&xas, -EEXIST);
+ goto unlock;
+ }
+ if (shadowp && xa_is_value(old) && !*shadowp)
+ *shadowp = old;
+ xas_store(&xas, folio);
+ xas_next(&xas);
+ }
+ node_stat_mod_folio(folio, NR_FILE_PAGES, nr_pages);
+ lruvec_stat_mod_folio(folio, NR_SWAPCACHE, nr_pages);
+unlock:
+ xas_unlock_irq(&xas);
+ } while (xas_nomem(&xas, gfp));
- if (shadowp)
- *shadowp = shadow;
+ if (!xas_error(&xas))
+ return 0;
+
+ folio_clear_swapcache(folio);
+ folio_ref_sub(folio, nr_pages);
+ return xas_error(&xas);
}
/**
* __swap_cache_del_folio - Removes a folio from the swap cache.
- * @ci: The locked swap cluster.
* @folio: The folio.
* @entry: The first swap entry that the folio corresponds to.
* @shadow: shadow value to be filled in the swap cache.
@@ -180,30 +221,23 @@ void swap_cache_add_folio(struct folio *folio, swp_entry_t entry, void **shadowp
* This won't put the folio's refcount. The caller has to do that.
*
* Context: Caller must ensure the folio is locked and in the swap cache
- * using the index of @entry, and lock the cluster that holds the entries.
+ * using the index of @entry, and lock the swap cache xarray.
*/
-void __swap_cache_del_folio(struct swap_cluster_info *ci, struct folio *folio,
- swp_entry_t entry, void *shadow)
+void __swap_cache_del_folio(struct folio *folio, swp_entry_t entry, void *shadow)
{
- unsigned long old_tb, new_tb;
- unsigned int ci_start, ci_off, ci_end;
- unsigned long nr_pages = folio_nr_pages(folio);
+ long nr_pages = folio_nr_pages(folio);
+ XA_STATE(xas, &swap_cache, entry.val);
+ int i;
- VM_WARN_ON_ONCE(__swap_entry_to_cluster(entry) != ci);
VM_WARN_ON_ONCE_FOLIO(!folio_test_locked(folio), folio);
VM_WARN_ON_ONCE_FOLIO(!folio_test_swapcache(folio), folio);
VM_WARN_ON_ONCE_FOLIO(folio_test_writeback(folio), folio);
- new_tb = shadow_swp_to_tb(shadow);
- ci_start = swp_cluster_offset(entry);
- ci_end = ci_start + nr_pages;
- ci_off = ci_start;
- do {
- /* If shadow is NULL, we sets an empty shadow */
- old_tb = __swap_table_xchg(ci, ci_off, new_tb);
- WARN_ON_ONCE(!swp_tb_is_folio(old_tb) ||
- swp_tb_to_folio(old_tb) != folio);
- } while (++ci_off < ci_end);
+ for (i = 0; i < nr_pages; i++) {
+ void *old = xas_store(&xas, shadow);
+ VM_WARN_ON_FOLIO(old != folio, folio);
+ xas_next(&xas);
+ }
folio->swap.val = 0;
folio_clear_swapcache(folio);
@@ -223,12 +257,11 @@ void __swap_cache_del_folio(struct swap_cluster_info *ci, struct folio *folio,
*/
void swap_cache_del_folio(struct folio *folio)
{
- struct swap_cluster_info *ci;
swp_entry_t entry = folio->swap;
- ci = swap_cluster_lock(__swap_entry_to_info(entry), swp_offset(entry));
- __swap_cache_del_folio(ci, folio, entry, NULL);
- swap_cluster_unlock(ci);
+ xa_lock_irq(&swap_cache);
+ __swap_cache_del_folio(folio, entry, NULL);
+ xa_unlock_irq(&swap_cache);
put_swap_folio(folio, entry);
folio_ref_sub(folio, folio_nr_pages(folio));
@@ -236,7 +269,6 @@ void swap_cache_del_folio(struct folio *folio)
/**
* __swap_cache_replace_folio - Replace a folio in the swap cache.
- * @ci: The locked swap cluster.
* @old: The old folio to be replaced.
* @new: The new folio.
*
@@ -246,39 +278,23 @@ void swap_cache_del_folio(struct folio *folio)
* the starting offset to override all slots covered by the new folio.
*
* Context: Caller must ensure both folios are locked, and lock the
- * cluster that holds the old folio to be replaced.
+ * swap cache xarray.
*/
-void __swap_cache_replace_folio(struct swap_cluster_info *ci,
- struct folio *old, struct folio *new)
+void __swap_cache_replace_folio(struct folio *old, struct folio *new)
{
swp_entry_t entry = new->swap;
unsigned long nr_pages = folio_nr_pages(new);
- unsigned int ci_off = swp_cluster_offset(entry);
- unsigned int ci_end = ci_off + nr_pages;
- unsigned long old_tb, new_tb;
+ XA_STATE(xas, &swap_cache, entry.val);
+ int i;
VM_WARN_ON_ONCE(!folio_test_swapcache(old) || !folio_test_swapcache(new));
VM_WARN_ON_ONCE(!folio_test_locked(old) || !folio_test_locked(new));
VM_WARN_ON_ONCE(!entry.val);
- /* Swap cache still stores N entries instead of a high-order entry */
- new_tb = folio_to_swp_tb(new);
- do {
- old_tb = __swap_table_xchg(ci, ci_off, new_tb);
- WARN_ON_ONCE(!swp_tb_is_folio(old_tb) || swp_tb_to_folio(old_tb) != old);
- } while (++ci_off < ci_end);
-
- /*
- * If the old folio is partially replaced (e.g., splitting a large
- * folio, the old folio is shrunk, and new split sub folios replace
- * the shrunk part), ensure the new folio doesn't overlap it.
- */
- if (IS_ENABLED(CONFIG_DEBUG_VM) &&
- folio_order(old) != folio_order(new)) {
- ci_off = swp_cluster_offset(old->swap);
- ci_end = ci_off + folio_nr_pages(old);
- while (ci_off++ < ci_end)
- WARN_ON_ONCE(swp_tb_to_folio(__swap_table_get(ci, ci_off)) != old);
+ for (i = 0; i < nr_pages; i++) {
+ void *old_entry = xas_store(&xas, new);
+ WARN_ON_ONCE(!old_entry || xa_is_value(old_entry) || old_entry != old);
+ xas_next(&xas);
}
}
@@ -287,20 +303,20 @@ void __swap_cache_replace_folio(struct swap_cluster_info *ci,
* @entry: The starting index entry.
* @nr_ents: How many slots need to be cleared.
*
- * Context: Caller must ensure the range is valid, all in one single cluster,
- * not occupied by any folio, and lock the cluster.
+ * Context: Caller must ensure the range is valid and all in one single cluster,
+ * not occupied by any folio.
*/
-void __swap_cache_clear_shadow(swp_entry_t entry, int nr_ents)
+void swap_cache_clear_shadow(swp_entry_t entry, int nr_ents)
{
- struct swap_cluster_info *ci = __swap_entry_to_cluster(entry);
- unsigned int ci_off = swp_cluster_offset(entry), ci_end;
- unsigned long old;
+ XA_STATE(xas, &swap_cache, entry.val);
+ int i;
- ci_end = ci_off + nr_ents;
- do {
- old = __swap_table_xchg(ci, ci_off, null_to_swp_tb());
- WARN_ON_ONCE(swp_tb_is_folio(old));
- } while (++ci_off < ci_end);
+ xas_lock(&xas);
+ for (i = 0; i < nr_ents; i++) {
+ xas_store(&xas, NULL);
+ xas_next(&xas);
+ }
+ xas_unlock(&xas);
}
/*
@@ -480,7 +496,10 @@ struct folio *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
if (mem_cgroup_swapin_charge_folio(new_folio, NULL, gfp_mask, entry))
goto fail_unlock;
- swap_cache_add_folio(new_folio, entry, &shadow);
+ /* May fail (-ENOMEM) if XArray node allocation failed. */
+ if (swap_cache_add_folio(new_folio, entry, gfp_mask & GFP_RECLAIM_MASK, &shadow))
+ goto fail_unlock;
+
memcg1_swapin(entry, 1);
if (shadow)
diff --git a/mm/swap_table.h b/mm/swap_table.h
index ea244a57a5b7a..ad2cb2ef46903 100644
--- a/mm/swap_table.h
+++ b/mm/swap_table.h
@@ -13,71 +13,6 @@ struct swap_table {
#define SWP_TABLE_USE_PAGE (sizeof(struct swap_table) == PAGE_SIZE)
-/*
- * A swap table entry represents the status of a swap slot on a swap
- * (physical or virtual) device. The swap table in each cluster is a
- * 1:1 map of the swap slots in this cluster.
- *
- * Each swap table entry could be a pointer (folio), a XA_VALUE
- * (shadow), or NULL.
- */
-
-/*
- * Helpers for casting one type of info into a swap table entry.
- */
-static inline unsigned long null_to_swp_tb(void)
-{
- BUILD_BUG_ON(sizeof(unsigned long) != sizeof(atomic_long_t));
- return 0;
-}
-
-static inline unsigned long folio_to_swp_tb(struct folio *folio)
-{
- BUILD_BUG_ON(sizeof(unsigned long) != sizeof(void *));
- return (unsigned long)folio;
-}
-
-static inline unsigned long shadow_swp_to_tb(void *shadow)
-{
- BUILD_BUG_ON((BITS_PER_XA_VALUE + 1) !=
- BITS_PER_BYTE * sizeof(unsigned long));
- VM_WARN_ON_ONCE(shadow && !xa_is_value(shadow));
- return (unsigned long)shadow;
-}
-
-/*
- * Helpers for swap table entry type checking.
- */
-static inline bool swp_tb_is_null(unsigned long swp_tb)
-{
- return !swp_tb;
-}
-
-static inline bool swp_tb_is_folio(unsigned long swp_tb)
-{
- return !xa_is_value((void *)swp_tb) && !swp_tb_is_null(swp_tb);
-}
-
-static inline bool swp_tb_is_shadow(unsigned long swp_tb)
-{
- return xa_is_value((void *)swp_tb);
-}
-
-/*
- * Helpers for retrieving info from swap table.
- */
-static inline struct folio *swp_tb_to_folio(unsigned long swp_tb)
-{
- VM_WARN_ON(!swp_tb_is_folio(swp_tb));
- return (void *)swp_tb;
-}
-
-static inline void *swp_tb_to_shadow(unsigned long swp_tb)
-{
- VM_WARN_ON(!swp_tb_is_shadow(swp_tb));
- return (void *)swp_tb;
-}
-
/*
* Helpers for accessing or modifying the swap table of a cluster,
* the swap cluster must be locked.
@@ -92,17 +27,6 @@ static inline void __swap_table_set(struct swap_cluster_info *ci,
atomic_long_set(&table[off], swp_tb);
}
-static inline unsigned long __swap_table_xchg(struct swap_cluster_info *ci,
- unsigned int off, unsigned long swp_tb)
-{
- atomic_long_t *table = rcu_dereference_protected(ci->table, true);
-
- lockdep_assert_held(&ci->lock);
- VM_WARN_ON_ONCE(off >= SWAPFILE_CLUSTER);
- /* Ordering is guaranteed by cluster lock, relax */
- return atomic_long_xchg_relaxed(&table[off], swp_tb);
-}
-
static inline unsigned long __swap_table_get(struct swap_cluster_info *ci,
unsigned int off)
{
@@ -122,7 +46,7 @@ static inline unsigned long swap_table_get(struct swap_cluster_info *ci,
rcu_read_lock();
table = rcu_dereference(ci->table);
- swp_tb = table ? atomic_long_read(&table[off]) : null_to_swp_tb();
+ swp_tb = table ? atomic_long_read(&table[off]) : 0;
rcu_read_unlock();
return swp_tb;
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 46d2008e4b996..cacfafa9a540d 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -474,7 +474,7 @@ static void swap_cluster_free_table(struct swap_cluster_info *ci)
lockdep_assert_held(&ci->lock);
VM_WARN_ON_ONCE(!cluster_is_empty(ci));
for (ci_off = 0; ci_off < SWAPFILE_CLUSTER; ci_off++)
- VM_WARN_ON_ONCE(!swp_tb_is_null(__swap_table_get(ci, ci_off)));
+ VM_WARN_ON_ONCE(__swap_table_get(ci, ci_off));
table = (void *)rcu_dereference_protected(ci->table, true);
rcu_assign_pointer(ci->table, NULL);
@@ -843,26 +843,6 @@ static bool cluster_scan_range(struct swap_info_struct *si,
return true;
}
-/*
- * Currently, the swap table is not used for count tracking, just
- * do a sanity check here to ensure nothing leaked, so the swap
- * table should be empty upon freeing.
- */
-static void swap_cluster_assert_table_empty(struct swap_cluster_info *ci,
- unsigned int start, unsigned int nr)
-{
- unsigned int ci_off = start % SWAPFILE_CLUSTER;
- unsigned int ci_end = ci_off + nr;
- unsigned long swp_tb;
-
- if (IS_ENABLED(CONFIG_DEBUG_VM)) {
- do {
- swp_tb = __swap_table_get(ci, ci_off);
- VM_WARN_ON_ONCE(!swp_tb_is_null(swp_tb));
- } while (++ci_off < ci_end);
- }
-}
-
static bool cluster_alloc_range(struct swap_info_struct *si, struct swap_cluster_info *ci,
unsigned int start, unsigned char usage,
unsigned int order)
@@ -882,7 +862,6 @@ static bool cluster_alloc_range(struct swap_info_struct *si, struct swap_cluster
ci->order = order;
memset(si->swap_map + start, usage, nr_pages);
- swap_cluster_assert_table_empty(ci, start, nr_pages);
swap_range_alloc(si, nr_pages);
ci->count += nr_pages;
@@ -1275,7 +1254,7 @@ static void swap_range_free(struct swap_info_struct *si, unsigned long offset,
swap_slot_free_notify(si->bdev, offset);
offset++;
}
- __swap_cache_clear_shadow(swp_entry(si->type, begin), nr_entries);
+ swap_cache_clear_shadow(swp_entry(si->type, begin), nr_entries);
/*
* Make sure that try_to_unuse() observes si->inuse_pages reaching 0
@@ -1423,6 +1402,7 @@ int folio_alloc_swap(struct folio *folio)
unsigned int order = folio_order(folio);
unsigned int size = 1 << order;
swp_entry_t entry = {};
+ int err;
VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
VM_BUG_ON_FOLIO(!folio_test_uptodate(folio), folio);
@@ -1457,19 +1437,23 @@ int folio_alloc_swap(struct folio *folio)
}
/* Need to call this even if allocation failed, for MEMCG_SWAP_FAIL. */
- if (mem_cgroup_try_charge_swap(folio, entry))
+ if (mem_cgroup_try_charge_swap(folio, entry)) {
+ err = -ENOMEM;
goto out_free;
+ }
if (!entry.val)
return -ENOMEM;
- swap_cache_add_folio(folio, entry, NULL);
+ err = swap_cache_add_folio(folio, entry, __GFP_HIGH | __GFP_NOMEMALLOC | __GFP_NOWARN, NULL);
+ if (err)
+ goto out_free;
return 0;
out_free:
put_swap_folio(folio, entry);
- return -ENOMEM;
+ return err;
}
static struct swap_info_struct *_swap_info_get(swp_entry_t entry)
@@ -1729,7 +1713,6 @@ static void swap_entries_free(struct swap_info_struct *si,
mem_cgroup_uncharge_swap(entry, nr_pages);
swap_range_free(si, offset, nr_pages);
- swap_cluster_assert_table_empty(ci, offset, nr_pages);
if (!ci->count)
free_cluster(si, ci);
@@ -4057,9 +4040,9 @@ static int __init swapfile_init(void)
swapfile_maximum_size = arch_max_swapfile_size();
/*
- * Once a cluster is freed, it's swap table content is read
- * only, and all swap cache readers (swap_cache_*) verifies
- * the content before use. So it's safe to use RCU slab here.
+ * Once a cluster is freed, it's swap table content is read only, and
+ * all swap table readers verify the content before use. So it's safe to
+ * use RCU slab here.
*/
if (!SWP_TABLE_USE_PAGE)
swap_table_cachep = kmem_cache_create("swap_table",
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 614ccf39fe3fa..558ff7f413786 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -707,13 +707,12 @@ static int __remove_mapping(struct address_space *mapping, struct folio *folio,
{
int refcount;
void *shadow = NULL;
- struct swap_cluster_info *ci;
BUG_ON(!folio_test_locked(folio));
BUG_ON(mapping != folio_mapping(folio));
if (folio_test_swapcache(folio)) {
- ci = swap_cluster_get_and_lock_irq(folio);
+ swap_cache_lock_irq();
} else {
spin_lock(&mapping->host->i_lock);
xa_lock_irq(&mapping->i_pages);
@@ -758,9 +757,9 @@ static int __remove_mapping(struct address_space *mapping, struct folio *folio,
if (reclaimed && !mapping_exiting(mapping))
shadow = workingset_eviction(folio, target_memcg);
- __swap_cache_del_folio(ci, folio, swap, shadow);
+ __swap_cache_del_folio(folio, swap, shadow);
memcg1_swapout(folio, swap);
- swap_cluster_unlock_irq(ci);
+ swap_cache_unlock_irq();
put_swap_folio(folio, swap);
} else {
void (*free_folio)(struct folio *);
@@ -799,7 +798,7 @@ static int __remove_mapping(struct address_space *mapping, struct folio *folio,
cannot_free:
if (folio_test_swapcache(folio)) {
- swap_cluster_unlock_irq(ci);
+ swap_cache_unlock_irq();
} else {
xa_unlock_irq(&mapping->i_pages);
spin_unlock(&mapping->host->i_lock);
--
2.47.3
^ permalink raw reply [flat|nested] 52+ messages in thread
* [PATCH v3 02/20] swap: rearrange the swap header file
2026-02-08 21:58 [PATCH v3 00/20] Virtual Swap Space Nhat Pham
2026-02-08 21:58 ` [PATCH v3 01/20] mm/swap: decouple swap cache from physical swap infrastructure Nhat Pham
@ 2026-02-08 21:58 ` Nhat Pham
2026-02-08 21:58 ` [PATCH v3 03/20] mm: swap: add an abstract API for locking out swapoff Nhat Pham
` (19 subsequent siblings)
21 siblings, 0 replies; 52+ messages in thread
From: Nhat Pham @ 2026-02-08 21:58 UTC (permalink / raw)
To: linux-mm
Cc: akpm, hannes, hughd, yosry.ahmed, mhocko, roman.gushchin,
shakeel.butt, muchun.song, len.brown, chengming.zhou, kasong,
chrisl, huang.ying.caritas, ryan.roberts, shikemeng, viro,
baohua, bhe, osalvador, lorenzo.stoakes, christophe.leroy, pavel,
kernel-team, linux-kernel, cgroups, linux-pm, peterx, riel,
joshua.hahnjy, npache, gourry, axelrasmussen, yuanchu, weixugc,
rafael, jannh, pfalcato, zhengqi.arch
In the swap header file (include/linux/swap.h), group the swap API into
the following categories:
1. Lifecycle swap functions (i.e the function that changes the reference
count of the swap entry).
2. Swap cache API.
3. Physical swapfile allocator and swap device API.
Also remove extern in the functions that are rearranged.
This is purely a clean up. No functional change intended.
Signed-off-by: Nhat Pham <nphamcs@gmail.com>
---
include/linux/swap.h | 53 +++++++++++++++++++++++---------------------
1 file changed, 28 insertions(+), 25 deletions(-)
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 38ca3df687160..aa29d8ac542d1 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -423,20 +423,34 @@ extern void __meminit kswapd_stop(int nid);
#ifdef CONFIG_SWAP
-int add_swap_extent(struct swap_info_struct *sis, unsigned long start_page,
- unsigned long nr_pages, sector_t start_block);
-int generic_swapfile_activate(struct swap_info_struct *, struct file *,
- sector_t *);
-
+/* Lifecycle swap API (mm/swapfile.c) */
+int folio_alloc_swap(struct folio *folio);
+bool folio_free_swap(struct folio *folio);
+void put_swap_folio(struct folio *folio, swp_entry_t entry);
+void swap_shmem_alloc(swp_entry_t, int);
+int swap_duplicate(swp_entry_t);
+int swapcache_prepare(swp_entry_t entry, int nr);
+void swap_free_nr(swp_entry_t entry, int nr_pages);
+void free_swap_and_cache_nr(swp_entry_t entry, int nr);
+int __swap_count(swp_entry_t entry);
+bool swap_entry_swapped(struct swap_info_struct *si, swp_entry_t entry);
+int swp_swapcount(swp_entry_t entry);
+
+/* Swap cache API (mm/swap_state.c) */
static inline unsigned long total_swapcache_pages(void)
{
return global_node_page_state(NR_SWAPCACHE);
}
-
-void free_swap_cache(struct folio *folio);
void free_folio_and_swap_cache(struct folio *folio);
void free_pages_and_swap_cache(struct encoded_page **, int);
-/* linux/mm/swapfile.c */
+void free_swap_cache(struct folio *folio);
+
+/* Physical swap allocator and swap device API (mm/swapfile.c) */
+int add_swap_extent(struct swap_info_struct *sis, unsigned long start_page,
+ unsigned long nr_pages, sector_t start_block);
+int generic_swapfile_activate(struct swap_info_struct *, struct file *,
+ sector_t *);
+
extern atomic_long_t nr_swap_pages;
extern long total_swap_pages;
extern atomic_t nr_rotate_swap;
@@ -452,26 +466,15 @@ static inline long get_nr_swap_pages(void)
return atomic_long_read(&nr_swap_pages);
}
-extern void si_swapinfo(struct sysinfo *);
-int folio_alloc_swap(struct folio *folio);
-bool folio_free_swap(struct folio *folio);
-void put_swap_folio(struct folio *folio, swp_entry_t entry);
-extern swp_entry_t get_swap_page_of_type(int);
-extern int add_swap_count_continuation(swp_entry_t, gfp_t);
-extern void swap_shmem_alloc(swp_entry_t, int);
-extern int swap_duplicate(swp_entry_t);
-extern int swapcache_prepare(swp_entry_t entry, int nr);
-extern void swap_free_nr(swp_entry_t entry, int nr_pages);
-extern void free_swap_and_cache_nr(swp_entry_t entry, int nr);
+void si_swapinfo(struct sysinfo *);
+swp_entry_t get_swap_page_of_type(int);
+int add_swap_count_continuation(swp_entry_t, gfp_t);
int swap_type_of(dev_t device, sector_t offset);
int find_first_swap(dev_t *device);
-extern unsigned int count_swap_pages(int, int);
-extern sector_t swapdev_block(int, pgoff_t);
-extern int __swap_count(swp_entry_t entry);
-extern bool swap_entry_swapped(struct swap_info_struct *si, swp_entry_t entry);
-extern int swp_swapcount(swp_entry_t entry);
+unsigned int count_swap_pages(int, int);
+sector_t swapdev_block(int, pgoff_t);
struct backing_dev_info;
-extern struct swap_info_struct *get_swap_device(swp_entry_t entry);
+struct swap_info_struct *get_swap_device(swp_entry_t entry);
sector_t swap_folio_sector(struct folio *folio);
static inline void put_swap_device(struct swap_info_struct *si)
--
2.47.3
^ permalink raw reply [flat|nested] 52+ messages in thread
* [PATCH v3 03/20] mm: swap: add an abstract API for locking out swapoff
2026-02-08 21:58 [PATCH v3 00/20] Virtual Swap Space Nhat Pham
2026-02-08 21:58 ` [PATCH v3 01/20] mm/swap: decouple swap cache from physical swap infrastructure Nhat Pham
2026-02-08 21:58 ` [PATCH v3 02/20] swap: rearrange the swap header file Nhat Pham
@ 2026-02-08 21:58 ` Nhat Pham
2026-02-08 21:58 ` [PATCH v3 04/20] zswap: add new helpers for zswap entry operations Nhat Pham
` (18 subsequent siblings)
21 siblings, 0 replies; 52+ messages in thread
From: Nhat Pham @ 2026-02-08 21:58 UTC (permalink / raw)
To: linux-mm
Cc: akpm, hannes, hughd, yosry.ahmed, mhocko, roman.gushchin,
shakeel.butt, muchun.song, len.brown, chengming.zhou, kasong,
chrisl, huang.ying.caritas, ryan.roberts, shikemeng, viro,
baohua, bhe, osalvador, lorenzo.stoakes, christophe.leroy, pavel,
kernel-team, linux-kernel, cgroups, linux-pm, peterx, riel,
joshua.hahnjy, npache, gourry, axelrasmussen, yuanchu, weixugc,
rafael, jannh, pfalcato, zhengqi.arch
Currently, we get a reference to the backing swap device in order to
prevent swapoff from freeing the metadata of a swap entry. This does not
make sense in the new virtual swap design, especially after the swap
backends are decoupled - a swap entry might not have any backing swap
device at all, and its backend might change at any time during its
lifetime.
In preparation for this, abstract away the swapoff locking out behavior
into a generic API.
Signed-off-by: Nhat Pham <nphamcs@gmail.com>
---
include/linux/swap.h | 17 +++++++++++++++++
mm/memory.c | 13 +++++++------
mm/mincore.c | 15 +++------------
mm/shmem.c | 12 ++++++------
mm/swap_state.c | 14 +++++++-------
mm/userfaultfd.c | 15 +++++++++------
mm/zswap.c | 5 ++---
7 files changed, 51 insertions(+), 40 deletions(-)
diff --git a/include/linux/swap.h b/include/linux/swap.h
index aa29d8ac542d1..3da637b218baf 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -659,5 +659,22 @@ static inline bool mem_cgroup_swap_full(struct folio *folio)
}
#endif
+static inline bool tryget_swap_entry(swp_entry_t entry,
+ struct swap_info_struct **sip)
+{
+ struct swap_info_struct *si = get_swap_device(entry);
+
+ if (sip)
+ *sip = si;
+
+ return si;
+}
+
+static inline void put_swap_entry(swp_entry_t entry,
+ struct swap_info_struct *si)
+{
+ put_swap_device(si);
+}
+
#endif /* __KERNEL__*/
#endif /* _LINUX_SWAP_H */
diff --git a/mm/memory.c b/mm/memory.c
index da360a6eb8a48..90031f833f52e 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4630,6 +4630,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
struct swap_info_struct *si = NULL;
rmap_t rmap_flags = RMAP_NONE;
bool need_clear_cache = false;
+ bool swapoff_locked = false;
bool exclusive = false;
softleaf_t entry;
pte_t pte;
@@ -4698,8 +4699,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
}
/* Prevent swapoff from happening to us. */
- si = get_swap_device(entry);
- if (unlikely(!si))
+ swapoff_locked = tryget_swap_entry(entry, &si);
+ if (unlikely(!swapoff_locked))
goto out;
folio = swap_cache_get_folio(entry);
@@ -5047,8 +5048,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
if (waitqueue_active(&swapcache_wq))
wake_up(&swapcache_wq);
}
- if (si)
- put_swap_device(si);
+ if (swapoff_locked)
+ put_swap_entry(entry, si);
return ret;
out_nomap:
if (vmf->pte)
@@ -5066,8 +5067,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
if (waitqueue_active(&swapcache_wq))
wake_up(&swapcache_wq);
}
- if (si)
- put_swap_device(si);
+ if (swapoff_locked)
+ put_swap_entry(entry, si);
return ret;
}
diff --git a/mm/mincore.c b/mm/mincore.c
index e5d13eea92347..f3eb771249d67 100644
--- a/mm/mincore.c
+++ b/mm/mincore.c
@@ -77,19 +77,10 @@ static unsigned char mincore_swap(swp_entry_t entry, bool shmem)
if (!softleaf_is_swap(entry))
return !shmem;
- /*
- * Shmem mapping lookup is lockless, so we need to grab the swap
- * device. mincore page table walk locks the PTL, and the swap
- * device is stable, avoid touching the si for better performance.
- */
- if (shmem) {
- si = get_swap_device(entry);
- if (!si)
- return 0;
- }
+ if (!tryget_swap_entry(entry, &si))
+ return 0;
folio = swap_cache_get_folio(entry);
- if (shmem)
- put_swap_device(si);
+ put_swap_entry(entry, si);
/* The swap cache space contains either folio, shadow or NULL */
if (folio && !xa_is_value(folio)) {
present = folio_test_uptodate(folio);
diff --git a/mm/shmem.c b/mm/shmem.c
index 1db97ef2d14eb..b40be22fa5f09 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2307,7 +2307,7 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
softleaf_t index_entry;
struct swap_info_struct *si;
struct folio *folio = NULL;
- bool skip_swapcache = false;
+ bool swapoff_locked, skip_swapcache = false;
int error, nr_pages, order;
pgoff_t offset;
@@ -2319,16 +2319,16 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
if (softleaf_is_poison_marker(index_entry))
return -EIO;
- si = get_swap_device(index_entry);
+ swapoff_locked = tryget_swap_entry(index_entry, &si);
order = shmem_confirm_swap(mapping, index, index_entry);
- if (unlikely(!si)) {
+ if (unlikely(!swapoff_locked)) {
if (order < 0)
return -EEXIST;
else
return -EINVAL;
}
if (unlikely(order < 0)) {
- put_swap_device(si);
+ put_swap_entry(index_entry, si);
return -EEXIST;
}
@@ -2448,7 +2448,7 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
}
folio_mark_dirty(folio);
swap_free_nr(swap, nr_pages);
- put_swap_device(si);
+ put_swap_entry(swap, si);
*foliop = folio;
return 0;
@@ -2466,7 +2466,7 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
swapcache_clear(si, folio->swap, folio_nr_pages(folio));
if (folio)
folio_put(folio);
- put_swap_device(si);
+ put_swap_entry(swap, si);
return error;
}
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 34c9d9b243a74..bece18eb540fa 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -538,8 +538,7 @@ struct folio *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
pgoff_t ilx;
struct folio *folio;
- si = get_swap_device(entry);
- if (!si)
+ if (!tryget_swap_entry(entry, &si))
return NULL;
mpol = get_vma_policy(vma, addr, 0, &ilx);
@@ -550,7 +549,7 @@ struct folio *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
if (page_allocated)
swap_read_folio(folio, plug);
- put_swap_device(si);
+ put_swap_entry(entry, si);
return folio;
}
@@ -763,6 +762,7 @@ static struct folio *swap_vma_readahead(swp_entry_t targ_entry, gfp_t gfp_mask,
for (addr = start; addr < end; ilx++, addr += PAGE_SIZE) {
struct swap_info_struct *si = NULL;
softleaf_t entry;
+ bool swapoff_locked = false;
if (!pte++) {
pte = pte_offset_map(vmf->pmd, addr);
@@ -781,14 +781,14 @@ static struct folio *swap_vma_readahead(swp_entry_t targ_entry, gfp_t gfp_mask,
* holding a reference to, try to grab a reference, or skip.
*/
if (swp_type(entry) != swp_type(targ_entry)) {
- si = get_swap_device(entry);
- if (!si)
+ swapoff_locked = tryget_swap_entry(entry, &si);
+ if (!swapoff_locked)
continue;
}
folio = __read_swap_cache_async(entry, gfp_mask, mpol, ilx,
&page_allocated, false);
- if (si)
- put_swap_device(si);
+ if (swapoff_locked)
+ put_swap_entry(entry, si);
if (!folio)
continue;
if (page_allocated) {
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index e6dfd5f28acd7..25f89eba0438c 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -1262,9 +1262,11 @@ static long move_pages_ptes(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd
pte_t *dst_pte = NULL;
pmd_t dummy_pmdval;
pmd_t dst_pmdval;
+ softleaf_t entry;
struct folio *src_folio = NULL;
struct mmu_notifier_range range;
long ret = 0;
+ bool swapoff_locked = false;
mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm,
src_addr, src_addr + len);
@@ -1429,7 +1431,7 @@ static long move_pages_ptes(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd
len);
} else { /* !pte_present() */
struct folio *folio = NULL;
- const softleaf_t entry = softleaf_from_pte(orig_src_pte);
+ entry = softleaf_from_pte(orig_src_pte);
if (softleaf_is_migration(entry)) {
pte_unmap(src_pte);
@@ -1449,8 +1451,8 @@ static long move_pages_ptes(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd
goto out;
}
- si = get_swap_device(entry);
- if (unlikely(!si)) {
+ swapoff_locked = tryget_swap_entry(entry, &si);
+ if (unlikely(!swapoff_locked)) {
ret = -EAGAIN;
goto out;
}
@@ -1480,8 +1482,9 @@ static long move_pages_ptes(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd
pte_unmap(src_pte);
pte_unmap(dst_pte);
src_pte = dst_pte = NULL;
- put_swap_device(si);
+ put_swap_entry(entry, si);
si = NULL;
+ swapoff_locked = false;
/* now we can block and wait */
folio_lock(src_folio);
goto retry;
@@ -1507,8 +1510,8 @@ static long move_pages_ptes(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd
if (dst_pte)
pte_unmap(dst_pte);
mmu_notifier_invalidate_range_end(&range);
- if (si)
- put_swap_device(si);
+ if (swapoff_locked)
+ put_swap_entry(entry, si);
return ret;
}
diff --git a/mm/zswap.c b/mm/zswap.c
index ac9b7a60736bc..315e4d0d08311 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -1009,14 +1009,13 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
int ret = 0;
/* try to allocate swap cache folio */
- si = get_swap_device(swpentry);
- if (!si)
+ if (!tryget_swap_entry(swpentry, &si))
return -EEXIST;
mpol = get_task_policy(current);
folio = __read_swap_cache_async(swpentry, GFP_KERNEL, mpol,
NO_INTERLEAVE_INDEX, &folio_was_allocated, true);
- put_swap_device(si);
+ put_swap_entry(swpentry, si);
if (!folio)
return -ENOMEM;
--
2.47.3
^ permalink raw reply [flat|nested] 52+ messages in thread
* [PATCH v3 04/20] zswap: add new helpers for zswap entry operations
2026-02-08 21:58 [PATCH v3 00/20] Virtual Swap Space Nhat Pham
` (2 preceding siblings ...)
2026-02-08 21:58 ` [PATCH v3 03/20] mm: swap: add an abstract API for locking out swapoff Nhat Pham
@ 2026-02-08 21:58 ` Nhat Pham
2026-02-08 21:58 ` [PATCH v3 05/20] mm/swap: add a new function to check if a swap entry is in swap cached Nhat Pham
` (17 subsequent siblings)
21 siblings, 0 replies; 52+ messages in thread
From: Nhat Pham @ 2026-02-08 21:58 UTC (permalink / raw)
To: linux-mm
Cc: akpm, hannes, hughd, yosry.ahmed, mhocko, roman.gushchin,
shakeel.butt, muchun.song, len.brown, chengming.zhou, kasong,
chrisl, huang.ying.caritas, ryan.roberts, shikemeng, viro,
baohua, bhe, osalvador, lorenzo.stoakes, christophe.leroy, pavel,
kernel-team, linux-kernel, cgroups, linux-pm, peterx, riel,
joshua.hahnjy, npache, gourry, axelrasmussen, yuanchu, weixugc,
rafael, jannh, pfalcato, zhengqi.arch
Add new helper functions to abstract away zswap entry operations, in
order to facilitate re-implementing these functions when swap is
virtualized.
Signed-off-by: Nhat Pham <nphamcs@gmail.com>
---
mm/zswap.c | 59 ++++++++++++++++++++++++++++++++++++------------------
1 file changed, 40 insertions(+), 19 deletions(-)
diff --git a/mm/zswap.c b/mm/zswap.c
index 315e4d0d08311..a5a3f068bd1a6 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -234,6 +234,38 @@ static inline struct xarray *swap_zswap_tree(swp_entry_t swp)
>> ZSWAP_ADDRESS_SPACE_SHIFT];
}
+static inline void *zswap_entry_store(swp_entry_t swpentry,
+ struct zswap_entry *entry)
+{
+ struct xarray *tree = swap_zswap_tree(swpentry);
+ pgoff_t offset = swp_offset(swpentry);
+
+ return xa_store(tree, offset, entry, GFP_KERNEL);
+}
+
+static inline void *zswap_entry_load(swp_entry_t swpentry)
+{
+ struct xarray *tree = swap_zswap_tree(swpentry);
+ pgoff_t offset = swp_offset(swpentry);
+
+ return xa_load(tree, offset);
+}
+
+static inline void *zswap_entry_erase(swp_entry_t swpentry)
+{
+ struct xarray *tree = swap_zswap_tree(swpentry);
+ pgoff_t offset = swp_offset(swpentry);
+
+ return xa_erase(tree, offset);
+}
+
+static inline bool zswap_empty(swp_entry_t swpentry)
+{
+ struct xarray *tree = swap_zswap_tree(swpentry);
+
+ return xa_empty(tree);
+}
+
#define zswap_pool_debug(msg, p) \
pr_debug("%s pool %s\n", msg, (p)->tfm_name)
@@ -1000,8 +1032,6 @@ static bool zswap_decompress(struct zswap_entry *entry, struct folio *folio)
static int zswap_writeback_entry(struct zswap_entry *entry,
swp_entry_t swpentry)
{
- struct xarray *tree;
- pgoff_t offset = swp_offset(swpentry);
struct folio *folio;
struct mempolicy *mpol;
bool folio_was_allocated;
@@ -1040,8 +1070,7 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
* old compressed data. Only when this is successful can the entry
* be dereferenced.
*/
- tree = swap_zswap_tree(swpentry);
- if (entry != xa_load(tree, offset)) {
+ if (entry != zswap_entry_load(swpentry)) {
ret = -ENOMEM;
goto out;
}
@@ -1051,7 +1080,7 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
goto out;
}
- xa_erase(tree, offset);
+ zswap_entry_erase(swpentry);
count_vm_event(ZSWPWB);
if (entry->objcg)
@@ -1427,9 +1456,7 @@ static bool zswap_store_page(struct page *page,
if (!zswap_compress(page, entry, pool))
goto compress_failed;
- old = xa_store(swap_zswap_tree(page_swpentry),
- swp_offset(page_swpentry),
- entry, GFP_KERNEL);
+ old = zswap_entry_store(page_swpentry, entry);
if (xa_is_err(old)) {
int err = xa_err(old);
@@ -1563,11 +1590,9 @@ bool zswap_store(struct folio *folio)
unsigned type = swp_type(swp);
pgoff_t offset = swp_offset(swp);
struct zswap_entry *entry;
- struct xarray *tree;
for (index = 0; index < nr_pages; ++index) {
- tree = swap_zswap_tree(swp_entry(type, offset + index));
- entry = xa_erase(tree, offset + index);
+ entry = zswap_entry_erase(swp_entry(type, offset + index));
if (entry)
zswap_entry_free(entry);
}
@@ -1599,9 +1624,7 @@ bool zswap_store(struct folio *folio)
int zswap_load(struct folio *folio)
{
swp_entry_t swp = folio->swap;
- pgoff_t offset = swp_offset(swp);
bool swapcache = folio_test_swapcache(folio);
- struct xarray *tree = swap_zswap_tree(swp);
struct zswap_entry *entry;
VM_WARN_ON_ONCE(!folio_test_locked(folio));
@@ -1619,7 +1642,7 @@ int zswap_load(struct folio *folio)
return -EINVAL;
}
- entry = xa_load(tree, offset);
+ entry = zswap_entry_load(swp);
if (!entry)
return -ENOENT;
@@ -1648,7 +1671,7 @@ int zswap_load(struct folio *folio)
*/
if (swapcache) {
folio_mark_dirty(folio);
- xa_erase(tree, offset);
+ zswap_entry_erase(swp);
zswap_entry_free(entry);
}
@@ -1658,14 +1681,12 @@ int zswap_load(struct folio *folio)
void zswap_invalidate(swp_entry_t swp)
{
- pgoff_t offset = swp_offset(swp);
- struct xarray *tree = swap_zswap_tree(swp);
struct zswap_entry *entry;
- if (xa_empty(tree))
+ if (zswap_empty(swp))
return;
- entry = xa_erase(tree, offset);
+ entry = zswap_entry_erase(swp);
if (entry)
zswap_entry_free(entry);
}
--
2.47.3
^ permalink raw reply [flat|nested] 52+ messages in thread
* [PATCH v3 05/20] mm/swap: add a new function to check if a swap entry is in swap cached.
2026-02-08 21:58 [PATCH v3 00/20] Virtual Swap Space Nhat Pham
` (3 preceding siblings ...)
2026-02-08 21:58 ` [PATCH v3 04/20] zswap: add new helpers for zswap entry operations Nhat Pham
@ 2026-02-08 21:58 ` Nhat Pham
2026-02-08 21:58 ` [PATCH v3 06/20] mm: swap: add a separate type for physical swap slots Nhat Pham
` (16 subsequent siblings)
21 siblings, 0 replies; 52+ messages in thread
From: Nhat Pham @ 2026-02-08 21:58 UTC (permalink / raw)
To: linux-mm
Cc: akpm, hannes, hughd, yosry.ahmed, mhocko, roman.gushchin,
shakeel.butt, muchun.song, len.brown, chengming.zhou, kasong,
chrisl, huang.ying.caritas, ryan.roberts, shikemeng, viro,
baohua, bhe, osalvador, lorenzo.stoakes, christophe.leroy, pavel,
kernel-team, linux-kernel, cgroups, linux-pm, peterx, riel,
joshua.hahnjy, npache, gourry, axelrasmussen, yuanchu, weixugc,
rafael, jannh, pfalcato, zhengqi.arch
Userfaultfd checks whether a swap entry is in swapcache. This is
currently done by directly looking at the swapfile's swap map - however,
the swap cached state will soon be managed at the virtual swap layer.
Abstract away this function.
Signed-off-by: Nhat Pham <nphamcs@gmail.com>
---
include/linux/swap.h | 6 ++++++
mm/swapfile.c | 15 +++++++++++++++
mm/userfaultfd.c | 3 +--
3 files changed, 22 insertions(+), 2 deletions(-)
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 3da637b218baf..f91a442ac0e82 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -435,6 +435,7 @@ void free_swap_and_cache_nr(swp_entry_t entry, int nr);
int __swap_count(swp_entry_t entry);
bool swap_entry_swapped(struct swap_info_struct *si, swp_entry_t entry);
int swp_swapcount(swp_entry_t entry);
+bool is_swap_cached(swp_entry_t entry);
/* Swap cache API (mm/swap_state.c) */
static inline unsigned long total_swapcache_pages(void)
@@ -554,6 +555,11 @@ static inline int swp_swapcount(swp_entry_t entry)
return 0;
}
+static inline bool is_swap_cached(swp_entry_t entry)
+{
+ return false;
+}
+
static inline int folio_alloc_swap(struct folio *folio)
{
return -EINVAL;
diff --git a/mm/swapfile.c b/mm/swapfile.c
index cacfafa9a540d..3c89dedbd5718 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -194,6 +194,21 @@ static bool swap_only_has_cache(struct swap_info_struct *si,
return true;
}
+/**
+ * is_swap_cached - check if the swap entry is cached
+ * @entry: swap entry to check
+ *
+ * Check swap_map directly to minimize overhead, READ_ONCE is sufficient.
+ *
+ * Returns true if the swap entry is cached, false otherwise.
+ */
+bool is_swap_cached(swp_entry_t entry)
+{
+ struct swap_info_struct *si = __swap_entry_to_info(entry);
+
+ return READ_ONCE(si->swap_map[swp_offset(entry)]) & SWAP_HAS_CACHE;
+}
+
static bool swap_is_last_map(struct swap_info_struct *si,
unsigned long offset, int nr_pages, bool *has_cache)
{
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 25f89eba0438c..98be764fb3ecd 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -1190,7 +1190,6 @@ static int move_swap_pte(struct mm_struct *mm, struct vm_area_struct *dst_vma,
* Check if the swap entry is cached after acquiring the src_pte
* lock. Otherwise, we might miss a newly loaded swap cache folio.
*
- * Check swap_map directly to minimize overhead, READ_ONCE is sufficient.
* We are trying to catch newly added swap cache, the only possible case is
* when a folio is swapped in and out again staying in swap cache, using the
* same entry before the PTE check above. The PTL is acquired and released
@@ -1200,7 +1199,7 @@ static int move_swap_pte(struct mm_struct *mm, struct vm_area_struct *dst_vma,
* cache, or during the tiny synchronization window between swap cache and
* swap_map, but it will be gone very quickly, worst result is retry jitters.
*/
- if (READ_ONCE(si->swap_map[swp_offset(entry)]) & SWAP_HAS_CACHE) {
+ if (is_swap_cached(entry)) {
double_pt_unlock(dst_ptl, src_ptl);
return -EAGAIN;
}
--
2.47.3
^ permalink raw reply [flat|nested] 52+ messages in thread
* [PATCH v3 06/20] mm: swap: add a separate type for physical swap slots
2026-02-08 21:58 [PATCH v3 00/20] Virtual Swap Space Nhat Pham
` (4 preceding siblings ...)
2026-02-08 21:58 ` [PATCH v3 05/20] mm/swap: add a new function to check if a swap entry is in swap cached Nhat Pham
@ 2026-02-08 21:58 ` Nhat Pham
2026-02-08 21:58 ` [PATCH v3 07/20] mm: create scaffolds for the new virtual swap implementation Nhat Pham
` (15 subsequent siblings)
21 siblings, 0 replies; 52+ messages in thread
From: Nhat Pham @ 2026-02-08 21:58 UTC (permalink / raw)
To: linux-mm
Cc: akpm, hannes, hughd, yosry.ahmed, mhocko, roman.gushchin,
shakeel.butt, muchun.song, len.brown, chengming.zhou, kasong,
chrisl, huang.ying.caritas, ryan.roberts, shikemeng, viro,
baohua, bhe, osalvador, lorenzo.stoakes, christophe.leroy, pavel,
kernel-team, linux-kernel, cgroups, linux-pm, peterx, riel,
joshua.hahnjy, npache, gourry, axelrasmussen, yuanchu, weixugc,
rafael, jannh, pfalcato, zhengqi.arch
In preparation for swap virtualization, add a new type to represent the
physical swap slots of swapfile. This allows us to separates:
1. The logical view of the swap entry (i.e what is stored in page table
entries and used to index into the swap cache), represented by the
old swp_entry_t type.
from:
2. Its physical backing state (i.e the actual backing slot on the swap
device), represented by the new swp_slot_t type.
The functions that operate at the physical level (i.e on the swp_slot_t
types) are also renamed where appropriate (prefixed with swp_slot_* for
e.g).
Note that we have not made any behavioral change - the mapping between
the two types is the identity mapping. In later patches, we shall
dynamically allocate a virtual swap slot (of type swp_entry_t) for each
swapped out page to store in the page table entry, and associate it with
a backing store. A physical swap slot (i.e a slot on a physical swap
device) is one of the backing options.
Signed-off-by: Nhat Pham <nphamcs@gmail.com>
---
include/linux/mm_types.h | 16 +++
include/linux/swap.h | 47 ++++--
include/linux/swapops.h | 25 ++++
kernel/power/swap.c | 6 +-
mm/internal.h | 10 +-
mm/page_io.c | 33 +++--
mm/shmem.c | 19 ++-
mm/swap.h | 52 +++----
mm/swap_cgroup.c | 18 +--
mm/swap_state.c | 32 +++--
mm/swapfile.c | 300 ++++++++++++++++++++++-----------------
11 files changed, 352 insertions(+), 206 deletions(-)
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 78950eb8926dc..bffde812decc5 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -279,6 +279,13 @@ static __always_inline unsigned long encoded_nr_pages(struct encoded_page *page)
}
/*
+ * Virtual swap slot.
+ *
+ * This type is used to represent a virtual swap slot, i.e an identifier of
+ * a swap entry. This is stored in PTEs that originally refer to the swapped
+ * out page, and is used to index into various swap architectures (swap cache,
+ * zswap tree, swap cgroup array, etc.).
+ *
* A swap entry has to fit into a "unsigned long", as the entry is hidden
* in the "index" field of the swapper address space.
*/
@@ -286,6 +293,15 @@ typedef struct {
unsigned long val;
} swp_entry_t;
+/*
+ * Physical swap slot.
+ *
+ * This type is used to represent a PAGE_SIZED slot on a swapfile.
+ */
+typedef struct {
+ unsigned long val;
+} swp_slot_t;
+
/**
* typedef softleaf_t - Describes a page table software leaf entry, abstracted
* from its architecture-specific encoding.
diff --git a/include/linux/swap.h b/include/linux/swap.h
index f91a442ac0e82..918b47da55f44 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -241,7 +241,7 @@ enum {
* cluster to which it belongs being marked free. Therefore 0 is safe to use as
* a sentinel to indicate an entry is not valid.
*/
-#define SWAP_ENTRY_INVALID 0
+#define SWAP_SLOT_INVALID 0
#ifdef CONFIG_THP_SWAP
#define SWAP_NR_ORDERS (PMD_ORDER + 1)
@@ -442,11 +442,14 @@ static inline unsigned long total_swapcache_pages(void)
{
return global_node_page_state(NR_SWAPCACHE);
}
+
void free_folio_and_swap_cache(struct folio *folio);
void free_pages_and_swap_cache(struct encoded_page **, int);
void free_swap_cache(struct folio *folio);
/* Physical swap allocator and swap device API (mm/swapfile.c) */
+void swap_slot_free_nr(swp_slot_t slot, int nr_pages);
+
int add_swap_extent(struct swap_info_struct *sis, unsigned long start_page,
unsigned long nr_pages, sector_t start_block);
int generic_swapfile_activate(struct swap_info_struct *, struct file *,
@@ -468,28 +471,28 @@ static inline long get_nr_swap_pages(void)
}
void si_swapinfo(struct sysinfo *);
-swp_entry_t get_swap_page_of_type(int);
+swp_slot_t swap_slot_alloc_of_type(int);
int add_swap_count_continuation(swp_entry_t, gfp_t);
int swap_type_of(dev_t device, sector_t offset);
int find_first_swap(dev_t *device);
unsigned int count_swap_pages(int, int);
sector_t swapdev_block(int, pgoff_t);
struct backing_dev_info;
-struct swap_info_struct *get_swap_device(swp_entry_t entry);
+struct swap_info_struct *swap_slot_tryget_swap_info(swp_slot_t slot);
sector_t swap_folio_sector(struct folio *folio);
-static inline void put_swap_device(struct swap_info_struct *si)
+static inline void swap_slot_put_swap_info(struct swap_info_struct *si)
{
percpu_ref_put(&si->users);
}
#else /* CONFIG_SWAP */
-static inline struct swap_info_struct *get_swap_device(swp_entry_t entry)
+static inline struct swap_info_struct *swap_slot_tryget_swap_info(swp_slot_t slot)
{
return NULL;
}
-static inline void put_swap_device(struct swap_info_struct *si)
+static inline void swap_slot_put_swap_info(struct swap_info_struct *si)
{
}
@@ -536,7 +539,7 @@ static inline void swap_free_nr(swp_entry_t entry, int nr_pages)
{
}
-static inline void put_swap_folio(struct folio *folio, swp_entry_t swp)
+static inline void put_swap_folio(struct folio *folio, swp_entry_t entry)
{
}
@@ -576,6 +579,7 @@ static inline int add_swap_extent(struct swap_info_struct *sis,
{
return -EINVAL;
}
+
#endif /* CONFIG_SWAP */
static inline void free_swap_and_cache(swp_entry_t entry)
@@ -665,10 +669,35 @@ static inline bool mem_cgroup_swap_full(struct folio *folio)
}
#endif
+/**
+ * swp_entry_to_swp_slot - look up the physical swap slot corresponding to a
+ * virtual swap slot.
+ * @entry: the virtual swap slot.
+ *
+ * Return: the physical swap slot corresponding to the virtual swap slot.
+ */
+static inline swp_slot_t swp_entry_to_swp_slot(swp_entry_t entry)
+{
+ return (swp_slot_t) { entry.val };
+}
+
+/**
+ * swp_slot_to_swp_entry - look up the virtual swap slot corresponding to a
+ * physical swap slot.
+ * @slot: the physical swap slot.
+ *
+ * Return: the virtual swap slot corresponding to the physical swap slot.
+ */
+static inline swp_entry_t swp_slot_to_swp_entry(swp_slot_t slot)
+{
+ return (swp_entry_t) { slot.val };
+}
+
static inline bool tryget_swap_entry(swp_entry_t entry,
struct swap_info_struct **sip)
{
- struct swap_info_struct *si = get_swap_device(entry);
+ swp_slot_t slot = swp_entry_to_swp_slot(entry);
+ struct swap_info_struct *si = swap_slot_tryget_swap_info(slot);
if (sip)
*sip = si;
@@ -679,7 +708,7 @@ static inline bool tryget_swap_entry(swp_entry_t entry,
static inline void put_swap_entry(swp_entry_t entry,
struct swap_info_struct *si)
{
- put_swap_device(si);
+ swap_slot_put_swap_info(si);
}
#endif /* __KERNEL__*/
diff --git a/include/linux/swapops.h b/include/linux/swapops.h
index 8cfc966eae48e..9e41c35664a95 100644
--- a/include/linux/swapops.h
+++ b/include/linux/swapops.h
@@ -360,5 +360,30 @@ static inline pmd_t swp_entry_to_pmd(swp_entry_t entry)
#endif /* CONFIG_ARCH_ENABLE_THP_MIGRATION */
+/* Physical swap slots operations */
+
+/*
+ * Store a swap device type + offset into a swp_slot_t handle.
+ */
+static inline swp_slot_t swp_slot(unsigned long type, pgoff_t offset)
+{
+ swp_slot_t ret;
+
+ ret.val = (type << SWP_TYPE_SHIFT) | (offset & SWP_OFFSET_MASK);
+ return ret;
+}
+
+/* Extract the `type' field from a swp_slot_t. */
+static inline unsigned swp_slot_type(swp_slot_t slot)
+{
+ return (slot.val >> SWP_TYPE_SHIFT);
+}
+
+/* Extract the `offset' field from a swp_slot_t. */
+static inline pgoff_t swp_slot_offset(swp_slot_t slot)
+{
+ return slot.val & SWP_OFFSET_MASK;
+}
+
#endif /* CONFIG_MMU */
#endif /* _LINUX_SWAPOPS_H */
diff --git a/kernel/power/swap.c b/kernel/power/swap.c
index 8050e51828351..0129c5ffa649d 100644
--- a/kernel/power/swap.c
+++ b/kernel/power/swap.c
@@ -174,10 +174,10 @@ sector_t alloc_swapdev_block(int swap)
* Allocate a swap page and register that it has been allocated, so that
* it can be freed in case of an error.
*/
- offset = swp_offset(get_swap_page_of_type(swap));
+ offset = swp_slot_offset(swap_slot_alloc_of_type(swap));
if (offset) {
if (swsusp_extents_insert(offset))
- swap_free(swp_entry(swap, offset));
+ swap_slot_free_nr(swp_slot(swap, offset), 1);
else
return swapdev_block(swap, offset);
}
@@ -197,7 +197,7 @@ void free_all_swap_pages(int swap)
ext = rb_entry(node, struct swsusp_extent, node);
rb_erase(node, &swsusp_extents);
- swap_free_nr(swp_entry(swap, ext->start),
+ swap_slot_free_nr(swp_slot(swap, ext->start),
ext->end - ext->start + 1);
kfree(ext);
diff --git a/mm/internal.h b/mm/internal.h
index f35dbcf99a86b..e739e8cac5b55 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -334,9 +334,13 @@ unsigned int folio_pte_batch(struct folio *folio, pte_t *ptep, pte_t pte,
*/
static inline pte_t pte_move_swp_offset(pte_t pte, long delta)
{
- const softleaf_t entry = softleaf_from_pte(pte);
- pte_t new = __swp_entry_to_pte(__swp_entry(swp_type(entry),
- (swp_offset(entry) + delta)));
+ softleaf_t entry = softleaf_from_pte(pte), new_entry;
+ swp_slot_t slot = swp_entry_to_swp_slot(entry);
+ pte_t new;
+
+ new_entry = swp_slot_to_swp_entry(swp_slot(swp_slot_type(slot),
+ swp_slot_offset(slot) + delta));
+ new = swp_entry_to_pte(new_entry);
if (pte_swp_soft_dirty(pte))
new = pte_swp_mksoft_dirty(new);
diff --git a/mm/page_io.c b/mm/page_io.c
index 3c342db77ce38..0b02bcc85e2a8 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -204,14 +204,17 @@ static bool is_folio_zero_filled(struct folio *folio)
static void swap_zeromap_folio_set(struct folio *folio)
{
struct obj_cgroup *objcg = get_obj_cgroup_from_folio(folio);
- struct swap_info_struct *sis = __swap_entry_to_info(folio->swap);
+ struct swap_info_struct *sis =
+ __swap_slot_to_info(swp_entry_to_swp_slot(folio->swap));
int nr_pages = folio_nr_pages(folio);
swp_entry_t entry;
+ swp_slot_t slot;
unsigned int i;
for (i = 0; i < folio_nr_pages(folio); i++) {
entry = page_swap_entry(folio_page(folio, i));
- set_bit(swp_offset(entry), sis->zeromap);
+ slot = swp_entry_to_swp_slot(entry);
+ set_bit(swp_slot_offset(slot), sis->zeromap);
}
count_vm_events(SWPOUT_ZERO, nr_pages);
@@ -223,13 +226,16 @@ static void swap_zeromap_folio_set(struct folio *folio)
static void swap_zeromap_folio_clear(struct folio *folio)
{
- struct swap_info_struct *sis = __swap_entry_to_info(folio->swap);
+ struct swap_info_struct *sis =
+ __swap_slot_to_info(swp_entry_to_swp_slot(folio->swap));
swp_entry_t entry;
+ swp_slot_t slot;
unsigned int i;
for (i = 0; i < folio_nr_pages(folio); i++) {
entry = page_swap_entry(folio_page(folio, i));
- clear_bit(swp_offset(entry), sis->zeromap);
+ slot = swp_entry_to_swp_slot(entry);
+ clear_bit(swp_slot_offset(slot), sis->zeromap);
}
}
@@ -357,7 +363,8 @@ static void sio_write_complete(struct kiocb *iocb, long ret)
* messages.
*/
pr_err_ratelimited("Write error %ld on dio swapfile (%llu)\n",
- ret, swap_dev_pos(page_swap_entry(page)));
+ ret,
+ swap_slot_pos(swp_entry_to_swp_slot(page_swap_entry(page))));
for (p = 0; p < sio->pages; p++) {
page = sio->bvec[p].bv_page;
set_page_dirty(page);
@@ -374,9 +381,10 @@ static void sio_write_complete(struct kiocb *iocb, long ret)
static void swap_writepage_fs(struct folio *folio, struct swap_iocb **swap_plug)
{
struct swap_iocb *sio = swap_plug ? *swap_plug : NULL;
- struct swap_info_struct *sis = __swap_entry_to_info(folio->swap);
+ swp_slot_t slot = swp_entry_to_swp_slot(folio->swap);
+ struct swap_info_struct *sis = __swap_slot_to_info(slot);
struct file *swap_file = sis->swap_file;
- loff_t pos = swap_dev_pos(folio->swap);
+ loff_t pos = swap_slot_pos(slot);
count_swpout_vm_event(folio);
folio_start_writeback(folio);
@@ -446,7 +454,8 @@ static void swap_writepage_bdev_async(struct folio *folio,
void __swap_writepage(struct folio *folio, struct swap_iocb **swap_plug)
{
- struct swap_info_struct *sis = __swap_entry_to_info(folio->swap);
+ struct swap_info_struct *sis =
+ __swap_slot_to_info(swp_entry_to_swp_slot(folio->swap));
VM_BUG_ON_FOLIO(!folio_test_swapcache(folio), folio);
/*
@@ -537,9 +546,10 @@ static bool swap_read_folio_zeromap(struct folio *folio)
static void swap_read_folio_fs(struct folio *folio, struct swap_iocb **plug)
{
- struct swap_info_struct *sis = __swap_entry_to_info(folio->swap);
+ swp_slot_t slot = swp_entry_to_swp_slot(folio->swap);
+ struct swap_info_struct *sis = __swap_slot_to_info(slot);
struct swap_iocb *sio = NULL;
- loff_t pos = swap_dev_pos(folio->swap);
+ loff_t pos = swap_slot_pos(slot);
if (plug)
sio = *plug;
@@ -608,7 +618,8 @@ static void swap_read_folio_bdev_async(struct folio *folio,
void swap_read_folio(struct folio *folio, struct swap_iocb **plug)
{
- struct swap_info_struct *sis = __swap_entry_to_info(folio->swap);
+ struct swap_info_struct *sis =
+ __swap_slot_to_info(swp_entry_to_swp_slot(folio->swap));
bool synchronous = sis->flags & SWP_SYNCHRONOUS_IO;
bool workingset = folio_test_workingset(folio);
unsigned long pflags;
diff --git a/mm/shmem.c b/mm/shmem.c
index b40be22fa5f09..400e2fa8e77cb 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1442,6 +1442,7 @@ static unsigned int shmem_find_swap_entries(struct address_space *mapping,
XA_STATE(xas, &mapping->i_pages, start);
struct folio *folio;
swp_entry_t entry;
+ swp_slot_t slot;
rcu_read_lock();
xas_for_each(&xas, folio, ULONG_MAX) {
@@ -1452,11 +1453,13 @@ static unsigned int shmem_find_swap_entries(struct address_space *mapping,
continue;
entry = radix_to_swp_entry(folio);
+ slot = swp_entry_to_swp_slot(entry);
+
/*
* swapin error entries can be found in the mapping. But they're
* deliberately ignored here as we've done everything we can do.
*/
- if (swp_type(entry) != type)
+ if (swp_slot_type(slot) != type)
continue;
indices[folio_batch_count(fbatch)] = xas.xa_index;
@@ -2224,6 +2227,7 @@ static int shmem_split_large_entry(struct inode *inode, pgoff_t index,
XA_STATE_ORDER(xas, &mapping->i_pages, index, 0);
int split_order = 0;
int i;
+ swp_slot_t slot = swp_entry_to_swp_slot(swap);
/* Convert user data gfp flags to xarray node gfp flags */
gfp &= GFP_RECLAIM_MASK;
@@ -2264,13 +2268,16 @@ static int shmem_split_large_entry(struct inode *inode, pgoff_t index,
*/
for (i = 0; i < 1 << cur_order;
i += (1 << split_order)) {
- swp_entry_t tmp;
+ swp_entry_t tmp_entry;
+ swp_slot_t tmp_slot;
+
+ tmp_slot =
+ swp_slot(swp_slot_type(slot),
+ swp_slot_offset(slot) + swap_offset + i);
+ tmp_entry = swp_slot_to_swp_entry(tmp_slot);
- tmp = swp_entry(swp_type(swap),
- swp_offset(swap) + swap_offset +
- i);
__xa_store(&mapping->i_pages, aligned_index + i,
- swp_to_radix_entry(tmp), 0);
+ swp_to_radix_entry(tmp_entry), 0);
}
cur_order = split_order;
split_order = xas_try_split_min_order(split_order);
diff --git a/mm/swap.h b/mm/swap.h
index 8726b587a5b5d..bdf7aca146643 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -10,10 +10,10 @@ extern int page_cluster;
#ifdef CONFIG_THP_SWAP
#define SWAPFILE_CLUSTER HPAGE_PMD_NR
-#define swap_entry_order(order) (order)
+#define swap_slot_order(order) (order)
#else
#define SWAPFILE_CLUSTER 256
-#define swap_entry_order(order) 0
+#define swap_slot_order(order) 0
#endif
extern struct swap_info_struct *swap_info[];
@@ -57,9 +57,9 @@ enum swap_cluster_flags {
#include <linux/swapops.h> /* for swp_offset */
#include <linux/blk_types.h> /* for bio_end_io_t */
-static inline unsigned int swp_cluster_offset(swp_entry_t entry)
+static inline unsigned int swp_cluster_offset(swp_slot_t slot)
{
- return swp_offset(entry) % SWAPFILE_CLUSTER;
+ return swp_slot_offset(slot) % SWAPFILE_CLUSTER;
}
/*
@@ -75,9 +75,9 @@ static inline struct swap_info_struct *__swap_type_to_info(int type)
return si;
}
-static inline struct swap_info_struct *__swap_entry_to_info(swp_entry_t entry)
+static inline struct swap_info_struct *__swap_slot_to_info(swp_slot_t slot)
{
- return __swap_type_to_info(swp_type(entry));
+ return __swap_type_to_info(swp_slot_type(slot));
}
static inline struct swap_cluster_info *__swap_offset_to_cluster(
@@ -88,10 +88,10 @@ static inline struct swap_cluster_info *__swap_offset_to_cluster(
return &si->cluster_info[offset / SWAPFILE_CLUSTER];
}
-static inline struct swap_cluster_info *__swap_entry_to_cluster(swp_entry_t entry)
+static inline struct swap_cluster_info *__swap_slot_to_cluster(swp_slot_t slot)
{
- return __swap_offset_to_cluster(__swap_entry_to_info(entry),
- swp_offset(entry));
+ return __swap_offset_to_cluster(__swap_slot_to_info(slot),
+ swp_slot_offset(slot));
}
static __always_inline struct swap_cluster_info *__swap_cluster_lock(
@@ -120,7 +120,7 @@ static __always_inline struct swap_cluster_info *__swap_cluster_lock(
/**
* swap_cluster_lock - Lock and return the swap cluster of given offset.
* @si: swap device the cluster belongs to.
- * @offset: the swap entry offset, pointing to a valid slot.
+ * @offset: the swap slot offset, pointing to a valid slot.
*
* Context: The caller must ensure the offset is in the valid range and
* protect the swap device with reference count or locks.
@@ -134,10 +134,12 @@ static inline struct swap_cluster_info *swap_cluster_lock(
static inline struct swap_cluster_info *__swap_cluster_get_and_lock(
const struct folio *folio, bool irq)
{
+ swp_slot_t slot = swp_entry_to_swp_slot(folio->swap);
+
VM_WARN_ON_ONCE_FOLIO(!folio_test_locked(folio), folio);
VM_WARN_ON_ONCE_FOLIO(!folio_test_swapcache(folio), folio);
- return __swap_cluster_lock(__swap_entry_to_info(folio->swap),
- swp_offset(folio->swap), irq);
+ return __swap_cluster_lock(__swap_slot_to_info(slot),
+ swp_slot_offset(slot), irq);
}
/*
@@ -209,12 +211,10 @@ static inline struct address_space *swap_address_space(swp_entry_t entry)
return &swap_space;
}
-/*
- * Return the swap device position of the swap entry.
- */
-static inline loff_t swap_dev_pos(swp_entry_t entry)
+/* Return the swap device position of the swap slot. */
+static inline loff_t swap_slot_pos(swp_slot_t slot)
{
- return ((loff_t)swp_offset(entry)) << PAGE_SHIFT;
+ return ((loff_t)swp_slot_offset(slot)) << PAGE_SHIFT;
}
/**
@@ -276,7 +276,9 @@ void swap_update_readahead(struct folio *folio, struct vm_area_struct *vma,
static inline unsigned int folio_swap_flags(struct folio *folio)
{
- return __swap_entry_to_info(folio->swap)->flags;
+ swp_slot_t swp_slot = swp_entry_to_swp_slot(folio->swap);
+
+ return __swap_slot_to_info(swp_slot)->flags;
}
/*
@@ -287,8 +289,9 @@ static inline unsigned int folio_swap_flags(struct folio *folio)
static inline int swap_zeromap_batch(swp_entry_t entry, int max_nr,
bool *is_zeromap)
{
- struct swap_info_struct *sis = __swap_entry_to_info(entry);
- unsigned long start = swp_offset(entry);
+ swp_slot_t slot = swp_entry_to_swp_slot(entry);
+ struct swap_info_struct *sis = __swap_slot_to_info(slot);
+ unsigned long start = swp_slot_offset(slot);
unsigned long end = start + max_nr;
bool first_bit;
@@ -306,8 +309,9 @@ static inline int swap_zeromap_batch(swp_entry_t entry, int max_nr,
static inline int non_swapcache_batch(swp_entry_t entry, int max_nr)
{
- struct swap_info_struct *si = __swap_entry_to_info(entry);
- pgoff_t offset = swp_offset(entry);
+ swp_slot_t slot = swp_entry_to_swp_slot(entry);
+ struct swap_info_struct *si = __swap_slot_to_info(slot);
+ pgoff_t offset = swp_slot_offset(slot);
int i;
/*
@@ -326,7 +330,7 @@ static inline int non_swapcache_batch(swp_entry_t entry, int max_nr)
#else /* CONFIG_SWAP */
struct swap_iocb;
static inline struct swap_cluster_info *swap_cluster_lock(
- struct swap_info_struct *si, pgoff_t offset, bool irq)
+ struct swap_info_struct *si, unsigned long offset)
{
return NULL;
}
@@ -351,7 +355,7 @@ static inline void swap_cluster_unlock_irq(struct swap_cluster_info *ci)
{
}
-static inline struct swap_info_struct *__swap_entry_to_info(swp_entry_t entry)
+static inline struct swap_info_struct *__swap_slot_to_info(swp_slot_t slot)
{
return NULL;
}
diff --git a/mm/swap_cgroup.c b/mm/swap_cgroup.c
index de779fed8c210..77ce1d66c318d 100644
--- a/mm/swap_cgroup.c
+++ b/mm/swap_cgroup.c
@@ -65,13 +65,14 @@ void swap_cgroup_record(struct folio *folio, unsigned short id,
swp_entry_t ent)
{
unsigned int nr_ents = folio_nr_pages(folio);
+ swp_slot_t slot = swp_entry_to_swp_slot(ent);
struct swap_cgroup *map;
pgoff_t offset, end;
unsigned short old;
- offset = swp_offset(ent);
+ offset = swp_slot_offset(slot);
end = offset + nr_ents;
- map = swap_cgroup_ctrl[swp_type(ent)].map;
+ map = swap_cgroup_ctrl[swp_slot_type(slot)].map;
do {
old = __swap_cgroup_id_xchg(map, offset, id);
@@ -92,13 +93,13 @@ void swap_cgroup_record(struct folio *folio, unsigned short id,
*/
unsigned short swap_cgroup_clear(swp_entry_t ent, unsigned int nr_ents)
{
- pgoff_t offset, end;
+ swp_slot_t slot = swp_entry_to_swp_slot(ent);
+ pgoff_t offset = swp_slot_offset(slot);
+ pgoff_t end = offset + nr_ents;
struct swap_cgroup *map;
unsigned short old, iter = 0;
- offset = swp_offset(ent);
- end = offset + nr_ents;
- map = swap_cgroup_ctrl[swp_type(ent)].map;
+ map = swap_cgroup_ctrl[swp_slot_type(slot)].map;
do {
old = __swap_cgroup_id_xchg(map, offset, 0);
@@ -119,12 +120,13 @@ unsigned short swap_cgroup_clear(swp_entry_t ent, unsigned int nr_ents)
unsigned short lookup_swap_cgroup_id(swp_entry_t ent)
{
struct swap_cgroup_ctrl *ctrl;
+ swp_slot_t slot = swp_entry_to_swp_slot(ent);
if (mem_cgroup_disabled())
return 0;
- ctrl = &swap_cgroup_ctrl[swp_type(ent)];
- return __swap_cgroup_id_lookup(ctrl->map, swp_offset(ent));
+ ctrl = &swap_cgroup_ctrl[swp_slot_type(slot)];
+ return __swap_cgroup_id_lookup(ctrl->map, swp_slot_offset(slot));
}
int swap_cgroup_swapon(int type, unsigned long max_pages)
diff --git a/mm/swap_state.c b/mm/swap_state.c
index bece18eb540fa..e2e9f55bea3bb 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -421,7 +421,8 @@ struct folio *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
struct mempolicy *mpol, pgoff_t ilx, bool *new_page_allocated,
bool skip_if_exists)
{
- struct swap_info_struct *si = __swap_entry_to_info(entry);
+ struct swap_info_struct *si =
+ __swap_slot_to_info(swp_entry_to_swp_slot(entry));
struct folio *folio;
struct folio *new_folio = NULL;
struct folio *result = NULL;
@@ -636,11 +637,12 @@ struct folio *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask,
struct mempolicy *mpol, pgoff_t ilx)
{
struct folio *folio;
- unsigned long entry_offset = swp_offset(entry);
- unsigned long offset = entry_offset;
+ swp_slot_t slot = swp_entry_to_swp_slot(entry);
+ unsigned long slot_offset = swp_slot_offset(slot);
+ unsigned long offset = slot_offset;
unsigned long start_offset, end_offset;
unsigned long mask;
- struct swap_info_struct *si = __swap_entry_to_info(entry);
+ struct swap_info_struct *si = __swap_slot_to_info(slot);
struct blk_plug plug;
struct swap_iocb *splug = NULL;
bool page_allocated;
@@ -661,13 +663,13 @@ struct folio *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask,
for (offset = start_offset; offset <= end_offset ; offset++) {
/* Ok, do the async read-ahead now */
folio = __read_swap_cache_async(
- swp_entry(swp_type(entry), offset),
+ swp_slot_to_swp_entry(swp_slot(swp_slot_type(slot), offset)),
gfp_mask, mpol, ilx, &page_allocated, false);
if (!folio)
continue;
if (page_allocated) {
swap_read_folio(folio, &splug);
- if (offset != entry_offset) {
+ if (offset != slot_offset) {
folio_set_readahead(folio);
count_vm_event(SWAP_RA);
}
@@ -779,16 +781,20 @@ static struct folio *swap_vma_readahead(swp_entry_t targ_entry, gfp_t gfp_mask,
/*
* Readahead entry may come from a device that we are not
* holding a reference to, try to grab a reference, or skip.
+ *
+ * XXX: for now, always try to pin the swap entries in the
+ * readahead window to avoid the annoying conversion to physical
+ * swap slots. Once we move all swap metadata to virtual swap
+ * layer, we can simply compare the clusters of the target
+ * swap entry and the current swap entry, and pin the latter
+ * swap entry's cluster if it differ from the former's.
*/
- if (swp_type(entry) != swp_type(targ_entry)) {
- swapoff_locked = tryget_swap_entry(entry, &si);
- if (!swapoff_locked)
- continue;
- }
+ swapoff_locked = tryget_swap_entry(entry, &si);
+ if (!swapoff_locked)
+ continue;
folio = __read_swap_cache_async(entry, gfp_mask, mpol, ilx,
&page_allocated, false);
- if (swapoff_locked)
- put_swap_entry(entry, si);
+ put_swap_entry(entry, si);
if (!folio)
continue;
if (page_allocated) {
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 3c89dedbd5718..4b4126d4e2769 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -53,9 +53,9 @@
static bool swap_count_continued(struct swap_info_struct *, pgoff_t,
unsigned char);
static void free_swap_count_continuations(struct swap_info_struct *);
-static void swap_entries_free(struct swap_info_struct *si,
+static void swap_slots_free(struct swap_info_struct *si,
struct swap_cluster_info *ci,
- swp_entry_t entry, unsigned int nr_pages);
+ swp_slot_t slot, unsigned int nr_pages);
static void swap_range_alloc(struct swap_info_struct *si,
unsigned int nr_entries);
static bool folio_swapcache_freeable(struct folio *folio);
@@ -126,7 +126,7 @@ struct percpu_swap_cluster {
static DEFINE_PER_CPU(struct percpu_swap_cluster, percpu_swap_cluster) = {
.si = { NULL },
- .offset = { SWAP_ENTRY_INVALID },
+ .offset = { SWAP_SLOT_INVALID },
.lock = INIT_LOCAL_LOCK(),
};
@@ -139,9 +139,9 @@ static struct swap_info_struct *swap_type_to_info(int type)
}
/* May return NULL on invalid entry, caller must check for NULL return */
-static struct swap_info_struct *swap_entry_to_info(swp_entry_t entry)
+static struct swap_info_struct *swap_slot_to_info(swp_slot_t slot)
{
- return swap_type_to_info(swp_type(entry));
+ return swap_type_to_info(swp_slot_type(slot));
}
static inline unsigned char swap_count(unsigned char ent)
@@ -204,9 +204,11 @@ static bool swap_only_has_cache(struct swap_info_struct *si,
*/
bool is_swap_cached(swp_entry_t entry)
{
- struct swap_info_struct *si = __swap_entry_to_info(entry);
+ swp_slot_t slot = swp_entry_to_swp_slot(entry);
+ struct swap_info_struct *si = swap_slot_to_info(slot);
+ unsigned long offset = swp_slot_offset(slot);
- return READ_ONCE(si->swap_map[swp_offset(entry)]) & SWAP_HAS_CACHE;
+ return READ_ONCE(si->swap_map[offset]) & SWAP_HAS_CACHE;
}
static bool swap_is_last_map(struct swap_info_struct *si,
@@ -236,7 +238,9 @@ static bool swap_is_last_map(struct swap_info_struct *si,
static int __try_to_reclaim_swap(struct swap_info_struct *si,
unsigned long offset, unsigned long flags)
{
- const swp_entry_t entry = swp_entry(si->type, offset);
+ const swp_entry_t entry =
+ swp_slot_to_swp_entry(swp_slot(si->type, offset));
+ swp_slot_t slot;
struct swap_cluster_info *ci;
struct folio *folio;
int ret, nr_pages;
@@ -268,7 +272,8 @@ static int __try_to_reclaim_swap(struct swap_info_struct *si,
folio_put(folio);
goto again;
}
- offset = swp_offset(folio->swap);
+ slot = swp_entry_to_swp_slot(folio->swap);
+ offset = swp_slot_offset(slot);
need_reclaim = ((flags & TTRS_ANYWAY) ||
((flags & TTRS_UNMAPPED) && !folio_mapped(folio)) ||
@@ -368,12 +373,12 @@ offset_to_swap_extent(struct swap_info_struct *sis, unsigned long offset)
sector_t swap_folio_sector(struct folio *folio)
{
- struct swap_info_struct *sis = __swap_entry_to_info(folio->swap);
+ swp_slot_t slot = swp_entry_to_swp_slot(folio->swap);
+ struct swap_info_struct *sis = __swap_slot_to_info(slot);
struct swap_extent *se;
sector_t sector;
- pgoff_t offset;
+ pgoff_t offset = swp_slot_offset(slot);
- offset = swp_offset(folio->swap);
se = offset_to_swap_extent(sis, offset);
sector = se->start_block + (offset - se->start_page);
return sector << (PAGE_SHIFT - 9);
@@ -890,7 +895,7 @@ static unsigned int alloc_swap_scan_cluster(struct swap_info_struct *si,
unsigned int order,
unsigned char usage)
{
- unsigned int next = SWAP_ENTRY_INVALID, found = SWAP_ENTRY_INVALID;
+ unsigned int next = SWAP_SLOT_INVALID, found = SWAP_SLOT_INVALID;
unsigned long start = ALIGN_DOWN(offset, SWAPFILE_CLUSTER);
unsigned long end = min(start + SWAPFILE_CLUSTER, si->max);
unsigned int nr_pages = 1 << order;
@@ -947,7 +952,7 @@ static unsigned int alloc_swap_scan_list(struct swap_info_struct *si,
unsigned char usage,
bool scan_all)
{
- unsigned int found = SWAP_ENTRY_INVALID;
+ unsigned int found = SWAP_SLOT_INVALID;
do {
struct swap_cluster_info *ci = isolate_lock_cluster(si, list);
@@ -1017,11 +1022,11 @@ static void swap_reclaim_work(struct work_struct *work)
* Try to allocate swap entries with specified order and try set a new
* cluster for current CPU too.
*/
-static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int order,
+static unsigned long cluster_alloc_swap_slot(struct swap_info_struct *si, int order,
unsigned char usage)
{
struct swap_cluster_info *ci;
- unsigned int offset = SWAP_ENTRY_INVALID, found = SWAP_ENTRY_INVALID;
+ unsigned int offset = SWAP_SLOT_INVALID, found = SWAP_SLOT_INVALID;
/*
* Swapfile is not block device so unable
@@ -1034,7 +1039,7 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
/* Serialize HDD SWAP allocation for each device. */
spin_lock(&si->global_cluster_lock);
offset = si->global_cluster->next[order];
- if (offset == SWAP_ENTRY_INVALID)
+ if (offset == SWAP_SLOT_INVALID)
goto new_cluster;
ci = swap_cluster_lock(si, offset);
@@ -1255,7 +1260,7 @@ static void swap_range_free(struct swap_info_struct *si, unsigned long offset,
*/
for (i = 0; i < nr_entries; i++) {
clear_bit(offset + i, si->zeromap);
- zswap_invalidate(swp_entry(si->type, offset + i));
+ zswap_invalidate(swp_slot_to_swp_entry(swp_slot(si->type, offset + i)));
}
if (si->flags & SWP_BLKDEV)
@@ -1300,12 +1305,11 @@ static bool get_swap_device_info(struct swap_info_struct *si)
* Fast path try to get swap entries with specified order from current
* CPU's swap entry pool (a cluster).
*/
-static bool swap_alloc_fast(swp_entry_t *entry,
- int order)
+static bool swap_alloc_fast(swp_slot_t *slot, int order)
{
struct swap_cluster_info *ci;
struct swap_info_struct *si;
- unsigned int offset, found = SWAP_ENTRY_INVALID;
+ unsigned int offset, found = SWAP_SLOT_INVALID;
/*
* Once allocated, swap_info_struct will never be completely freed,
@@ -1322,18 +1326,17 @@ static bool swap_alloc_fast(swp_entry_t *entry,
offset = cluster_offset(si, ci);
found = alloc_swap_scan_cluster(si, ci, offset, order, SWAP_HAS_CACHE);
if (found)
- *entry = swp_entry(si->type, found);
+ *slot = swp_slot(si->type, found);
} else {
swap_cluster_unlock(ci);
}
- put_swap_device(si);
+ swap_slot_put_swap_info(si);
return !!found;
}
/* Rotate the device and switch to a new cluster */
-static void swap_alloc_slow(swp_entry_t *entry,
- int order)
+static void swap_alloc_slow(swp_slot_t *slot, int order)
{
unsigned long offset;
struct swap_info_struct *si, *next;
@@ -1345,10 +1348,10 @@ static void swap_alloc_slow(swp_entry_t *entry,
plist_requeue(&si->avail_list, &swap_avail_head);
spin_unlock(&swap_avail_lock);
if (get_swap_device_info(si)) {
- offset = cluster_alloc_swap_entry(si, order, SWAP_HAS_CACHE);
- put_swap_device(si);
+ offset = cluster_alloc_swap_slot(si, order, SWAP_HAS_CACHE);
+ swap_slot_put_swap_info(si);
if (offset) {
- *entry = swp_entry(si->type, offset);
+ *slot = swp_slot(si->type, offset);
return;
}
if (order)
@@ -1388,7 +1391,7 @@ static bool swap_sync_discard(void)
if (get_swap_device_info(si)) {
if (si->flags & SWP_PAGE_DISCARD)
ret = swap_do_scheduled_discard(si);
- put_swap_device(si);
+ swap_slot_put_swap_info(si);
}
if (ret)
return true;
@@ -1402,25 +1405,9 @@ static bool swap_sync_discard(void)
return false;
}
-/**
- * folio_alloc_swap - allocate swap space for a folio
- * @folio: folio we want to move to swap
- *
- * Allocate swap space for the folio and add the folio to the
- * swap cache.
- *
- * Context: Caller needs to hold the folio lock.
- * Return: Whether the folio was added to the swap cache.
- */
-int folio_alloc_swap(struct folio *folio)
+static int swap_slot_alloc(swp_slot_t *slot, unsigned int order)
{
- unsigned int order = folio_order(folio);
unsigned int size = 1 << order;
- swp_entry_t entry = {};
- int err;
-
- VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
- VM_BUG_ON_FOLIO(!folio_test_uptodate(folio), folio);
if (order) {
/*
@@ -1442,22 +1429,52 @@ int folio_alloc_swap(struct folio *folio)
again:
local_lock(&percpu_swap_cluster.lock);
- if (!swap_alloc_fast(&entry, order))
- swap_alloc_slow(&entry, order);
+ if (!swap_alloc_fast(slot, order))
+ swap_alloc_slow(slot, order);
local_unlock(&percpu_swap_cluster.lock);
- if (unlikely(!order && !entry.val)) {
+ if (unlikely(!order && !slot->val)) {
if (swap_sync_discard())
goto again;
}
+ return 0;
+}
+
+/**
+ * folio_alloc_swap - allocate swap space for a folio
+ * @folio: folio we want to move to swap
+ *
+ * Allocate swap space for the folio and add the folio to the
+ * swap cache.
+ *
+ * Context: Caller needs to hold the folio lock.
+ * Return: Whether the folio was added to the swap cache.
+ */
+int folio_alloc_swap(struct folio *folio)
+{
+ unsigned int order = folio_order(folio);
+ swp_slot_t slot = { 0 };
+ swp_entry_t entry = {};
+ int err = 0, ret;
+
+ VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
+ VM_BUG_ON_FOLIO(!folio_test_uptodate(folio), folio);
+
+ ret = swap_slot_alloc(&slot, order);
+ if (ret)
+ return ret;
+
+ /* XXX: for now, physical and virtual swap slots are identical */
+ entry.val = slot.val;
+
/* Need to call this even if allocation failed, for MEMCG_SWAP_FAIL. */
if (mem_cgroup_try_charge_swap(folio, entry)) {
err = -ENOMEM;
goto out_free;
}
- if (!entry.val)
+ if (!slot.val)
return -ENOMEM;
err = swap_cache_add_folio(folio, entry, __GFP_HIGH | __GFP_NOMEMALLOC | __GFP_NOWARN, NULL);
@@ -1471,46 +1488,46 @@ int folio_alloc_swap(struct folio *folio)
return err;
}
-static struct swap_info_struct *_swap_info_get(swp_entry_t entry)
+static struct swap_info_struct *_swap_info_get(swp_slot_t slot)
{
struct swap_info_struct *si;
unsigned long offset;
- if (!entry.val)
+ if (!slot.val)
goto out;
- si = swap_entry_to_info(entry);
+ si = swap_slot_to_info(slot);
if (!si)
goto bad_nofile;
if (data_race(!(si->flags & SWP_USED)))
goto bad_device;
- offset = swp_offset(entry);
+ offset = swp_slot_offset(slot);
if (offset >= si->max)
goto bad_offset;
- if (data_race(!si->swap_map[swp_offset(entry)]))
+ if (data_race(!si->swap_map[swp_slot_offset(slot)]))
goto bad_free;
return si;
bad_free:
- pr_err("%s: %s%08lx\n", __func__, Unused_offset, entry.val);
+ pr_err("%s: %s%08lx\n", __func__, Unused_offset, slot.val);
goto out;
bad_offset:
- pr_err("%s: %s%08lx\n", __func__, Bad_offset, entry.val);
+ pr_err("%s: %s%08lx\n", __func__, Bad_offset, slot.val);
goto out;
bad_device:
- pr_err("%s: %s%08lx\n", __func__, Unused_file, entry.val);
+ pr_err("%s: %s%08lx\n", __func__, Unused_file, slot.val);
goto out;
bad_nofile:
- pr_err("%s: %s%08lx\n", __func__, Bad_file, entry.val);
+ pr_err("%s: %s%08lx\n", __func__, Bad_file, slot.val);
out:
return NULL;
}
-static unsigned char swap_entry_put_locked(struct swap_info_struct *si,
+static unsigned char swap_slot_put_locked(struct swap_info_struct *si,
struct swap_cluster_info *ci,
- swp_entry_t entry,
+ swp_slot_t slot,
unsigned char usage)
{
- unsigned long offset = swp_offset(entry);
+ unsigned long offset = swp_slot_offset(slot);
unsigned char count;
unsigned char has_cache;
@@ -1542,7 +1559,7 @@ static unsigned char swap_entry_put_locked(struct swap_info_struct *si,
if (usage)
WRITE_ONCE(si->swap_map[offset], usage);
else
- swap_entries_free(si, ci, entry, 1);
+ swap_slots_free(si, ci, slot, 1);
return usage;
}
@@ -1552,8 +1569,9 @@ static unsigned char swap_entry_put_locked(struct swap_info_struct *si,
* prevent swapoff, such as the folio in swap cache is locked, RCU
* reader side is locked, etc., the swap entry may become invalid
* because of swapoff. Then, we need to enclose all swap related
- * functions with get_swap_device() and put_swap_device(), unless the
- * swap functions call get/put_swap_device() by themselves.
+ * functions with swap_slot_tryget_swap_info() and
+ * swap_slot_put_swap_info(), unless the swap functions call
+ * swap_slot_(tryget|put)_swap_info by themselves.
*
* RCU reader side lock (including any spinlock) is sufficient to
* prevent swapoff, because synchronize_rcu() is called in swapoff()
@@ -1562,11 +1580,11 @@ static unsigned char swap_entry_put_locked(struct swap_info_struct *si,
* Check whether swap entry is valid in the swap device. If so,
* return pointer to swap_info_struct, and keep the swap entry valid
* via preventing the swap device from being swapoff, until
- * put_swap_device() is called. Otherwise return NULL.
+ * swap_slot_put_swap_info() is called. Otherwise return NULL.
*
* Notice that swapoff or swapoff+swapon can still happen before the
- * percpu_ref_tryget_live() in get_swap_device() or after the
- * percpu_ref_put() in put_swap_device() if there isn't any other way
+ * percpu_ref_tryget_live() in swap_slot_tryget_swap_info() or after the
+ * percpu_ref_put() in swap_slot_put_swap_info() if there isn't any other way
* to prevent swapoff. The caller must be prepared for that. For
* example, the following situation is possible.
*
@@ -1586,53 +1604,53 @@ static unsigned char swap_entry_put_locked(struct swap_info_struct *si,
* changed with the page table locked to check whether the swap device
* has been swapoff or swapoff+swapon.
*/
-struct swap_info_struct *get_swap_device(swp_entry_t entry)
+struct swap_info_struct *swap_slot_tryget_swap_info(swp_slot_t slot)
{
struct swap_info_struct *si;
unsigned long offset;
- if (!entry.val)
+ if (!slot.val)
goto out;
- si = swap_entry_to_info(entry);
+ si = swap_slot_to_info(slot);
if (!si)
goto bad_nofile;
if (!get_swap_device_info(si))
goto out;
- offset = swp_offset(entry);
+ offset = swp_slot_offset(slot);
if (offset >= si->max)
goto put_out;
return si;
bad_nofile:
- pr_err("%s: %s%08lx\n", __func__, Bad_file, entry.val);
+ pr_err("%s: %s%08lx\n", __func__, Bad_file, slot.val);
out:
return NULL;
put_out:
- pr_err("%s: %s%08lx\n", __func__, Bad_offset, entry.val);
+ pr_err("%s: %s%08lx\n", __func__, Bad_offset, slot.val);
percpu_ref_put(&si->users);
return NULL;
}
-static void swap_entries_put_cache(struct swap_info_struct *si,
- swp_entry_t entry, int nr)
+static void swap_slots_put_cache(struct swap_info_struct *si,
+ swp_slot_t slot, int nr)
{
- unsigned long offset = swp_offset(entry);
+ unsigned long offset = swp_slot_offset(slot);
struct swap_cluster_info *ci;
ci = swap_cluster_lock(si, offset);
if (swap_only_has_cache(si, offset, nr)) {
- swap_entries_free(si, ci, entry, nr);
+ swap_slots_free(si, ci, slot, nr);
} else {
- for (int i = 0; i < nr; i++, entry.val++)
- swap_entry_put_locked(si, ci, entry, SWAP_HAS_CACHE);
+ for (int i = 0; i < nr; i++, slot.val++)
+ swap_slot_put_locked(si, ci, slot, SWAP_HAS_CACHE);
}
swap_cluster_unlock(ci);
}
-static bool swap_entries_put_map(struct swap_info_struct *si,
- swp_entry_t entry, int nr)
+static bool swap_slots_put_map(struct swap_info_struct *si,
+ swp_slot_t slot, int nr)
{
- unsigned long offset = swp_offset(entry);
+ unsigned long offset = swp_slot_offset(slot);
struct swap_cluster_info *ci;
bool has_cache = false;
unsigned char count;
@@ -1649,7 +1667,7 @@ static bool swap_entries_put_map(struct swap_info_struct *si,
goto locked_fallback;
}
if (!has_cache)
- swap_entries_free(si, ci, entry, nr);
+ swap_slots_free(si, ci, slot, nr);
else
for (i = 0; i < nr; i++)
WRITE_ONCE(si->swap_map[offset + i], SWAP_HAS_CACHE);
@@ -1660,8 +1678,8 @@ static bool swap_entries_put_map(struct swap_info_struct *si,
fallback:
ci = swap_cluster_lock(si, offset);
locked_fallback:
- for (i = 0; i < nr; i++, entry.val++) {
- count = swap_entry_put_locked(si, ci, entry, 1);
+ for (i = 0; i < nr; i++, slot.val++) {
+ count = swap_slot_put_locked(si, ci, slot, 1);
if (count == SWAP_HAS_CACHE)
has_cache = true;
}
@@ -1674,20 +1692,20 @@ static bool swap_entries_put_map(struct swap_info_struct *si,
* cross multi clusters, so ensure the range is within a single cluster
* when freeing entries with functions without "_nr" suffix.
*/
-static bool swap_entries_put_map_nr(struct swap_info_struct *si,
- swp_entry_t entry, int nr)
+static bool swap_slots_put_map_nr(struct swap_info_struct *si,
+ swp_slot_t slot, int nr)
{
int cluster_nr, cluster_rest;
- unsigned long offset = swp_offset(entry);
+ unsigned long offset = swp_slot_offset(slot);
bool has_cache = false;
cluster_rest = SWAPFILE_CLUSTER - offset % SWAPFILE_CLUSTER;
while (nr) {
cluster_nr = min(nr, cluster_rest);
- has_cache |= swap_entries_put_map(si, entry, cluster_nr);
+ has_cache |= swap_slots_put_map(si, slot, cluster_nr);
cluster_rest = SWAPFILE_CLUSTER;
nr -= cluster_nr;
- entry.val += cluster_nr;
+ slot.val += cluster_nr;
}
return has_cache;
@@ -1707,13 +1725,14 @@ static inline bool __maybe_unused swap_is_last_ref(unsigned char count)
* Drop the last ref of swap entries, caller have to ensure all entries
* belong to the same cgroup and cluster.
*/
-static void swap_entries_free(struct swap_info_struct *si,
+static void swap_slots_free(struct swap_info_struct *si,
struct swap_cluster_info *ci,
- swp_entry_t entry, unsigned int nr_pages)
+ swp_slot_t slot, unsigned int nr_pages)
{
- unsigned long offset = swp_offset(entry);
+ unsigned long offset = swp_slot_offset(slot);
unsigned char *map = si->swap_map + offset;
unsigned char *map_end = map + nr_pages;
+ swp_entry_t entry = swp_slot_to_swp_entry(slot);
/* It should never free entries across different clusters */
VM_BUG_ON(ci != __swap_offset_to_cluster(si, offset + nr_pages - 1));
@@ -1739,43 +1758,54 @@ static void swap_entries_free(struct swap_info_struct *si,
* Caller has made sure that the swap device corresponding to entry
* is still around or has not been recycled.
*/
-void swap_free_nr(swp_entry_t entry, int nr_pages)
+void swap_slot_free_nr(swp_slot_t slot, int nr_pages)
{
int nr;
struct swap_info_struct *sis;
- unsigned long offset = swp_offset(entry);
+ unsigned long offset = swp_slot_offset(slot);
- sis = _swap_info_get(entry);
+ sis = _swap_info_get(slot);
if (!sis)
return;
while (nr_pages) {
nr = min_t(int, nr_pages, SWAPFILE_CLUSTER - offset % SWAPFILE_CLUSTER);
- swap_entries_put_map(sis, swp_entry(sis->type, offset), nr);
+ swap_slots_put_map(sis, swp_slot(sis->type, offset), nr);
offset += nr;
nr_pages -= nr;
}
}
+/*
+ * Caller has made sure that the swap device corresponding to entry
+ * is still around or has not been recycled.
+ */
+void swap_free_nr(swp_entry_t entry, int nr_pages)
+{
+ swap_slot_free_nr(swp_entry_to_swp_slot(entry), nr_pages);
+}
+
/*
* Called after dropping swapcache to decrease refcnt to swap entries.
*/
void put_swap_folio(struct folio *folio, swp_entry_t entry)
{
+ swp_slot_t slot = swp_entry_to_swp_slot(entry);
struct swap_info_struct *si;
- int size = 1 << swap_entry_order(folio_order(folio));
+ int size = 1 << swap_slot_order(folio_order(folio));
- si = _swap_info_get(entry);
+ si = _swap_info_get(slot);
if (!si)
return;
- swap_entries_put_cache(si, entry, size);
+ swap_slots_put_cache(si, slot, size);
}
int __swap_count(swp_entry_t entry)
{
- struct swap_info_struct *si = __swap_entry_to_info(entry);
- pgoff_t offset = swp_offset(entry);
+ swp_slot_t slot = swp_entry_to_swp_slot(entry);
+ struct swap_info_struct *si = __swap_slot_to_info(slot);
+ pgoff_t offset = swp_slot_offset(slot);
return swap_count(si->swap_map[offset]);
}
@@ -1787,7 +1817,8 @@ int __swap_count(swp_entry_t entry)
*/
bool swap_entry_swapped(struct swap_info_struct *si, swp_entry_t entry)
{
- pgoff_t offset = swp_offset(entry);
+ swp_slot_t slot = swp_entry_to_swp_slot(entry);
+ pgoff_t offset = swp_slot_offset(slot);
struct swap_cluster_info *ci;
int count;
@@ -1803,6 +1834,7 @@ bool swap_entry_swapped(struct swap_info_struct *si, swp_entry_t entry)
*/
int swp_swapcount(swp_entry_t entry)
{
+ swp_slot_t slot = swp_entry_to_swp_slot(entry);
int count, tmp_count, n;
struct swap_info_struct *si;
struct swap_cluster_info *ci;
@@ -1810,11 +1842,11 @@ int swp_swapcount(swp_entry_t entry)
pgoff_t offset;
unsigned char *map;
- si = _swap_info_get(entry);
+ si = _swap_info_get(slot);
if (!si)
return 0;
- offset = swp_offset(entry);
+ offset = swp_slot_offset(slot);
ci = swap_cluster_lock(si, offset);
@@ -1846,10 +1878,11 @@ int swp_swapcount(swp_entry_t entry)
static bool swap_page_trans_huge_swapped(struct swap_info_struct *si,
swp_entry_t entry, int order)
{
+ swp_slot_t slot = swp_entry_to_swp_slot(entry);
struct swap_cluster_info *ci;
unsigned char *map = si->swap_map;
unsigned int nr_pages = 1 << order;
- unsigned long roffset = swp_offset(entry);
+ unsigned long roffset = swp_slot_offset(slot);
unsigned long offset = round_down(roffset, nr_pages);
int i;
bool ret = false;
@@ -1874,7 +1907,8 @@ static bool swap_page_trans_huge_swapped(struct swap_info_struct *si,
static bool folio_swapped(struct folio *folio)
{
swp_entry_t entry = folio->swap;
- struct swap_info_struct *si = _swap_info_get(entry);
+ swp_slot_t slot = swp_entry_to_swp_slot(entry);
+ struct swap_info_struct *si = _swap_info_get(slot);
if (!si)
return false;
@@ -1948,13 +1982,14 @@ bool folio_free_swap(struct folio *folio)
*/
void free_swap_and_cache_nr(swp_entry_t entry, int nr)
{
- const unsigned long start_offset = swp_offset(entry);
+ swp_slot_t slot = swp_entry_to_swp_slot(entry);
+ const unsigned long start_offset = swp_slot_offset(slot);
const unsigned long end_offset = start_offset + nr;
struct swap_info_struct *si;
bool any_only_cache = false;
unsigned long offset;
- si = get_swap_device(entry);
+ si = swap_slot_tryget_swap_info(slot);
if (!si)
return;
@@ -1964,7 +1999,7 @@ void free_swap_and_cache_nr(swp_entry_t entry, int nr)
/*
* First free all entries in the range.
*/
- any_only_cache = swap_entries_put_map_nr(si, entry, nr);
+ any_only_cache = swap_slots_put_map_nr(si, slot, nr);
/*
* Short-circuit the below loop if none of the entries had their
@@ -1998,16 +2033,16 @@ void free_swap_and_cache_nr(swp_entry_t entry, int nr)
}
out:
- put_swap_device(si);
+ swap_slot_put_swap_info(si);
}
#ifdef CONFIG_HIBERNATION
-swp_entry_t get_swap_page_of_type(int type)
+swp_slot_t swap_slot_alloc_of_type(int type)
{
struct swap_info_struct *si = swap_type_to_info(type);
unsigned long offset;
- swp_entry_t entry = {0};
+ swp_slot_t slot = {0};
if (!si)
goto fail;
@@ -2020,15 +2055,15 @@ swp_entry_t get_swap_page_of_type(int type)
* with swap table allocation.
*/
local_lock(&percpu_swap_cluster.lock);
- offset = cluster_alloc_swap_entry(si, 0, 1);
+ offset = cluster_alloc_swap_slot(si, 0, 1);
local_unlock(&percpu_swap_cluster.lock);
if (offset)
- entry = swp_entry(si->type, offset);
+ slot = swp_slot(si->type, offset);
}
- put_swap_device(si);
+ swap_slot_put_swap_info(si);
}
fail:
- return entry;
+ return slot;
}
/*
@@ -2257,6 +2292,7 @@ static int unuse_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
unsigned long offset;
unsigned char swp_count;
softleaf_t entry;
+ swp_slot_t slot;
int ret;
pte_t ptent;
@@ -2271,10 +2307,12 @@ static int unuse_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
if (!softleaf_is_swap(entry))
continue;
- if (swp_type(entry) != type)
+
+ slot = swp_entry_to_swp_slot(entry);
+ if (swp_slot_type(slot) != type)
continue;
- offset = swp_offset(entry);
+ offset = swp_slot_offset(slot);
pte_unmap(pte);
pte = NULL;
@@ -2459,6 +2497,7 @@ static int try_to_unuse(unsigned int type)
struct swap_info_struct *si = swap_info[type];
struct folio *folio;
swp_entry_t entry;
+ swp_slot_t slot;
unsigned int i;
if (!swap_usage_in_pages(si))
@@ -2506,7 +2545,8 @@ static int try_to_unuse(unsigned int type)
!signal_pending(current) &&
(i = find_next_to_unuse(si, i)) != 0) {
- entry = swp_entry(type, i);
+ slot = swp_slot(type, i);
+ entry = swp_slot_to_swp_entry(slot);
folio = swap_cache_get_folio(entry);
if (!folio)
continue;
@@ -2890,7 +2930,7 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
}
/*
- * Wait for swap operations protected by get/put_swap_device()
+ * Wait for swap operations protected by swap_slot_(tryget|put)_swap_info()
* to complete. Because of synchronize_rcu() here, all swap
* operations protected by RCU reader side lock (including any
* spinlock) will be waited too. This makes it easy to
@@ -3331,7 +3371,7 @@ static struct swap_cluster_info *setup_clusters(struct swap_info_struct *si,
if (!si->global_cluster)
goto err;
for (i = 0; i < SWAP_NR_ORDERS; i++)
- si->global_cluster->next[i] = SWAP_ENTRY_INVALID;
+ si->global_cluster->next[i] = SWAP_SLOT_INVALID;
spin_lock_init(&si->global_cluster_lock);
}
@@ -3669,6 +3709,7 @@ void si_swapinfo(struct sysinfo *val)
*/
static int __swap_duplicate(swp_entry_t entry, unsigned char usage, int nr)
{
+ swp_slot_t slot = swp_entry_to_swp_slot(entry);
struct swap_info_struct *si;
struct swap_cluster_info *ci;
unsigned long offset;
@@ -3676,13 +3717,13 @@ static int __swap_duplicate(swp_entry_t entry, unsigned char usage, int nr)
unsigned char has_cache;
int err, i;
- si = swap_entry_to_info(entry);
+ si = swap_slot_to_info(slot);
if (WARN_ON_ONCE(!si)) {
pr_err("%s%08lx\n", Bad_file, entry.val);
return -EINVAL;
}
- offset = swp_offset(entry);
+ offset = swp_slot_offset(slot);
VM_WARN_ON(nr > SWAPFILE_CLUSTER - offset % SWAPFILE_CLUSTER);
VM_WARN_ON(usage == 1 && nr > 1);
ci = swap_cluster_lock(si, offset);
@@ -3788,7 +3829,7 @@ int swapcache_prepare(swp_entry_t entry, int nr)
*/
void swapcache_clear(struct swap_info_struct *si, swp_entry_t entry, int nr)
{
- swap_entries_put_cache(si, entry, nr);
+ swap_slots_put_cache(si, swp_entry_to_swp_slot(entry), nr);
}
/*
@@ -3815,6 +3856,7 @@ int add_swap_count_continuation(swp_entry_t entry, gfp_t gfp_mask)
struct page *list_page;
pgoff_t offset;
unsigned char count;
+ swp_slot_t slot = swp_entry_to_swp_slot(entry);
int ret = 0;
/*
@@ -3823,7 +3865,7 @@ int add_swap_count_continuation(swp_entry_t entry, gfp_t gfp_mask)
*/
page = alloc_page(gfp_mask | __GFP_HIGHMEM);
- si = get_swap_device(entry);
+ si = swap_slot_tryget_swap_info(slot);
if (!si) {
/*
* An acceptable race has occurred since the failing
@@ -3832,7 +3874,7 @@ int add_swap_count_continuation(swp_entry_t entry, gfp_t gfp_mask)
goto outer;
}
- offset = swp_offset(entry);
+ offset = swp_slot_offset(slot);
ci = swap_cluster_lock(si, offset);
@@ -3895,7 +3937,7 @@ int add_swap_count_continuation(swp_entry_t entry, gfp_t gfp_mask)
spin_unlock(&si->cont_lock);
out:
swap_cluster_unlock(ci);
- put_swap_device(si);
+ swap_slot_put_swap_info(si);
outer:
if (page)
__free_page(page);
--
2.47.3
^ permalink raw reply [flat|nested] 52+ messages in thread
* [PATCH v3 07/20] mm: create scaffolds for the new virtual swap implementation
2026-02-08 21:58 [PATCH v3 00/20] Virtual Swap Space Nhat Pham
` (5 preceding siblings ...)
2026-02-08 21:58 ` [PATCH v3 06/20] mm: swap: add a separate type for physical swap slots Nhat Pham
@ 2026-02-08 21:58 ` Nhat Pham
2026-02-08 21:58 ` [PATCH v3 08/20] zswap: prepare zswap for swap virtualization Nhat Pham
` (14 subsequent siblings)
21 siblings, 0 replies; 52+ messages in thread
From: Nhat Pham @ 2026-02-08 21:58 UTC (permalink / raw)
To: linux-mm
Cc: akpm, hannes, hughd, yosry.ahmed, mhocko, roman.gushchin,
shakeel.butt, muchun.song, len.brown, chengming.zhou, kasong,
chrisl, huang.ying.caritas, ryan.roberts, shikemeng, viro,
baohua, bhe, osalvador, lorenzo.stoakes, christophe.leroy, pavel,
kernel-team, linux-kernel, cgroups, linux-pm, peterx, riel,
joshua.hahnjy, npache, gourry, axelrasmussen, yuanchu, weixugc,
rafael, jannh, pfalcato, zhengqi.arch
In prepration for the implementation of swap virtualization, add new
scaffolds for the new code: a new mm/vswap.c source file, which
currently only holds the logic to set up the (for now, empty) vswap
debugfs directory. Hook this up in the swap setup step in
mm/swap_state.c, and set up vswap compilation in the Makefile.
Other than the debugfs directory, no behavioral change intended.
Finally, make Johannes a swap reviewer, given that he has contributed
majorly to the developments of virtual swap.
Signed-off-by: Nhat Pham <nphamcs@gmail.com>
---
MAINTAINERS | 2 ++
include/linux/swap.h | 2 ++
mm/Makefile | 2 +-
mm/swap_state.c | 6 ++++++
mm/vswap.c | 35 +++++++++++++++++++++++++++++++++++
5 files changed, 46 insertions(+), 1 deletion(-)
create mode 100644 mm/vswap.c
diff --git a/MAINTAINERS b/MAINTAINERS
index e087673237636..b21038b160a07 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -16728,6 +16728,7 @@ R: Kemeng Shi <shikemeng@huaweicloud.com>
R: Nhat Pham <nphamcs@gmail.com>
R: Baoquan He <bhe@redhat.com>
R: Barry Song <baohua@kernel.org>
+R: Johannes Weiner <hannes@cmpxchg.org>
L: linux-mm@kvack.org
S: Maintained
F: Documentation/mm/swap-table.rst
@@ -16740,6 +16741,7 @@ F: mm/swap.h
F: mm/swap_table.h
F: mm/swap_state.c
F: mm/swapfile.c
+F: mm/vswap.c
MEMORY MANAGEMENT - THP (TRANSPARENT HUGE PAGE)
M: Andrew Morton <akpm@linux-foundation.org>
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 918b47da55f44..1ff463fb3a966 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -669,6 +669,8 @@ static inline bool mem_cgroup_swap_full(struct folio *folio)
}
#endif
+int vswap_init(void);
+
/**
* swp_entry_to_swp_slot - look up the physical swap slot corresponding to a
* virtual swap slot.
diff --git a/mm/Makefile b/mm/Makefile
index 2d0570a16e5be..67fa4586e7e18 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -75,7 +75,7 @@ ifdef CONFIG_MMU
obj-$(CONFIG_ADVISE_SYSCALLS) += madvise.o
endif
-obj-$(CONFIG_SWAP) += page_io.o swap_state.o swapfile.o
+obj-$(CONFIG_SWAP) += page_io.o swap_state.o swapfile.o vswap.o
obj-$(CONFIG_ZSWAP) += zswap.o
obj-$(CONFIG_HAS_DMA) += dmapool.o
obj-$(CONFIG_HUGETLBFS) += hugetlb.o hugetlb_sysfs.o hugetlb_sysctl.o
diff --git a/mm/swap_state.c b/mm/swap_state.c
index e2e9f55bea3bb..29ec666be4204 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -882,6 +882,12 @@ static int __init swap_init(void)
int err;
struct kobject *swap_kobj;
+ err = vswap_init();
+ if (err) {
+ pr_err("failed to initialize virtual swap space\n");
+ return err;
+ }
+
swap_kobj = kobject_create_and_add("swap", mm_kobj);
if (!swap_kobj) {
pr_err("failed to create swap kobject\n");
diff --git a/mm/vswap.c b/mm/vswap.c
new file mode 100644
index 0000000000000..e68234f053fc9
--- /dev/null
+++ b/mm/vswap.c
@@ -0,0 +1,35 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Virtual swap space
+ *
+ * Copyright (C) 2024 Meta Platforms, Inc., Nhat Pham
+ */
+#include <linux/swap.h>
+
+#ifdef CONFIG_DEBUG_FS
+#include <linux/debugfs.h>
+
+static struct dentry *vswap_debugfs_root;
+
+static int vswap_debug_fs_init(void)
+{
+ if (!debugfs_initialized())
+ return -ENODEV;
+
+ vswap_debugfs_root = debugfs_create_dir("vswap", NULL);
+ return 0;
+}
+#else
+static int vswap_debug_fs_init(void)
+{
+ return 0;
+}
+#endif
+
+int vswap_init(void)
+{
+ if (vswap_debug_fs_init())
+ pr_warn("Failed to initialize vswap debugfs\n");
+
+ return 0;
+}
--
2.47.3
^ permalink raw reply [flat|nested] 52+ messages in thread
* [PATCH v3 08/20] zswap: prepare zswap for swap virtualization
2026-02-08 21:58 [PATCH v3 00/20] Virtual Swap Space Nhat Pham
` (6 preceding siblings ...)
2026-02-08 21:58 ` [PATCH v3 07/20] mm: create scaffolds for the new virtual swap implementation Nhat Pham
@ 2026-02-08 21:58 ` Nhat Pham
2026-02-08 21:58 ` [PATCH v3 09/20] mm: swap: allocate a virtual swap slot for each swapped out page Nhat Pham
` (13 subsequent siblings)
21 siblings, 0 replies; 52+ messages in thread
From: Nhat Pham @ 2026-02-08 21:58 UTC (permalink / raw)
To: linux-mm
Cc: akpm, hannes, hughd, yosry.ahmed, mhocko, roman.gushchin,
shakeel.butt, muchun.song, len.brown, chengming.zhou, kasong,
chrisl, huang.ying.caritas, ryan.roberts, shikemeng, viro,
baohua, bhe, osalvador, lorenzo.stoakes, christophe.leroy, pavel,
kernel-team, linux-kernel, cgroups, linux-pm, peterx, riel,
joshua.hahnjy, npache, gourry, axelrasmussen, yuanchu, weixugc,
rafael, jannh, pfalcato, zhengqi.arch
The zswap tree code, specifically the range partition logic, can no
longer easily be reused for the new virtual swap space design. Use a
simple unified zswap tree in the new implementation for now.
Signed-off-by: Nhat Pham <nphamcs@gmail.com>
---
include/linux/zswap.h | 7 -----
mm/swapfile.c | 9 +-----
mm/zswap.c | 69 +++++++------------------------------------
3 files changed, 11 insertions(+), 74 deletions(-)
diff --git a/include/linux/zswap.h b/include/linux/zswap.h
index 30c193a1207e1..1a04caf283dc8 100644
--- a/include/linux/zswap.h
+++ b/include/linux/zswap.h
@@ -28,8 +28,6 @@ unsigned long zswap_total_pages(void);
bool zswap_store(struct folio *folio);
int zswap_load(struct folio *folio);
void zswap_invalidate(swp_entry_t swp);
-int zswap_swapon(int type, unsigned long nr_pages);
-void zswap_swapoff(int type);
void zswap_memcg_offline_cleanup(struct mem_cgroup *memcg);
void zswap_lruvec_state_init(struct lruvec *lruvec);
void zswap_folio_swapin(struct folio *folio);
@@ -50,11 +48,6 @@ static inline int zswap_load(struct folio *folio)
}
static inline void zswap_invalidate(swp_entry_t swp) {}
-static inline int zswap_swapon(int type, unsigned long nr_pages)
-{
- return 0;
-}
-static inline void zswap_swapoff(int type) {}
static inline void zswap_memcg_offline_cleanup(struct mem_cgroup *memcg) {}
static inline void zswap_lruvec_state_init(struct lruvec *lruvec) {}
static inline void zswap_folio_swapin(struct folio *folio) {}
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 4b4126d4e2769..3f70df488c1da 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -2970,7 +2970,6 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
spin_unlock(&p->lock);
spin_unlock(&swap_lock);
arch_swap_invalidate_area(p->type);
- zswap_swapoff(p->type);
mutex_unlock(&swapon_mutex);
kfree(p->global_cluster);
p->global_cluster = NULL;
@@ -3613,10 +3612,6 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
}
}
- error = zswap_swapon(si->type, maxpages);
- if (error)
- goto bad_swap_unlock_inode;
-
/*
* Flush any pending IO and dirty mappings before we start using this
* swap device.
@@ -3625,7 +3620,7 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
error = inode_drain_writes(inode);
if (error) {
inode->i_flags &= ~S_SWAPFILE;
- goto free_swap_zswap;
+ goto bad_swap_unlock_inode;
}
mutex_lock(&swapon_mutex);
@@ -3648,8 +3643,6 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
error = 0;
goto out;
-free_swap_zswap:
- zswap_swapoff(si->type);
bad_swap_unlock_inode:
inode_unlock(inode);
bad_swap:
diff --git a/mm/zswap.c b/mm/zswap.c
index a5a3f068bd1a6..f7313261673ff 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -197,8 +197,6 @@ struct zswap_entry {
struct list_head lru;
};
-static struct xarray *zswap_trees[MAX_SWAPFILES];
-static unsigned int nr_zswap_trees[MAX_SWAPFILES];
/* RCU-protected iteration */
static LIST_HEAD(zswap_pools);
@@ -225,45 +223,35 @@ static bool zswap_has_pool;
* helpers and fwd declarations
**********************************/
-/* One swap address space for each 64M swap space */
-#define ZSWAP_ADDRESS_SPACE_SHIFT 14
-#define ZSWAP_ADDRESS_SPACE_PAGES (1 << ZSWAP_ADDRESS_SPACE_SHIFT)
-static inline struct xarray *swap_zswap_tree(swp_entry_t swp)
-{
- return &zswap_trees[swp_type(swp)][swp_offset(swp)
- >> ZSWAP_ADDRESS_SPACE_SHIFT];
-}
+static DEFINE_XARRAY(zswap_tree);
+
+#define zswap_tree_index(entry) (entry.val)
static inline void *zswap_entry_store(swp_entry_t swpentry,
struct zswap_entry *entry)
{
- struct xarray *tree = swap_zswap_tree(swpentry);
- pgoff_t offset = swp_offset(swpentry);
+ pgoff_t offset = zswap_tree_index(swpentry);
- return xa_store(tree, offset, entry, GFP_KERNEL);
+ return xa_store(&zswap_tree, offset, entry, GFP_KERNEL);
}
static inline void *zswap_entry_load(swp_entry_t swpentry)
{
- struct xarray *tree = swap_zswap_tree(swpentry);
- pgoff_t offset = swp_offset(swpentry);
+ pgoff_t offset = zswap_tree_index(swpentry);
- return xa_load(tree, offset);
+ return xa_load(&zswap_tree, offset);
}
static inline void *zswap_entry_erase(swp_entry_t swpentry)
{
- struct xarray *tree = swap_zswap_tree(swpentry);
- pgoff_t offset = swp_offset(swpentry);
+ pgoff_t offset = zswap_tree_index(swpentry);
- return xa_erase(tree, offset);
+ return xa_erase(&zswap_tree, offset);
}
static inline bool zswap_empty(swp_entry_t swpentry)
{
- struct xarray *tree = swap_zswap_tree(swpentry);
-
- return xa_empty(tree);
+ return xa_empty(&zswap_tree);
}
#define zswap_pool_debug(msg, p) \
@@ -1691,43 +1679,6 @@ void zswap_invalidate(swp_entry_t swp)
zswap_entry_free(entry);
}
-int zswap_swapon(int type, unsigned long nr_pages)
-{
- struct xarray *trees, *tree;
- unsigned int nr, i;
-
- nr = DIV_ROUND_UP(nr_pages, ZSWAP_ADDRESS_SPACE_PAGES);
- trees = kvcalloc(nr, sizeof(*tree), GFP_KERNEL);
- if (!trees) {
- pr_err("alloc failed, zswap disabled for swap type %d\n", type);
- return -ENOMEM;
- }
-
- for (i = 0; i < nr; i++)
- xa_init(trees + i);
-
- nr_zswap_trees[type] = nr;
- zswap_trees[type] = trees;
- return 0;
-}
-
-void zswap_swapoff(int type)
-{
- struct xarray *trees = zswap_trees[type];
- unsigned int i;
-
- if (!trees)
- return;
-
- /* try_to_unuse() invalidated all the entries already */
- for (i = 0; i < nr_zswap_trees[type]; i++)
- WARN_ON_ONCE(!xa_empty(trees + i));
-
- kvfree(trees);
- nr_zswap_trees[type] = 0;
- zswap_trees[type] = NULL;
-}
-
/*********************************
* debugfs functions
**********************************/
--
2.47.3
^ permalink raw reply [flat|nested] 52+ messages in thread
* [PATCH v3 09/20] mm: swap: allocate a virtual swap slot for each swapped out page
2026-02-08 21:58 [PATCH v3 00/20] Virtual Swap Space Nhat Pham
` (7 preceding siblings ...)
2026-02-08 21:58 ` [PATCH v3 08/20] zswap: prepare zswap for swap virtualization Nhat Pham
@ 2026-02-08 21:58 ` Nhat Pham
2026-02-09 17:12 ` kernel test robot
2026-02-11 13:42 ` kernel test robot
2026-02-08 21:58 ` [PATCH v3 10/20] swap: move swap cache to virtual swap descriptor Nhat Pham
` (12 subsequent siblings)
21 siblings, 2 replies; 52+ messages in thread
From: Nhat Pham @ 2026-02-08 21:58 UTC (permalink / raw)
To: linux-mm
Cc: akpm, hannes, hughd, yosry.ahmed, mhocko, roman.gushchin,
shakeel.butt, muchun.song, len.brown, chengming.zhou, kasong,
chrisl, huang.ying.caritas, ryan.roberts, shikemeng, viro,
baohua, bhe, osalvador, lorenzo.stoakes, christophe.leroy, pavel,
kernel-team, linux-kernel, cgroups, linux-pm, peterx, riel,
joshua.hahnjy, npache, gourry, axelrasmussen, yuanchu, weixugc,
rafael, jannh, pfalcato, zhengqi.arch
For the new virtual swap space design, dynamically allocate a virtual
slot (as well as an associated metadata structure) for each swapped out
page, and associate it to the (physical) swap slot on the swapfile/swap
partition. This virtual swap slot is now stored in page table entries
and used to index into swap data structures (swap cache, zswap tree,
swap cgroup array), in place of the old physical swap slot.
For now, there is always a physical slot in the swapfile associated for
each virtual swap slot (except those about to be freed). The virtual
swap slot's lifetime is still tied to the lifetime of its physical swap
slot. We do change the freeing ordering a bit - we clear the shadow,
invalidate the zswap entry, and uncharge swap cgroup when we release the
virtual swap slot, as we now use virtual swap slot to index into these
swap data structures.
We also repurpose the swap table infrastructure as a reverse map to look
up the virtual swap slot from its associated physical swap slot on
swapfile. This is used in cluster readahead, as well as several swapfile
operations, such as the swap slot reclamation that happens when the
swapfile is almost full. It will also be used in a future patch that
simplifies swapoff.
Signed-off-by: Nhat Pham <nphamcs@gmail.com>
---
include/linux/cpuhotplug.h | 1 +
include/linux/swap.h | 49 +--
mm/internal.h | 28 +-
mm/page_io.c | 6 +-
mm/shmem.c | 9 +-
mm/swap.h | 8 +-
mm/swap_state.c | 5 +-
mm/swapfile.c | 63 +---
mm/vswap.c | 658 +++++++++++++++++++++++++++++++++++++
9 files changed, 710 insertions(+), 117 deletions(-)
diff --git a/include/linux/cpuhotplug.h b/include/linux/cpuhotplug.h
index 62cd7b35a29c9..85cb45022e796 100644
--- a/include/linux/cpuhotplug.h
+++ b/include/linux/cpuhotplug.h
@@ -86,6 +86,7 @@ enum cpuhp_state {
CPUHP_FS_BUFF_DEAD,
CPUHP_PRINTK_DEAD,
CPUHP_MM_MEMCQ_DEAD,
+ CPUHP_MM_VSWAP_DEAD,
CPUHP_PERCPU_CNT_DEAD,
CPUHP_RADIX_DEAD,
CPUHP_PAGE_ALLOC,
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 1ff463fb3a966..0410a00fd353c 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -471,6 +471,7 @@ static inline long get_nr_swap_pages(void)
}
void si_swapinfo(struct sysinfo *);
+int swap_slot_alloc(swp_slot_t *slot, unsigned int order);
swp_slot_t swap_slot_alloc_of_type(int);
int add_swap_count_continuation(swp_entry_t, gfp_t);
int swap_type_of(dev_t device, sector_t offset);
@@ -670,48 +671,12 @@ static inline bool mem_cgroup_swap_full(struct folio *folio)
#endif
int vswap_init(void);
-
-/**
- * swp_entry_to_swp_slot - look up the physical swap slot corresponding to a
- * virtual swap slot.
- * @entry: the virtual swap slot.
- *
- * Return: the physical swap slot corresponding to the virtual swap slot.
- */
-static inline swp_slot_t swp_entry_to_swp_slot(swp_entry_t entry)
-{
- return (swp_slot_t) { entry.val };
-}
-
-/**
- * swp_slot_to_swp_entry - look up the virtual swap slot corresponding to a
- * physical swap slot.
- * @slot: the physical swap slot.
- *
- * Return: the virtual swap slot corresponding to the physical swap slot.
- */
-static inline swp_entry_t swp_slot_to_swp_entry(swp_slot_t slot)
-{
- return (swp_entry_t) { slot.val };
-}
-
-static inline bool tryget_swap_entry(swp_entry_t entry,
- struct swap_info_struct **sip)
-{
- swp_slot_t slot = swp_entry_to_swp_slot(entry);
- struct swap_info_struct *si = swap_slot_tryget_swap_info(slot);
-
- if (sip)
- *sip = si;
-
- return si;
-}
-
-static inline void put_swap_entry(swp_entry_t entry,
- struct swap_info_struct *si)
-{
- swap_slot_put_swap_info(si);
-}
+void vswap_exit(void);
+void vswap_free(swp_entry_t entry, struct swap_cluster_info *ci);
+swp_slot_t swp_entry_to_swp_slot(swp_entry_t entry);
+swp_entry_t swp_slot_to_swp_entry(swp_slot_t slot);
+bool tryget_swap_entry(swp_entry_t entry, struct swap_info_struct **si);
+void put_swap_entry(swp_entry_t entry, struct swap_info_struct *si);
#endif /* __KERNEL__*/
#endif /* _LINUX_SWAP_H */
diff --git a/mm/internal.h b/mm/internal.h
index e739e8cac5b55..7ced0def684ca 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -322,6 +322,25 @@ static inline unsigned int folio_pte_batch_flags(struct folio *folio,
unsigned int folio_pte_batch(struct folio *folio, pte_t *ptep, pte_t pte,
unsigned int max_nr);
+static inline swp_entry_t swap_nth(swp_entry_t entry, long n)
+{
+ return (swp_entry_t) { entry.val + n };
+}
+
+/* similar to swap_nth, but check the backing physical slots as well. */
+static inline swp_entry_t swap_move(swp_entry_t entry, long delta)
+{
+ swp_slot_t slot = swp_entry_to_swp_slot(entry), next_slot;
+ swp_entry_t next_entry = swap_nth(entry, delta);
+
+ next_slot = swp_entry_to_swp_slot(next_entry);
+ if (swp_slot_type(slot) != swp_slot_type(next_slot) ||
+ swp_slot_offset(slot) + delta != swp_slot_offset(next_slot))
+ next_entry.val = 0;
+
+ return next_entry;
+}
+
/**
* pte_move_swp_offset - Move the swap entry offset field of a swap pte
* forward or backward by delta
@@ -334,13 +353,8 @@ unsigned int folio_pte_batch(struct folio *folio, pte_t *ptep, pte_t pte,
*/
static inline pte_t pte_move_swp_offset(pte_t pte, long delta)
{
- softleaf_t entry = softleaf_from_pte(pte), new_entry;
- swp_slot_t slot = swp_entry_to_swp_slot(entry);
- pte_t new;
-
- new_entry = swp_slot_to_swp_entry(swp_slot(swp_slot_type(slot),
- swp_slot_offset(slot) + delta));
- new = swp_entry_to_pte(new_entry);
+ softleaf_t entry = softleaf_from_pte(pte);
+ pte_t new = swp_entry_to_pte(swap_move(entry, delta));
if (pte_swp_soft_dirty(pte))
new = pte_swp_mksoft_dirty(new);
diff --git a/mm/page_io.c b/mm/page_io.c
index 0b02bcc85e2a8..5de3705572955 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -364,7 +364,7 @@ static void sio_write_complete(struct kiocb *iocb, long ret)
*/
pr_err_ratelimited("Write error %ld on dio swapfile (%llu)\n",
ret,
- swap_slot_pos(swp_entry_to_swp_slot(page_swap_entry(page))));
+ swap_slot_dev_pos(swp_entry_to_swp_slot(page_swap_entry(page))));
for (p = 0; p < sio->pages; p++) {
page = sio->bvec[p].bv_page;
set_page_dirty(page);
@@ -384,7 +384,7 @@ static void swap_writepage_fs(struct folio *folio, struct swap_iocb **swap_plug)
swp_slot_t slot = swp_entry_to_swp_slot(folio->swap);
struct swap_info_struct *sis = __swap_slot_to_info(slot);
struct file *swap_file = sis->swap_file;
- loff_t pos = swap_slot_pos(slot);
+ loff_t pos = swap_slot_dev_pos(slot);
count_swpout_vm_event(folio);
folio_start_writeback(folio);
@@ -549,7 +549,7 @@ static void swap_read_folio_fs(struct folio *folio, struct swap_iocb **plug)
swp_slot_t slot = swp_entry_to_swp_slot(folio->swap);
struct swap_info_struct *sis = __swap_slot_to_info(slot);
struct swap_iocb *sio = NULL;
- loff_t pos = swap_slot_pos(slot);
+ loff_t pos = swap_slot_dev_pos(slot);
if (plug)
sio = *plug;
diff --git a/mm/shmem.c b/mm/shmem.c
index 400e2fa8e77cb..13f7469a04c8a 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2227,7 +2227,6 @@ static int shmem_split_large_entry(struct inode *inode, pgoff_t index,
XA_STATE_ORDER(xas, &mapping->i_pages, index, 0);
int split_order = 0;
int i;
- swp_slot_t slot = swp_entry_to_swp_slot(swap);
/* Convert user data gfp flags to xarray node gfp flags */
gfp &= GFP_RECLAIM_MASK;
@@ -2268,13 +2267,7 @@ static int shmem_split_large_entry(struct inode *inode, pgoff_t index,
*/
for (i = 0; i < 1 << cur_order;
i += (1 << split_order)) {
- swp_entry_t tmp_entry;
- swp_slot_t tmp_slot;
-
- tmp_slot =
- swp_slot(swp_slot_type(slot),
- swp_slot_offset(slot) + swap_offset + i);
- tmp_entry = swp_slot_to_swp_entry(tmp_slot);
+ swp_entry_t tmp_entry = swap_nth(swap, swap_offset + i);
__xa_store(&mapping->i_pages, aligned_index + i,
swp_to_radix_entry(tmp_entry), 0);
diff --git a/mm/swap.h b/mm/swap.h
index bdf7aca146643..5eb53758bbd5d 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -36,7 +36,11 @@ struct swap_cluster_info {
u16 count;
u8 flags;
u8 order;
- atomic_long_t __rcu *table; /* Swap table entries, see mm/swap_table.h */
+ /*
+ * Reverse map, to look up the virtual swap slot backed by a given physical
+ * swap slot.
+ */
+ atomic_long_t __rcu *table;
struct list_head list;
};
@@ -212,7 +216,7 @@ static inline struct address_space *swap_address_space(swp_entry_t entry)
}
/* Return the swap device position of the swap slot. */
-static inline loff_t swap_slot_pos(swp_slot_t slot)
+static inline loff_t swap_slot_dev_pos(swp_slot_t slot)
{
return ((loff_t)swp_slot_offset(slot)) << PAGE_SHIFT;
}
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 29ec666be4204..c5ceccd756699 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -891,7 +891,8 @@ static int __init swap_init(void)
swap_kobj = kobject_create_and_add("swap", mm_kobj);
if (!swap_kobj) {
pr_err("failed to create swap kobject\n");
- return -ENOMEM;
+ err = -ENOMEM;
+ goto vswap_exit;
}
err = sysfs_create_group(swap_kobj, &swap_attr_group);
if (err) {
@@ -904,6 +905,8 @@ static int __init swap_init(void)
delete_obj:
kobject_put(swap_kobj);
+vswap_exit:
+ vswap_exit();
return err;
}
subsys_initcall(swap_init);
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 3f70df488c1da..68ec5d9f05848 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1249,7 +1249,6 @@ static void swap_range_alloc(struct swap_info_struct *si,
static void swap_range_free(struct swap_info_struct *si, unsigned long offset,
unsigned int nr_entries)
{
- unsigned long begin = offset;
unsigned long end = offset + nr_entries - 1;
void (*swap_slot_free_notify)(struct block_device *, unsigned long);
unsigned int i;
@@ -1258,10 +1257,8 @@ static void swap_range_free(struct swap_info_struct *si, unsigned long offset,
* Use atomic clear_bit operations only on zeromap instead of non-atomic
* bitmap_clear to prevent adjacent bits corruption due to simultaneous writes.
*/
- for (i = 0; i < nr_entries; i++) {
+ for (i = 0; i < nr_entries; i++)
clear_bit(offset + i, si->zeromap);
- zswap_invalidate(swp_slot_to_swp_entry(swp_slot(si->type, offset + i)));
- }
if (si->flags & SWP_BLKDEV)
swap_slot_free_notify =
@@ -1274,7 +1271,6 @@ static void swap_range_free(struct swap_info_struct *si, unsigned long offset,
swap_slot_free_notify(si->bdev, offset);
offset++;
}
- swap_cache_clear_shadow(swp_entry(si->type, begin), nr_entries);
/*
* Make sure that try_to_unuse() observes si->inuse_pages reaching 0
@@ -1405,7 +1401,7 @@ static bool swap_sync_discard(void)
return false;
}
-static int swap_slot_alloc(swp_slot_t *slot, unsigned int order)
+int swap_slot_alloc(swp_slot_t *slot, unsigned int order)
{
unsigned int size = 1 << order;
@@ -1441,53 +1437,6 @@ static int swap_slot_alloc(swp_slot_t *slot, unsigned int order)
return 0;
}
-/**
- * folio_alloc_swap - allocate swap space for a folio
- * @folio: folio we want to move to swap
- *
- * Allocate swap space for the folio and add the folio to the
- * swap cache.
- *
- * Context: Caller needs to hold the folio lock.
- * Return: Whether the folio was added to the swap cache.
- */
-int folio_alloc_swap(struct folio *folio)
-{
- unsigned int order = folio_order(folio);
- swp_slot_t slot = { 0 };
- swp_entry_t entry = {};
- int err = 0, ret;
-
- VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
- VM_BUG_ON_FOLIO(!folio_test_uptodate(folio), folio);
-
- ret = swap_slot_alloc(&slot, order);
- if (ret)
- return ret;
-
- /* XXX: for now, physical and virtual swap slots are identical */
- entry.val = slot.val;
-
- /* Need to call this even if allocation failed, for MEMCG_SWAP_FAIL. */
- if (mem_cgroup_try_charge_swap(folio, entry)) {
- err = -ENOMEM;
- goto out_free;
- }
-
- if (!slot.val)
- return -ENOMEM;
-
- err = swap_cache_add_folio(folio, entry, __GFP_HIGH | __GFP_NOMEMALLOC | __GFP_NOWARN, NULL);
- if (err)
- goto out_free;
-
- return 0;
-
-out_free:
- put_swap_folio(folio, entry);
- return err;
-}
-
static struct swap_info_struct *_swap_info_get(swp_slot_t slot)
{
struct swap_info_struct *si;
@@ -1733,6 +1682,13 @@ static void swap_slots_free(struct swap_info_struct *si,
unsigned char *map = si->swap_map + offset;
unsigned char *map_end = map + nr_pages;
swp_entry_t entry = swp_slot_to_swp_entry(slot);
+ int i;
+
+ /* release all the associated (virtual) swap slots */
+ for (i = 0; i < nr_pages; i++) {
+ vswap_free(entry, ci);
+ entry.val++;
+ }
/* It should never free entries across different clusters */
VM_BUG_ON(ci != __swap_offset_to_cluster(si, offset + nr_pages - 1));
@@ -1745,7 +1701,6 @@ static void swap_slots_free(struct swap_info_struct *si,
*map = 0;
} while (++map < map_end);
- mem_cgroup_uncharge_swap(entry, nr_pages);
swap_range_free(si, offset, nr_pages);
if (!ci->count)
diff --git a/mm/vswap.c b/mm/vswap.c
index e68234f053fc9..9aa95558f320a 100644
--- a/mm/vswap.c
+++ b/mm/vswap.c
@@ -4,7 +4,147 @@
*
* Copyright (C) 2024 Meta Platforms, Inc., Nhat Pham
*/
+#include <linux/mm.h>
+#include <linux/gfp.h>
#include <linux/swap.h>
+#include <linux/swapops.h>
+#include <linux/swap_cgroup.h>
+#include <linux/cpuhotplug.h>
+#include "swap.h"
+#include "swap_table.h"
+
+/*
+ * Virtual Swap Space
+ *
+ * We associate with each swapped out page a virtual swap slot. This will allow
+ * us to change the backing state of a swapped out page without having to
+ * update every single page table entries referring to it.
+ *
+ * For now, there is a one-to-one correspondence between a virtual swap slot
+ * and its associated physical swap slot.
+ *
+ * Virtual swap slots are organized into PMD-sized clusters, analogous to
+ * physical swap allocator. However, unlike the physical swap allocator,
+ * the clusters are dynamically allocated and freed on-demand. There is no
+ * "free list" of virtual swap clusters - new free clusters are allocated
+ * directly from the cluster map xarray.
+ *
+ * This allows us to avoid the overhead of pre-allocating a large number of
+ * virtual swap clusters.
+ */
+
+/**
+ * Swap descriptor - metadata of a swapped out page.
+ *
+ * @slot: The handle to the physical swap slot backing this page.
+ */
+struct swp_desc {
+ swp_slot_t slot;
+};
+
+#define VSWAP_CLUSTER_SHIFT HPAGE_PMD_ORDER
+#define VSWAP_CLUSTER_SIZE (1UL << VSWAP_CLUSTER_SHIFT)
+#define VSWAP_CLUSTER_MASK (VSWAP_CLUSTER_SIZE - 1)
+
+/*
+ * Map from a cluster id to the number of allocated virtual swap slots in the
+ * (PMD-sized) cluster. This allows us to quickly allocate an empty cluster
+ * for a large folio being swapped out.
+ *
+ * This xarray's lock is also used as the "global" allocator lock (for e.g, to
+ * synchronize global cluster lists manipulation).
+ */
+static DEFINE_XARRAY_FLAGS(vswap_cluster_map, XA_FLAGS_TRACK_FREE);
+
+#if SWP_TYPE_SHIFT > 32
+/*
+ * In 64 bit architecture, the maximum number of virtual swap slots is capped
+ * by the number of clusters (as the vswap_cluster_map xarray can only allocate
+ * up to U32 clusters).
+ */
+#define MAX_VSWAP \
+ (((unsigned long)U32_MAX << VSWAP_CLUSTER_SHIFT) + (VSWAP_CLUSTER_SIZE - 1))
+#else
+/*
+ * In 32 bit architecture, just make sure the range of virtual swap slots is
+ * the same as the range of physical swap slots.
+ */
+#define MAX_VSWAP (((MAX_SWAPFILES - 1) << SWP_TYPE_SHIFT) | SWP_OFFSET_MASK)
+#endif
+
+static const struct xa_limit vswap_cluster_map_limit = {
+ .max = MAX_VSWAP >> VSWAP_CLUSTER_SHIFT,
+ .min = 0,
+};
+
+static struct list_head partial_clusters_lists[SWAP_NR_ORDERS];
+
+/**
+ * struct vswap_cluster
+ *
+ * @lock: Spinlock protecting the cluster's data
+ * @rcu: RCU head for deferred freeing when the cluster is no longer in use
+ * @list: List entry for tracking in partial_clusters_lists when not fully allocated
+ * @id: Unique identifier for this cluster, used to calculate swap slot values
+ * @count: Number of allocated virtual swap slots in this cluster
+ * @order: Order of allocation (0 for single pages, higher for contiguous ranges)
+ * @cached: Whether this cluster is cached in a per-CPU variable for fast allocation
+ * @full: Whether this cluster is considered full (no more allocations possible)
+ * @refcnt: Reference count tracking usage of slots in this cluster
+ * @bitmap: Bitmap tracking which slots in the cluster are allocated
+ * @descriptors: Pointer to array of swap descriptors for each slot in the cluster
+ *
+ * A vswap_cluster manages a PMD-sized group of contiguous virtual swap slots.
+ * It tracks which slots are allocated using a bitmap and maintains the
+ * swap descriptors in an array. The cluster is reference-counted and freed when
+ * all of its slots are released and the cluster is not cached. Each cluster
+ * only allocates aligned slots of a single order, determined when the cluster is
+ * allocated (and never change for the entire lifetime of the cluster).
+ *
+ * Clusters can be in the following states:
+ * - Cached in per-CPU variables for fast allocation.
+ * - In partial_clusters_lists when partially allocated but not cached.
+ * - Marked as full when no more allocations are possible.
+ */
+struct vswap_cluster {
+ spinlock_t lock;
+ union {
+ struct rcu_head rcu;
+ struct list_head list;
+ };
+ unsigned long id;
+ unsigned int count:VSWAP_CLUSTER_SHIFT + 1;
+ unsigned int order:4;
+ bool cached:1;
+ bool full:1;
+ refcount_t refcnt;
+ DECLARE_BITMAP(bitmap, VSWAP_CLUSTER_SIZE);
+ struct swp_desc descriptors[VSWAP_CLUSTER_SIZE];
+};
+
+#define VSWAP_VAL_CLUSTER_IDX(val) ((val) >> VSWAP_CLUSTER_SHIFT)
+#define VSWAP_CLUSTER_IDX(entry) VSWAP_VAL_CLUSTER_IDX(entry.val)
+#define VSWAP_IDX_WITHIN_CLUSTER_VAL(val) ((val) & VSWAP_CLUSTER_MASK)
+#define VSWAP_IDX_WITHIN_CLUSTER(entry) VSWAP_IDX_WITHIN_CLUSTER_VAL(entry.val)
+
+struct percpu_vswap_cluster {
+ struct vswap_cluster *clusters[SWAP_NR_ORDERS];
+ local_lock_t lock;
+};
+
+/*
+ * Per-CPU cache of the last allocated cluster for each order. This allows
+ * allocation fast path to skip the global vswap_cluster_map's spinlock, if
+ * the locally cached cluster still has free slots. Note that caching a cluster
+ * also increments its reference count.
+ */
+static DEFINE_PER_CPU(struct percpu_vswap_cluster, percpu_vswap_cluster) = {
+ .clusters = { NULL, },
+ .lock = INIT_LOCAL_LOCK(),
+};
+
+static atomic_t vswap_alloc_reject;
+static atomic_t vswap_used;
#ifdef CONFIG_DEBUG_FS
#include <linux/debugfs.h>
@@ -17,6 +157,10 @@ static int vswap_debug_fs_init(void)
return -ENODEV;
vswap_debugfs_root = debugfs_create_dir("vswap", NULL);
+ debugfs_create_atomic_t("alloc_reject", 0444,
+ vswap_debugfs_root, &vswap_alloc_reject);
+ debugfs_create_atomic_t("used", 0444, vswap_debugfs_root, &vswap_used);
+
return 0;
}
#else
@@ -26,10 +170,524 @@ static int vswap_debug_fs_init(void)
}
#endif
+static struct swp_desc *vswap_iter(struct vswap_cluster **clusterp, unsigned long i)
+{
+ unsigned long cluster_id = VSWAP_VAL_CLUSTER_IDX(i);
+ struct vswap_cluster *cluster = *clusterp;
+ struct swp_desc *desc = NULL;
+ unsigned long slot_index;
+
+ if (!cluster || cluster_id != cluster->id) {
+ if (cluster)
+ spin_unlock(&cluster->lock);
+ cluster = xa_load(&vswap_cluster_map, cluster_id);
+ if (!cluster)
+ goto done;
+ VM_WARN_ON(cluster->id != cluster_id);
+ spin_lock(&cluster->lock);
+ }
+
+ slot_index = VSWAP_IDX_WITHIN_CLUSTER_VAL(i);
+ if (test_bit(slot_index, cluster->bitmap))
+ desc = &cluster->descriptors[slot_index];
+
+ if (!desc) {
+ spin_unlock(&cluster->lock);
+ cluster = NULL;
+ }
+
+done:
+ *clusterp = cluster;
+ return desc;
+}
+
+static bool cluster_is_alloc_candidate(struct vswap_cluster *cluster)
+{
+ return cluster->count + (1 << (cluster->order)) <= VSWAP_CLUSTER_SIZE;
+}
+
+static void __vswap_alloc_from_cluster(struct vswap_cluster *cluster, int start)
+{
+ int i, nr = 1 << cluster->order;
+ struct swp_desc *desc;
+
+ for (i = 0; i < nr; i++) {
+ desc = &cluster->descriptors[start + i];
+ desc->slot.val = 0;
+ }
+ cluster->count += nr;
+}
+
+static unsigned long vswap_alloc_from_cluster(struct vswap_cluster *cluster)
+{
+ int nr = 1 << cluster->order;
+ unsigned long i = cluster->id ? 0 : nr;
+
+ VM_WARN_ON(!spin_is_locked(&cluster->lock));
+ if (!cluster_is_alloc_candidate(cluster))
+ return 0;
+
+ /* Find the first free range of nr contiguous aligned slots */
+ i = bitmap_find_next_zero_area(cluster->bitmap,
+ VSWAP_CLUSTER_SIZE, i, nr, nr - 1);
+ if (i >= VSWAP_CLUSTER_SIZE)
+ return 0;
+
+ /* Mark the range as allocated in the bitmap */
+ bitmap_set(cluster->bitmap, i, nr);
+
+ refcount_add(nr, &cluster->refcnt);
+ __vswap_alloc_from_cluster(cluster, i);
+ return i + (cluster->id << VSWAP_CLUSTER_SHIFT);
+}
+
+/* Allocate a contiguous range of virtual swap slots */
+static swp_entry_t vswap_alloc(int order)
+{
+ struct xa_limit limit = vswap_cluster_map_limit;
+ struct vswap_cluster *local, *cluster;
+ int nr = 1 << order;
+ bool need_caching = true;
+ u32 cluster_id;
+ swp_entry_t entry;
+
+ entry.val = 0;
+
+ /* first, let's try the locally cached cluster */
+ rcu_read_lock();
+ local_lock(&percpu_vswap_cluster.lock);
+ cluster = this_cpu_read(percpu_vswap_cluster.clusters[order]);
+ if (cluster) {
+ spin_lock(&cluster->lock);
+ entry.val = vswap_alloc_from_cluster(cluster);
+ need_caching = !entry.val;
+
+ if (!entry.val || !cluster_is_alloc_candidate(cluster)) {
+ this_cpu_write(percpu_vswap_cluster.clusters[order], NULL);
+ cluster->cached = false;
+ refcount_dec(&cluster->refcnt);
+ cluster->full = true;
+ }
+ spin_unlock(&cluster->lock);
+ }
+ local_unlock(&percpu_vswap_cluster.lock);
+ rcu_read_unlock();
+
+ /*
+ * Local cluster does not have space. Let's try the uncached partial
+ * clusters before acquiring a new free cluster to reduce fragmentation,
+ * and avoid having to allocate a new cluster structure.
+ */
+ if (!entry.val) {
+ cluster = NULL;
+ xa_lock(&vswap_cluster_map);
+ list_for_each_entry_safe(cluster, local,
+ &partial_clusters_lists[order], list) {
+ if (!spin_trylock(&cluster->lock))
+ continue;
+
+ entry.val = vswap_alloc_from_cluster(cluster);
+ list_del_init(&cluster->list);
+ cluster->full = !entry.val || !cluster_is_alloc_candidate(cluster);
+ need_caching = !cluster->full;
+ spin_unlock(&cluster->lock);
+ if (entry.val)
+ break;
+ }
+ xa_unlock(&vswap_cluster_map);
+ }
+
+ /* try a new free cluster */
+ if (!entry.val) {
+ cluster = kvzalloc(sizeof(*cluster), GFP_KERNEL);
+ if (cluster) {
+ /* first cluster cannot allocate a PMD-sized THP */
+ if (order == SWAP_NR_ORDERS - 1)
+ limit.min = 1;
+
+ if (!xa_alloc(&vswap_cluster_map, &cluster_id, cluster, limit,
+ GFP_KERNEL)) {
+ spin_lock_init(&cluster->lock);
+ cluster->id = cluster_id;
+ cluster->order = order;
+ INIT_LIST_HEAD(&cluster->list);
+ /* Initialize bitmap to all zeros (all slots free) */
+ bitmap_zero(cluster->bitmap, VSWAP_CLUSTER_SIZE);
+ entry.val = cluster->id << VSWAP_CLUSTER_SHIFT;
+ refcount_set(&cluster->refcnt, nr);
+ if (!cluster_id)
+ entry.val += nr;
+ __vswap_alloc_from_cluster(cluster,
+ (entry.val & VSWAP_CLUSTER_MASK));
+ /* Mark the allocated range in the bitmap */
+ bitmap_set(cluster->bitmap, (entry.val & VSWAP_CLUSTER_MASK), nr);
+ need_caching = cluster_is_alloc_candidate(cluster);
+ } else {
+ /* Failed to insert into cluster map, free the cluster */
+ kvfree(cluster);
+ cluster = NULL;
+ }
+ }
+ }
+
+ if (need_caching && entry.val) {
+ local_lock(&percpu_vswap_cluster.lock);
+ local = this_cpu_read(percpu_vswap_cluster.clusters[order]);
+ if (local != cluster) {
+ if (local) {
+ spin_lock(&local->lock);
+ /* only update the local cache if cached cluster is full */
+ need_caching = !cluster_is_alloc_candidate(local);
+ if (need_caching) {
+ this_cpu_write(percpu_vswap_cluster.clusters[order], NULL);
+ local->cached = false;
+ refcount_dec(&local->refcnt);
+ }
+ spin_unlock(&local->lock);
+ }
+
+ VM_WARN_ON(!cluster);
+ spin_lock(&cluster->lock);
+ if (cluster_is_alloc_candidate(cluster)) {
+ if (need_caching) {
+ this_cpu_write(percpu_vswap_cluster.clusters[order], cluster);
+ refcount_inc(&cluster->refcnt);
+ cluster->cached = true;
+ } else {
+ xa_lock(&vswap_cluster_map);
+ VM_WARN_ON(!list_empty(&cluster->list));
+ list_add(&cluster->list, &partial_clusters_lists[order]);
+ xa_unlock(&vswap_cluster_map);
+ }
+ }
+ spin_unlock(&cluster->lock);
+ }
+ local_unlock(&percpu_vswap_cluster.lock);
+ }
+
+ if (entry.val) {
+ VM_WARN_ON(entry.val + nr - 1 > MAX_VSWAP);
+ atomic_add(nr, &vswap_used);
+ } else {
+ atomic_add(nr, &vswap_alloc_reject);
+ }
+ return entry;
+}
+
+static void vswap_cluster_free(struct vswap_cluster *cluster)
+{
+ VM_WARN_ON(cluster->count || cluster->cached);
+ VM_WARN_ON(!spin_is_locked(&cluster->lock));
+ xa_lock(&vswap_cluster_map);
+ list_del_init(&cluster->list);
+ __xa_erase(&vswap_cluster_map, cluster->id);
+ xa_unlock(&vswap_cluster_map);
+ rcu_head_init(&cluster->rcu);
+ kvfree_rcu(cluster, rcu);
+}
+
+static inline void release_vswap_slot(struct vswap_cluster *cluster,
+ unsigned long index)
+{
+ unsigned long slot_index = VSWAP_IDX_WITHIN_CLUSTER_VAL(index);
+
+ VM_WARN_ON(!spin_is_locked(&cluster->lock));
+ cluster->count--;
+
+ bitmap_clear(cluster->bitmap, slot_index, 1);
+
+ /* we only free uncached empty clusters */
+ if (refcount_dec_and_test(&cluster->refcnt))
+ vswap_cluster_free(cluster);
+ else if (cluster->full && cluster_is_alloc_candidate(cluster)) {
+ cluster->full = false;
+ if (!cluster->cached) {
+ xa_lock(&vswap_cluster_map);
+ VM_WARN_ON(!list_empty(&cluster->list));
+ list_add_tail(&cluster->list,
+ &partial_clusters_lists[cluster->order]);
+ xa_unlock(&vswap_cluster_map);
+ }
+ }
+
+ atomic_dec(&vswap_used);
+}
+
+/*
+ * Update the physical-to-virtual swap slot mapping.
+ * Caller must ensure the physical swap slot's cluster is locked.
+ */
+static void vswap_rmap_set(struct swap_cluster_info *ci, swp_slot_t slot,
+ unsigned long vswap, int nr)
+{
+ atomic_long_t *table;
+ unsigned long slot_offset = swp_slot_offset(slot);
+ unsigned int ci_off = slot_offset % SWAPFILE_CLUSTER;
+ int i;
+
+ table = rcu_dereference_protected(ci->table, lockdep_is_held(&ci->lock));
+ VM_WARN_ON(!table);
+ for (i = 0; i < nr; i++)
+ __swap_table_set(ci, ci_off + i, vswap ? vswap + i : 0);
+}
+
+/**
+ * vswap_free - free a virtual swap slot.
+ * @entry: the virtual swap slot to free
+ * @ci: the physical swap slot's cluster (optional, can be NULL)
+ *
+ * If @ci is NULL, this function is called to clean up a virtual swap entry
+ * when no linkage has been established between physical and virtual swap slots.
+ * If @ci is provided, the caller must ensure it is locked.
+ */
+void vswap_free(swp_entry_t entry, struct swap_cluster_info *ci)
+{
+ struct vswap_cluster *cluster = NULL;
+ struct swp_desc *desc;
+
+ if (!entry.val)
+ return;
+
+ swap_cache_clear_shadow(entry, 1);
+ zswap_invalidate(entry);
+ mem_cgroup_uncharge_swap(entry, 1);
+
+ /* do not immediately erase the virtual slot to prevent its reuse */
+ rcu_read_lock();
+ desc = vswap_iter(&cluster, entry.val);
+ if (!desc) {
+ rcu_read_unlock();
+ return;
+ }
+
+ if (desc->slot.val)
+ vswap_rmap_set(ci, desc->slot, 0, 1);
+
+ /* erase forward mapping and release the virtual slot for reallocation */
+ release_vswap_slot(cluster, entry.val);
+ spin_unlock(&cluster->lock);
+ rcu_read_unlock();
+}
+
+/**
+ * folio_alloc_swap - allocate swap space for a folio.
+ * @folio: the folio.
+ *
+ * Return: 0, if the allocation succeeded, -ENOMEM, if the allocation failed.
+ */
+int folio_alloc_swap(struct folio *folio)
+{
+ struct vswap_cluster *cluster = NULL;
+ struct swap_info_struct *si;
+ struct swap_cluster_info *ci;
+ int i, err, nr = folio_nr_pages(folio), order = folio_order(folio);
+ struct swp_desc *desc;
+ swp_entry_t entry;
+ swp_slot_t slot;
+
+ VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
+ VM_BUG_ON_FOLIO(!folio_test_uptodate(folio), folio);
+
+ entry = vswap_alloc(folio_order(folio));
+ if (!entry.val)
+ return -ENOMEM;
+
+ /*
+ * XXX: for now, we always allocate a physical swap slot for each virtual
+ * swap slot, and their lifetime are coupled. This will change once we
+ * decouple virtual swap slots from their backing states, and only allocate
+ * physical swap slots for them on demand (i.e on zswap writeback, or
+ * fallback from zswap store failure).
+ */
+ if (swap_slot_alloc(&slot, order)) {
+ for (i = 0; i < nr; i++)
+ vswap_free((swp_entry_t){entry.val + i}, NULL);
+ entry.val = 0;
+ return -ENOMEM;
+ }
+
+ /* establish the vrtual <-> physical swap slots linkages. */
+ si = __swap_slot_to_info(slot);
+ ci = swap_cluster_lock(si, swp_slot_offset(slot));
+ vswap_rmap_set(ci, slot, entry.val, nr);
+ swap_cluster_unlock(ci);
+
+ rcu_read_lock();
+ for (i = 0; i < nr; i++) {
+ desc = vswap_iter(&cluster, entry.val + i);
+ VM_WARN_ON(!desc);
+
+ desc->slot.val = slot.val + i;
+ }
+ if (cluster)
+ spin_unlock(&cluster->lock);
+ rcu_read_unlock();
+
+ /*
+ * XXX: for now, we charge towards the memory cgroup's swap limit on virtual
+ * swap slots allocation. This is acceptable because as noted above, each
+ * virtual swap slot corresponds to a physical swap slot. Once we have
+ * decoupled virtual and physical swap slots, we will only charge when we
+ * actually allocate a physical swap slot.
+ */
+ if (mem_cgroup_try_charge_swap(folio, entry))
+ goto out_free;
+
+ err = swap_cache_add_folio(folio, entry, __GFP_HIGH | __GFP_NOMEMALLOC | __GFP_NOWARN, NULL);
+ if (err)
+ goto out_free;
+
+ return 0;
+
+out_free:
+ put_swap_folio(folio, entry);
+ return -ENOMEM;
+}
+
+/**
+ * swp_entry_to_swp_slot - look up the physical swap slot corresponding to a
+ * virtual swap slot.
+ * @entry: the virtual swap slot.
+ *
+ * Return: the physical swap slot corresponding to the virtual swap slot.
+ */
+swp_slot_t swp_entry_to_swp_slot(swp_entry_t entry)
+{
+ struct vswap_cluster *cluster = NULL;
+ struct swp_desc *desc;
+ swp_slot_t slot;
+
+ slot.val = 0;
+ if (!entry.val)
+ return slot;
+
+ rcu_read_lock();
+ desc = vswap_iter(&cluster, entry.val);
+ if (!desc) {
+ rcu_read_unlock();
+ return (swp_slot_t){0};
+ }
+ slot = desc->slot;
+ spin_unlock(&cluster->lock);
+ rcu_read_unlock();
+ return slot;
+}
+
+/**
+ * swp_slot_to_swp_entry - look up the virtual swap slot corresponding to a
+ * physical swap slot.
+ * @slot: the physical swap slot.
+ *
+ * Return: the virtual swap slot corresponding to the physical swap slot.
+ */
+swp_entry_t swp_slot_to_swp_entry(swp_slot_t slot)
+{
+ swp_entry_t ret;
+ struct swap_cluster_info *ci;
+ unsigned long offset;
+ unsigned int ci_off;
+
+ ret.val = 0;
+ if (!slot.val)
+ return ret;
+
+ offset = swp_slot_offset(slot);
+ ci_off = offset % SWAPFILE_CLUSTER;
+ ci = __swap_slot_to_cluster(slot);
+
+ ret.val = swap_table_get(ci, ci_off);
+ return ret;
+}
+
+bool tryget_swap_entry(swp_entry_t entry, struct swap_info_struct **si)
+{
+ struct vswap_cluster *cluster;
+ swp_slot_t slot;
+
+ slot = swp_entry_to_swp_slot(entry);
+ *si = swap_slot_tryget_swap_info(slot);
+ if (!*si)
+ return false;
+
+ /*
+ * Ensure the cluster and its associated data structures (swap cache etc.)
+ * remain valid.
+ */
+ rcu_read_lock();
+ cluster = xa_load(&vswap_cluster_map, VSWAP_CLUSTER_IDX(entry));
+ if (!cluster || !refcount_inc_not_zero(&cluster->refcnt)) {
+ rcu_read_unlock();
+ swap_slot_put_swap_info(*si);
+ *si = NULL;
+ return false;
+ }
+ rcu_read_unlock();
+ return true;
+}
+
+void put_swap_entry(swp_entry_t entry, struct swap_info_struct *si)
+{
+ struct vswap_cluster *cluster;
+
+ if (si)
+ swap_slot_put_swap_info(si);
+
+ rcu_read_lock();
+ cluster = xa_load(&vswap_cluster_map, VSWAP_CLUSTER_IDX(entry));
+ spin_lock(&cluster->lock);
+ if (refcount_dec_and_test(&cluster->refcnt))
+ vswap_cluster_free(cluster);
+ spin_unlock(&cluster->lock);
+ rcu_read_unlock();
+}
+
+static int vswap_cpu_dead(unsigned int cpu)
+{
+ struct percpu_vswap_cluster *percpu_cluster;
+ struct vswap_cluster *cluster;
+ int order;
+
+ percpu_cluster = per_cpu_ptr(&percpu_vswap_cluster, cpu);
+
+ rcu_read_lock();
+ local_lock(&percpu_cluster->lock);
+ for (order = 0; order < SWAP_NR_ORDERS; order++) {
+ cluster = percpu_cluster->clusters[order];
+ if (cluster) {
+ percpu_cluster->clusters[order] = NULL;
+ spin_lock(&cluster->lock);
+ cluster->cached = false;
+ if (refcount_dec_and_test(&cluster->refcnt))
+ vswap_cluster_free(cluster);
+ spin_unlock(&cluster->lock);
+ }
+ }
+ local_unlock(&percpu_cluster->lock);
+ rcu_read_unlock();
+
+ return 0;
+}
+
+
int vswap_init(void)
{
+ int i;
+
+ if (cpuhp_setup_state_nocalls(CPUHP_MM_VSWAP_DEAD, "mm/vswap:dead", NULL,
+ vswap_cpu_dead)) {
+ pr_err("Failed to register vswap CPU hotplug callback\n");
+ return -ENOMEM;
+ }
+
if (vswap_debug_fs_init())
pr_warn("Failed to initialize vswap debugfs\n");
+ for (i = 0; i < SWAP_NR_ORDERS; i++)
+ INIT_LIST_HEAD(&partial_clusters_lists[i]);
+
return 0;
}
+
+void vswap_exit(void)
+{
+}
--
2.47.3
^ permalink raw reply [flat|nested] 52+ messages in thread
* [PATCH v3 10/20] swap: move swap cache to virtual swap descriptor
2026-02-08 21:58 [PATCH v3 00/20] Virtual Swap Space Nhat Pham
` (8 preceding siblings ...)
2026-02-08 21:58 ` [PATCH v3 09/20] mm: swap: allocate a virtual swap slot for each swapped out page Nhat Pham
@ 2026-02-08 21:58 ` Nhat Pham
2026-02-08 21:58 ` [PATCH v3 11/20] zswap: move zswap entry management to the " Nhat Pham
` (11 subsequent siblings)
21 siblings, 0 replies; 52+ messages in thread
From: Nhat Pham @ 2026-02-08 21:58 UTC (permalink / raw)
To: linux-mm
Cc: akpm, hannes, hughd, yosry.ahmed, mhocko, roman.gushchin,
shakeel.butt, muchun.song, len.brown, chengming.zhou, kasong,
chrisl, huang.ying.caritas, ryan.roberts, shikemeng, viro,
baohua, bhe, osalvador, lorenzo.stoakes, christophe.leroy, pavel,
kernel-team, linux-kernel, cgroups, linux-pm, peterx, riel,
joshua.hahnjy, npache, gourry, axelrasmussen, yuanchu, weixugc,
rafael, jannh, pfalcato, zhengqi.arch
Move the swap cache (and workingset shadow for anonymous pages) to the
virtual swap descriptor. This effectively range-partitions the swap
cache by virtual swap clusters (of PMD sized), eliminate swap cache lock
contention.
Signed-off-by: Nhat Pham <nphamcs@gmail.com>
---
mm/huge_memory.c | 4 +-
mm/migrate.c | 6 +-
mm/shmem.c | 4 +-
mm/swap.h | 16 +--
mm/swap_state.c | 251 +--------------------------------
mm/vmscan.c | 6 +-
mm/vswap.c | 350 ++++++++++++++++++++++++++++++++++++++++++++++-
7 files changed, 364 insertions(+), 273 deletions(-)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 21215ac870144..dcbd3821d6178 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3825,7 +3825,7 @@ static int __folio_freeze_and_split_unmapped(struct folio *folio, unsigned int n
return -EINVAL;
}
- swap_cache_lock();
+ swap_cache_lock(folio->swap);
}
/* lock lru list/PageCompound, ref frozen by page_ref_freeze */
@@ -3901,7 +3901,7 @@ static int __folio_freeze_and_split_unmapped(struct folio *folio, unsigned int n
unlock_page_lruvec(lruvec);
if (folio_test_swapcache(folio))
- swap_cache_unlock();
+ swap_cache_unlock(folio->swap);
} else {
split_queue_unlock(ds_queue);
return -EAGAIN;
diff --git a/mm/migrate.c b/mm/migrate.c
index 11d9b43dff5d8..e850b05a232de 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -600,13 +600,13 @@ static int __folio_migrate_mapping(struct address_space *mapping,
newzone = folio_zone(newfolio);
if (folio_test_swapcache(folio))
- swap_cache_lock_irq();
+ swap_cache_lock_irq(folio->swap);
else
xas_lock_irq(&xas);
if (!folio_ref_freeze(folio, expected_count)) {
if (folio_test_swapcache(folio))
- swap_cache_unlock_irq();
+ swap_cache_unlock_irq(folio->swap);
else
xas_unlock_irq(&xas);
return -EAGAIN;
@@ -652,7 +652,7 @@ static int __folio_migrate_mapping(struct address_space *mapping,
/* Leave irq disabled to prevent preemption while updating stats */
if (folio_test_swapcache(folio))
- swap_cache_unlock();
+ swap_cache_unlock(folio->swap);
else
xas_unlock(&xas);
diff --git a/mm/shmem.c b/mm/shmem.c
index 13f7469a04c8a..66cf8af6779ca 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2168,12 +2168,12 @@ static int shmem_replace_folio(struct folio **foliop, gfp_t gfp,
new->swap = entry;
folio_set_swapcache(new);
- swap_cache_lock_irq();
+ swap_cache_lock_irq(entry);
__swap_cache_replace_folio(old, new);
mem_cgroup_replace_folio(old, new);
shmem_update_stats(new, nr_pages);
shmem_update_stats(old, -nr_pages);
- swap_cache_unlock_irq();
+ swap_cache_unlock_irq(entry);
folio_add_lru(new);
*foliop = new;
diff --git a/mm/swap.h b/mm/swap.h
index 5eb53758bbd5d..57ed24a2d6356 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -205,10 +205,12 @@ void __swap_writepage(struct folio *folio, struct swap_iocb **swap_plug);
/* linux/mm/swap_state.c */
extern struct address_space swap_space __read_mostly;
-void swap_cache_lock_irq(void);
-void swap_cache_unlock_irq(void);
-void swap_cache_lock(void);
-void swap_cache_unlock(void);
+
+/* linux/mm/vswap.c */
+void swap_cache_lock_irq(swp_entry_t entry);
+void swap_cache_unlock_irq(swp_entry_t entry);
+void swap_cache_lock(swp_entry_t entry);
+void swap_cache_unlock(swp_entry_t entry);
static inline struct address_space *swap_address_space(swp_entry_t entry)
{
@@ -256,12 +258,11 @@ static inline bool folio_matches_swap_entry(const struct folio *folio,
*/
struct folio *swap_cache_get_folio(swp_entry_t entry);
void *swap_cache_get_shadow(swp_entry_t entry);
-int swap_cache_add_folio(struct folio *folio, swp_entry_t entry, gfp_t gfp, void **shadow);
+void swap_cache_add_folio(struct folio *folio, swp_entry_t entry, void **shadow);
void swap_cache_del_folio(struct folio *folio);
/* Below helpers require the caller to lock the swap cache. */
void __swap_cache_del_folio(struct folio *folio, swp_entry_t entry, void *shadow);
void __swap_cache_replace_folio(struct folio *old, struct folio *new);
-void swap_cache_clear_shadow(swp_entry_t entry, int nr_ents);
void show_swap_cache_info(void);
void swapcache_clear(struct swap_info_struct *si, swp_entry_t entry, int nr);
@@ -422,9 +423,8 @@ static inline void *swap_cache_get_shadow(swp_entry_t entry)
return NULL;
}
-static inline int swap_cache_add_folio(struct folio *folio, swp_entry_t entry, gfp_t gfp, void **shadow)
+static inline void swap_cache_add_folio(struct folio *folio, swp_entry_t entry, void **shadow)
{
- return 0;
}
static inline void swap_cache_del_folio(struct folio *folio)
diff --git a/mm/swap_state.c b/mm/swap_state.c
index c5ceccd756699..00fa3e76a5c19 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -41,28 +41,6 @@ struct address_space swap_space __read_mostly = {
.a_ops = &swap_aops,
};
-static DEFINE_XARRAY(swap_cache);
-
-void swap_cache_lock_irq(void)
-{
- xa_lock_irq(&swap_cache);
-}
-
-void swap_cache_unlock_irq(void)
-{
- xa_unlock_irq(&swap_cache);
-}
-
-void swap_cache_lock(void)
-{
- xa_lock(&swap_cache);
-}
-
-void swap_cache_unlock(void)
-{
- xa_unlock(&swap_cache);
-}
-
static bool enable_vma_readahead __read_mostly = true;
#define SWAP_RA_ORDER_CEILING 5
@@ -94,231 +72,6 @@ void show_swap_cache_info(void)
printk("Total swap = %lukB\n", K(total_swap_pages));
}
-/**
- * swap_cache_get_folio - Looks up a folio in the swap cache.
- * @entry: swap entry used for the lookup.
- *
- * A found folio will be returned unlocked and with its refcount increased.
- *
- * Context: Caller must ensure @entry is valid and protect the swap device
- * with reference count or locks.
- * Return: Returns the found folio on success, NULL otherwise. The caller
- * must lock nd check if the folio still matches the swap entry before
- * use (e.g., folio_matches_swap_entry).
- */
-struct folio *swap_cache_get_folio(swp_entry_t entry)
-{
- void *entry_val;
- struct folio *folio;
-
- for (;;) {
- rcu_read_lock();
- entry_val = xa_load(&swap_cache, entry.val);
- if (!entry_val || xa_is_value(entry_val)) {
- rcu_read_unlock();
- return NULL;
- }
- folio = entry_val;
- if (likely(folio_try_get(folio))) {
- rcu_read_unlock();
- return folio;
- }
- rcu_read_unlock();
- }
-
- return NULL;
-}
-
-/**
- * swap_cache_get_shadow - Looks up a shadow in the swap cache.
- * @entry: swap entry used for the lookup.
- *
- * Context: Caller must ensure @entry is valid and protect the swap device
- * with reference count or locks.
- * Return: Returns either NULL or an XA_VALUE (shadow).
- */
-void *swap_cache_get_shadow(swp_entry_t entry)
-{
- void *entry_val;
-
- rcu_read_lock();
- entry_val = xa_load(&swap_cache, entry.val);
- rcu_read_unlock();
-
- if (xa_is_value(entry_val))
- return entry_val;
- return NULL;
-}
-
-/**
- * swap_cache_add_folio - Add a folio into the swap cache.
- * @folio: The folio to be added.
- * @entry: The swap entry corresponding to the folio.
- * @gfp: gfp_mask for XArray node allocation.
- * @shadowp: If a shadow is found, return the shadow.
- *
- * Context: Caller must ensure @entry is valid and protect the swap device
- * with reference count or locks.
- * The caller also needs to update the corresponding swap_map slots with
- * SWAP_HAS_CACHE bit to avoid race or conflict.
- *
- * Return: 0 on success, negative error code on failure.
- */
-int swap_cache_add_folio(struct folio *folio, swp_entry_t entry, gfp_t gfp, void **shadowp)
-{
- XA_STATE_ORDER(xas, &swap_cache, entry.val, folio_order(folio));
- unsigned long nr_pages = folio_nr_pages(folio);
- unsigned long i;
- void *old;
-
- VM_WARN_ON_ONCE_FOLIO(!folio_test_locked(folio), folio);
- VM_WARN_ON_ONCE_FOLIO(folio_test_swapcache(folio), folio);
- VM_WARN_ON_ONCE_FOLIO(!folio_test_swapbacked(folio), folio);
-
- folio_ref_add(folio, nr_pages);
- folio_set_swapcache(folio);
- folio->swap = entry;
-
- do {
- xas_lock_irq(&xas);
- xas_create_range(&xas);
- if (xas_error(&xas))
- goto unlock;
- for (i = 0; i < nr_pages; i++) {
- VM_BUG_ON_FOLIO(xas.xa_index != entry.val + i, folio);
- old = xas_load(&xas);
- if (old && !xa_is_value(old)) {
- VM_WARN_ON_ONCE_FOLIO(1, folio);
- xas_set_err(&xas, -EEXIST);
- goto unlock;
- }
- if (shadowp && xa_is_value(old) && !*shadowp)
- *shadowp = old;
- xas_store(&xas, folio);
- xas_next(&xas);
- }
- node_stat_mod_folio(folio, NR_FILE_PAGES, nr_pages);
- lruvec_stat_mod_folio(folio, NR_SWAPCACHE, nr_pages);
-unlock:
- xas_unlock_irq(&xas);
- } while (xas_nomem(&xas, gfp));
-
- if (!xas_error(&xas))
- return 0;
-
- folio_clear_swapcache(folio);
- folio_ref_sub(folio, nr_pages);
- return xas_error(&xas);
-}
-
-/**
- * __swap_cache_del_folio - Removes a folio from the swap cache.
- * @folio: The folio.
- * @entry: The first swap entry that the folio corresponds to.
- * @shadow: shadow value to be filled in the swap cache.
- *
- * Removes a folio from the swap cache and fills a shadow in place.
- * This won't put the folio's refcount. The caller has to do that.
- *
- * Context: Caller must ensure the folio is locked and in the swap cache
- * using the index of @entry, and lock the swap cache xarray.
- */
-void __swap_cache_del_folio(struct folio *folio, swp_entry_t entry, void *shadow)
-{
- long nr_pages = folio_nr_pages(folio);
- XA_STATE(xas, &swap_cache, entry.val);
- int i;
-
- VM_WARN_ON_ONCE_FOLIO(!folio_test_locked(folio), folio);
- VM_WARN_ON_ONCE_FOLIO(!folio_test_swapcache(folio), folio);
- VM_WARN_ON_ONCE_FOLIO(folio_test_writeback(folio), folio);
-
- for (i = 0; i < nr_pages; i++) {
- void *old = xas_store(&xas, shadow);
- VM_WARN_ON_FOLIO(old != folio, folio);
- xas_next(&xas);
- }
-
- folio->swap.val = 0;
- folio_clear_swapcache(folio);
- node_stat_mod_folio(folio, NR_FILE_PAGES, -nr_pages);
- lruvec_stat_mod_folio(folio, NR_SWAPCACHE, -nr_pages);
-}
-
-/**
- * swap_cache_del_folio - Removes a folio from the swap cache.
- * @folio: The folio.
- *
- * Same as __swap_cache_del_folio, but handles lock and refcount. The
- * caller must ensure the folio is either clean or has a swap count
- * equal to zero, or it may cause data loss.
- *
- * Context: Caller must ensure the folio is locked and in the swap cache.
- */
-void swap_cache_del_folio(struct folio *folio)
-{
- swp_entry_t entry = folio->swap;
-
- xa_lock_irq(&swap_cache);
- __swap_cache_del_folio(folio, entry, NULL);
- xa_unlock_irq(&swap_cache);
-
- put_swap_folio(folio, entry);
- folio_ref_sub(folio, folio_nr_pages(folio));
-}
-
-/**
- * __swap_cache_replace_folio - Replace a folio in the swap cache.
- * @old: The old folio to be replaced.
- * @new: The new folio.
- *
- * Replace an existing folio in the swap cache with a new folio. The
- * caller is responsible for setting up the new folio's flag and swap
- * entries. Replacement will take the new folio's swap entry value as
- * the starting offset to override all slots covered by the new folio.
- *
- * Context: Caller must ensure both folios are locked, and lock the
- * swap cache xarray.
- */
-void __swap_cache_replace_folio(struct folio *old, struct folio *new)
-{
- swp_entry_t entry = new->swap;
- unsigned long nr_pages = folio_nr_pages(new);
- XA_STATE(xas, &swap_cache, entry.val);
- int i;
-
- VM_WARN_ON_ONCE(!folio_test_swapcache(old) || !folio_test_swapcache(new));
- VM_WARN_ON_ONCE(!folio_test_locked(old) || !folio_test_locked(new));
- VM_WARN_ON_ONCE(!entry.val);
-
- for (i = 0; i < nr_pages; i++) {
- void *old_entry = xas_store(&xas, new);
- WARN_ON_ONCE(!old_entry || xa_is_value(old_entry) || old_entry != old);
- xas_next(&xas);
- }
-}
-
-/**
- * swap_cache_clear_shadow - Clears a set of shadows in the swap cache.
- * @entry: The starting index entry.
- * @nr_ents: How many slots need to be cleared.
- *
- * Context: Caller must ensure the range is valid and all in one single cluster,
- * not occupied by any folio.
- */
-void swap_cache_clear_shadow(swp_entry_t entry, int nr_ents)
-{
- XA_STATE(xas, &swap_cache, entry.val);
- int i;
-
- xas_lock(&xas);
- for (i = 0; i < nr_ents; i++) {
- xas_store(&xas, NULL);
- xas_next(&xas);
- }
- xas_unlock(&xas);
-}
-
/*
* If we are the only user, then try to free up the swap cache.
*
@@ -497,9 +250,7 @@ struct folio *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
if (mem_cgroup_swapin_charge_folio(new_folio, NULL, gfp_mask, entry))
goto fail_unlock;
- /* May fail (-ENOMEM) if XArray node allocation failed. */
- if (swap_cache_add_folio(new_folio, entry, gfp_mask & GFP_RECLAIM_MASK, &shadow))
- goto fail_unlock;
+ swap_cache_add_folio(new_folio, entry, &shadow);
memcg1_swapin(entry, 1);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 558ff7f413786..c9ec1a1458b4e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -712,7 +712,7 @@ static int __remove_mapping(struct address_space *mapping, struct folio *folio,
BUG_ON(mapping != folio_mapping(folio));
if (folio_test_swapcache(folio)) {
- swap_cache_lock_irq();
+ swap_cache_lock_irq(folio->swap);
} else {
spin_lock(&mapping->host->i_lock);
xa_lock_irq(&mapping->i_pages);
@@ -759,7 +759,7 @@ static int __remove_mapping(struct address_space *mapping, struct folio *folio,
shadow = workingset_eviction(folio, target_memcg);
__swap_cache_del_folio(folio, swap, shadow);
memcg1_swapout(folio, swap);
- swap_cache_unlock_irq();
+ swap_cache_unlock_irq(swap);
put_swap_folio(folio, swap);
} else {
void (*free_folio)(struct folio *);
@@ -798,7 +798,7 @@ static int __remove_mapping(struct address_space *mapping, struct folio *folio,
cannot_free:
if (folio_test_swapcache(folio)) {
- swap_cache_unlock_irq();
+ swap_cache_unlock_irq(folio->swap);
} else {
xa_unlock_irq(&mapping->i_pages);
spin_unlock(&mapping->host->i_lock);
diff --git a/mm/vswap.c b/mm/vswap.c
index 9aa95558f320a..d44199dc059a3 100644
--- a/mm/vswap.c
+++ b/mm/vswap.c
@@ -37,9 +37,15 @@
* Swap descriptor - metadata of a swapped out page.
*
* @slot: The handle to the physical swap slot backing this page.
+ * @swap_cache: The folio in swap cache.
+ * @shadow: The shadow entry.
*/
struct swp_desc {
swp_slot_t slot;
+ union {
+ struct folio *swap_cache;
+ void *shadow;
+ };
};
#define VSWAP_CLUSTER_SHIFT HPAGE_PMD_ORDER
@@ -170,6 +176,24 @@ static int vswap_debug_fs_init(void)
}
#endif
+/*
+ * Lockless version of vswap_iter - assumes caller holds cluster lock.
+ * Used when iterating within the same cluster with the lock already held.
+ */
+static struct swp_desc *__vswap_iter(struct vswap_cluster *cluster, unsigned long i)
+{
+ unsigned long slot_index;
+
+ lockdep_assert_held(&cluster->lock);
+ VM_WARN_ON(cluster->id != VSWAP_VAL_CLUSTER_IDX(i));
+
+ slot_index = VSWAP_IDX_WITHIN_CLUSTER_VAL(i);
+ if (test_bit(slot_index, cluster->bitmap))
+ return &cluster->descriptors[slot_index];
+
+ return NULL;
+}
+
static struct swp_desc *vswap_iter(struct vswap_cluster **clusterp, unsigned long i)
{
unsigned long cluster_id = VSWAP_VAL_CLUSTER_IDX(i);
@@ -448,7 +472,6 @@ void vswap_free(swp_entry_t entry, struct swap_cluster_info *ci)
if (!entry.val)
return;
- swap_cache_clear_shadow(entry, 1);
zswap_invalidate(entry);
mem_cgroup_uncharge_swap(entry, 1);
@@ -460,6 +483,10 @@ void vswap_free(swp_entry_t entry, struct swap_cluster_info *ci)
return;
}
+ /* Clear shadow if present */
+ if (xa_is_value(desc->shadow))
+ desc->shadow = NULL;
+
if (desc->slot.val)
vswap_rmap_set(ci, desc->slot, 0, 1);
@@ -480,7 +507,7 @@ int folio_alloc_swap(struct folio *folio)
struct vswap_cluster *cluster = NULL;
struct swap_info_struct *si;
struct swap_cluster_info *ci;
- int i, err, nr = folio_nr_pages(folio), order = folio_order(folio);
+ int i, nr = folio_nr_pages(folio), order = folio_order(folio);
struct swp_desc *desc;
swp_entry_t entry;
swp_slot_t slot;
@@ -533,9 +560,7 @@ int folio_alloc_swap(struct folio *folio)
if (mem_cgroup_try_charge_swap(folio, entry))
goto out_free;
- err = swap_cache_add_folio(folio, entry, __GFP_HIGH | __GFP_NOMEMALLOC | __GFP_NOWARN, NULL);
- if (err)
- goto out_free;
+ swap_cache_add_folio(folio, entry, NULL);
return 0;
@@ -668,6 +693,321 @@ static int vswap_cpu_dead(unsigned int cpu)
return 0;
}
+/**
+ * swap_cache_lock - lock the swap cache for a swap entry
+ * @entry: the swap entry
+ *
+ * Locks the vswap cluster spinlock for the given swap entry.
+ */
+void swap_cache_lock(swp_entry_t entry)
+{
+ struct vswap_cluster *cluster;
+ unsigned long cluster_id = VSWAP_CLUSTER_IDX(entry);
+
+ rcu_read_lock();
+ cluster = xa_load(&vswap_cluster_map, cluster_id);
+ VM_WARN_ON(!cluster);
+ spin_lock(&cluster->lock);
+ rcu_read_unlock();
+}
+
+/**
+ * swap_cache_unlock - unlock the swap cache for a swap entry
+ * @entry: the swap entry
+ *
+ * Unlocks the vswap cluster spinlock for the given swap entry.
+ */
+void swap_cache_unlock(swp_entry_t entry)
+{
+ struct vswap_cluster *cluster;
+ unsigned long cluster_id = VSWAP_CLUSTER_IDX(entry);
+
+ rcu_read_lock();
+ cluster = xa_load(&vswap_cluster_map, cluster_id);
+ VM_WARN_ON(!cluster);
+ spin_unlock(&cluster->lock);
+ rcu_read_unlock();
+}
+
+/**
+ * swap_cache_lock_irq - lock the swap cache with interrupts disabled
+ * @entry: the swap entry
+ *
+ * Locks the vswap cluster spinlock and disables interrupts for the given swap entry.
+ */
+void swap_cache_lock_irq(swp_entry_t entry)
+{
+ struct vswap_cluster *cluster;
+ unsigned long cluster_id = VSWAP_CLUSTER_IDX(entry);
+
+ rcu_read_lock();
+ cluster = xa_load(&vswap_cluster_map, cluster_id);
+ VM_WARN_ON(!cluster);
+ spin_lock_irq(&cluster->lock);
+ rcu_read_unlock();
+}
+
+/**
+ * swap_cache_unlock_irq - unlock the swap cache with interrupts enabled
+ * @entry: the swap entry
+ *
+ * Unlocks the vswap cluster spinlock and enables interrupts for the given swap entry.
+ */
+void swap_cache_unlock_irq(swp_entry_t entry)
+{
+ struct vswap_cluster *cluster;
+ unsigned long cluster_id = VSWAP_CLUSTER_IDX(entry);
+
+ rcu_read_lock();
+ cluster = xa_load(&vswap_cluster_map, cluster_id);
+ VM_WARN_ON(!cluster);
+ spin_unlock_irq(&cluster->lock);
+ rcu_read_unlock();
+}
+
+/**
+ * swap_cache_get_folio - Looks up a folio in the swap cache.
+ * @entry: swap entry used for the lookup.
+ *
+ * A found folio will be returned unlocked and with its refcount increased.
+ *
+ * Context: Caller must ensure @entry is valid and protect the cluster with
+ * reference count or locks.
+ *
+ * Return: Returns the found folio on success, NULL otherwise. The caller
+ * must lock and check if the folio still matches the swap entry before
+ * use (e.g., folio_matches_swap_entry).
+ */
+struct folio *swap_cache_get_folio(swp_entry_t entry)
+{
+ struct vswap_cluster *cluster = NULL;
+ struct swp_desc *desc;
+ struct folio *folio;
+
+ for (;;) {
+ rcu_read_lock();
+ desc = vswap_iter(&cluster, entry.val);
+ if (!desc) {
+ rcu_read_unlock();
+ return NULL;
+ }
+
+ /* Check if this is a shadow value (xa_is_value equivalent) */
+ if (xa_is_value(desc->shadow)) {
+ spin_unlock(&cluster->lock);
+ rcu_read_unlock();
+ return NULL;
+ }
+
+ folio = desc->swap_cache;
+ if (!folio) {
+ spin_unlock(&cluster->lock);
+ rcu_read_unlock();
+ return NULL;
+ }
+
+ if (likely(folio_try_get(folio))) {
+ spin_unlock(&cluster->lock);
+ rcu_read_unlock();
+ return folio;
+ }
+ spin_unlock(&cluster->lock);
+ rcu_read_unlock();
+ }
+
+ return NULL;
+}
+
+/**
+ * swap_cache_get_shadow - Looks up a shadow in the swap cache.
+ * @entry: swap entry used for the lookup.
+ *
+ * Context: Caller must ensure @entry is valid and protect the cluster with
+ * reference count or locks.
+ *
+ * Return: Returns either NULL or an XA_VALUE (shadow).
+ */
+void *swap_cache_get_shadow(swp_entry_t entry)
+{
+ struct vswap_cluster *cluster = NULL;
+ struct swp_desc *desc;
+ void *shadow;
+
+ rcu_read_lock();
+ desc = vswap_iter(&cluster, entry.val);
+ if (!desc) {
+ rcu_read_unlock();
+ return NULL;
+ }
+
+ shadow = desc->shadow;
+ spin_unlock(&cluster->lock);
+ rcu_read_unlock();
+
+ if (xa_is_value(shadow))
+ return shadow;
+ return NULL;
+}
+
+/**
+ * swap_cache_add_folio - Add a folio into the swap cache.
+ * @folio: The folio to be added.
+ * @entry: The swap entry corresponding to the folio.
+ * @shadowp: If a shadow is found, return the shadow.
+ *
+ * Context: Caller must ensure @entry is valid and protect the cluster with
+ * reference count or locks.
+ *
+ * The caller also needs to update the corresponding swap_map slots with
+ * SWAP_HAS_CACHE bit to avoid race or conflict.
+ */
+void swap_cache_add_folio(struct folio *folio, swp_entry_t entry, void **shadowp)
+{
+ struct vswap_cluster *cluster;
+ unsigned long nr_pages = folio_nr_pages(folio);
+ unsigned long cluster_id = VSWAP_CLUSTER_IDX(entry);
+ unsigned long i;
+ struct swp_desc *desc;
+ void *old;
+
+ VM_WARN_ON_ONCE_FOLIO(!folio_test_locked(folio), folio);
+ VM_WARN_ON_ONCE_FOLIO(folio_test_swapcache(folio), folio);
+ VM_WARN_ON_ONCE_FOLIO(!folio_test_swapbacked(folio), folio);
+
+ folio_ref_add(folio, nr_pages);
+ folio_set_swapcache(folio);
+ folio->swap = entry;
+
+ rcu_read_lock();
+ cluster = xa_load(&vswap_cluster_map, cluster_id);
+ VM_WARN_ON(!cluster);
+ spin_lock_irq(&cluster->lock);
+
+ for (i = 0; i < nr_pages; i++) {
+ desc = __vswap_iter(cluster, entry.val + i);
+ VM_WARN_ON(!desc);
+ old = desc->shadow;
+
+ /* Warn if slot is already occupied by a folio */
+ VM_WARN_ON_FOLIO(old && !xa_is_value(old), folio);
+
+ /* Save shadow if found and not yet saved */
+ if (shadowp && xa_is_value(old) && !*shadowp)
+ *shadowp = old;
+
+ desc->swap_cache = folio;
+ }
+
+ spin_unlock_irq(&cluster->lock);
+ rcu_read_unlock();
+
+ node_stat_mod_folio(folio, NR_FILE_PAGES, nr_pages);
+ lruvec_stat_mod_folio(folio, NR_SWAPCACHE, nr_pages);
+}
+
+/**
+ * __swap_cache_del_folio - Removes a folio from the swap cache.
+ * @folio: The folio.
+ * @entry: The first swap entry that the folio corresponds to.
+ * @shadow: shadow value to be filled in the swap cache.
+ *
+ * Removes a folio from the swap cache and fills a shadow in place.
+ * This won't put the folio's refcount. The caller has to do that.
+ *
+ * Context: Caller must ensure the folio is locked and in the swap cache
+ * using the index of @entry, and lock the swap cache.
+ */
+void __swap_cache_del_folio(struct folio *folio, swp_entry_t entry, void *shadow)
+{
+ long nr_pages = folio_nr_pages(folio);
+ struct vswap_cluster *cluster;
+ struct swp_desc *desc;
+ unsigned long cluster_id = VSWAP_CLUSTER_IDX(entry);
+ int i;
+
+ VM_WARN_ON_ONCE_FOLIO(!folio_test_locked(folio), folio);
+ VM_WARN_ON_ONCE_FOLIO(!folio_test_swapcache(folio), folio);
+ VM_WARN_ON_ONCE_FOLIO(folio_test_writeback(folio), folio);
+
+ rcu_read_lock();
+ cluster = xa_load(&vswap_cluster_map, cluster_id);
+ VM_WARN_ON(!cluster);
+
+ for (i = 0; i < nr_pages; i++) {
+ desc = __vswap_iter(cluster, entry.val + i);
+ VM_WARN_ON_FOLIO(!desc || desc->swap_cache != folio, folio);
+ desc->shadow = shadow;
+ }
+ rcu_read_unlock();
+
+ folio->swap.val = 0;
+ folio_clear_swapcache(folio);
+ node_stat_mod_folio(folio, NR_FILE_PAGES, -nr_pages);
+ lruvec_stat_mod_folio(folio, NR_SWAPCACHE, -nr_pages);
+}
+
+/**
+ * swap_cache_del_folio - Removes a folio from the swap cache.
+ * @folio: The folio.
+ *
+ * Same as __swap_cache_del_folio, but handles lock and refcount. The
+ * caller must ensure the folio is either clean or has a swap count
+ * equal to zero, or it may cause data loss.
+ *
+ * Context: Caller must ensure the folio is locked and in the swap cache.
+ */
+void swap_cache_del_folio(struct folio *folio)
+{
+ swp_entry_t entry = folio->swap;
+
+ swap_cache_lock_irq(entry);
+ __swap_cache_del_folio(folio, entry, NULL);
+ swap_cache_unlock_irq(entry);
+
+ put_swap_folio(folio, entry);
+ folio_ref_sub(folio, folio_nr_pages(folio));
+}
+
+/**
+ * __swap_cache_replace_folio - Replace a folio in the swap cache.
+ * @old: The old folio to be replaced.
+ * @new: The new folio.
+ *
+ * Replace an existing folio in the swap cache with a new folio. The
+ * caller is responsible for setting up the new folio's flag and swap
+ * entries. Replacement will take the new folio's swap entry value as
+ * the starting offset to override all slots covered by the new folio.
+ *
+ * Context: Caller must ensure both folios are locked, and lock the
+ * swap cache.
+ */
+void __swap_cache_replace_folio(struct folio *old, struct folio *new)
+{
+ swp_entry_t entry = new->swap;
+ unsigned long nr_pages = folio_nr_pages(new);
+ struct vswap_cluster *cluster;
+ struct swp_desc *desc;
+ unsigned long cluster_id = VSWAP_CLUSTER_IDX(entry);
+ void *old_entry;
+ int i;
+
+ VM_WARN_ON_ONCE(!folio_test_swapcache(old) || !folio_test_swapcache(new));
+ VM_WARN_ON_ONCE(!folio_test_locked(old) || !folio_test_locked(new));
+ VM_WARN_ON_ONCE(!entry.val);
+
+ rcu_read_lock();
+ cluster = xa_load(&vswap_cluster_map, cluster_id);
+ VM_WARN_ON(!cluster);
+
+ for (i = 0; i < nr_pages; i++) {
+ desc = __vswap_iter(cluster, entry.val + i);
+ VM_WARN_ON(!desc);
+ old_entry = desc->swap_cache;
+ VM_WARN_ON(!old_entry || xa_is_value(old_entry) || old_entry != old);
+ desc->swap_cache = new;
+ }
+ rcu_read_unlock();
+}
int vswap_init(void)
{
--
2.47.3
^ permalink raw reply [flat|nested] 52+ messages in thread
* [PATCH v3 11/20] zswap: move zswap entry management to the virtual swap descriptor
2026-02-08 21:58 [PATCH v3 00/20] Virtual Swap Space Nhat Pham
` (9 preceding siblings ...)
2026-02-08 21:58 ` [PATCH v3 10/20] swap: move swap cache to virtual swap descriptor Nhat Pham
@ 2026-02-08 21:58 ` Nhat Pham
2026-02-08 21:58 ` [PATCH v3 12/20] swap: implement the swap_cgroup API using virtual swap Nhat Pham
` (10 subsequent siblings)
21 siblings, 0 replies; 52+ messages in thread
From: Nhat Pham @ 2026-02-08 21:58 UTC (permalink / raw)
To: linux-mm
Cc: akpm, hannes, hughd, yosry.ahmed, mhocko, roman.gushchin,
shakeel.butt, muchun.song, len.brown, chengming.zhou, kasong,
chrisl, huang.ying.caritas, ryan.roberts, shikemeng, viro,
baohua, bhe, osalvador, lorenzo.stoakes, christophe.leroy, pavel,
kernel-team, linux-kernel, cgroups, linux-pm, peterx, riel,
joshua.hahnjy, npache, gourry, axelrasmussen, yuanchu, weixugc,
rafael, jannh, pfalcato, zhengqi.arch
Remove the zswap tree and manage zswap entries directly
through the virtual swap descriptor. This re-partitions the zswap pool
(by virtual swap cluster), which eliminates zswap tree lock contention.
Signed-off-by: Nhat Pham <nphamcs@gmail.com>
---
include/linux/zswap.h | 6 +++
mm/vswap.c | 100 ++++++++++++++++++++++++++++++++++++++++++
mm/zswap.c | 40 -----------------
3 files changed, 106 insertions(+), 40 deletions(-)
diff --git a/include/linux/zswap.h b/include/linux/zswap.h
index 1a04caf283dc8..7eb3ce7e124fc 100644
--- a/include/linux/zswap.h
+++ b/include/linux/zswap.h
@@ -6,6 +6,7 @@
#include <linux/mm_types.h>
struct lruvec;
+struct zswap_entry;
extern atomic_long_t zswap_stored_pages;
@@ -33,6 +34,11 @@ void zswap_lruvec_state_init(struct lruvec *lruvec);
void zswap_folio_swapin(struct folio *folio);
bool zswap_is_enabled(void);
bool zswap_never_enabled(void);
+void *zswap_entry_store(swp_entry_t swpentry, struct zswap_entry *entry);
+void *zswap_entry_load(swp_entry_t swpentry);
+void *zswap_entry_erase(swp_entry_t swpentry);
+bool zswap_empty(swp_entry_t swpentry);
+
#else
struct zswap_lruvec_state {};
diff --git a/mm/vswap.c b/mm/vswap.c
index d44199dc059a3..9bb733f00fd21 100644
--- a/mm/vswap.c
+++ b/mm/vswap.c
@@ -10,6 +10,7 @@
#include <linux/swapops.h>
#include <linux/swap_cgroup.h>
#include <linux/cpuhotplug.h>
+#include <linux/zswap.h>
#include "swap.h"
#include "swap_table.h"
@@ -37,11 +38,13 @@
* Swap descriptor - metadata of a swapped out page.
*
* @slot: The handle to the physical swap slot backing this page.
+ * @zswap_entry: The zswap entry associated with this swap slot.
* @swap_cache: The folio in swap cache.
* @shadow: The shadow entry.
*/
struct swp_desc {
swp_slot_t slot;
+ struct zswap_entry *zswap_entry;
union {
struct folio *swap_cache;
void *shadow;
@@ -238,6 +241,7 @@ static void __vswap_alloc_from_cluster(struct vswap_cluster *cluster, int start)
for (i = 0; i < nr; i++) {
desc = &cluster->descriptors[start + i];
desc->slot.val = 0;
+ desc->zswap_entry = NULL;
}
cluster->count += nr;
}
@@ -1009,6 +1013,102 @@ void __swap_cache_replace_folio(struct folio *old, struct folio *new)
rcu_read_unlock();
}
+#ifdef CONFIG_ZSWAP
+/**
+ * zswap_entry_store - store a zswap entry for a swap entry
+ * @swpentry: the swap entry
+ * @entry: the zswap entry to store
+ *
+ * Stores a zswap entry in the swap descriptor for the given swap entry.
+ * The cluster is locked during the store operation.
+ *
+ * Return: the old zswap entry if one existed, NULL otherwise
+ */
+void *zswap_entry_store(swp_entry_t swpentry, struct zswap_entry *entry)
+{
+ struct vswap_cluster *cluster = NULL;
+ struct swp_desc *desc;
+ void *old;
+
+ rcu_read_lock();
+ desc = vswap_iter(&cluster, swpentry.val);
+ if (!desc) {
+ rcu_read_unlock();
+ return NULL;
+ }
+
+ old = desc->zswap_entry;
+ desc->zswap_entry = entry;
+ spin_unlock(&cluster->lock);
+ rcu_read_unlock();
+
+ return old;
+}
+
+/**
+ * zswap_entry_load - load a zswap entry for a swap entry
+ * @swpentry: the swap entry
+ *
+ * Loads the zswap entry from the swap descriptor for the given swap entry.
+ *
+ * Return: the zswap entry if one exists, NULL otherwise
+ */
+void *zswap_entry_load(swp_entry_t swpentry)
+{
+ struct vswap_cluster *cluster = NULL;
+ struct swp_desc *desc;
+ void *zswap_entry;
+
+ rcu_read_lock();
+ desc = vswap_iter(&cluster, swpentry.val);
+ if (!desc) {
+ rcu_read_unlock();
+ return NULL;
+ }
+
+ zswap_entry = desc->zswap_entry;
+ spin_unlock(&cluster->lock);
+ rcu_read_unlock();
+
+ return zswap_entry;
+}
+
+/**
+ * zswap_entry_erase - erase a zswap entry for a swap entry
+ * @swpentry: the swap entry
+ *
+ * Erases the zswap entry from the swap descriptor for the given swap entry.
+ * The cluster is locked during the erase operation.
+ *
+ * Return: the zswap entry that was erased, NULL if none existed
+ */
+void *zswap_entry_erase(swp_entry_t swpentry)
+{
+ struct vswap_cluster *cluster = NULL;
+ struct swp_desc *desc;
+ void *old;
+
+ rcu_read_lock();
+ desc = vswap_iter(&cluster, swpentry.val);
+ if (!desc) {
+ rcu_read_unlock();
+ return NULL;
+ }
+
+ old = desc->zswap_entry;
+ desc->zswap_entry = NULL;
+ spin_unlock(&cluster->lock);
+ rcu_read_unlock();
+
+ return old;
+}
+
+bool zswap_empty(swp_entry_t swpentry)
+{
+ return xa_empty(&vswap_cluster_map);
+}
+#endif /* CONFIG_ZSWAP */
+
int vswap_init(void)
{
int i;
diff --git a/mm/zswap.c b/mm/zswap.c
index f7313261673ff..72441131f094e 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -223,37 +223,6 @@ static bool zswap_has_pool;
* helpers and fwd declarations
**********************************/
-static DEFINE_XARRAY(zswap_tree);
-
-#define zswap_tree_index(entry) (entry.val)
-
-static inline void *zswap_entry_store(swp_entry_t swpentry,
- struct zswap_entry *entry)
-{
- pgoff_t offset = zswap_tree_index(swpentry);
-
- return xa_store(&zswap_tree, offset, entry, GFP_KERNEL);
-}
-
-static inline void *zswap_entry_load(swp_entry_t swpentry)
-{
- pgoff_t offset = zswap_tree_index(swpentry);
-
- return xa_load(&zswap_tree, offset);
-}
-
-static inline void *zswap_entry_erase(swp_entry_t swpentry)
-{
- pgoff_t offset = zswap_tree_index(swpentry);
-
- return xa_erase(&zswap_tree, offset);
-}
-
-static inline bool zswap_empty(swp_entry_t swpentry)
-{
- return xa_empty(&zswap_tree);
-}
-
#define zswap_pool_debug(msg, p) \
pr_debug("%s pool %s\n", msg, (p)->tfm_name)
@@ -1445,13 +1414,6 @@ static bool zswap_store_page(struct page *page,
goto compress_failed;
old = zswap_entry_store(page_swpentry, entry);
- if (xa_is_err(old)) {
- int err = xa_err(old);
-
- WARN_ONCE(err != -ENOMEM, "unexpected xarray error: %d\n", err);
- zswap_reject_alloc_fail++;
- goto store_failed;
- }
/*
* We may have had an existing entry that became stale when
@@ -1498,8 +1460,6 @@ static bool zswap_store_page(struct page *page,
return true;
-store_failed:
- zs_free(pool->zs_pool, entry->handle);
compress_failed:
zswap_entry_cache_free(entry);
return false;
--
2.47.3
^ permalink raw reply [flat|nested] 52+ messages in thread
* [PATCH v3 12/20] swap: implement the swap_cgroup API using virtual swap
2026-02-08 21:58 [PATCH v3 00/20] Virtual Swap Space Nhat Pham
` (10 preceding siblings ...)
2026-02-08 21:58 ` [PATCH v3 11/20] zswap: move zswap entry management to the " Nhat Pham
@ 2026-02-08 21:58 ` Nhat Pham
2026-02-08 21:58 ` [PATCH v3 13/20] swap: manage swap entry lifecycle at the virtual swap layer Nhat Pham
` (9 subsequent siblings)
21 siblings, 0 replies; 52+ messages in thread
From: Nhat Pham @ 2026-02-08 21:58 UTC (permalink / raw)
To: linux-mm
Cc: akpm, hannes, hughd, yosry.ahmed, mhocko, roman.gushchin,
shakeel.butt, muchun.song, len.brown, chengming.zhou, kasong,
chrisl, huang.ying.caritas, ryan.roberts, shikemeng, viro,
baohua, bhe, osalvador, lorenzo.stoakes, christophe.leroy, pavel,
kernel-team, linux-kernel, cgroups, linux-pm, peterx, riel,
joshua.hahnjy, npache, gourry, axelrasmussen, yuanchu, weixugc,
rafael, jannh, pfalcato, zhengqi.arch
Once we decouple a swap entry from its backing store via the virtual
swap, we can no longer statically allocate an array to store the swap
entries' cgroup information. Move it to the swap descriptor.
Note that the memory overhead for swap cgroup information is now on
demand, i.e dynamically incurred when the virtual swap cluster is
allocated. This help reduces the memory overhead in a huge but
sparsely used swap space.
For instance, a 2 TB swapfile consists of 2147483648 swap slots, each
incurring 2 bytes of overhead for swap cgroup, for a total of 1 GB. If
we only utilize 10% of the swapfile, we will save 900 MB.
Signed-off-by: Nhat Pham <nphamcs@gmail.com>
---
include/linux/swap_cgroup.h | 13 ---
mm/Makefile | 3 -
mm/swap_cgroup.c | 174 ------------------------------------
mm/swapfile.c | 7 --
mm/vswap.c | 95 ++++++++++++++++++++
5 files changed, 95 insertions(+), 197 deletions(-)
delete mode 100644 mm/swap_cgroup.c
diff --git a/include/linux/swap_cgroup.h b/include/linux/swap_cgroup.h
index 91cdf12190a03..a2abb4d6fa085 100644
--- a/include/linux/swap_cgroup.h
+++ b/include/linux/swap_cgroup.h
@@ -9,8 +9,6 @@
extern void swap_cgroup_record(struct folio *folio, unsigned short id, swp_entry_t ent);
extern unsigned short swap_cgroup_clear(swp_entry_t ent, unsigned int nr_ents);
extern unsigned short lookup_swap_cgroup_id(swp_entry_t ent);
-extern int swap_cgroup_swapon(int type, unsigned long max_pages);
-extern void swap_cgroup_swapoff(int type);
#else
@@ -31,17 +29,6 @@ unsigned short lookup_swap_cgroup_id(swp_entry_t ent)
return 0;
}
-static inline int
-swap_cgroup_swapon(int type, unsigned long max_pages)
-{
- return 0;
-}
-
-static inline void swap_cgroup_swapoff(int type)
-{
- return;
-}
-
#endif
#endif /* __LINUX_SWAP_CGROUP_H */
diff --git a/mm/Makefile b/mm/Makefile
index 67fa4586e7e18..a7538784191bf 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -103,9 +103,6 @@ obj-$(CONFIG_PAGE_COUNTER) += page_counter.o
obj-$(CONFIG_LIVEUPDATE) += memfd_luo.o
obj-$(CONFIG_MEMCG_V1) += memcontrol-v1.o
obj-$(CONFIG_MEMCG) += memcontrol.o vmpressure.o
-ifdef CONFIG_SWAP
-obj-$(CONFIG_MEMCG) += swap_cgroup.o
-endif
obj-$(CONFIG_CGROUP_HUGETLB) += hugetlb_cgroup.o
obj-$(CONFIG_GUP_TEST) += gup_test.o
obj-$(CONFIG_DMAPOOL_TEST) += dmapool_test.o
diff --git a/mm/swap_cgroup.c b/mm/swap_cgroup.c
deleted file mode 100644
index 77ce1d66c318d..0000000000000
--- a/mm/swap_cgroup.c
+++ /dev/null
@@ -1,174 +0,0 @@
-// SPDX-License-Identifier: GPL-2.0
-#include <linux/swap_cgroup.h>
-#include <linux/vmalloc.h>
-#include <linux/mm.h>
-
-#include <linux/swapops.h> /* depends on mm.h include */
-
-static DEFINE_MUTEX(swap_cgroup_mutex);
-
-/* Pack two cgroup id (short) of two entries in one swap_cgroup (atomic_t) */
-#define ID_PER_SC (sizeof(struct swap_cgroup) / sizeof(unsigned short))
-#define ID_SHIFT (BITS_PER_TYPE(unsigned short))
-#define ID_MASK (BIT(ID_SHIFT) - 1)
-struct swap_cgroup {
- atomic_t ids;
-};
-
-struct swap_cgroup_ctrl {
- struct swap_cgroup *map;
-};
-
-static struct swap_cgroup_ctrl swap_cgroup_ctrl[MAX_SWAPFILES];
-
-static unsigned short __swap_cgroup_id_lookup(struct swap_cgroup *map,
- pgoff_t offset)
-{
- unsigned int shift = (offset % ID_PER_SC) * ID_SHIFT;
- unsigned int old_ids = atomic_read(&map[offset / ID_PER_SC].ids);
-
- BUILD_BUG_ON(!is_power_of_2(ID_PER_SC));
- BUILD_BUG_ON(sizeof(struct swap_cgroup) != sizeof(atomic_t));
-
- return (old_ids >> shift) & ID_MASK;
-}
-
-static unsigned short __swap_cgroup_id_xchg(struct swap_cgroup *map,
- pgoff_t offset,
- unsigned short new_id)
-{
- unsigned short old_id;
- struct swap_cgroup *sc = &map[offset / ID_PER_SC];
- unsigned int shift = (offset % ID_PER_SC) * ID_SHIFT;
- unsigned int new_ids, old_ids = atomic_read(&sc->ids);
-
- do {
- old_id = (old_ids >> shift) & ID_MASK;
- new_ids = (old_ids & ~(ID_MASK << shift));
- new_ids |= ((unsigned int)new_id) << shift;
- } while (!atomic_try_cmpxchg(&sc->ids, &old_ids, new_ids));
-
- return old_id;
-}
-
-/**
- * swap_cgroup_record - record mem_cgroup for a set of swap entries.
- * These entries must belong to one single folio, and that folio
- * must be being charged for swap space (swap out), and these
- * entries must not have been charged
- *
- * @folio: the folio that the swap entry belongs to
- * @id: mem_cgroup ID to be recorded
- * @ent: the first swap entry to be recorded
- */
-void swap_cgroup_record(struct folio *folio, unsigned short id,
- swp_entry_t ent)
-{
- unsigned int nr_ents = folio_nr_pages(folio);
- swp_slot_t slot = swp_entry_to_swp_slot(ent);
- struct swap_cgroup *map;
- pgoff_t offset, end;
- unsigned short old;
-
- offset = swp_slot_offset(slot);
- end = offset + nr_ents;
- map = swap_cgroup_ctrl[swp_slot_type(slot)].map;
-
- do {
- old = __swap_cgroup_id_xchg(map, offset, id);
- VM_BUG_ON(old);
- } while (++offset != end);
-}
-
-/**
- * swap_cgroup_clear - clear mem_cgroup for a set of swap entries.
- * These entries must be being uncharged from swap. They either
- * belongs to one single folio in the swap cache (swap in for
- * cgroup v1), or no longer have any users (slot freeing).
- *
- * @ent: the first swap entry to be recorded into
- * @nr_ents: number of swap entries to be recorded
- *
- * Returns the existing old value.
- */
-unsigned short swap_cgroup_clear(swp_entry_t ent, unsigned int nr_ents)
-{
- swp_slot_t slot = swp_entry_to_swp_slot(ent);
- pgoff_t offset = swp_slot_offset(slot);
- pgoff_t end = offset + nr_ents;
- struct swap_cgroup *map;
- unsigned short old, iter = 0;
-
- map = swap_cgroup_ctrl[swp_slot_type(slot)].map;
-
- do {
- old = __swap_cgroup_id_xchg(map, offset, 0);
- if (!iter)
- iter = old;
- VM_BUG_ON(iter != old);
- } while (++offset != end);
-
- return old;
-}
-
-/**
- * lookup_swap_cgroup_id - lookup mem_cgroup id tied to swap entry
- * @ent: swap entry to be looked up.
- *
- * Returns ID of mem_cgroup at success. 0 at failure. (0 is invalid ID)
- */
-unsigned short lookup_swap_cgroup_id(swp_entry_t ent)
-{
- struct swap_cgroup_ctrl *ctrl;
- swp_slot_t slot = swp_entry_to_swp_slot(ent);
-
- if (mem_cgroup_disabled())
- return 0;
-
- ctrl = &swap_cgroup_ctrl[swp_slot_type(slot)];
- return __swap_cgroup_id_lookup(ctrl->map, swp_slot_offset(slot));
-}
-
-int swap_cgroup_swapon(int type, unsigned long max_pages)
-{
- struct swap_cgroup *map;
- struct swap_cgroup_ctrl *ctrl;
-
- if (mem_cgroup_disabled())
- return 0;
-
- BUILD_BUG_ON(sizeof(unsigned short) * ID_PER_SC !=
- sizeof(struct swap_cgroup));
- map = vzalloc(DIV_ROUND_UP(max_pages, ID_PER_SC) *
- sizeof(struct swap_cgroup));
- if (!map)
- goto nomem;
-
- ctrl = &swap_cgroup_ctrl[type];
- mutex_lock(&swap_cgroup_mutex);
- ctrl->map = map;
- mutex_unlock(&swap_cgroup_mutex);
-
- return 0;
-nomem:
- pr_info("couldn't allocate enough memory for swap_cgroup\n");
- pr_info("swap_cgroup can be disabled by swapaccount=0 boot option\n");
- return -ENOMEM;
-}
-
-void swap_cgroup_swapoff(int type)
-{
- struct swap_cgroup *map;
- struct swap_cgroup_ctrl *ctrl;
-
- if (mem_cgroup_disabled())
- return;
-
- mutex_lock(&swap_cgroup_mutex);
- ctrl = &swap_cgroup_ctrl[type];
- map = ctrl->map;
- ctrl->map = NULL;
- mutex_unlock(&swap_cgroup_mutex);
-
- vfree(map);
-}
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 68ec5d9f05848..345877786e432 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -2931,8 +2931,6 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
vfree(swap_map);
kvfree(zeromap);
free_cluster_info(cluster_info, maxpages);
- /* Destroy swap account information */
- swap_cgroup_swapoff(p->type);
inode = mapping->host;
@@ -3497,10 +3495,6 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
goto bad_swap_unlock_inode;
}
- error = swap_cgroup_swapon(si->type, maxpages);
- if (error)
- goto bad_swap_unlock_inode;
-
error = setup_swap_map(si, swap_header, swap_map, maxpages);
if (error)
goto bad_swap_unlock_inode;
@@ -3605,7 +3599,6 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
si->global_cluster = NULL;
inode = NULL;
destroy_swap_extents(si);
- swap_cgroup_swapoff(si->type);
spin_lock(&swap_lock);
si->swap_file = NULL;
si->flags = 0;
diff --git a/mm/vswap.c b/mm/vswap.c
index 9bb733f00fd21..64747493ca9f7 100644
--- a/mm/vswap.c
+++ b/mm/vswap.c
@@ -41,6 +41,7 @@
* @zswap_entry: The zswap entry associated with this swap slot.
* @swap_cache: The folio in swap cache.
* @shadow: The shadow entry.
+ * @memcgid: The memcg id of the owning memcg, if any.
*/
struct swp_desc {
swp_slot_t slot;
@@ -49,6 +50,9 @@ struct swp_desc {
struct folio *swap_cache;
void *shadow;
};
+#ifdef CONFIG_MEMCG
+ unsigned short memcgid;
+#endif
};
#define VSWAP_CLUSTER_SHIFT HPAGE_PMD_ORDER
@@ -242,6 +246,9 @@ static void __vswap_alloc_from_cluster(struct vswap_cluster *cluster, int start)
desc = &cluster->descriptors[start + i];
desc->slot.val = 0;
desc->zswap_entry = NULL;
+#ifdef CONFIG_MEMCG
+ desc->memcgid = 0;
+#endif
}
cluster->count += nr;
}
@@ -1109,6 +1116,94 @@ bool zswap_empty(swp_entry_t swpentry)
}
#endif /* CONFIG_ZSWAP */
+#ifdef CONFIG_MEMCG
+static unsigned short vswap_cgroup_record(swp_entry_t entry,
+ unsigned short memcgid, unsigned int nr_ents)
+{
+ struct vswap_cluster *cluster = NULL;
+ struct swp_desc *desc;
+ unsigned short oldid, iter = 0;
+ int i;
+
+ rcu_read_lock();
+ for (i = 0; i < nr_ents; i++) {
+ desc = vswap_iter(&cluster, entry.val + i);
+ VM_WARN_ON(!desc);
+ oldid = desc->memcgid;
+ desc->memcgid = memcgid;
+ if (!iter)
+ iter = oldid;
+ VM_WARN_ON(iter != oldid);
+ }
+ spin_unlock(&cluster->lock);
+ rcu_read_unlock();
+
+ return oldid;
+}
+
+/**
+ * swap_cgroup_record - record mem_cgroup for a set of swap entries.
+ * These entries must belong to one single folio, and that folio
+ * must be being charged for swap space (swap out), and these
+ * entries must not have been charged
+ *
+ * @folio: the folio that the swap entry belongs to
+ * @memcgid: mem_cgroup ID to be recorded
+ * @entry: the first swap entry to be recorded
+ */
+void swap_cgroup_record(struct folio *folio, unsigned short memcgid,
+ swp_entry_t entry)
+{
+ unsigned short oldid =
+ vswap_cgroup_record(entry, memcgid, folio_nr_pages(folio));
+
+ VM_WARN_ON(oldid);
+}
+
+/**
+ * swap_cgroup_clear - clear mem_cgroup for a set of swap entries.
+ * These entries must be being uncharged from swap. They either
+ * belongs to one single folio in the swap cache (swap in for
+ * cgroup v1), or no longer have any users (slot freeing).
+ *
+ * @entry: the first swap entry to be recorded into
+ * @nr_ents: number of swap entries to be recorded
+ *
+ * Returns the existing old value.
+ */
+unsigned short swap_cgroup_clear(swp_entry_t entry, unsigned int nr_ents)
+{
+ return vswap_cgroup_record(entry, 0, nr_ents);
+}
+
+/**
+ * lookup_swap_cgroup_id - lookup mem_cgroup id tied to swap entry
+ * @entry: swap entry to be looked up.
+ *
+ * Returns ID of mem_cgroup at success. 0 at failure. (0 is invalid ID)
+ */
+unsigned short lookup_swap_cgroup_id(swp_entry_t entry)
+{
+ struct vswap_cluster *cluster = NULL;
+ struct swp_desc *desc;
+ unsigned short ret;
+
+ /*
+ * Note that the virtual swap slot can be freed under us, for instance in
+ * the invocation of mem_cgroup_swapin_charge_folio. We need to wrap the
+ * entire lookup in RCU read-side critical section, and double check the
+ * existence of the swap descriptor.
+ */
+ rcu_read_lock();
+ desc = vswap_iter(&cluster, entry.val);
+ ret = desc ? desc->memcgid : 0;
+ if (cluster)
+ spin_unlock(&cluster->lock);
+ rcu_read_unlock();
+ return ret;
+}
+#endif /* CONFIG_MEMCG */
+
int vswap_init(void)
{
int i;
--
2.47.3
^ permalink raw reply [flat|nested] 52+ messages in thread
* [PATCH v3 13/20] swap: manage swap entry lifecycle at the virtual swap layer
2026-02-08 21:58 [PATCH v3 00/20] Virtual Swap Space Nhat Pham
` (11 preceding siblings ...)
2026-02-08 21:58 ` [PATCH v3 12/20] swap: implement the swap_cgroup API using virtual swap Nhat Pham
@ 2026-02-08 21:58 ` Nhat Pham
2026-02-08 21:58 ` [PATCH v3 14/20] mm: swap: decouple virtual swap slot from backing store Nhat Pham
` (8 subsequent siblings)
21 siblings, 0 replies; 52+ messages in thread
From: Nhat Pham @ 2026-02-08 21:58 UTC (permalink / raw)
To: linux-mm
Cc: akpm, hannes, hughd, yosry.ahmed, mhocko, roman.gushchin,
shakeel.butt, muchun.song, len.brown, chengming.zhou, kasong,
chrisl, huang.ying.caritas, ryan.roberts, shikemeng, viro,
baohua, bhe, osalvador, lorenzo.stoakes, christophe.leroy, pavel,
kernel-team, linux-kernel, cgroups, linux-pm, peterx, riel,
joshua.hahnjy, npache, gourry, axelrasmussen, yuanchu, weixugc,
rafael, jannh, pfalcato, zhengqi.arch
This patch moves the swap entry lifecycle management to the virtual swap
layer by adding to the swap descriptor two fields:
1. in_swapcache, i.e whether the swap entry is in swap cache (or about
to be added).
2. The swap count of the swap entry, which counts the number of page
table entries at which the swap entry is inserted.
and re-implementing all of the swap entry lifecycle API
(swap_duplicate(), swap_free_nr(), swapcache_prepare(), etc.) in the
virtual swap layer.
For now, we do not implement swap count continuation - the swap_count
field in the swap descriptor is big enough to hold the maximum number of
swap counts. This vastly simplifies the logic.
Signed-off-by: Nhat Pham <nphamcs@gmail.com>
---
include/linux/swap.h | 29 +-
include/linux/zswap.h | 5 +-
mm/memory.c | 8 +-
mm/shmem.c | 4 +-
mm/swap.h | 58 ++--
mm/swap_state.c | 4 +-
mm/swapfile.c | 786 ++----------------------------------------
mm/vswap.c | 452 ++++++++++++++++++++++--
mm/zswap.c | 14 +-
9 files changed, 502 insertions(+), 858 deletions(-)
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 0410a00fd353c..aae2e502d9975 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -223,17 +223,9 @@ enum {
#define SWAP_CLUSTER_MAX_SKIPPED (SWAP_CLUSTER_MAX << 10)
#define COMPACT_CLUSTER_MAX SWAP_CLUSTER_MAX
-/* Bit flag in swap_map */
-#define SWAP_HAS_CACHE 0x40 /* Flag page is cached, in first swap_map */
-#define COUNT_CONTINUED 0x80 /* Flag swap_map continuation for full count */
-
-/* Special value in first swap_map */
-#define SWAP_MAP_MAX 0x3e /* Max count */
-#define SWAP_MAP_BAD 0x3f /* Note page is bad */
-#define SWAP_MAP_SHMEM 0xbf /* Owned by shmem/tmpfs */
-
-/* Special value in each swap_map continuation */
-#define SWAP_CONT_MAX 0x7f /* Max count */
+/* Swapfile's swap map state*/
+#define SWAP_MAP_ALLOCATED 0x01 /* Page is allocated */
+#define SWAP_MAP_BAD 0x02 /* Page is bad */
/*
* The first page in the swap file is the swap header, which is always marked
@@ -423,7 +415,7 @@ extern void __meminit kswapd_stop(int nid);
#ifdef CONFIG_SWAP
-/* Lifecycle swap API (mm/swapfile.c) */
+/* Lifecycle swap API (mm/swapfile.c and mm/vswap.c) */
int folio_alloc_swap(struct folio *folio);
bool folio_free_swap(struct folio *folio);
void put_swap_folio(struct folio *folio, swp_entry_t entry);
@@ -433,7 +425,7 @@ int swapcache_prepare(swp_entry_t entry, int nr);
void swap_free_nr(swp_entry_t entry, int nr_pages);
void free_swap_and_cache_nr(swp_entry_t entry, int nr);
int __swap_count(swp_entry_t entry);
-bool swap_entry_swapped(struct swap_info_struct *si, swp_entry_t entry);
+bool swap_entry_swapped(swp_entry_t entry);
int swp_swapcount(swp_entry_t entry);
bool is_swap_cached(swp_entry_t entry);
@@ -473,7 +465,6 @@ static inline long get_nr_swap_pages(void)
void si_swapinfo(struct sysinfo *);
int swap_slot_alloc(swp_slot_t *slot, unsigned int order);
swp_slot_t swap_slot_alloc_of_type(int);
-int add_swap_count_continuation(swp_entry_t, gfp_t);
int swap_type_of(dev_t device, sector_t offset);
int find_first_swap(dev_t *device);
unsigned int count_swap_pages(int, int);
@@ -517,11 +508,6 @@ static inline void free_swap_cache(struct folio *folio)
{
}
-static inline int add_swap_count_continuation(swp_entry_t swp, gfp_t gfp_mask)
-{
- return 0;
-}
-
static inline void swap_shmem_alloc(swp_entry_t swp, int nr)
{
}
@@ -549,7 +535,7 @@ static inline int __swap_count(swp_entry_t entry)
return 0;
}
-static inline bool swap_entry_swapped(struct swap_info_struct *si, swp_entry_t entry)
+static inline bool swap_entry_swapped(swp_entry_t entry)
{
return false;
}
@@ -672,11 +658,12 @@ static inline bool mem_cgroup_swap_full(struct folio *folio)
int vswap_init(void);
void vswap_exit(void);
-void vswap_free(swp_entry_t entry, struct swap_cluster_info *ci);
swp_slot_t swp_entry_to_swp_slot(swp_entry_t entry);
swp_entry_t swp_slot_to_swp_entry(swp_slot_t slot);
bool tryget_swap_entry(swp_entry_t entry, struct swap_info_struct **si);
void put_swap_entry(swp_entry_t entry, struct swap_info_struct *si);
+bool folio_swapped(struct folio *folio);
+bool vswap_only_has_cache(swp_entry_t entry, int nr);
#endif /* __KERNEL__*/
#endif /* _LINUX_SWAP_H */
diff --git a/include/linux/zswap.h b/include/linux/zswap.h
index 7eb3ce7e124fc..07b2936c38f29 100644
--- a/include/linux/zswap.h
+++ b/include/linux/zswap.h
@@ -28,7 +28,6 @@ struct zswap_lruvec_state {
unsigned long zswap_total_pages(void);
bool zswap_store(struct folio *folio);
int zswap_load(struct folio *folio);
-void zswap_invalidate(swp_entry_t swp);
void zswap_memcg_offline_cleanup(struct mem_cgroup *memcg);
void zswap_lruvec_state_init(struct lruvec *lruvec);
void zswap_folio_swapin(struct folio *folio);
@@ -38,6 +37,7 @@ void *zswap_entry_store(swp_entry_t swpentry, struct zswap_entry *entry);
void *zswap_entry_load(swp_entry_t swpentry);
void *zswap_entry_erase(swp_entry_t swpentry);
bool zswap_empty(swp_entry_t swpentry);
+void zswap_entry_free(struct zswap_entry *entry);
#else
@@ -53,7 +53,6 @@ static inline int zswap_load(struct folio *folio)
return -ENOENT;
}
-static inline void zswap_invalidate(swp_entry_t swp) {}
static inline void zswap_memcg_offline_cleanup(struct mem_cgroup *memcg) {}
static inline void zswap_lruvec_state_init(struct lruvec *lruvec) {}
static inline void zswap_folio_swapin(struct folio *folio) {}
@@ -68,6 +67,8 @@ static inline bool zswap_never_enabled(void)
return true;
}
+static inline void zswap_entry_free(struct zswap_entry *entry) {}
+
#endif
#endif /* _LINUX_ZSWAP_H */
diff --git a/mm/memory.c b/mm/memory.c
index 90031f833f52e..641e3f65edc00 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1333,10 +1333,6 @@ copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
if (ret == -EIO) {
VM_WARN_ON_ONCE(!entry.val);
- if (add_swap_count_continuation(entry, GFP_KERNEL) < 0) {
- ret = -ENOMEM;
- goto out;
- }
entry.val = 0;
} else if (ret == -EBUSY || unlikely(ret == -EHWPOISON)) {
goto out;
@@ -5044,7 +5040,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
out:
/* Clear the swap cache pin for direct swapin after PTL unlock */
if (need_clear_cache) {
- swapcache_clear(si, entry, nr_pages);
+ swapcache_clear(entry, nr_pages);
if (waitqueue_active(&swapcache_wq))
wake_up(&swapcache_wq);
}
@@ -5063,7 +5059,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
folio_put(swapcache);
}
if (need_clear_cache) {
- swapcache_clear(si, entry, nr_pages);
+ swapcache_clear(entry, nr_pages);
if (waitqueue_active(&swapcache_wq))
wake_up(&swapcache_wq);
}
diff --git a/mm/shmem.c b/mm/shmem.c
index 66cf8af6779ca..780571c830e5b 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2442,7 +2442,7 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
if (skip_swapcache) {
folio->swap.val = 0;
- swapcache_clear(si, swap, nr_pages);
+ swapcache_clear(swap, nr_pages);
} else {
swap_cache_del_folio(folio);
}
@@ -2463,7 +2463,7 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
folio_unlock(folio);
failed_nolock:
if (skip_swapcache)
- swapcache_clear(si, folio->swap, folio_nr_pages(folio));
+ swapcache_clear(folio->swap, folio_nr_pages(folio));
if (folio)
folio_put(folio);
put_swap_entry(swap, si);
diff --git a/mm/swap.h b/mm/swap.h
index 57ed24a2d6356..ae97cf9712c5c 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -211,6 +211,8 @@ void swap_cache_lock_irq(swp_entry_t entry);
void swap_cache_unlock_irq(swp_entry_t entry);
void swap_cache_lock(swp_entry_t entry);
void swap_cache_unlock(swp_entry_t entry);
+void vswap_rmap_set(struct swap_cluster_info *ci, swp_slot_t slot,
+ unsigned long vswap, int nr);
static inline struct address_space *swap_address_space(swp_entry_t entry)
{
@@ -245,6 +247,31 @@ static inline bool folio_matches_swap_entry(const struct folio *folio,
return folio_entry.val == round_down(entry.val, nr_pages);
}
+/**
+ * folio_matches_swap_slot - Check if a folio matches both the virtual
+ * swap entry and its backing physical swap slot.
+ * @folio: The folio.
+ * @entry: The virtual swap entry to check against.
+ * @slot: The physical swap slot to check against.
+ *
+ * Context: The caller should have the folio locked to ensure it's stable
+ * and nothing will move it in or out of the swap cache.
+ * Return: true if both checks pass, false otherwise.
+ */
+static inline bool folio_matches_swap_slot(const struct folio *folio,
+ swp_entry_t entry,
+ swp_slot_t slot)
+{
+ if (!folio_matches_swap_entry(folio, entry))
+ return false;
+
+ /*
+ * Confirm the virtual swap entry is still backed by the same
+ * physical swap slot.
+ */
+ return slot.val == swp_entry_to_swp_slot(entry).val;
+}
+
/*
* All swap cache helpers below require the caller to ensure the swap entries
* used are valid and stablize the device by any of the following ways:
@@ -265,7 +292,7 @@ void __swap_cache_del_folio(struct folio *folio, swp_entry_t entry, void *shadow
void __swap_cache_replace_folio(struct folio *old, struct folio *new);
void show_swap_cache_info(void);
-void swapcache_clear(struct swap_info_struct *si, swp_entry_t entry, int nr);
+void swapcache_clear(swp_entry_t entry, int nr);
struct folio *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
struct vm_area_struct *vma, unsigned long addr,
struct swap_iocb **plug);
@@ -312,25 +339,7 @@ static inline int swap_zeromap_batch(swp_entry_t entry, int max_nr,
return find_next_bit(sis->zeromap, end, start) - start;
}
-static inline int non_swapcache_batch(swp_entry_t entry, int max_nr)
-{
- swp_slot_t slot = swp_entry_to_swp_slot(entry);
- struct swap_info_struct *si = __swap_slot_to_info(slot);
- pgoff_t offset = swp_slot_offset(slot);
- int i;
-
- /*
- * While allocating a large folio and doing mTHP swapin, we need to
- * ensure all entries are not cached, otherwise, the mTHP folio will
- * be in conflict with the folio in swap cache.
- */
- for (i = 0; i < max_nr; i++) {
- if ((si->swap_map[offset + i] & SWAP_HAS_CACHE))
- return i;
- }
-
- return i;
-}
+int non_swapcache_batch(swp_entry_t entry, int max_nr);
#else /* CONFIG_SWAP */
struct swap_iocb;
@@ -382,6 +391,13 @@ static inline bool folio_matches_swap_entry(const struct folio *folio, swp_entry
return false;
}
+static inline bool folio_matches_swap_slot(const struct folio *folio,
+ swp_entry_t entry,
+ swp_slot_t slot)
+{
+ return false;
+}
+
static inline void show_swap_cache_info(void)
{
}
@@ -409,7 +425,7 @@ static inline int swap_writeout(struct folio *folio,
return 0;
}
-static inline void swapcache_clear(struct swap_info_struct *si, swp_entry_t entry, int nr)
+static inline void swapcache_clear(swp_entry_t entry, int nr)
{
}
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 00fa3e76a5c19..1827527e88d33 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -174,8 +174,6 @@ struct folio *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
struct mempolicy *mpol, pgoff_t ilx, bool *new_page_allocated,
bool skip_if_exists)
{
- struct swap_info_struct *si =
- __swap_slot_to_info(swp_entry_to_swp_slot(entry));
struct folio *folio;
struct folio *new_folio = NULL;
struct folio *result = NULL;
@@ -196,7 +194,7 @@ struct folio *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
/*
* Just skip read ahead for unused swap slot.
*/
- if (!swap_entry_swapped(si, entry))
+ if (!swap_entry_swapped(entry))
goto put_and_return;
/*
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 345877786e432..6c5e46bf40701 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -50,9 +50,6 @@
#include "internal.h"
#include "swap.h"
-static bool swap_count_continued(struct swap_info_struct *, pgoff_t,
- unsigned char);
-static void free_swap_count_continuations(struct swap_info_struct *);
static void swap_slots_free(struct swap_info_struct *si,
struct swap_cluster_info *ci,
swp_slot_t slot, unsigned int nr_pages);
@@ -146,7 +143,7 @@ static struct swap_info_struct *swap_slot_to_info(swp_slot_t slot)
static inline unsigned char swap_count(unsigned char ent)
{
- return ent & ~SWAP_HAS_CACHE; /* may include COUNT_CONTINUED flag */
+ return ent;
}
/*
@@ -182,52 +179,14 @@ static long swap_usage_in_pages(struct swap_info_struct *si)
static bool swap_only_has_cache(struct swap_info_struct *si,
unsigned long offset, int nr_pages)
{
- unsigned char *map = si->swap_map + offset;
- unsigned char *map_end = map + nr_pages;
-
- do {
- VM_BUG_ON(!(*map & SWAP_HAS_CACHE));
- if (*map != SWAP_HAS_CACHE)
- return false;
- } while (++map < map_end);
+ swp_entry_t entry = swp_slot_to_swp_entry(swp_slot(si->type, offset));
- return true;
+ return vswap_only_has_cache(entry, nr_pages);
}
-/**
- * is_swap_cached - check if the swap entry is cached
- * @entry: swap entry to check
- *
- * Check swap_map directly to minimize overhead, READ_ONCE is sufficient.
- *
- * Returns true if the swap entry is cached, false otherwise.
- */
-bool is_swap_cached(swp_entry_t entry)
+static bool swap_cache_only(struct swap_info_struct *si, unsigned long offset)
{
- swp_slot_t slot = swp_entry_to_swp_slot(entry);
- struct swap_info_struct *si = swap_slot_to_info(slot);
- unsigned long offset = swp_slot_offset(slot);
-
- return READ_ONCE(si->swap_map[offset]) & SWAP_HAS_CACHE;
-}
-
-static bool swap_is_last_map(struct swap_info_struct *si,
- unsigned long offset, int nr_pages, bool *has_cache)
-{
- unsigned char *map = si->swap_map + offset;
- unsigned char *map_end = map + nr_pages;
- unsigned char count = *map;
-
- if (swap_count(count) != 1 && swap_count(count) != SWAP_MAP_SHMEM)
- return false;
-
- while (++map < map_end) {
- if (*map != count)
- return false;
- }
-
- *has_cache = !!(count & SWAP_HAS_CACHE);
- return true;
+ return swap_only_has_cache(si, offset, 1);
}
/*
@@ -238,15 +197,15 @@ static bool swap_is_last_map(struct swap_info_struct *si,
static int __try_to_reclaim_swap(struct swap_info_struct *si,
unsigned long offset, unsigned long flags)
{
- const swp_entry_t entry =
- swp_slot_to_swp_entry(swp_slot(si->type, offset));
- swp_slot_t slot;
+ const swp_slot_t slot = swp_slot(si->type, offset);
+ swp_entry_t entry;
struct swap_cluster_info *ci;
struct folio *folio;
int ret, nr_pages;
bool need_reclaim;
again:
+ entry = swp_slot_to_swp_entry(slot);
folio = swap_cache_get_folio(entry);
if (!folio)
return 0;
@@ -266,14 +225,15 @@ static int __try_to_reclaim_swap(struct swap_info_struct *si,
/*
* Offset could point to the middle of a large folio, or folio
* may no longer point to the expected offset before it's locked.
+ * Additionally, the virtual swap entry may no longer be backed
+ * by the same physical swap slot.
*/
- if (!folio_matches_swap_entry(folio, entry)) {
+ if (!folio_matches_swap_slot(folio, entry, slot)) {
folio_unlock(folio);
folio_put(folio);
goto again;
}
- slot = swp_entry_to_swp_slot(folio->swap);
- offset = swp_slot_offset(slot);
+ offset = swp_slot_offset(swp_entry_to_swp_slot(folio->swap));
need_reclaim = ((flags & TTRS_ANYWAY) ||
((flags & TTRS_UNMAPPED) && !folio_mapped(folio)) ||
@@ -283,8 +243,7 @@ static int __try_to_reclaim_swap(struct swap_info_struct *si,
/*
* It's safe to delete the folio from swap cache only if the folio's
- * swap_map is HAS_CACHE only, which means the slots have no page table
- * reference or pending writeback, and can't be allocated to others.
+ * swap slots have no page table reference or pending writeback.
*/
ci = swap_cluster_lock(si, offset);
need_reclaim = swap_only_has_cache(si, offset, nr_pages);
@@ -811,7 +770,7 @@ static bool cluster_reclaim_range(struct swap_info_struct *si,
case 0:
offset++;
break;
- case SWAP_HAS_CACHE:
+ case SWAP_MAP_ALLOCATED:
nr_reclaim = __try_to_reclaim_swap(si, offset, TTRS_ANYWAY);
if (nr_reclaim > 0)
offset += nr_reclaim;
@@ -842,22 +801,23 @@ static bool cluster_scan_range(struct swap_info_struct *si,
{
unsigned long offset, end = start + nr_pages;
unsigned char *map = si->swap_map;
+ unsigned char count;
if (cluster_is_empty(ci))
return true;
for (offset = start; offset < end; offset++) {
- switch (READ_ONCE(map[offset])) {
- case 0:
+ count = READ_ONCE(map[offset]);
+ if (!count)
continue;
- case SWAP_HAS_CACHE:
+
+ if (swap_cache_only(si, offset)) {
if (!vm_swap_full())
return false;
*need_reclaim = true;
continue;
- default:
- return false;
}
+ return false;
}
return true;
@@ -974,7 +934,6 @@ static void swap_reclaim_full_clusters(struct swap_info_struct *si, bool force)
long to_scan = 1;
unsigned long offset, end;
struct swap_cluster_info *ci;
- unsigned char *map = si->swap_map;
int nr_reclaim;
if (force)
@@ -986,7 +945,7 @@ static void swap_reclaim_full_clusters(struct swap_info_struct *si, bool force)
to_scan--;
while (offset < end) {
- if (READ_ONCE(map[offset]) == SWAP_HAS_CACHE) {
+ if (swap_cache_only(si, offset)) {
spin_unlock(&ci->lock);
nr_reclaim = __try_to_reclaim_swap(si, offset,
TTRS_ANYWAY);
@@ -1320,7 +1279,8 @@ static bool swap_alloc_fast(swp_slot_t *slot, int order)
if (cluster_is_usable(ci, order)) {
if (cluster_is_empty(ci))
offset = cluster_offset(si, ci);
- found = alloc_swap_scan_cluster(si, ci, offset, order, SWAP_HAS_CACHE);
+ found = alloc_swap_scan_cluster(si, ci, offset, order,
+ SWAP_MAP_ALLOCATED);
if (found)
*slot = swp_slot(si->type, found);
} else {
@@ -1344,7 +1304,7 @@ static void swap_alloc_slow(swp_slot_t *slot, int order)
plist_requeue(&si->avail_list, &swap_avail_head);
spin_unlock(&swap_avail_lock);
if (get_swap_device_info(si)) {
- offset = cluster_alloc_swap_slot(si, order, SWAP_HAS_CACHE);
+ offset = cluster_alloc_swap_slot(si, order, SWAP_MAP_ALLOCATED);
swap_slot_put_swap_info(si);
if (offset) {
*slot = swp_slot(si->type, offset);
@@ -1471,48 +1431,6 @@ static struct swap_info_struct *_swap_info_get(swp_slot_t slot)
return NULL;
}
-static unsigned char swap_slot_put_locked(struct swap_info_struct *si,
- struct swap_cluster_info *ci,
- swp_slot_t slot,
- unsigned char usage)
-{
- unsigned long offset = swp_slot_offset(slot);
- unsigned char count;
- unsigned char has_cache;
-
- count = si->swap_map[offset];
-
- has_cache = count & SWAP_HAS_CACHE;
- count &= ~SWAP_HAS_CACHE;
-
- if (usage == SWAP_HAS_CACHE) {
- VM_BUG_ON(!has_cache);
- has_cache = 0;
- } else if (count == SWAP_MAP_SHMEM) {
- /*
- * Or we could insist on shmem.c using a special
- * swap_shmem_free() and free_shmem_swap_and_cache()...
- */
- count = 0;
- } else if ((count & ~COUNT_CONTINUED) <= SWAP_MAP_MAX) {
- if (count == COUNT_CONTINUED) {
- if (swap_count_continued(si, offset, count))
- count = SWAP_MAP_MAX | COUNT_CONTINUED;
- else
- count = SWAP_MAP_MAX;
- } else
- count--;
- }
-
- usage = count | has_cache;
- if (usage)
- WRITE_ONCE(si->swap_map[offset], usage);
- else
- swap_slots_free(si, ci, slot, 1);
-
- return usage;
-}
-
/*
* When we get a swap entry, if there aren't some other ways to
* prevent swapoff, such as the folio in swap cache is locked, RCU
@@ -1580,94 +1498,23 @@ struct swap_info_struct *swap_slot_tryget_swap_info(swp_slot_t slot)
return NULL;
}
-static void swap_slots_put_cache(struct swap_info_struct *si,
- swp_slot_t slot, int nr)
-{
- unsigned long offset = swp_slot_offset(slot);
- struct swap_cluster_info *ci;
-
- ci = swap_cluster_lock(si, offset);
- if (swap_only_has_cache(si, offset, nr)) {
- swap_slots_free(si, ci, slot, nr);
- } else {
- for (int i = 0; i < nr; i++, slot.val++)
- swap_slot_put_locked(si, ci, slot, SWAP_HAS_CACHE);
- }
- swap_cluster_unlock(ci);
-}
-
static bool swap_slots_put_map(struct swap_info_struct *si,
swp_slot_t slot, int nr)
{
unsigned long offset = swp_slot_offset(slot);
struct swap_cluster_info *ci;
- bool has_cache = false;
- unsigned char count;
- int i;
-
- if (nr <= 1)
- goto fallback;
- count = swap_count(data_race(si->swap_map[offset]));
- if (count != 1 && count != SWAP_MAP_SHMEM)
- goto fallback;
ci = swap_cluster_lock(si, offset);
- if (!swap_is_last_map(si, offset, nr, &has_cache)) {
- goto locked_fallback;
- }
- if (!has_cache)
- swap_slots_free(si, ci, slot, nr);
- else
- for (i = 0; i < nr; i++)
- WRITE_ONCE(si->swap_map[offset + i], SWAP_HAS_CACHE);
+ vswap_rmap_set(ci, slot, 0, nr);
+ swap_slots_free(si, ci, slot, nr);
swap_cluster_unlock(ci);
- return has_cache;
-
-fallback:
- ci = swap_cluster_lock(si, offset);
-locked_fallback:
- for (i = 0; i < nr; i++, slot.val++) {
- count = swap_slot_put_locked(si, ci, slot, 1);
- if (count == SWAP_HAS_CACHE)
- has_cache = true;
- }
- swap_cluster_unlock(ci);
- return has_cache;
-}
-
-/*
- * Only functions with "_nr" suffix are able to free entries spanning
- * cross multi clusters, so ensure the range is within a single cluster
- * when freeing entries with functions without "_nr" suffix.
- */
-static bool swap_slots_put_map_nr(struct swap_info_struct *si,
- swp_slot_t slot, int nr)
-{
- int cluster_nr, cluster_rest;
- unsigned long offset = swp_slot_offset(slot);
- bool has_cache = false;
-
- cluster_rest = SWAPFILE_CLUSTER - offset % SWAPFILE_CLUSTER;
- while (nr) {
- cluster_nr = min(nr, cluster_rest);
- has_cache |= swap_slots_put_map(si, slot, cluster_nr);
- cluster_rest = SWAPFILE_CLUSTER;
- nr -= cluster_nr;
- slot.val += cluster_nr;
- }
-
- return has_cache;
+ return true;
}
-/*
- * Check if it's the last ref of swap entry in the freeing path.
- * Qualified value includes 1, SWAP_HAS_CACHE or SWAP_MAP_SHMEM.
- */
static inline bool __maybe_unused swap_is_last_ref(unsigned char count)
{
- return (count == SWAP_HAS_CACHE) || (count == 1) ||
- (count == SWAP_MAP_SHMEM);
+ return count == SWAP_MAP_ALLOCATED;
}
/*
@@ -1681,14 +1528,6 @@ static void swap_slots_free(struct swap_info_struct *si,
unsigned long offset = swp_slot_offset(slot);
unsigned char *map = si->swap_map + offset;
unsigned char *map_end = map + nr_pages;
- swp_entry_t entry = swp_slot_to_swp_entry(slot);
- int i;
-
- /* release all the associated (virtual) swap slots */
- for (i = 0; i < nr_pages; i++) {
- vswap_free(entry, ci);
- entry.val++;
- }
/* It should never free entries across different clusters */
VM_BUG_ON(ci != __swap_offset_to_cluster(si, offset + nr_pages - 1));
@@ -1731,149 +1570,6 @@ void swap_slot_free_nr(swp_slot_t slot, int nr_pages)
}
}
-/*
- * Caller has made sure that the swap device corresponding to entry
- * is still around or has not been recycled.
- */
-void swap_free_nr(swp_entry_t entry, int nr_pages)
-{
- swap_slot_free_nr(swp_entry_to_swp_slot(entry), nr_pages);
-}
-
-/*
- * Called after dropping swapcache to decrease refcnt to swap entries.
- */
-void put_swap_folio(struct folio *folio, swp_entry_t entry)
-{
- swp_slot_t slot = swp_entry_to_swp_slot(entry);
- struct swap_info_struct *si;
- int size = 1 << swap_slot_order(folio_order(folio));
-
- si = _swap_info_get(slot);
- if (!si)
- return;
-
- swap_slots_put_cache(si, slot, size);
-}
-
-int __swap_count(swp_entry_t entry)
-{
- swp_slot_t slot = swp_entry_to_swp_slot(entry);
- struct swap_info_struct *si = __swap_slot_to_info(slot);
- pgoff_t offset = swp_slot_offset(slot);
-
- return swap_count(si->swap_map[offset]);
-}
-
-/*
- * How many references to @entry are currently swapped out?
- * This does not give an exact answer when swap count is continued,
- * but does include the high COUNT_CONTINUED flag to allow for that.
- */
-bool swap_entry_swapped(struct swap_info_struct *si, swp_entry_t entry)
-{
- swp_slot_t slot = swp_entry_to_swp_slot(entry);
- pgoff_t offset = swp_slot_offset(slot);
- struct swap_cluster_info *ci;
- int count;
-
- ci = swap_cluster_lock(si, offset);
- count = swap_count(si->swap_map[offset]);
- swap_cluster_unlock(ci);
- return !!count;
-}
-
-/*
- * How many references to @entry are currently swapped out?
- * This considers COUNT_CONTINUED so it returns exact answer.
- */
-int swp_swapcount(swp_entry_t entry)
-{
- swp_slot_t slot = swp_entry_to_swp_slot(entry);
- int count, tmp_count, n;
- struct swap_info_struct *si;
- struct swap_cluster_info *ci;
- struct page *page;
- pgoff_t offset;
- unsigned char *map;
-
- si = _swap_info_get(slot);
- if (!si)
- return 0;
-
- offset = swp_slot_offset(slot);
-
- ci = swap_cluster_lock(si, offset);
-
- count = swap_count(si->swap_map[offset]);
- if (!(count & COUNT_CONTINUED))
- goto out;
-
- count &= ~COUNT_CONTINUED;
- n = SWAP_MAP_MAX + 1;
-
- page = vmalloc_to_page(si->swap_map + offset);
- offset &= ~PAGE_MASK;
- VM_BUG_ON(page_private(page) != SWP_CONTINUED);
-
- do {
- page = list_next_entry(page, lru);
- map = kmap_local_page(page);
- tmp_count = map[offset];
- kunmap_local(map);
-
- count += (tmp_count & ~COUNT_CONTINUED) * n;
- n *= (SWAP_CONT_MAX + 1);
- } while (tmp_count & COUNT_CONTINUED);
-out:
- swap_cluster_unlock(ci);
- return count;
-}
-
-static bool swap_page_trans_huge_swapped(struct swap_info_struct *si,
- swp_entry_t entry, int order)
-{
- swp_slot_t slot = swp_entry_to_swp_slot(entry);
- struct swap_cluster_info *ci;
- unsigned char *map = si->swap_map;
- unsigned int nr_pages = 1 << order;
- unsigned long roffset = swp_slot_offset(slot);
- unsigned long offset = round_down(roffset, nr_pages);
- int i;
- bool ret = false;
-
- ci = swap_cluster_lock(si, offset);
- if (nr_pages == 1) {
- if (swap_count(map[roffset]))
- ret = true;
- goto unlock_out;
- }
- for (i = 0; i < nr_pages; i++) {
- if (swap_count(map[offset + i])) {
- ret = true;
- break;
- }
- }
-unlock_out:
- swap_cluster_unlock(ci);
- return ret;
-}
-
-static bool folio_swapped(struct folio *folio)
-{
- swp_entry_t entry = folio->swap;
- swp_slot_t slot = swp_entry_to_swp_slot(entry);
- struct swap_info_struct *si = _swap_info_get(slot);
-
- if (!si)
- return false;
-
- if (!IS_ENABLED(CONFIG_THP_SWAP) || likely(!folio_test_large(folio)))
- return swap_entry_swapped(si, entry);
-
- return swap_page_trans_huge_swapped(si, entry, folio_order(folio));
-}
-
static bool folio_swapcache_freeable(struct folio *folio)
{
VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
@@ -1925,72 +1621,6 @@ bool folio_free_swap(struct folio *folio)
return true;
}
-/**
- * free_swap_and_cache_nr() - Release reference on range of swap entries and
- * reclaim their cache if no more references remain.
- * @entry: First entry of range.
- * @nr: Number of entries in range.
- *
- * For each swap entry in the contiguous range, release a reference. If any swap
- * entries become free, try to reclaim their underlying folios, if present. The
- * offset range is defined by [entry.offset, entry.offset + nr).
- */
-void free_swap_and_cache_nr(swp_entry_t entry, int nr)
-{
- swp_slot_t slot = swp_entry_to_swp_slot(entry);
- const unsigned long start_offset = swp_slot_offset(slot);
- const unsigned long end_offset = start_offset + nr;
- struct swap_info_struct *si;
- bool any_only_cache = false;
- unsigned long offset;
-
- si = swap_slot_tryget_swap_info(slot);
- if (!si)
- return;
-
- if (WARN_ON(end_offset > si->max))
- goto out;
-
- /*
- * First free all entries in the range.
- */
- any_only_cache = swap_slots_put_map_nr(si, slot, nr);
-
- /*
- * Short-circuit the below loop if none of the entries had their
- * reference drop to zero.
- */
- if (!any_only_cache)
- goto out;
-
- /*
- * Now go back over the range trying to reclaim the swap cache.
- */
- for (offset = start_offset; offset < end_offset; offset += nr) {
- nr = 1;
- if (READ_ONCE(si->swap_map[offset]) == SWAP_HAS_CACHE) {
- /*
- * Folios are always naturally aligned in swap so
- * advance forward to the next boundary. Zero means no
- * folio was found for the swap entry, so advance by 1
- * in this case. Negative value means folio was found
- * but could not be reclaimed. Here we can still advance
- * to the next boundary.
- */
- nr = __try_to_reclaim_swap(si, offset,
- TTRS_UNMAPPED | TTRS_FULL);
- if (nr == 0)
- nr = 1;
- else if (nr < 0)
- nr = -nr;
- nr = ALIGN(offset + 1, nr) - offset;
- }
- }
-
-out:
- swap_slot_put_swap_info(si);
-}
-
#ifdef CONFIG_HIBERNATION
swp_slot_t swap_slot_alloc_of_type(int type)
@@ -2901,8 +2531,6 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
flush_percpu_swap_cluster(p);
destroy_swap_extents(p);
- if (p->flags & SWP_CONTINUED)
- free_swap_count_continuations(p);
if (!(p->flags & SWP_SOLIDSTATE))
atomic_dec(&nr_rotate_swap);
@@ -3638,364 +3266,6 @@ void si_swapinfo(struct sysinfo *val)
spin_unlock(&swap_lock);
}
-/*
- * Verify that nr swap entries are valid and increment their swap map counts.
- *
- * Returns error code in following case.
- * - success -> 0
- * - swp_entry is invalid -> EINVAL
- * - swap-cache reference is requested but there is already one. -> EEXIST
- * - swap-cache reference is requested but the entry is not used. -> ENOENT
- * - swap-mapped reference requested but needs continued swap count. -> ENOMEM
- */
-static int __swap_duplicate(swp_entry_t entry, unsigned char usage, int nr)
-{
- swp_slot_t slot = swp_entry_to_swp_slot(entry);
- struct swap_info_struct *si;
- struct swap_cluster_info *ci;
- unsigned long offset;
- unsigned char count;
- unsigned char has_cache;
- int err, i;
-
- si = swap_slot_to_info(slot);
- if (WARN_ON_ONCE(!si)) {
- pr_err("%s%08lx\n", Bad_file, entry.val);
- return -EINVAL;
- }
-
- offset = swp_slot_offset(slot);
- VM_WARN_ON(nr > SWAPFILE_CLUSTER - offset % SWAPFILE_CLUSTER);
- VM_WARN_ON(usage == 1 && nr > 1);
- ci = swap_cluster_lock(si, offset);
-
- err = 0;
- for (i = 0; i < nr; i++) {
- count = si->swap_map[offset + i];
-
- /*
- * swapin_readahead() doesn't check if a swap entry is valid, so the
- * swap entry could be SWAP_MAP_BAD. Check here with lock held.
- */
- if (unlikely(swap_count(count) == SWAP_MAP_BAD)) {
- err = -ENOENT;
- goto unlock_out;
- }
-
- has_cache = count & SWAP_HAS_CACHE;
- count &= ~SWAP_HAS_CACHE;
-
- if (!count && !has_cache) {
- err = -ENOENT;
- } else if (usage == SWAP_HAS_CACHE) {
- if (has_cache)
- err = -EEXIST;
- } else if ((count & ~COUNT_CONTINUED) > SWAP_MAP_MAX) {
- err = -EINVAL;
- }
-
- if (err)
- goto unlock_out;
- }
-
- for (i = 0; i < nr; i++) {
- count = si->swap_map[offset + i];
- has_cache = count & SWAP_HAS_CACHE;
- count &= ~SWAP_HAS_CACHE;
-
- if (usage == SWAP_HAS_CACHE)
- has_cache = SWAP_HAS_CACHE;
- else if ((count & ~COUNT_CONTINUED) < SWAP_MAP_MAX)
- count += usage;
- else if (swap_count_continued(si, offset + i, count))
- count = COUNT_CONTINUED;
- else {
- /*
- * Don't need to rollback changes, because if
- * usage == 1, there must be nr == 1.
- */
- err = -ENOMEM;
- goto unlock_out;
- }
-
- WRITE_ONCE(si->swap_map[offset + i], count | has_cache);
- }
-
-unlock_out:
- swap_cluster_unlock(ci);
- return err;
-}
-
-/*
- * Help swapoff by noting that swap entry belongs to shmem/tmpfs
- * (in which case its reference count is never incremented).
- */
-void swap_shmem_alloc(swp_entry_t entry, int nr)
-{
- __swap_duplicate(entry, SWAP_MAP_SHMEM, nr);
-}
-
-/*
- * Increase reference count of swap entry by 1.
- * Returns 0 for success, or -ENOMEM if a swap_count_continuation is required
- * but could not be atomically allocated. Returns 0, just as if it succeeded,
- * if __swap_duplicate() fails for another reason (-EINVAL or -ENOENT), which
- * might occur if a page table entry has got corrupted.
- */
-int swap_duplicate(swp_entry_t entry)
-{
- int err = 0;
-
- while (!err && __swap_duplicate(entry, 1, 1) == -ENOMEM)
- err = add_swap_count_continuation(entry, GFP_ATOMIC);
- return err;
-}
-
-/*
- * @entry: first swap entry from which we allocate nr swap cache.
- *
- * Called when allocating swap cache for existing swap entries,
- * This can return error codes. Returns 0 at success.
- * -EEXIST means there is a swap cache.
- * Note: return code is different from swap_duplicate().
- */
-int swapcache_prepare(swp_entry_t entry, int nr)
-{
- return __swap_duplicate(entry, SWAP_HAS_CACHE, nr);
-}
-
-/*
- * Caller should ensure entries belong to the same folio so
- * the entries won't span cross cluster boundary.
- */
-void swapcache_clear(struct swap_info_struct *si, swp_entry_t entry, int nr)
-{
- swap_slots_put_cache(si, swp_entry_to_swp_slot(entry), nr);
-}
-
-/*
- * add_swap_count_continuation - called when a swap count is duplicated
- * beyond SWAP_MAP_MAX, it allocates a new page and links that to the entry's
- * page of the original vmalloc'ed swap_map, to hold the continuation count
- * (for that entry and for its neighbouring PAGE_SIZE swap entries). Called
- * again when count is duplicated beyond SWAP_MAP_MAX * SWAP_CONT_MAX, etc.
- *
- * These continuation pages are seldom referenced: the common paths all work
- * on the original swap_map, only referring to a continuation page when the
- * low "digit" of a count is incremented or decremented through SWAP_MAP_MAX.
- *
- * add_swap_count_continuation(, GFP_ATOMIC) can be called while holding
- * page table locks; if it fails, add_swap_count_continuation(, GFP_KERNEL)
- * can be called after dropping locks.
- */
-int add_swap_count_continuation(swp_entry_t entry, gfp_t gfp_mask)
-{
- struct swap_info_struct *si;
- struct swap_cluster_info *ci;
- struct page *head;
- struct page *page;
- struct page *list_page;
- pgoff_t offset;
- unsigned char count;
- swp_slot_t slot = swp_entry_to_swp_slot(entry);
- int ret = 0;
-
- /*
- * When debugging, it's easier to use __GFP_ZERO here; but it's better
- * for latency not to zero a page while GFP_ATOMIC and holding locks.
- */
- page = alloc_page(gfp_mask | __GFP_HIGHMEM);
-
- si = swap_slot_tryget_swap_info(slot);
- if (!si) {
- /*
- * An acceptable race has occurred since the failing
- * __swap_duplicate(): the swap device may be swapoff
- */
- goto outer;
- }
-
- offset = swp_slot_offset(slot);
-
- ci = swap_cluster_lock(si, offset);
-
- count = swap_count(si->swap_map[offset]);
-
- if ((count & ~COUNT_CONTINUED) != SWAP_MAP_MAX) {
- /*
- * The higher the swap count, the more likely it is that tasks
- * will race to add swap count continuation: we need to avoid
- * over-provisioning.
- */
- goto out;
- }
-
- if (!page) {
- ret = -ENOMEM;
- goto out;
- }
-
- head = vmalloc_to_page(si->swap_map + offset);
- offset &= ~PAGE_MASK;
-
- spin_lock(&si->cont_lock);
- /*
- * Page allocation does not initialize the page's lru field,
- * but it does always reset its private field.
- */
- if (!page_private(head)) {
- BUG_ON(count & COUNT_CONTINUED);
- INIT_LIST_HEAD(&head->lru);
- set_page_private(head, SWP_CONTINUED);
- si->flags |= SWP_CONTINUED;
- }
-
- list_for_each_entry(list_page, &head->lru, lru) {
- unsigned char *map;
-
- /*
- * If the previous map said no continuation, but we've found
- * a continuation page, free our allocation and use this one.
- */
- if (!(count & COUNT_CONTINUED))
- goto out_unlock_cont;
-
- map = kmap_local_page(list_page) + offset;
- count = *map;
- kunmap_local(map);
-
- /*
- * If this continuation count now has some space in it,
- * free our allocation and use this one.
- */
- if ((count & ~COUNT_CONTINUED) != SWAP_CONT_MAX)
- goto out_unlock_cont;
- }
-
- list_add_tail(&page->lru, &head->lru);
- page = NULL; /* now it's attached, don't free it */
-out_unlock_cont:
- spin_unlock(&si->cont_lock);
-out:
- swap_cluster_unlock(ci);
- swap_slot_put_swap_info(si);
-outer:
- if (page)
- __free_page(page);
- return ret;
-}
-
-/*
- * swap_count_continued - when the original swap_map count is incremented
- * from SWAP_MAP_MAX, check if there is already a continuation page to carry
- * into, carry if so, or else fail until a new continuation page is allocated;
- * when the original swap_map count is decremented from 0 with continuation,
- * borrow from the continuation and report whether it still holds more.
- * Called while __swap_duplicate() or caller of swap_entry_put_locked()
- * holds cluster lock.
- */
-static bool swap_count_continued(struct swap_info_struct *si,
- pgoff_t offset, unsigned char count)
-{
- struct page *head;
- struct page *page;
- unsigned char *map;
- bool ret;
-
- head = vmalloc_to_page(si->swap_map + offset);
- if (page_private(head) != SWP_CONTINUED) {
- BUG_ON(count & COUNT_CONTINUED);
- return false; /* need to add count continuation */
- }
-
- spin_lock(&si->cont_lock);
- offset &= ~PAGE_MASK;
- page = list_next_entry(head, lru);
- map = kmap_local_page(page) + offset;
-
- if (count == SWAP_MAP_MAX) /* initial increment from swap_map */
- goto init_map; /* jump over SWAP_CONT_MAX checks */
-
- if (count == (SWAP_MAP_MAX | COUNT_CONTINUED)) { /* incrementing */
- /*
- * Think of how you add 1 to 999
- */
- while (*map == (SWAP_CONT_MAX | COUNT_CONTINUED)) {
- kunmap_local(map);
- page = list_next_entry(page, lru);
- BUG_ON(page == head);
- map = kmap_local_page(page) + offset;
- }
- if (*map == SWAP_CONT_MAX) {
- kunmap_local(map);
- page = list_next_entry(page, lru);
- if (page == head) {
- ret = false; /* add count continuation */
- goto out;
- }
- map = kmap_local_page(page) + offset;
-init_map: *map = 0; /* we didn't zero the page */
- }
- *map += 1;
- kunmap_local(map);
- while ((page = list_prev_entry(page, lru)) != head) {
- map = kmap_local_page(page) + offset;
- *map = COUNT_CONTINUED;
- kunmap_local(map);
- }
- ret = true; /* incremented */
-
- } else { /* decrementing */
- /*
- * Think of how you subtract 1 from 1000
- */
- BUG_ON(count != COUNT_CONTINUED);
- while (*map == COUNT_CONTINUED) {
- kunmap_local(map);
- page = list_next_entry(page, lru);
- BUG_ON(page == head);
- map = kmap_local_page(page) + offset;
- }
- BUG_ON(*map == 0);
- *map -= 1;
- if (*map == 0)
- count = 0;
- kunmap_local(map);
- while ((page = list_prev_entry(page, lru)) != head) {
- map = kmap_local_page(page) + offset;
- *map = SWAP_CONT_MAX | count;
- count = COUNT_CONTINUED;
- kunmap_local(map);
- }
- ret = count == COUNT_CONTINUED;
- }
-out:
- spin_unlock(&si->cont_lock);
- return ret;
-}
-
-/*
- * free_swap_count_continuations - swapoff free all the continuation pages
- * appended to the swap_map, after swap_map is quiesced, before vfree'ing it.
- */
-static void free_swap_count_continuations(struct swap_info_struct *si)
-{
- pgoff_t offset;
-
- for (offset = 0; offset < si->max; offset += PAGE_SIZE) {
- struct page *head;
- head = vmalloc_to_page(si->swap_map + offset);
- if (page_private(head)) {
- struct page *page, *next;
-
- list_for_each_entry_safe(page, next, &head->lru, lru) {
- list_del(&page->lru);
- __free_page(page);
- }
- }
- }
-}
-
#if defined(CONFIG_MEMCG) && defined(CONFIG_BLK_CGROUP)
static bool __has_usable_swap(void)
{
diff --git a/mm/vswap.c b/mm/vswap.c
index 64747493ca9f7..318933071edc6 100644
--- a/mm/vswap.c
+++ b/mm/vswap.c
@@ -24,6 +24,8 @@
* For now, there is a one-to-one correspondence between a virtual swap slot
* and its associated physical swap slot.
*
+ * I. Allocation
+ *
* Virtual swap slots are organized into PMD-sized clusters, analogous to
* physical swap allocator. However, unlike the physical swap allocator,
* the clusters are dynamically allocated and freed on-demand. There is no
@@ -32,6 +34,26 @@
*
* This allows us to avoid the overhead of pre-allocating a large number of
* virtual swap clusters.
+ *
+ * II. Swap Entry Lifecycle
+ *
+ * The swap entry's lifecycle is managed at the virtual swap layer. Conceptually,
+ * each virtual swap slot has a reference count, which includes:
+ *
+ * 1. The number of page table entries that refer to the virtual swap slot, i.e
+ * its swap count.
+ *
+ * 2. Whether the virtual swap slot has been added to the swap cache - if so,
+ * its reference count is incremented by 1.
+ *
+ * Each virtual swap slot starts out with a reference count of 1 (since it is
+ * about to be added to the swap cache). Its reference count is incremented or
+ * decremented every time it is mapped to or unmapped from a PTE, as well as
+ * when it is added to or removed from the swap cache. Finally, when its
+ * reference count reaches 0, the virtual swap slot is freed.
+ *
+ * Note that we do not have a reference count field per se - it is derived from
+ * the swap_count and the in_swapcache fields.
*/
/**
@@ -42,6 +64,8 @@
* @swap_cache: The folio in swap cache.
* @shadow: The shadow entry.
* @memcgid: The memcg id of the owning memcg, if any.
+ * @swap_count: The number of page table entries that refer to the swap entry.
+ * @in_swapcache: Whether the swap entry is (about to be) pinned in swap cache.
*/
struct swp_desc {
swp_slot_t slot;
@@ -50,9 +74,14 @@ struct swp_desc {
struct folio *swap_cache;
void *shadow;
};
+
+ unsigned int swap_count;
+
#ifdef CONFIG_MEMCG
unsigned short memcgid;
#endif
+
+ bool in_swapcache;
};
#define VSWAP_CLUSTER_SHIFT HPAGE_PMD_ORDER
@@ -249,6 +278,8 @@ static void __vswap_alloc_from_cluster(struct vswap_cluster *cluster, int start)
#ifdef CONFIG_MEMCG
desc->memcgid = 0;
#endif
+ desc->swap_count = 0;
+ desc->in_swapcache = true;
}
cluster->count += nr;
}
@@ -452,7 +483,7 @@ static inline void release_vswap_slot(struct vswap_cluster *cluster,
* Update the physical-to-virtual swap slot mapping.
* Caller must ensure the physical swap slot's cluster is locked.
*/
-static void vswap_rmap_set(struct swap_cluster_info *ci, swp_slot_t slot,
+void vswap_rmap_set(struct swap_cluster_info *ci, swp_slot_t slot,
unsigned long vswap, int nr)
{
atomic_long_t *table;
@@ -466,45 +497,50 @@ static void vswap_rmap_set(struct swap_cluster_info *ci, swp_slot_t slot,
__swap_table_set(ci, ci_off + i, vswap ? vswap + i : 0);
}
-/**
- * vswap_free - free a virtual swap slot.
- * @entry: the virtual swap slot to free
- * @ci: the physical swap slot's cluster (optional, can be NULL)
+/*
+ * Entered with the cluster locked, but might unlock the cluster.
+ * This is because several operations, such as releasing physical swap slots
+ * (i.e swap_slot_free_nr()) require the cluster to be unlocked to avoid
+ * deadlocks.
*
- * If @ci is NULL, this function is called to clean up a virtual swap entry
- * when no linkage has been established between physical and virtual swap slots.
- * If @ci is provided, the caller must ensure it is locked.
+ * This is safe, because:
+ *
+ * 1. The swap entry to be freed has refcnt (swap count and swapcache pin)
+ * down to 0, so no one can change its internal state
+ *
+ * 2. The swap entry to be freed still holds a refcnt to the cluster, keeping
+ * the cluster itself valid.
+ *
+ * We will exit the function with the cluster re-locked.
*/
-void vswap_free(swp_entry_t entry, struct swap_cluster_info *ci)
+static void vswap_free(struct vswap_cluster *cluster, struct swp_desc *desc,
+ swp_entry_t entry)
{
- struct vswap_cluster *cluster = NULL;
- struct swp_desc *desc;
+ struct zswap_entry *zswap_entry;
+ swp_slot_t slot;
- if (!entry.val)
- return;
+ /* Clear shadow if present */
+ if (xa_is_value(desc->shadow))
+ desc->shadow = NULL;
- zswap_invalidate(entry);
- mem_cgroup_uncharge_swap(entry, 1);
+ slot = desc->slot;
+ desc->slot.val = 0;
- /* do not immediately erase the virtual slot to prevent its reuse */
- rcu_read_lock();
- desc = vswap_iter(&cluster, entry.val);
- if (!desc) {
- rcu_read_unlock();
- return;
+ zswap_entry = desc->zswap_entry;
+ if (zswap_entry) {
+ desc->zswap_entry = NULL;
+ zswap_entry_free(zswap_entry);
}
+ spin_unlock(&cluster->lock);
- /* Clear shadow if present */
- if (xa_is_value(desc->shadow))
- desc->shadow = NULL;
+ mem_cgroup_uncharge_swap(entry, 1);
- if (desc->slot.val)
- vswap_rmap_set(ci, desc->slot, 0, 1);
+ if (slot.val)
+ swap_slot_free_nr(slot, 1);
+ spin_lock(&cluster->lock);
/* erase forward mapping and release the virtual slot for reallocation */
release_vswap_slot(cluster, entry.val);
- spin_unlock(&cluster->lock);
- rcu_read_unlock();
}
/**
@@ -538,8 +574,12 @@ int folio_alloc_swap(struct folio *folio)
* fallback from zswap store failure).
*/
if (swap_slot_alloc(&slot, order)) {
- for (i = 0; i < nr; i++)
- vswap_free((swp_entry_t){entry.val + i}, NULL);
+ for (i = 0; i < nr; i++) {
+ desc = vswap_iter(&cluster, entry.val + i);
+ VM_WARN_ON(!desc);
+ vswap_free(cluster, desc, (swp_entry_t){ entry.val + i });
+ }
+ spin_unlock(&cluster->lock);
entry.val = 0;
return -ENOMEM;
}
@@ -603,9 +643,11 @@ swp_slot_t swp_entry_to_swp_slot(swp_entry_t entry)
rcu_read_unlock();
return (swp_slot_t){0};
}
+
slot = desc->slot;
spin_unlock(&cluster->lock);
rcu_read_unlock();
+
return slot;
}
@@ -635,6 +677,352 @@ swp_entry_t swp_slot_to_swp_entry(swp_slot_t slot)
return ret;
}
+/*
+ * Decrease the swap count of nr contiguous swap entries by 1 (when the swap
+ * entries are removed from a range of PTEs), and check if any of the swap
+ * entries are in swap cache only after its swap count is decreased.
+ *
+ * The check is racy, but it is OK because free_swap_and_cache_nr() only use
+ * the result as a hint.
+ */
+static bool vswap_free_nr_any_cache_only(swp_entry_t entry, int nr)
+{
+ struct vswap_cluster *cluster = NULL;
+ struct swp_desc *desc;
+ bool ret = false;
+ int i;
+
+ rcu_read_lock();
+ for (i = 0; i < nr; i++) {
+ desc = vswap_iter(&cluster, entry.val);
+ VM_WARN_ON(!desc);
+ ret |= (desc->swap_count == 1 && desc->in_swapcache);
+ desc->swap_count--;
+ if (!desc->swap_count && !desc->in_swapcache)
+ vswap_free(cluster, desc, entry);
+ entry.val++;
+ }
+ if (cluster)
+ spin_unlock(&cluster->lock);
+ rcu_read_unlock();
+ return ret;
+}
+
+/**
+ * swap_free_nr - decrease the swap count of nr contiguous swap entries by 1
+ * (when the swap entries are removed from a range of PTEs).
+ * @entry: the first entry in the range.
+ * @nr: the number of entries in the range.
+ */
+void swap_free_nr(swp_entry_t entry, int nr)
+{
+ vswap_free_nr_any_cache_only(entry, nr);
+}
+
+static int swap_duplicate_nr(swp_entry_t entry, int nr)
+{
+ struct vswap_cluster *cluster = NULL;
+ struct swp_desc *desc;
+ int i = 0;
+
+ rcu_read_lock();
+ for (i = 0; i < nr; i++) {
+ desc = vswap_iter(&cluster, entry.val + i);
+ if (!desc || (!desc->swap_count && !desc->in_swapcache))
+ goto done;
+ desc->swap_count++;
+ }
+done:
+ if (cluster)
+ spin_unlock(&cluster->lock);
+ rcu_read_unlock();
+ if (i && i < nr)
+ swap_free_nr(entry, i);
+
+ return i == nr ? 0 : -ENOENT;
+}
+
+/**
+ * swap_duplicate - increase the swap count of the swap entry by 1 (i.e when
+ * the swap entry is stored at a new PTE).
+ * @entry: the swap entry.
+ *
+ * Return: -ENONENT, if we try to duplicate a non-existent swap entry.
+ */
+int swap_duplicate(swp_entry_t entry)
+{
+ return swap_duplicate_nr(entry, 1);
+}
+
+
+bool folio_swapped(struct folio *folio)
+{
+ struct vswap_cluster *cluster = NULL;
+ swp_entry_t entry = folio->swap;
+ int i, nr = folio_nr_pages(folio);
+ struct swp_desc *desc;
+ bool swapped = false;
+
+ if (!entry.val)
+ return false;
+
+ rcu_read_lock();
+ for (i = 0; i < nr; i++) {
+ desc = vswap_iter(&cluster, entry.val + i);
+ if (desc && desc->swap_count) {
+ swapped = true;
+ break;
+ }
+ }
+ if (cluster)
+ spin_unlock(&cluster->lock);
+ rcu_read_unlock();
+ return swapped;
+}
+
+/**
+ * swp_swapcount - return the swap count of the swap entry.
+ * @id: the swap entry.
+ *
+ * Note that all the swap count functions are identical in the new design,
+ * since we no longer need swap count continuation.
+ *
+ * Return: the swap count of the swap entry.
+ */
+int swp_swapcount(swp_entry_t entry)
+{
+ struct vswap_cluster *cluster = NULL;
+ struct swp_desc *desc;
+ unsigned int ret;
+
+ rcu_read_lock();
+ desc = vswap_iter(&cluster, entry.val);
+ ret = desc ? desc->swap_count : 0;
+ if (cluster)
+ spin_unlock(&cluster->lock);
+ rcu_read_unlock();
+
+ return ret;
+}
+
+int __swap_count(swp_entry_t entry)
+{
+ return swp_swapcount(entry);
+}
+
+bool swap_entry_swapped(swp_entry_t entry)
+{
+ return !!swp_swapcount(entry);
+}
+
+void swap_shmem_alloc(swp_entry_t entry, int nr)
+{
+ swap_duplicate_nr(entry, nr);
+}
+
+void swapcache_clear(swp_entry_t entry, int nr)
+{
+ struct vswap_cluster *cluster = NULL;
+ struct swp_desc *desc;
+ int i;
+
+ if (!nr)
+ return;
+
+ rcu_read_lock();
+ for (i = 0; i < nr; i++) {
+ desc = vswap_iter(&cluster, entry.val);
+ desc->in_swapcache = false;
+ if (!desc->swap_count)
+ vswap_free(cluster, desc, entry);
+ entry.val++;
+ }
+ if (cluster)
+ spin_unlock(&cluster->lock);
+ rcu_read_unlock();
+}
+
+int swapcache_prepare(swp_entry_t entry, int nr)
+{
+ struct vswap_cluster *cluster = NULL;
+ struct swp_desc *desc;
+ int i, ret = 0;
+
+ rcu_read_lock();
+ for (i = 0; i < nr; i++) {
+ desc = vswap_iter(&cluster, entry.val + i);
+
+ if (!desc) {
+ ret = -ENOENT;
+ goto done;
+ }
+
+ if (!desc->swap_count && !desc->in_swapcache) {
+ ret = -ENOENT;
+ goto done;
+ }
+
+ if (desc->in_swapcache) {
+ ret = -EEXIST;
+ goto done;
+ }
+
+ desc->in_swapcache = true;
+ }
+done:
+ if (cluster)
+ spin_unlock(&cluster->lock);
+ rcu_read_unlock();
+ if (i && i < nr)
+ swapcache_clear(entry, i);
+ if (i < nr && !ret)
+ ret = -ENOENT;
+ return ret;
+}
+
+/**
+ * is_swap_cached - check if the swap entry is cached
+ * @entry: swap entry to check
+ *
+ * Returns true if the swap entry is cached, false otherwise.
+ */
+bool is_swap_cached(swp_entry_t entry)
+{
+ struct vswap_cluster *cluster = NULL;
+ struct swp_desc *desc;
+ bool cached;
+
+ rcu_read_lock();
+ desc = vswap_iter(&cluster, entry.val);
+ cached = desc ? desc->in_swapcache : false;
+ if (cluster)
+ spin_unlock(&cluster->lock);
+ rcu_read_unlock();
+
+ return cached;
+}
+
+/**
+ * vswap_only_has_cache - check if all the slots in the range are still valid,
+ * and are in swap cache only (i.e not stored in any
+ * PTEs).
+ * @entry: the first slot in the range.
+ * @nr: the number of slots in the range.
+ *
+ * Return: true if all the slots in the range are still valid, and are in swap
+ * cache only, or false otherwise.
+ */
+bool vswap_only_has_cache(swp_entry_t entry, int nr)
+{
+ struct vswap_cluster *cluster = NULL;
+ struct swp_desc *desc;
+ int i = 0;
+
+ rcu_read_lock();
+ for (i = 0; i < nr; i++) {
+ desc = vswap_iter(&cluster, entry.val + i);
+ if (!desc || desc->swap_count || !desc->in_swapcache)
+ goto done;
+ }
+done:
+ if (cluster)
+ spin_unlock(&cluster->lock);
+ rcu_read_unlock();
+ return i == nr;
+}
+
+/**
+ * non_swapcache_batch - count the longest range starting from a particular
+ * swap slot that are stil valid, but not in swap cache.
+ * @entry: the first slot to check.
+ * @max_nr: the maximum number of slots to check.
+ *
+ * Return: the number of slots in the longest range that are still valid, but
+ * not in swap cache.
+ */
+int non_swapcache_batch(swp_entry_t entry, int max_nr)
+{
+ struct vswap_cluster *cluster = NULL;
+ struct swp_desc *desc;
+ int i;
+
+ if (!entry.val)
+ return 0;
+
+ rcu_read_lock();
+ for (i = 0; i < max_nr; i++) {
+ desc = vswap_iter(&cluster, entry.val + i);
+ if (!desc || desc->in_swapcache || !desc->swap_count)
+ goto done;
+ }
+done:
+ if (cluster)
+ spin_unlock(&cluster->lock);
+ rcu_read_unlock();
+ return i;
+}
+
+/**
+ * free_swap_and_cache_nr() - Release a swap count on range of swap entries and
+ * reclaim their cache if no more references remain.
+ * @entry: First entry of range.
+ * @nr: Number of entries in range.
+ *
+ * For each swap entry in the contiguous range, release a swap count. If any
+ * swap entries have their swap count decremented to zero, try to reclaim their
+ * associated swap cache pages.
+ */
+void free_swap_and_cache_nr(swp_entry_t entry, int nr)
+{
+ int i = 0, incr = 1;
+ struct folio *folio;
+
+ if (vswap_free_nr_any_cache_only(entry, nr)) {
+ while (i < nr) {
+ incr = 1;
+ if (vswap_only_has_cache(entry, 1)) {
+ folio = swap_cache_get_folio(entry);
+ if (!folio)
+ goto next;
+
+ if (!folio_trylock(folio)) {
+ folio_put(folio);
+ goto next;
+ }
+
+ if (!folio_matches_swap_entry(folio, entry)) {
+ folio_unlock(folio);
+ folio_put(folio);
+ goto next;
+ }
+
+ /*
+ * Folios are always naturally aligned in swap so
+ * advance forward to the next boundary.
+ */
+ incr = ALIGN(entry.val + 1, folio_nr_pages(folio)) - entry.val;
+ folio_free_swap(folio);
+ folio_unlock(folio);
+ folio_put(folio);
+ }
+next:
+ i += incr;
+ entry.val += incr;
+ }
+ }
+}
+
+/*
+ * Called after dropping swapcache to decrease refcnt to swap entries.
+ */
+void put_swap_folio(struct folio *folio, swp_entry_t entry)
+{
+ int nr = folio_nr_pages(folio);
+
+ VM_WARN_ON(!folio_test_locked(folio));
+ swapcache_clear(entry, nr);
+}
+
bool tryget_swap_entry(swp_entry_t entry, struct swap_info_struct **si)
{
struct vswap_cluster *cluster;
@@ -869,8 +1257,8 @@ void *swap_cache_get_shadow(swp_entry_t entry)
* Context: Caller must ensure @entry is valid and protect the cluster with
* reference count or locks.
*
- * The caller also needs to update the corresponding swap_map slots with
- * SWAP_HAS_CACHE bit to avoid race or conflict.
+ * The caller also needs to obtain the swap entries' swap cache pins to avoid
+ * race or conflict.
*/
void swap_cache_add_folio(struct folio *folio, swp_entry_t entry, void **shadowp)
{
diff --git a/mm/zswap.c b/mm/zswap.c
index 72441131f094e..e46349f9c90bb 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -703,7 +703,7 @@ static void zswap_entry_cache_free(struct zswap_entry *entry)
* Carries out the common pattern of freeing an entry's zsmalloc allocation,
* freeing the entry itself, and decrementing the number of stored pages.
*/
-static void zswap_entry_free(struct zswap_entry *entry)
+void zswap_entry_free(struct zswap_entry *entry)
{
zswap_lru_del(&zswap_list_lru, entry);
zs_free(entry->pool->zs_pool, entry->handle);
@@ -1627,18 +1627,6 @@ int zswap_load(struct folio *folio)
return 0;
}
-void zswap_invalidate(swp_entry_t swp)
-{
- struct zswap_entry *entry;
-
- if (zswap_empty(swp))
- return;
-
- entry = zswap_entry_erase(swp);
- if (entry)
- zswap_entry_free(entry);
-}
-
/*********************************
* debugfs functions
**********************************/
--
2.47.3
^ permalink raw reply [flat|nested] 52+ messages in thread
* [PATCH v3 14/20] mm: swap: decouple virtual swap slot from backing store
2026-02-08 21:58 [PATCH v3 00/20] Virtual Swap Space Nhat Pham
` (12 preceding siblings ...)
2026-02-08 21:58 ` [PATCH v3 13/20] swap: manage swap entry lifecycle at the virtual swap layer Nhat Pham
@ 2026-02-08 21:58 ` Nhat Pham
2026-02-10 6:31 ` Dan Carpenter
2026-02-08 21:58 ` [PATCH v3 15/20] zswap: do not start zswap shrinker if there is no physical swap slots Nhat Pham
` (7 subsequent siblings)
21 siblings, 1 reply; 52+ messages in thread
From: Nhat Pham @ 2026-02-08 21:58 UTC (permalink / raw)
To: linux-mm
Cc: akpm, hannes, hughd, yosry.ahmed, mhocko, roman.gushchin,
shakeel.butt, muchun.song, len.brown, chengming.zhou, kasong,
chrisl, huang.ying.caritas, ryan.roberts, shikemeng, viro,
baohua, bhe, osalvador, lorenzo.stoakes, christophe.leroy, pavel,
kernel-team, linux-kernel, cgroups, linux-pm, peterx, riel,
joshua.hahnjy, npache, gourry, axelrasmussen, yuanchu, weixugc,
rafael, jannh, pfalcato, zhengqi.arch
This patch presents the first real use case of the new virtual swap
design. It leverages the virtualization of the swap space to decouple a
swap entry and its backing storage. A swap entry can now be backed by
one of the following options:
1. A physical swap slot (i.e on a physical swapfile/swap partition).
2. A "zero swap page", i.e the swapped out page is a zero page.
3. A compressed object in the zswap pool.
4. An in-memory page. This can happen when a page is loaded
(exclusively) from the zswap pool, or if the page is rejected by
zswap and zswap writeback is disabled.
This allows us to use zswap and the zero swap page optimization, without
having to reserved a slot on a swapfile, or a swapfile at all. This
translates to tens to hundreds of GBs of disk saving on hosts and
workloads that have high memory usage, as well as removes this spurious
limit on the usage of these optimizations.
One implication of this change is that we need to be much more careful
with THP swapin and batched swap free operations. The central
requirement is the range of entries we are working with must
have no mixed backing states:
1. For now, zswap-backed entries are not supported for these batched
operations.
2. All the entries must be backed by the same type.
3. If the swap entries in the batch are backed by in-memory folio, it
must be the same folio (i.e they correspond to the subpages of that
folio).
4. If the swap entries in the batch are backed by slots on swapfiles, it
must be the same swapfile, and these physical swap slots must also be
contiguous.
For now, we still charge virtual swap slots towards the memcg's swap
usage. In a following patch, we will change this behavior and only
charge physical (i.e on swapfile) swap slots towards the memcg's swap
usage.
Signed-off-by: Nhat Pham <nphamcs@gmail.com>
---
include/linux/swap.h | 14 +
include/linux/zswap.h | 3 +-
mm/internal.h | 14 +-
mm/memcontrol.c | 65 +++--
mm/memory.c | 84 ++++--
mm/page_io.c | 74 ++---
mm/shmem.c | 6 +-
mm/swap.h | 32 +--
mm/swap_state.c | 29 +-
mm/swapfile.c | 8 -
mm/vmscan.c | 19 +-
mm/vswap.c | 638 ++++++++++++++++++++++++++++++++++--------
mm/zswap.c | 45 ++-
13 files changed, 729 insertions(+), 302 deletions(-)
diff --git a/include/linux/swap.h b/include/linux/swap.h
index aae2e502d9975..54df972608047 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -658,12 +658,26 @@ static inline bool mem_cgroup_swap_full(struct folio *folio)
int vswap_init(void);
void vswap_exit(void);
+bool vswap_alloc_swap_slot(struct folio *folio);
swp_slot_t swp_entry_to_swp_slot(swp_entry_t entry);
swp_entry_t swp_slot_to_swp_entry(swp_slot_t slot);
bool tryget_swap_entry(swp_entry_t entry, struct swap_info_struct **si);
void put_swap_entry(swp_entry_t entry, struct swap_info_struct *si);
bool folio_swapped(struct folio *folio);
bool vswap_only_has_cache(swp_entry_t entry, int nr);
+int non_swapcache_batch(swp_entry_t entry, int nr);
+bool vswap_swapfile_backed(swp_entry_t entry, int nr);
+bool vswap_folio_backed(swp_entry_t entry, int nr);
+void vswap_store_folio(swp_entry_t entry, struct folio *folio);
+void swap_zeromap_folio_set(struct folio *folio);
+void vswap_assoc_zswap(swp_entry_t entry, struct zswap_entry *zswap_entry);
+bool vswap_can_swapin_thp(swp_entry_t entry, int nr);
+static inline struct swap_info_struct *vswap_get_device(swp_entry_t entry)
+{
+ swp_slot_t slot = swp_entry_to_swp_slot(entry);
+
+ return slot.val ? swap_slot_tryget_swap_info(slot) : NULL;
+}
#endif /* __KERNEL__*/
#endif /* _LINUX_SWAP_H */
diff --git a/include/linux/zswap.h b/include/linux/zswap.h
index 07b2936c38f29..f33b4433a5ee8 100644
--- a/include/linux/zswap.h
+++ b/include/linux/zswap.h
@@ -33,9 +33,8 @@ void zswap_lruvec_state_init(struct lruvec *lruvec);
void zswap_folio_swapin(struct folio *folio);
bool zswap_is_enabled(void);
bool zswap_never_enabled(void);
-void *zswap_entry_store(swp_entry_t swpentry, struct zswap_entry *entry);
+void zswap_entry_store(swp_entry_t swpentry, struct zswap_entry *entry);
void *zswap_entry_load(swp_entry_t swpentry);
-void *zswap_entry_erase(swp_entry_t swpentry);
bool zswap_empty(swp_entry_t swpentry);
void zswap_entry_free(struct zswap_entry *entry);
diff --git a/mm/internal.h b/mm/internal.h
index 7ced0def684ca..cfe97501e4885 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -327,19 +327,7 @@ static inline swp_entry_t swap_nth(swp_entry_t entry, long n)
return (swp_entry_t) { entry.val + n };
}
-/* similar to swap_nth, but check the backing physical slots as well. */
-static inline swp_entry_t swap_move(swp_entry_t entry, long delta)
-{
- swp_slot_t slot = swp_entry_to_swp_slot(entry), next_slot;
- swp_entry_t next_entry = swap_nth(entry, delta);
-
- next_slot = swp_entry_to_swp_slot(next_entry);
- if (swp_slot_type(slot) != swp_slot_type(next_slot) ||
- swp_slot_offset(slot) + delta != swp_slot_offset(next_slot))
- next_entry.val = 0;
-
- return next_entry;
-}
+swp_entry_t swap_move(swp_entry_t entry, long delta);
/**
* pte_move_swp_offset - Move the swap entry offset field of a swap pte
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 86f43b7e5f710..2ba5811e7edba 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5247,10 +5247,18 @@ void __mem_cgroup_uncharge_swap(swp_entry_t entry, unsigned int nr_pages)
rcu_read_unlock();
}
+static bool mem_cgroup_may_zswap(struct mem_cgroup *original_memcg);
+
long mem_cgroup_get_nr_swap_pages(struct mem_cgroup *memcg)
{
- long nr_swap_pages = get_nr_swap_pages();
+ long nr_swap_pages, nr_zswap_pages = 0;
+
+ if (zswap_is_enabled() && (mem_cgroup_disabled() || do_memsw_account() ||
+ mem_cgroup_may_zswap(memcg))) {
+ nr_zswap_pages = PAGE_COUNTER_MAX;
+ }
+ nr_swap_pages = max_t(long, nr_zswap_pages, get_nr_swap_pages());
if (mem_cgroup_disabled() || do_memsw_account())
return nr_swap_pages;
for (; !mem_cgroup_is_root(memcg); memcg = parent_mem_cgroup(memcg))
@@ -5419,6 +5427,29 @@ static struct cftype swap_files[] = {
};
#ifdef CONFIG_ZSWAP
+static bool mem_cgroup_may_zswap(struct mem_cgroup *original_memcg)
+{
+ struct mem_cgroup *memcg;
+
+ for (memcg = original_memcg; !mem_cgroup_is_root(memcg);
+ memcg = parent_mem_cgroup(memcg)) {
+ unsigned long max = READ_ONCE(memcg->zswap_max);
+ unsigned long pages;
+
+ if (max == PAGE_COUNTER_MAX)
+ continue;
+ if (max == 0)
+ return false;
+
+ /* Force flush to get accurate stats for charging */
+ __mem_cgroup_flush_stats(memcg, true);
+ pages = memcg_page_state(memcg, MEMCG_ZSWAP_B) / PAGE_SIZE;
+ if (pages >= max)
+ return false;
+ }
+ return true;
+}
+
/**
* obj_cgroup_may_zswap - check if this cgroup can zswap
* @objcg: the object cgroup
@@ -5433,34 +5464,15 @@ static struct cftype swap_files[] = {
*/
bool obj_cgroup_may_zswap(struct obj_cgroup *objcg)
{
- struct mem_cgroup *memcg, *original_memcg;
+ struct mem_cgroup *memcg;
bool ret = true;
if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
return true;
- original_memcg = get_mem_cgroup_from_objcg(objcg);
- for (memcg = original_memcg; !mem_cgroup_is_root(memcg);
- memcg = parent_mem_cgroup(memcg)) {
- unsigned long max = READ_ONCE(memcg->zswap_max);
- unsigned long pages;
-
- if (max == PAGE_COUNTER_MAX)
- continue;
- if (max == 0) {
- ret = false;
- break;
- }
-
- /* Force flush to get accurate stats for charging */
- __mem_cgroup_flush_stats(memcg, true);
- pages = memcg_page_state(memcg, MEMCG_ZSWAP_B) / PAGE_SIZE;
- if (pages < max)
- continue;
- ret = false;
- break;
- }
- mem_cgroup_put(original_memcg);
+ memcg = get_mem_cgroup_from_objcg(objcg);
+ ret = mem_cgroup_may_zswap(memcg);
+ mem_cgroup_put(memcg);
return ret;
}
@@ -5604,6 +5616,11 @@ static struct cftype zswap_files[] = {
},
{ } /* terminate */
};
+#else
+static inline bool mem_cgroup_may_zswap(struct mem_cgroup *original_memcg)
+{
+ return false;
+}
#endif /* CONFIG_ZSWAP */
static int __init mem_cgroup_swap_init(void)
diff --git a/mm/memory.c b/mm/memory.c
index 641e3f65edc00..a16bf84ebaaf9 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4362,6 +4362,15 @@ static inline bool should_try_to_free_swap(struct folio *folio,
if (mem_cgroup_swap_full(folio) || (vma->vm_flags & VM_LOCKED) ||
folio_test_mlocked(folio))
return true;
+
+ /*
+ * Mixed and/or non-swapfile backends cannot be re-used for future swapouts
+ * anyway. Try to free swap space unless the folio is backed by contiguous
+ * physical swap slots.
+ */
+ if (!vswap_swapfile_backed(folio->swap, folio_nr_pages(folio)))
+ return true;
+
/*
* If we want to map a page that's in the swapcache writable, we
* have to detect via the refcount if we're really the exclusive
@@ -4623,12 +4632,12 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
struct folio *swapcache, *folio = NULL;
DECLARE_WAITQUEUE(wait, current);
struct page *page;
- struct swap_info_struct *si = NULL;
+ struct swap_info_struct *si = NULL, *stable_si;
rmap_t rmap_flags = RMAP_NONE;
bool need_clear_cache = false;
bool swapoff_locked = false;
bool exclusive = false;
- softleaf_t entry;
+ softleaf_t orig_entry, entry;
pte_t pte;
vm_fault_t ret = 0;
void *shadow = NULL;
@@ -4641,6 +4650,11 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
goto out;
entry = softleaf_from_pte(vmf->orig_pte);
+ /*
+ * entry might change if we get a large folio - remember the original entry
+ * for unlocking swapoff etc.
+ */
+ orig_entry = entry;
if (unlikely(!softleaf_is_swap(entry))) {
if (softleaf_is_migration(entry)) {
migration_entry_wait(vma->vm_mm, vmf->pmd,
@@ -4705,7 +4719,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
swapcache = folio;
if (!folio) {
- if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
+ if (si && data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
__swap_count(entry) == 1) {
/* skip swapcache */
folio = alloc_swap_folio(vmf);
@@ -4736,6 +4750,17 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
}
need_clear_cache = true;
+ /*
+ * Recheck to make sure the entire range is still
+ * THP-swapin-able. Note that before we call
+ * swapcache_prepare(), entries in the range can
+ * still have their backing status changed.
+ */
+ if (!vswap_can_swapin_thp(entry, nr_pages)) {
+ schedule_timeout_uninterruptible(1);
+ goto out_page;
+ }
+
memcg1_swapin(entry, nr_pages);
shadow = swap_cache_get_shadow(entry);
@@ -4916,27 +4941,40 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
* swapcache -> certainly exclusive.
*/
exclusive = true;
- } else if (exclusive && folio_test_writeback(folio) &&
- data_race(si->flags & SWP_STABLE_WRITES)) {
+ } else if (exclusive && folio_test_writeback(folio)) {
/*
- * This is tricky: not all swap backends support
- * concurrent page modifications while under writeback.
- *
- * So if we stumble over such a page in the swapcache
- * we must not set the page exclusive, otherwise we can
- * map it writable without further checks and modify it
- * while still under writeback.
+ * We need to look up the swap device again here, because
+ * the si we got from tryget_swap_entry() might have changed
+ * before we pin the backend.
*
- * For these problematic swap backends, simply drop the
- * exclusive marker: this is perfectly fine as we start
- * writeback only if we fully unmapped the page and
- * there are no unexpected references on the page after
- * unmapping succeeded. After fully unmapped, no
- * further GUP references (FOLL_GET and FOLL_PIN) can
- * appear, so dropping the exclusive marker and mapping
- * it only R/O is fine.
+ * With the folio locked and loaded into the swap cache, we can
+ * now guarantee a stable backing state.
*/
- exclusive = false;
+ stable_si = vswap_get_device(entry);
+ if (stable_si && data_race(stable_si->flags & SWP_STABLE_WRITES)) {
+ /*
+ * This is tricky: not all swap backends support
+ * concurrent page modifications while under writeback.
+ *
+ * So if we stumble over such a page in the swapcache
+ * we must not set the page exclusive, otherwise we can
+ * map it writable without further checks and modify it
+ * while still under writeback.
+ *
+ * For these problematic swap backends, simply drop the
+ * exclusive marker: this is perfectly fine as we start
+ * writeback only if we fully unmapped the page and
+ * there are no unexpected references on the page after
+ * unmapping succeeded. After fully unmapped, no
+ * further GUP references (FOLL_GET and FOLL_PIN) can
+ * appear, so dropping the exclusive marker and mapping
+ * it only R/O is fine.
+ */
+ exclusive = false;
+ }
+
+ if (stable_si)
+ swap_slot_put_swap_info(stable_si);
}
}
@@ -5045,7 +5083,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
wake_up(&swapcache_wq);
}
if (swapoff_locked)
- put_swap_entry(entry, si);
+ put_swap_entry(orig_entry, si);
return ret;
out_nomap:
if (vmf->pte)
@@ -5064,7 +5102,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
wake_up(&swapcache_wq);
}
if (swapoff_locked)
- put_swap_entry(entry, si);
+ put_swap_entry(orig_entry, si);
return ret;
}
diff --git a/mm/page_io.c b/mm/page_io.c
index 5de3705572955..675ec6445609b 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -201,44 +201,6 @@ static bool is_folio_zero_filled(struct folio *folio)
return true;
}
-static void swap_zeromap_folio_set(struct folio *folio)
-{
- struct obj_cgroup *objcg = get_obj_cgroup_from_folio(folio);
- struct swap_info_struct *sis =
- __swap_slot_to_info(swp_entry_to_swp_slot(folio->swap));
- int nr_pages = folio_nr_pages(folio);
- swp_entry_t entry;
- swp_slot_t slot;
- unsigned int i;
-
- for (i = 0; i < folio_nr_pages(folio); i++) {
- entry = page_swap_entry(folio_page(folio, i));
- slot = swp_entry_to_swp_slot(entry);
- set_bit(swp_slot_offset(slot), sis->zeromap);
- }
-
- count_vm_events(SWPOUT_ZERO, nr_pages);
- if (objcg) {
- count_objcg_events(objcg, SWPOUT_ZERO, nr_pages);
- obj_cgroup_put(objcg);
- }
-}
-
-static void swap_zeromap_folio_clear(struct folio *folio)
-{
- struct swap_info_struct *sis =
- __swap_slot_to_info(swp_entry_to_swp_slot(folio->swap));
- swp_entry_t entry;
- swp_slot_t slot;
- unsigned int i;
-
- for (i = 0; i < folio_nr_pages(folio); i++) {
- entry = page_swap_entry(folio_page(folio, i));
- slot = swp_entry_to_swp_slot(entry);
- clear_bit(swp_slot_offset(slot), sis->zeromap);
- }
-}
-
/*
* We may have stale swap cache pages in memory: notice
* them here and get rid of the unnecessary final write.
@@ -260,23 +222,22 @@ int swap_writeout(struct folio *folio, struct swap_iocb **swap_plug)
goto out_unlock;
}
- /*
- * Use a bitmap (zeromap) to avoid doing IO for zero-filled pages.
- * The bits in zeromap are protected by the locked swapcache folio
- * and atomic updates are used to protect against read-modify-write
- * corruption due to other zero swap entries seeing concurrent updates.
- */
if (is_folio_zero_filled(folio)) {
swap_zeromap_folio_set(folio);
goto out_unlock;
}
/*
- * Clear bits this folio occupies in the zeromap to prevent zero data
- * being read in from any previous zero writes that occupied the same
- * swap entries.
+ * Release swap backends to make sure we do not have mixed backends
+ *
+ * The only exception is if the folio is already backed by a
+ * contiguous range of physical swap slots (for e.g, from a previous
+ * swapout attempt when zswap is disabled).
+ *
+ * Keep that backend to avoid reallocation of physical swap slots.
*/
- swap_zeromap_folio_clear(folio);
+ if (!vswap_swapfile_backed(folio->swap, folio_nr_pages(folio)))
+ vswap_store_folio(folio->swap, folio);
if (zswap_store(folio)) {
count_mthp_stat(folio_order(folio), MTHP_STAT_ZSWPOUT);
@@ -287,6 +248,12 @@ int swap_writeout(struct folio *folio, struct swap_iocb **swap_plug)
return AOP_WRITEPAGE_ACTIVATE;
}
+ /* fall back to physical swap device */
+ if (!vswap_alloc_swap_slot(folio)) {
+ folio_mark_dirty(folio);
+ return AOP_WRITEPAGE_ACTIVATE;
+ }
+
__swap_writepage(folio, swap_plug);
return 0;
out_unlock:
@@ -618,14 +585,11 @@ static void swap_read_folio_bdev_async(struct folio *folio,
void swap_read_folio(struct folio *folio, struct swap_iocb **plug)
{
- struct swap_info_struct *sis =
- __swap_slot_to_info(swp_entry_to_swp_slot(folio->swap));
- bool synchronous = sis->flags & SWP_SYNCHRONOUS_IO;
- bool workingset = folio_test_workingset(folio);
+ struct swap_info_struct *sis;
+ bool synchronous, workingset = folio_test_workingset(folio);
unsigned long pflags;
bool in_thrashing;
- VM_BUG_ON_FOLIO(!folio_test_swapcache(folio) && !synchronous, folio);
VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
VM_BUG_ON_FOLIO(folio_test_uptodate(folio), folio);
@@ -651,6 +615,10 @@ void swap_read_folio(struct folio *folio, struct swap_iocb **plug)
/* We have to read from slower devices. Increase zswap protection. */
zswap_folio_swapin(folio);
+ sis = __swap_slot_to_info(swp_entry_to_swp_slot(folio->swap));
+ synchronous = sis->flags & SWP_SYNCHRONOUS_IO;
+ VM_BUG_ON_FOLIO(!folio_test_swapcache(folio) && !synchronous, folio);
+
if (data_race(sis->flags & SWP_FS_OPS)) {
swap_read_folio_fs(folio, plug);
} else if (synchronous) {
diff --git a/mm/shmem.c b/mm/shmem.c
index 780571c830e5b..3a346cca114ab 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1459,7 +1459,7 @@ static unsigned int shmem_find_swap_entries(struct address_space *mapping,
* swapin error entries can be found in the mapping. But they're
* deliberately ignored here as we've done everything we can do.
*/
- if (swp_slot_type(slot) != type)
+ if (!slot.val || swp_slot_type(slot) != type)
continue;
indices[folio_batch_count(fbatch)] = xas.xa_index;
@@ -1604,7 +1604,7 @@ int shmem_writeout(struct folio *folio, struct swap_iocb **plug,
if ((info->flags & SHMEM_F_LOCKED) || sbinfo->noswap)
goto redirty;
- if (!total_swap_pages)
+ if (!zswap_is_enabled() && !total_swap_pages)
goto redirty;
/*
@@ -2341,7 +2341,7 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
/* Look it up and read it in.. */
folio = swap_cache_get_folio(swap);
if (!folio) {
- if (data_race(si->flags & SWP_SYNCHRONOUS_IO)) {
+ if (si && data_race(si->flags & SWP_SYNCHRONOUS_IO)) {
/* Direct swapin skipping swap cache & readahead */
folio = shmem_swap_alloc_folio(inode, vma, index,
index_entry, order, gfp);
diff --git a/mm/swap.h b/mm/swap.h
index ae97cf9712c5c..d41e6a0e70753 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -310,35 +310,15 @@ static inline unsigned int folio_swap_flags(struct folio *folio)
{
swp_slot_t swp_slot = swp_entry_to_swp_slot(folio->swap);
+ /* The folio might not be backed by any physical swap slots
+ * (for e.g zswap-backed only).
+ */
+ if (!swp_slot.val)
+ return 0;
return __swap_slot_to_info(swp_slot)->flags;
}
-/*
- * Return the count of contiguous swap entries that share the same
- * zeromap status as the starting entry. If is_zeromap is not NULL,
- * it will return the zeromap status of the starting entry.
- */
-static inline int swap_zeromap_batch(swp_entry_t entry, int max_nr,
- bool *is_zeromap)
-{
- swp_slot_t slot = swp_entry_to_swp_slot(entry);
- struct swap_info_struct *sis = __swap_slot_to_info(slot);
- unsigned long start = swp_slot_offset(slot);
- unsigned long end = start + max_nr;
- bool first_bit;
-
- first_bit = test_bit(start, sis->zeromap);
- if (is_zeromap)
- *is_zeromap = first_bit;
-
- if (max_nr <= 1)
- return max_nr;
- if (first_bit)
- return find_next_zero_bit(sis->zeromap, end, start) - start;
- else
- return find_next_bit(sis->zeromap, end, start) - start;
-}
-
+int swap_zeromap_batch(swp_entry_t entry, int max_nr, bool *is_zeromap);
int non_swapcache_batch(swp_entry_t entry, int max_nr);
#else /* CONFIG_SWAP */
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 1827527e88d33..ad80bf098b63f 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -179,6 +179,10 @@ struct folio *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
struct folio *result = NULL;
void *shadow = NULL;
+ /* we might get an unsed entry from cluster readahead - just skip */
+ if (!entry.val)
+ return NULL;
+
*new_page_allocated = false;
for (;;) {
int err;
@@ -213,8 +217,20 @@ struct folio *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
* Swap entry may have been freed since our caller observed it.
*/
err = swapcache_prepare(entry, 1);
- if (!err)
+ if (!err) {
+ /* This might be invoked by swap_cluster_readahead(), which can
+ * race with shmem_swapin_folio(). The latter might have already
+ * called swap_cache_del_folio(), allowing swapcache_prepare()
+ * to succeed here. This can lead to reading bogus data to populate
+ * the page. To prevent this, skip folio-backed virtual swap slots,
+ * and let caller retry if necessary.
+ */
+ if (vswap_folio_backed(entry, 1)) {
+ swapcache_clear(entry, 1);
+ goto put_and_return;
+ }
break;
+ }
else if (err != -EEXIST)
goto put_and_return;
@@ -391,11 +407,18 @@ struct folio *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask,
unsigned long offset = slot_offset;
unsigned long start_offset, end_offset;
unsigned long mask;
- struct swap_info_struct *si = __swap_slot_to_info(slot);
+ struct swap_info_struct *si = swap_slot_tryget_swap_info(slot);
struct blk_plug plug;
struct swap_iocb *splug = NULL;
bool page_allocated;
+ /*
+ * The swap entry might not be backed by any physical swap slot. In that
+ * case, just skip readahead and bring in the target entry.
+ */
+ if (!si)
+ goto skip;
+
mask = swapin_nr_pages(offset) - 1;
if (!mask)
goto skip;
@@ -429,6 +452,8 @@ struct folio *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask,
swap_read_unplug(splug);
lru_add_drain(); /* Push any new pages onto the LRU now */
skip:
+ if (si)
+ swap_slot_put_swap_info(si);
/* The page was likely read above, so no need for plugging here */
folio = __read_swap_cache_async(entry, gfp_mask, mpol, ilx,
&page_allocated, false);
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 6c5e46bf40701..1aa29dd220f9a 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1210,14 +1210,6 @@ static void swap_range_free(struct swap_info_struct *si, unsigned long offset,
{
unsigned long end = offset + nr_entries - 1;
void (*swap_slot_free_notify)(struct block_device *, unsigned long);
- unsigned int i;
-
- /*
- * Use atomic clear_bit operations only on zeromap instead of non-atomic
- * bitmap_clear to prevent adjacent bits corruption due to simultaneous writes.
- */
- for (i = 0; i < nr_entries; i++)
- clear_bit(offset + i, si->zeromap);
if (si->flags & SWP_BLKDEV)
swap_slot_free_notify =
diff --git a/mm/vmscan.c b/mm/vmscan.c
index c9ec1a1458b4e..6b200a6bb1160 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -365,10 +365,11 @@ static inline bool can_reclaim_anon_pages(struct mem_cgroup *memcg,
{
if (memcg == NULL) {
/*
- * For non-memcg reclaim, is there
- * space in any swap device?
+ * For non-memcg reclaim:
+ *
+ * Check if zswap is enabled or if there is space in any swap device?
*/
- if (get_nr_swap_pages() > 0)
+ if (zswap_is_enabled() || get_nr_swap_pages() > 0)
return true;
} else {
/* Is the memcg below its swap limit? */
@@ -2640,12 +2641,12 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
static bool can_age_anon_pages(struct lruvec *lruvec,
struct scan_control *sc)
{
- /* Aging the anon LRU is valuable if swap is present: */
- if (total_swap_pages > 0)
- return true;
-
- /* Also valuable if anon pages can be demoted: */
- return can_demote(lruvec_pgdat(lruvec)->node_id, sc,
+ /*
+ * Aging the anon LRU is valuable if zswap or physical swap is available or
+ * anon pages can be demoted
+ */
+ return zswap_is_enabled() || total_swap_pages > 0 ||
+ can_demote(lruvec_pgdat(lruvec)->node_id, sc,
lruvec_memcg(lruvec));
}
diff --git a/mm/vswap.c b/mm/vswap.c
index 318933071edc6..fb6179ce3ace7 100644
--- a/mm/vswap.c
+++ b/mm/vswap.c
@@ -11,6 +11,7 @@
#include <linux/swap_cgroup.h>
#include <linux/cpuhotplug.h>
#include <linux/zswap.h>
+#include "internal.h"
#include "swap.h"
#include "swap_table.h"
@@ -54,22 +55,48 @@
*
* Note that we do not have a reference count field per se - it is derived from
* the swap_count and the in_swapcache fields.
+ *
+ * III. Backing State
+ *
+ * Each virtual swap slot can be backed by:
+ *
+ * 1. A slot on a physical swap device (i.e a swapfile or a swap partition).
+ * 2. A swapped out zero-filled page.
+ * 3. A compressed object in zswap.
+ * 4. An in-memory folio, that is not backed by neither a physical swap device
+ * nor zswap (i.e only in swap cache). This is used for pages that are
+ * rejected by zswap, but not (yet) backed by a physical swap device,
+ * (for e.g, due to zswap.writeback = 0), or for pages that were previously
+ * stored in zswap, but has since been loaded back into memory (and has its
+ * zswap copy invalidated).
*/
+/* The backing state options of a virtual swap slot */
+enum swap_type {
+ VSWAP_SWAPFILE,
+ VSWAP_ZERO,
+ VSWAP_ZSWAP,
+ VSWAP_FOLIO
+};
+
/**
* Swap descriptor - metadata of a swapped out page.
*
* @slot: The handle to the physical swap slot backing this page.
* @zswap_entry: The zswap entry associated with this swap slot.
- * @swap_cache: The folio in swap cache.
+ * @swap_cache: The folio in swap cache. If the swap entry backing type is
+ * VSWAP_FOLIO, the backend is also stored here.
* @shadow: The shadow entry.
- * @memcgid: The memcg id of the owning memcg, if any.
* @swap_count: The number of page table entries that refer to the swap entry.
+ * @memcgid: The memcg id of the owning memcg, if any.
* @in_swapcache: Whether the swap entry is (about to be) pinned in swap cache.
+ * @type: The backing store type of the swap entry.
*/
struct swp_desc {
- swp_slot_t slot;
- struct zswap_entry *zswap_entry;
+ union {
+ swp_slot_t slot;
+ struct zswap_entry *zswap_entry;
+ };
union {
struct folio *swap_cache;
void *shadow;
@@ -78,10 +105,10 @@ struct swp_desc {
unsigned int swap_count;
#ifdef CONFIG_MEMCG
- unsigned short memcgid;
+ unsigned short memcgid:16;
#endif
-
- bool in_swapcache;
+ bool in_swapcache:1;
+ enum swap_type type:2;
};
#define VSWAP_CLUSTER_SHIFT HPAGE_PMD_ORDER
@@ -266,15 +293,16 @@ static bool cluster_is_alloc_candidate(struct vswap_cluster *cluster)
return cluster->count + (1 << (cluster->order)) <= VSWAP_CLUSTER_SIZE;
}
-static void __vswap_alloc_from_cluster(struct vswap_cluster *cluster, int start)
+static void __vswap_alloc_from_cluster(struct vswap_cluster *cluster,
+ int start, struct folio *folio)
{
int i, nr = 1 << cluster->order;
struct swp_desc *desc;
for (i = 0; i < nr; i++) {
desc = &cluster->descriptors[start + i];
- desc->slot.val = 0;
- desc->zswap_entry = NULL;
+ desc->type = VSWAP_FOLIO;
+ desc->swap_cache = folio;
#ifdef CONFIG_MEMCG
desc->memcgid = 0;
#endif
@@ -284,7 +312,8 @@ static void __vswap_alloc_from_cluster(struct vswap_cluster *cluster, int start)
cluster->count += nr;
}
-static unsigned long vswap_alloc_from_cluster(struct vswap_cluster *cluster)
+static unsigned long vswap_alloc_from_cluster(struct vswap_cluster *cluster,
+ struct folio *folio)
{
int nr = 1 << cluster->order;
unsigned long i = cluster->id ? 0 : nr;
@@ -303,16 +332,16 @@ static unsigned long vswap_alloc_from_cluster(struct vswap_cluster *cluster)
bitmap_set(cluster->bitmap, i, nr);
refcount_add(nr, &cluster->refcnt);
- __vswap_alloc_from_cluster(cluster, i);
+ __vswap_alloc_from_cluster(cluster, i, folio);
return i + (cluster->id << VSWAP_CLUSTER_SHIFT);
}
/* Allocate a contiguous range of virtual swap slots */
-static swp_entry_t vswap_alloc(int order)
+static swp_entry_t vswap_alloc(struct folio *folio)
{
struct xa_limit limit = vswap_cluster_map_limit;
struct vswap_cluster *local, *cluster;
- int nr = 1 << order;
+ int order = folio_order(folio), nr = 1 << order;
bool need_caching = true;
u32 cluster_id;
swp_entry_t entry;
@@ -325,7 +354,7 @@ static swp_entry_t vswap_alloc(int order)
cluster = this_cpu_read(percpu_vswap_cluster.clusters[order]);
if (cluster) {
spin_lock(&cluster->lock);
- entry.val = vswap_alloc_from_cluster(cluster);
+ entry.val = vswap_alloc_from_cluster(cluster, folio);
need_caching = !entry.val;
if (!entry.val || !cluster_is_alloc_candidate(cluster)) {
@@ -352,7 +381,7 @@ static swp_entry_t vswap_alloc(int order)
if (!spin_trylock(&cluster->lock))
continue;
- entry.val = vswap_alloc_from_cluster(cluster);
+ entry.val = vswap_alloc_from_cluster(cluster, folio);
list_del_init(&cluster->list);
cluster->full = !entry.val || !cluster_is_alloc_candidate(cluster);
need_caching = !cluster->full;
@@ -384,7 +413,7 @@ static swp_entry_t vswap_alloc(int order)
if (!cluster_id)
entry.val += nr;
__vswap_alloc_from_cluster(cluster,
- (entry.val & VSWAP_CLUSTER_MASK));
+ (entry.val & VSWAP_CLUSTER_MASK), folio);
/* Mark the allocated range in the bitmap */
bitmap_set(cluster->bitmap, (entry.val & VSWAP_CLUSTER_MASK), nr);
need_caching = cluster_is_alloc_candidate(cluster);
@@ -497,6 +526,84 @@ void vswap_rmap_set(struct swap_cluster_info *ci, swp_slot_t slot,
__swap_table_set(ci, ci_off + i, vswap ? vswap + i : 0);
}
+/*
+ * Caller needs to handle races with other operations themselves.
+ *
+ * Specifically, this function is safe to be called in contexts where the swap
+ * entry has been added to the swap cache and the associated folio is locked.
+ * We cannot race with other accessors, and the swap entry is guaranteed to be
+ * valid the whole time (since swap cache implies one refcount).
+ *
+ * We cannot assume that the backends will be of the same type,
+ * contiguous, etc. We might have a large folio coalesced from subpages with
+ * mixed backend, which is only rectified when it is reclaimed.
+ */
+ static void release_backing(swp_entry_t entry, int nr)
+{
+ struct vswap_cluster *cluster = NULL;
+ struct swp_desc *desc;
+ unsigned long flush_nr, phys_swap_start = 0, phys_swap_end = 0;
+ unsigned int phys_swap_type = 0;
+ bool need_flushing_phys_swap = false;
+ swp_slot_t flush_slot;
+ int i;
+
+ VM_WARN_ON(!entry.val);
+
+ rcu_read_lock();
+ for (i = 0; i < nr; i++) {
+ desc = vswap_iter(&cluster, entry.val + i);
+ VM_WARN_ON(!desc);
+
+ /*
+ * We batch contiguous physical swap slots for more efficient
+ * freeing.
+ */
+ if (phys_swap_start != phys_swap_end &&
+ (desc->type != VSWAP_SWAPFILE ||
+ swp_slot_type(desc->slot) != phys_swap_type ||
+ swp_slot_offset(desc->slot) != phys_swap_end)) {
+ need_flushing_phys_swap = true;
+ flush_slot = swp_slot(phys_swap_type, phys_swap_start);
+ flush_nr = phys_swap_end - phys_swap_start;
+ phys_swap_start = phys_swap_end = 0;
+ }
+
+ if (desc->type == VSWAP_ZSWAP && desc->zswap_entry) {
+ zswap_entry_free(desc->zswap_entry);
+ } else if (desc->type == VSWAP_SWAPFILE) {
+ if (!phys_swap_start) {
+ /* start a new contiguous range of phys swap */
+ phys_swap_start = swp_slot_offset(desc->slot);
+ phys_swap_end = phys_swap_start + 1;
+ phys_swap_type = swp_slot_type(desc->slot);
+ } else {
+ /* extend the current contiguous range of phys swap */
+ phys_swap_end++;
+ }
+ }
+
+ desc->slot.val = 0;
+
+ if (need_flushing_phys_swap) {
+ spin_unlock(&cluster->lock);
+ cluster = NULL;
+ swap_slot_free_nr(flush_slot, flush_nr);
+ need_flushing_phys_swap = false;
+ }
+ }
+ if (cluster)
+ spin_unlock(&cluster->lock);
+ rcu_read_unlock();
+
+ /* Flush any remaining physical swap range */
+ if (phys_swap_start) {
+ flush_slot = swp_slot(phys_swap_type, phys_swap_start);
+ flush_nr = phys_swap_end - phys_swap_start;
+ swap_slot_free_nr(flush_slot, flush_nr);
+ }
+ }
+
/*
* Entered with the cluster locked, but might unlock the cluster.
* This is because several operations, such as releasing physical swap slots
@@ -516,35 +623,21 @@ void vswap_rmap_set(struct swap_cluster_info *ci, swp_slot_t slot,
static void vswap_free(struct vswap_cluster *cluster, struct swp_desc *desc,
swp_entry_t entry)
{
- struct zswap_entry *zswap_entry;
- swp_slot_t slot;
-
/* Clear shadow if present */
if (xa_is_value(desc->shadow))
desc->shadow = NULL;
-
- slot = desc->slot;
- desc->slot.val = 0;
-
- zswap_entry = desc->zswap_entry;
- if (zswap_entry) {
- desc->zswap_entry = NULL;
- zswap_entry_free(zswap_entry);
- }
spin_unlock(&cluster->lock);
+ release_backing(entry, 1);
mem_cgroup_uncharge_swap(entry, 1);
- if (slot.val)
- swap_slot_free_nr(slot, 1);
-
- spin_lock(&cluster->lock);
/* erase forward mapping and release the virtual slot for reallocation */
+ spin_lock(&cluster->lock);
release_vswap_slot(cluster, entry.val);
}
/**
- * folio_alloc_swap - allocate swap space for a folio.
+ * folio_alloc_swap - allocate virtual swap space for a folio.
* @folio: the folio.
*
* Return: 0, if the allocation succeeded, -ENOMEM, if the allocation failed.
@@ -552,38 +645,77 @@ static void vswap_free(struct vswap_cluster *cluster, struct swp_desc *desc,
int folio_alloc_swap(struct folio *folio)
{
struct vswap_cluster *cluster = NULL;
- struct swap_info_struct *si;
- struct swap_cluster_info *ci;
- int i, nr = folio_nr_pages(folio), order = folio_order(folio);
+ int i, nr = folio_nr_pages(folio);
struct swp_desc *desc;
swp_entry_t entry;
- swp_slot_t slot;
VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
VM_BUG_ON_FOLIO(!folio_test_uptodate(folio), folio);
- entry = vswap_alloc(folio_order(folio));
+ entry = vswap_alloc(folio);
if (!entry.val)
return -ENOMEM;
/*
- * XXX: for now, we always allocate a physical swap slot for each virtual
- * swap slot, and their lifetime are coupled. This will change once we
- * decouple virtual swap slots from their backing states, and only allocate
- * physical swap slots for them on demand (i.e on zswap writeback, or
- * fallback from zswap store failure).
+ * XXX: for now, we charge towards the memory cgroup's swap limit on virtual
+ * swap slots allocation. This will be changed soon - we will only charge on
+ * physical swap slots allocation.
*/
- if (swap_slot_alloc(&slot, order)) {
+ if (mem_cgroup_try_charge_swap(folio, entry)) {
+ rcu_read_lock();
for (i = 0; i < nr; i++) {
desc = vswap_iter(&cluster, entry.val + i);
VM_WARN_ON(!desc);
vswap_free(cluster, desc, (swp_entry_t){ entry.val + i });
}
spin_unlock(&cluster->lock);
+ rcu_read_unlock();
+ atomic_add(nr, &vswap_alloc_reject);
entry.val = 0;
return -ENOMEM;
}
+ swap_cache_add_folio(folio, entry, NULL);
+
+ return 0;
+}
+
+/**
+ * vswap_alloc_swap_slot - allocate physical swap space for a folio that is
+ * already associated with virtual swap slots.
+ * @folio: folio we want to allocate physical swap space for.
+ *
+ * Note that this does NOT release existing swap backends of the folio.
+ * Callers need to handle this themselves.
+
+ * Return: true if the folio is now backed by physical swap slots, false
+ * otherwise.
+ */
+bool vswap_alloc_swap_slot(struct folio *folio)
+{
+ int i, nr = folio_nr_pages(folio);
+ struct vswap_cluster *cluster = NULL;
+ struct swap_info_struct *si;
+ struct swap_cluster_info *ci;
+ swp_slot_t slot = { .val = 0 };
+ swp_entry_t entry = folio->swap;
+ struct swp_desc *desc;
+ bool fallback = false;
+
+ /*
+ * We might have already allocated a backing physical swap slot in past
+ * attempts (for instance, when we disable zswap). If the entire range is
+ * already swapfile-backed we can skip swapfile case.
+ */
+ if (vswap_swapfile_backed(entry, nr))
+ return true;
+
+ if (swap_slot_alloc(&slot, folio_order(folio)))
+ return false;
+
+ if (!slot.val)
+ return false;
+
/* establish the vrtual <-> physical swap slots linkages. */
si = __swap_slot_to_info(slot);
ci = swap_cluster_lock(si, swp_slot_offset(slot));
@@ -595,29 +727,29 @@ int folio_alloc_swap(struct folio *folio)
desc = vswap_iter(&cluster, entry.val + i);
VM_WARN_ON(!desc);
+ if (desc->type == VSWAP_FOLIO) {
+ /* case 1: fallback from zswap store failure */
+ fallback = true;
+ if (!folio)
+ folio = desc->swap_cache;
+ else
+ VM_WARN_ON(folio != desc->swap_cache);
+ } else {
+ /*
+ * Case 2: zswap writeback.
+ *
+ * No need to free zswap entry here - it will be freed once zswap
+ * writeback suceeds.
+ */
+ VM_WARN_ON(desc->type != VSWAP_ZSWAP);
+ VM_WARN_ON(fallback);
+ }
+ desc->type = VSWAP_SWAPFILE;
desc->slot.val = slot.val + i;
}
- if (cluster)
- spin_unlock(&cluster->lock);
+ spin_unlock(&cluster->lock);
rcu_read_unlock();
-
- /*
- * XXX: for now, we charge towards the memory cgroup's swap limit on virtual
- * swap slots allocation. This is acceptable because as noted above, each
- * virtual swap slot corresponds to a physical swap slot. Once we have
- * decoupled virtual and physical swap slots, we will only charge when we
- * actually allocate a physical swap slot.
- */
- if (mem_cgroup_try_charge_swap(folio, entry))
- goto out_free;
-
- swap_cache_add_folio(folio, entry, NULL);
-
- return 0;
-
-out_free:
- put_swap_folio(folio, entry);
- return -ENOMEM;
+ return true;
}
/**
@@ -625,7 +757,9 @@ int folio_alloc_swap(struct folio *folio)
* virtual swap slot.
* @entry: the virtual swap slot.
*
- * Return: the physical swap slot corresponding to the virtual swap slot.
+ * Return: the physical swap slot corresponding to the virtual swap slot, if
+ * exists, or the zero physical swap slot if the virtual swap slot is not
+ * backed by any physical slot on a swapfile.
*/
swp_slot_t swp_entry_to_swp_slot(swp_entry_t entry)
{
@@ -644,7 +778,10 @@ swp_slot_t swp_entry_to_swp_slot(swp_entry_t entry)
return (swp_slot_t){0};
}
- slot = desc->slot;
+ if (desc->type != VSWAP_SWAPFILE)
+ slot.val = 0;
+ else
+ slot = desc->slot;
spin_unlock(&cluster->lock);
rcu_read_unlock();
@@ -962,6 +1099,293 @@ int non_swapcache_batch(swp_entry_t entry, int max_nr)
return i;
}
+/**
+ * vswap_store_folio - set a folio as the backing of a range of virtual swap
+ * slots.
+ * @entry: the first virtual swap slot in the range.
+ * @folio: the folio.
+ */
+void vswap_store_folio(swp_entry_t entry, struct folio *folio)
+{
+ struct vswap_cluster *cluster = NULL;
+ int i, nr = folio_nr_pages(folio);
+ struct swp_desc *desc;
+
+ VM_BUG_ON(!folio_test_locked(folio));
+ VM_BUG_ON(folio->swap.val != entry.val);
+
+ release_backing(entry, nr);
+
+ rcu_read_lock();
+ for (i = 0; i < nr; i++) {
+ desc = vswap_iter(&cluster, entry.val + i);
+ VM_WARN_ON(!desc);
+ desc->type = VSWAP_FOLIO;
+ desc->swap_cache = folio;
+ }
+ spin_unlock(&cluster->lock);
+ rcu_read_unlock();
+}
+
+/**
+ * swap_zeromap_folio_set - mark a range of virtual swap slots corresponding to
+ * a folio as zero-filled.
+ * @folio: the folio
+ */
+void swap_zeromap_folio_set(struct folio *folio)
+{
+ struct obj_cgroup *objcg = get_obj_cgroup_from_folio(folio);
+ struct vswap_cluster *cluster = NULL;
+ swp_entry_t entry = folio->swap;
+ int i, nr = folio_nr_pages(folio);
+ struct swp_desc *desc;
+
+ VM_BUG_ON(!folio_test_locked(folio));
+ VM_BUG_ON(!entry.val);
+
+ release_backing(entry, nr);
+
+ rcu_read_lock();
+ for (i = 0; i < nr; i++) {
+ desc = vswap_iter(&cluster, entry.val + i);
+ VM_WARN_ON(!desc);
+ desc->type = VSWAP_ZERO;
+ }
+ spin_unlock(&cluster->lock);
+ rcu_read_unlock();
+
+ count_vm_events(SWPOUT_ZERO, nr);
+ if (objcg) {
+ count_objcg_events(objcg, SWPOUT_ZERO, nr);
+ obj_cgroup_put(objcg);
+ }
+}
+
+/*
+ * Iterate through the entire range of virtual swap slots, returning the
+ * longest contiguous range of slots starting from the first slot that satisfies:
+ *
+ * 1. If the first slot is zero-mapped, the entire range should be
+ * zero-mapped.
+ * 2. If the first slot is backed by a swapfile, the entire range should
+ * be backed by a range of contiguous swap slots on the same swapfile.
+ * 3. If the first slot is zswap-backed, the entire range should be
+ * zswap-backed.
+ * 4. If the first slot is backed by a folio, the entire range should
+ * be backed by the same folio.
+ *
+ * Note that this check is racy unless we can ensure that the entire range
+ * has their backing state stable - for instance, if the caller was the one
+ * who set the swap cache pin.
+ */
+static int vswap_check_backing(swp_entry_t entry, enum swap_type *type, int nr)
+{
+ unsigned int swapfile_type;
+ struct vswap_cluster *cluster = NULL;
+ enum swap_type first_type;
+ struct swp_desc *desc;
+ pgoff_t first_offset;
+ struct folio *folio;
+ int i = 0;
+
+ if (!entry.val)
+ return 0;
+
+ rcu_read_lock();
+ for (i = 0; i < nr; i++) {
+ desc = vswap_iter(&cluster, entry.val + i);
+ if (!desc)
+ goto done;
+
+ if (!i) {
+ first_type = desc->type;
+ if (first_type == VSWAP_SWAPFILE) {
+ swapfile_type = swp_slot_type(desc->slot);
+ first_offset = swp_slot_offset(desc->slot);
+ } else if (first_type == VSWAP_FOLIO) {
+ folio = desc->swap_cache;
+ }
+ } else if (desc->type != first_type) {
+ goto done;
+ } else if (first_type == VSWAP_SWAPFILE &&
+ (swp_slot_type(desc->slot) != swapfile_type ||
+ swp_slot_offset(desc->slot) != first_offset + i)) {
+ goto done;
+ } else if (first_type == VSWAP_FOLIO && desc->swap_cache != folio) {
+ goto done;
+ }
+ }
+done:
+ if (cluster)
+ spin_unlock(&cluster->lock);
+ rcu_read_unlock();
+ if (type)
+ *type = first_type;
+ return i;
+}
+
+/**
+ * vswap_swapfile_backed - check if the virtual swap slots are backed by physical
+ * swap slots.
+ * @entry: the first entry in the range.
+ * @nr: the number of entries in the range.
+ */
+bool vswap_swapfile_backed(swp_entry_t entry, int nr)
+{
+ enum swap_type type;
+
+ return vswap_check_backing(entry, &type, nr) == nr
+ && type == VSWAP_SWAPFILE;
+}
+
+/**
+ * vswap_folio_backed - check if the virtual swap slots are backed by in-memory
+ * pages.
+ * @entry: the first virtual swap slot in the range.
+ * @nr: the number of slots in the range.
+ */
+bool vswap_folio_backed(swp_entry_t entry, int nr)
+{
+ enum swap_type type;
+
+ return vswap_check_backing(entry, &type, nr) == nr && type == VSWAP_FOLIO;
+}
+
+/**
+ * vswap_can_swapin_thp - check if the swap entries can be swapped in as a THP.
+ * @entry: the first virtual swap slot in the range.
+ * @nr: the number of slots in the range.
+ *
+ * For now, we can only swap in a THP if the entire range is zero-filled, or if
+ * the entire range is backed by a contiguous range of physical swap slots on a
+ * swapfile.
+ */
+bool vswap_can_swapin_thp(swp_entry_t entry, int nr)
+{
+ enum swap_type type;
+
+ return vswap_check_backing(entry, &type, nr) == nr &&
+ (type == VSWAP_ZERO || type == VSWAP_SWAPFILE);
+}
+
+/**
+ * swap_move - increment the swap slot by delta, checking the backing state and
+ * return 0 if the backing state does not match (i.e wrong backing
+ * state type, or wrong offset on the backing stores).
+ * @entry: the original virtual swap slot.
+ * @delta: the offset to increment the original slot.
+ *
+ * Note that this function is racy unless we can pin the backing state of these
+ * swap slots down with swapcache_prepare().
+ *
+ * Caller should only rely on this function as a best-effort hint otherwise,
+ * and should double-check after ensuring the whole range is pinned down.
+ *
+ * Return: the incremented virtual swap slot if the backing state matches, or
+ * 0 if the backing state does not match.
+ */
+swp_entry_t swap_move(swp_entry_t entry, long delta)
+{
+ struct vswap_cluster *cluster = NULL;
+ struct swp_desc *desc, *next_desc;
+ swp_entry_t next_entry;
+ struct folio *folio = NULL, *next_folio = NULL;
+ enum swap_type type, next_type;
+ swp_slot_t slot = {0}, next_slot = {0};
+
+ next_entry.val = entry.val + delta;
+
+ rcu_read_lock();
+
+ /* Look up first descriptor and get its type and backing store */
+ desc = vswap_iter(&cluster, entry.val);
+ if (!desc) {
+ rcu_read_unlock();
+ return (swp_entry_t){0};
+ }
+
+ type = desc->type;
+ if (type == VSWAP_ZSWAP) {
+ /* zswap not supported for move */
+ spin_unlock(&cluster->lock);
+ rcu_read_unlock();
+ return (swp_entry_t){0};
+ }
+ if (type == VSWAP_FOLIO)
+ folio = desc->swap_cache;
+ else if (type == VSWAP_SWAPFILE)
+ slot = desc->slot;
+
+ /* Look up second descriptor and get its type and backing store */
+ next_desc = vswap_iter(&cluster, next_entry.val);
+ if (!next_desc) {
+ rcu_read_unlock();
+ return (swp_entry_t){0};
+ }
+
+ next_type = next_desc->type;
+ if (next_type == VSWAP_FOLIO)
+ next_folio = next_desc->swap_cache;
+ else if (next_type == VSWAP_SWAPFILE)
+ next_slot = next_desc->slot;
+
+ if (cluster)
+ spin_unlock(&cluster->lock);
+
+ rcu_read_unlock();
+
+ /* Check if types match */
+ if (next_type != type)
+ return (swp_entry_t){0};
+
+ /* Check backing state consistency */
+ if (type == VSWAP_SWAPFILE &&
+ (swp_slot_type(next_slot) != swp_slot_type(slot) ||
+ swp_slot_offset(next_slot) !=
+ swp_slot_offset(slot) + delta))
+ return (swp_entry_t){0};
+
+ if (type == VSWAP_FOLIO && next_folio != folio)
+ return (swp_entry_t){0};
+
+ return next_entry;
+}
+
+/*
+ * Return the count of contiguous swap entries that share the same
+ * VSWAP_ZERO status as the starting entry. If is_zeromap is not NULL,
+ * it will return the VSWAP_ZERO status of the starting entry.
+ */
+int swap_zeromap_batch(swp_entry_t entry, int max_nr, bool *is_zeromap)
+{
+ struct vswap_cluster *cluster = NULL;
+ struct swp_desc *desc;
+ int i = 0;
+ bool is_zero = false;
+
+ VM_WARN_ON(!entry.val);
+
+ rcu_read_lock();
+ for (i = 0; i < max_nr; i++) {
+ desc = vswap_iter(&cluster, entry.val + i);
+ if (!desc)
+ goto done;
+
+ if (!i)
+ is_zero = (desc->type == VSWAP_ZERO);
+ else if ((desc->type == VSWAP_ZERO) != is_zero)
+ goto done;
+ }
+done:
+ if (cluster)
+ spin_unlock(&cluster->lock);
+ rcu_read_unlock();
+ if (i && is_zeromap)
+ *is_zeromap = is_zero;
+
+ return i;
+}
+
/**
* free_swap_and_cache_nr() - Release a swap count on range of swap entries and
* reclaim their cache if no more references remain.
@@ -1028,11 +1452,6 @@ bool tryget_swap_entry(swp_entry_t entry, struct swap_info_struct **si)
struct vswap_cluster *cluster;
swp_slot_t slot;
- slot = swp_entry_to_swp_slot(entry);
- *si = swap_slot_tryget_swap_info(slot);
- if (!*si)
- return false;
-
/*
* Ensure the cluster and its associated data structures (swap cache etc.)
* remain valid.
@@ -1041,11 +1460,30 @@ bool tryget_swap_entry(swp_entry_t entry, struct swap_info_struct **si)
cluster = xa_load(&vswap_cluster_map, VSWAP_CLUSTER_IDX(entry));
if (!cluster || !refcount_inc_not_zero(&cluster->refcnt)) {
rcu_read_unlock();
- swap_slot_put_swap_info(*si);
*si = NULL;
return false;
}
rcu_read_unlock();
+
+ slot = swp_entry_to_swp_slot(entry);
+ /*
+ * Note that this function does not provide any guarantee that the virtual
+ * swap slot's backing state will be stable. This has several implications:
+ *
+ * 1. We have to obtain a reference to the swap device itself, because we
+ * need swap device's metadata in certain scenarios, for example when we
+ * need to inspect the swap device flag in do_swap_page().
+ *
+ * 2. The swap device we are looking up here might be outdated by the time we
+ * return to the caller. It is perfectly OK, if the swap_info_struct is only
+ * used in a best-effort manner (i.e optimization). If we need the precise
+ * backing state, we need to re-check after the entry is pinned in swapcache.
+ */
+ if (slot.val)
+ *si = swap_slot_tryget_swap_info(slot);
+ else
+ *si = NULL;
+
return true;
}
@@ -1288,7 +1726,7 @@ void swap_cache_add_folio(struct folio *folio, swp_entry_t entry, void **shadowp
old = desc->shadow;
/* Warn if slot is already occupied by a folio */
- VM_WARN_ON_FOLIO(old && !xa_is_value(old), folio);
+ VM_WARN_ON_FOLIO(old && !xa_is_value(old) && old != folio, folio);
/* Save shadow if found and not yet saved */
if (shadowp && xa_is_value(old) && !*shadowp)
@@ -1415,29 +1853,22 @@ void __swap_cache_replace_folio(struct folio *old, struct folio *new)
* @entry: the zswap entry to store
*
* Stores a zswap entry in the swap descriptor for the given swap entry.
- * The cluster is locked during the store operation.
- *
- * Return: the old zswap entry if one existed, NULL otherwise
+ * Releases the old backend if one existed.
*/
-void *zswap_entry_store(swp_entry_t swpentry, struct zswap_entry *entry)
+void zswap_entry_store(swp_entry_t swpentry, struct zswap_entry *entry)
{
struct vswap_cluster *cluster = NULL;
struct swp_desc *desc;
- void *old;
+
+ release_backing(swpentry, 1);
rcu_read_lock();
desc = vswap_iter(&cluster, swpentry.val);
- if (!desc) {
- rcu_read_unlock();
- return NULL;
- }
-
- old = desc->zswap_entry;
+ VM_WARN_ON(!desc);
desc->zswap_entry = entry;
+ desc->type = VSWAP_ZSWAP;
spin_unlock(&cluster->lock);
rcu_read_unlock();
-
- return old;
}
/**
@@ -1452,6 +1883,7 @@ void *zswap_entry_load(swp_entry_t swpentry)
{
struct vswap_cluster *cluster = NULL;
struct swp_desc *desc;
+ enum swap_type type;
void *zswap_entry;
rcu_read_lock();
@@ -1461,41 +1893,15 @@ void *zswap_entry_load(swp_entry_t swpentry)
return NULL;
}
+ type = desc->type;
zswap_entry = desc->zswap_entry;
spin_unlock(&cluster->lock);
rcu_read_unlock();
- return zswap_entry;
-}
-
-/**
- * zswap_entry_erase - erase a zswap entry for a swap entry
- * @swpentry: the swap entry
- *
- * Erases the zswap entry from the swap descriptor for the given swap entry.
- * The cluster is locked during the erase operation.
- *
- * Return: the zswap entry that was erased, NULL if none existed
- */
-void *zswap_entry_erase(swp_entry_t swpentry)
-{
- struct vswap_cluster *cluster = NULL;
- struct swp_desc *desc;
- void *old;
-
- rcu_read_lock();
- desc = vswap_iter(&cluster, swpentry.val);
- if (!desc) {
- rcu_read_unlock();
+ if (type != VSWAP_ZSWAP)
return NULL;
- }
- old = desc->zswap_entry;
- desc->zswap_entry = NULL;
- spin_unlock(&cluster->lock);
- rcu_read_unlock();
-
- return old;
+ return zswap_entry;
}
bool zswap_empty(swp_entry_t swpentry)
diff --git a/mm/zswap.c b/mm/zswap.c
index e46349f9c90bb..c5e1d252cb463 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -991,8 +991,9 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
{
struct folio *folio;
struct mempolicy *mpol;
- bool folio_was_allocated;
+ bool folio_was_allocated, phys_swap_alloced = false;
struct swap_info_struct *si;
+ struct zswap_entry *new_entry = NULL;
int ret = 0;
/* try to allocate swap cache folio */
@@ -1027,18 +1028,23 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
* old compressed data. Only when this is successful can the entry
* be dereferenced.
*/
- if (entry != zswap_entry_load(swpentry)) {
+ new_entry = zswap_entry_load(swpentry);
+ if (entry != new_entry) {
ret = -ENOMEM;
goto out;
}
+ if (!vswap_alloc_swap_slot(folio)) {
+ ret = -ENOMEM;
+ goto out;
+ }
+ phys_swap_alloced = true;
+
if (!zswap_decompress(entry, folio)) {
ret = -EIO;
goto out;
}
- zswap_entry_erase(swpentry);
-
count_vm_event(ZSWPWB);
if (entry->objcg)
count_objcg_events(entry->objcg, ZSWPWB, 1);
@@ -1056,6 +1062,8 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
out:
if (ret && ret != -EEXIST) {
+ if (phys_swap_alloced)
+ zswap_entry_store(swpentry, new_entry);
swap_cache_del_folio(folio);
folio_unlock(folio);
}
@@ -1401,7 +1409,7 @@ static bool zswap_store_page(struct page *page,
struct zswap_pool *pool)
{
swp_entry_t page_swpentry = page_swap_entry(page);
- struct zswap_entry *entry, *old;
+ struct zswap_entry *entry;
/* allocate entry */
entry = zswap_entry_cache_alloc(GFP_KERNEL, page_to_nid(page));
@@ -1413,15 +1421,12 @@ static bool zswap_store_page(struct page *page,
if (!zswap_compress(page, entry, pool))
goto compress_failed;
- old = zswap_entry_store(page_swpentry, entry);
-
/*
* We may have had an existing entry that became stale when
* the folio was redirtied and now the new version is being
- * swapped out. Get rid of the old.
+ * swapped out. zswap_entry_store() will get rid of the old.
*/
- if (old)
- zswap_entry_free(old);
+ zswap_entry_store(page_swpentry, entry);
/*
* The entry is successfully compressed and stored in the tree, there is
@@ -1533,18 +1538,13 @@ bool zswap_store(struct folio *folio)
* the possibly stale entries which were previously stored at the
* offsets corresponding to each page of the folio. Otherwise,
* writeback could overwrite the new data in the swapfile.
+ *
+ * The only exception is if we still have a full contiguous
+ * range of physical swap slots backing the folio. Keep them for
+ * fallback disk swapping.
*/
- if (!ret) {
- unsigned type = swp_type(swp);
- pgoff_t offset = swp_offset(swp);
- struct zswap_entry *entry;
-
- for (index = 0; index < nr_pages; ++index) {
- entry = zswap_entry_erase(swp_entry(type, offset + index));
- if (entry)
- zswap_entry_free(entry);
- }
- }
+ if (!ret && !vswap_swapfile_backed(swp, nr_pages))
+ vswap_store_folio(swp, folio);
return ret;
}
@@ -1619,8 +1619,7 @@ int zswap_load(struct folio *folio)
*/
if (swapcache) {
folio_mark_dirty(folio);
- zswap_entry_erase(swp);
- zswap_entry_free(entry);
+ vswap_store_folio(swp, folio);
}
folio_unlock(folio);
--
2.47.3
^ permalink raw reply [flat|nested] 52+ messages in thread
* [PATCH v3 15/20] zswap: do not start zswap shrinker if there is no physical swap slots
2026-02-08 21:58 [PATCH v3 00/20] Virtual Swap Space Nhat Pham
` (13 preceding siblings ...)
2026-02-08 21:58 ` [PATCH v3 14/20] mm: swap: decouple virtual swap slot from backing store Nhat Pham
@ 2026-02-08 21:58 ` Nhat Pham
2026-02-08 21:58 ` [PATCH v3 16/20] swap: do not unnecesarily pin readahead swap entries Nhat Pham
` (6 subsequent siblings)
21 siblings, 0 replies; 52+ messages in thread
From: Nhat Pham @ 2026-02-08 21:58 UTC (permalink / raw)
To: linux-mm
Cc: akpm, hannes, hughd, yosry.ahmed, mhocko, roman.gushchin,
shakeel.butt, muchun.song, len.brown, chengming.zhou, kasong,
chrisl, huang.ying.caritas, ryan.roberts, shikemeng, viro,
baohua, bhe, osalvador, lorenzo.stoakes, christophe.leroy, pavel,
kernel-team, linux-kernel, cgroups, linux-pm, peterx, riel,
joshua.hahnjy, npache, gourry, axelrasmussen, yuanchu, weixugc,
rafael, jannh, pfalcato, zhengqi.arch
When swap is virtualized, we no longer pre-allocate a slot on swapfile
for each zswap entry. Do not start the zswap shrinker if there is no
physical swap slots available.
Signed-off-by: Nhat Pham <nphamcs@gmail.com>
---
mm/zswap.c | 8 ++++++++
1 file changed, 8 insertions(+)
diff --git a/mm/zswap.c b/mm/zswap.c
index c5e1d252cb463..9d1822753d321 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -1211,6 +1211,14 @@ static unsigned long zswap_shrinker_count(struct shrinker *shrinker,
if (!zswap_shrinker_enabled || !mem_cgroup_zswap_writeback_enabled(memcg))
return 0;
+ /*
+ * When swap is virtualized, we do not have any swap slots on swapfile
+ * preallocated for zswap objects. If there is no slot available, we
+ * cannot writeback and should just bail out here.
+ */
+ if (!get_nr_swap_pages())
+ return 0;
+
/*
* The shrinker resumes swap writeback, which will enter block
* and may enter fs. XXX: Harmonize with vmscan.c __GFP_FS
--
2.47.3
^ permalink raw reply [flat|nested] 52+ messages in thread
* [PATCH v3 16/20] swap: do not unnecesarily pin readahead swap entries
2026-02-08 21:58 [PATCH v3 00/20] Virtual Swap Space Nhat Pham
` (14 preceding siblings ...)
2026-02-08 21:58 ` [PATCH v3 15/20] zswap: do not start zswap shrinker if there is no physical swap slots Nhat Pham
@ 2026-02-08 21:58 ` Nhat Pham
2026-02-08 21:58 ` [PATCH v3 17/20] swapfile: remove zeromap bitmap Nhat Pham
` (5 subsequent siblings)
21 siblings, 0 replies; 52+ messages in thread
From: Nhat Pham @ 2026-02-08 21:58 UTC (permalink / raw)
To: linux-mm
Cc: akpm, hannes, hughd, yosry.ahmed, mhocko, roman.gushchin,
shakeel.butt, muchun.song, len.brown, chengming.zhou, kasong,
chrisl, huang.ying.caritas, ryan.roberts, shikemeng, viro,
baohua, bhe, osalvador, lorenzo.stoakes, christophe.leroy, pavel,
kernel-team, linux-kernel, cgroups, linux-pm, peterx, riel,
joshua.hahnjy, npache, gourry, axelrasmussen, yuanchu, weixugc,
rafael, jannh, pfalcato, zhengqi.arch
When we perform swap readahead, the target entry is already pinned by
the caller. No need to pin swap entries in the readahead window that
belongs in the same virtual swap cluster as the target swap entry.
Signed-off-by: Nhat Pham <nphamcs@gmail.com>
---
mm/swap.h | 1 +
mm/swap_state.c | 22 +++++++++-------------
mm/vswap.c | 10 ++++++++++
3 files changed, 20 insertions(+), 13 deletions(-)
diff --git a/mm/swap.h b/mm/swap.h
index d41e6a0e70753..08a6369a6dfad 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -213,6 +213,7 @@ void swap_cache_lock(swp_entry_t entry);
void swap_cache_unlock(swp_entry_t entry);
void vswap_rmap_set(struct swap_cluster_info *ci, swp_slot_t slot,
unsigned long vswap, int nr);
+bool vswap_same_cluster(swp_entry_t entry1, swp_entry_t entry2);
static inline struct address_space *swap_address_space(swp_entry_t entry)
{
diff --git a/mm/swap_state.c b/mm/swap_state.c
index ad80bf098b63f..e8e0905c7723f 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -553,22 +553,18 @@ static struct folio *swap_vma_readahead(swp_entry_t targ_entry, gfp_t gfp_mask,
pte_unmap(pte);
pte = NULL;
/*
- * Readahead entry may come from a device that we are not
- * holding a reference to, try to grab a reference, or skip.
- *
- * XXX: for now, always try to pin the swap entries in the
- * readahead window to avoid the annoying conversion to physical
- * swap slots. Once we move all swap metadata to virtual swap
- * layer, we can simply compare the clusters of the target
- * swap entry and the current swap entry, and pin the latter
- * swap entry's cluster if it differ from the former's.
+ * The target entry is already pinned - if the readahead entry
+ * belongs to the same cluster, it's already protected.
*/
- swapoff_locked = tryget_swap_entry(entry, &si);
- if (!swapoff_locked)
- continue;
+ if (!vswap_same_cluster(entry, targ_entry)) {
+ swapoff_locked = tryget_swap_entry(entry, &si);
+ if (!swapoff_locked)
+ continue;
+ }
folio = __read_swap_cache_async(entry, gfp_mask, mpol, ilx,
&page_allocated, false);
- put_swap_entry(entry, si);
+ if (swapoff_locked)
+ put_swap_entry(entry, si);
if (!folio)
continue;
if (page_allocated) {
diff --git a/mm/vswap.c b/mm/vswap.c
index fb6179ce3ace7..7563107eb8eee 100644
--- a/mm/vswap.c
+++ b/mm/vswap.c
@@ -1503,6 +1503,16 @@ void put_swap_entry(swp_entry_t entry, struct swap_info_struct *si)
rcu_read_unlock();
}
+/*
+ * Check if two virtual swap entries belong to the same vswap cluster.
+ * Useful for optimizing readahead when entries in the same cluster
+ * share protection from a pinned target entry.
+ */
+bool vswap_same_cluster(swp_entry_t entry1, swp_entry_t entry2)
+{
+ return VSWAP_CLUSTER_IDX(entry1) == VSWAP_CLUSTER_IDX(entry2);
+}
+
static int vswap_cpu_dead(unsigned int cpu)
{
struct percpu_vswap_cluster *percpu_cluster;
--
2.47.3
^ permalink raw reply [flat|nested] 52+ messages in thread
* [PATCH v3 17/20] swapfile: remove zeromap bitmap
2026-02-08 21:58 [PATCH v3 00/20] Virtual Swap Space Nhat Pham
` (15 preceding siblings ...)
2026-02-08 21:58 ` [PATCH v3 16/20] swap: do not unnecesarily pin readahead swap entries Nhat Pham
@ 2026-02-08 21:58 ` Nhat Pham
2026-02-08 21:58 ` [PATCH v3 18/20] memcg: swap: only charge physical swap slots Nhat Pham
` (4 subsequent siblings)
21 siblings, 0 replies; 52+ messages in thread
From: Nhat Pham @ 2026-02-08 21:58 UTC (permalink / raw)
To: linux-mm
Cc: akpm, hannes, hughd, yosry.ahmed, mhocko, roman.gushchin,
shakeel.butt, muchun.song, len.brown, chengming.zhou, kasong,
chrisl, huang.ying.caritas, ryan.roberts, shikemeng, viro,
baohua, bhe, osalvador, lorenzo.stoakes, christophe.leroy, pavel,
kernel-team, linux-kernel, cgroups, linux-pm, peterx, riel,
joshua.hahnjy, npache, gourry, axelrasmussen, yuanchu, weixugc,
rafael, jannh, pfalcato, zhengqi.arch
Zero swap entries are now treated as a separate, decoupled backend in
the virtual swap layer. The zeromap bitmap of physical swapfile is no
longer used - remove it. This does not have any behavioral change, and
save 1 bit per swap page in terms of memory overhead.
Signed-off-by: Nhat Pham <nphamcs@gmail.com>
---
include/linux/swap.h | 1 -
mm/swapfile.c | 30 +++++-------------------------
2 files changed, 5 insertions(+), 26 deletions(-)
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 54df972608047..9cd45eab313f8 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -260,7 +260,6 @@ struct swap_info_struct {
signed char type; /* strange name for an index */
unsigned int max; /* extent of the swap_map */
unsigned char *swap_map; /* vmalloc'ed array of usage counts */
- unsigned long *zeromap; /* kvmalloc'ed bitmap to track zero pages */
struct swap_cluster_info *cluster_info; /* cluster info. Only for SSD */
struct list_head free_clusters; /* free clusters list */
struct list_head full_clusters; /* full clusters list */
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 1aa29dd220f9a..e1cb01b821ff3 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -2317,8 +2317,7 @@ static int setup_swap_extents(struct swap_info_struct *sis, sector_t *span)
static void setup_swap_info(struct swap_info_struct *si, int prio,
unsigned char *swap_map,
- struct swap_cluster_info *cluster_info,
- unsigned long *zeromap)
+ struct swap_cluster_info *cluster_info)
{
si->prio = prio;
/*
@@ -2329,7 +2328,6 @@ static void setup_swap_info(struct swap_info_struct *si, int prio,
si->avail_list.prio = -si->prio;
si->swap_map = swap_map;
si->cluster_info = cluster_info;
- si->zeromap = zeromap;
}
static void _enable_swap_info(struct swap_info_struct *si)
@@ -2347,12 +2345,11 @@ static void _enable_swap_info(struct swap_info_struct *si)
static void enable_swap_info(struct swap_info_struct *si, int prio,
unsigned char *swap_map,
- struct swap_cluster_info *cluster_info,
- unsigned long *zeromap)
+ struct swap_cluster_info *cluster_info)
{
spin_lock(&swap_lock);
spin_lock(&si->lock);
- setup_swap_info(si, prio, swap_map, cluster_info, zeromap);
+ setup_swap_info(si, prio, swap_map, cluster_info);
spin_unlock(&si->lock);
spin_unlock(&swap_lock);
/*
@@ -2370,7 +2367,7 @@ static void reinsert_swap_info(struct swap_info_struct *si)
{
spin_lock(&swap_lock);
spin_lock(&si->lock);
- setup_swap_info(si, si->prio, si->swap_map, si->cluster_info, si->zeromap);
+ setup_swap_info(si, si->prio, si->swap_map, si->cluster_info);
_enable_swap_info(si);
spin_unlock(&si->lock);
spin_unlock(&swap_lock);
@@ -2441,7 +2438,6 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
{
struct swap_info_struct *p = NULL;
unsigned char *swap_map;
- unsigned long *zeromap;
struct swap_cluster_info *cluster_info;
struct file *swap_file, *victim;
struct address_space *mapping;
@@ -2536,8 +2532,6 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
p->swap_file = NULL;
swap_map = p->swap_map;
p->swap_map = NULL;
- zeromap = p->zeromap;
- p->zeromap = NULL;
maxpages = p->max;
cluster_info = p->cluster_info;
p->max = 0;
@@ -2549,7 +2543,6 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
kfree(p->global_cluster);
p->global_cluster = NULL;
vfree(swap_map);
- kvfree(zeromap);
free_cluster_info(cluster_info, maxpages);
inode = mapping->host;
@@ -3013,7 +3006,6 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
sector_t span;
unsigned long maxpages;
unsigned char *swap_map = NULL;
- unsigned long *zeromap = NULL;
struct swap_cluster_info *cluster_info = NULL;
struct folio *folio = NULL;
struct inode *inode = NULL;
@@ -3119,17 +3111,6 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
if (error)
goto bad_swap_unlock_inode;
- /*
- * Use kvmalloc_array instead of bitmap_zalloc as the allocation order might
- * be above MAX_PAGE_ORDER incase of a large swap file.
- */
- zeromap = kvmalloc_array(BITS_TO_LONGS(maxpages), sizeof(long),
- GFP_KERNEL | __GFP_ZERO);
- if (!zeromap) {
- error = -ENOMEM;
- goto bad_swap_unlock_inode;
- }
-
if (si->bdev && bdev_stable_writes(si->bdev))
si->flags |= SWP_STABLE_WRITES;
@@ -3196,7 +3177,7 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
prio = DEF_SWAP_PRIO;
if (swap_flags & SWAP_FLAG_PREFER)
prio = swap_flags & SWAP_FLAG_PRIO_MASK;
- enable_swap_info(si, prio, swap_map, cluster_info, zeromap);
+ enable_swap_info(si, prio, swap_map, cluster_info);
pr_info("Adding %uk swap on %s. Priority:%d extents:%d across:%lluk %s%s%s%s\n",
K(si->pages), name->name, si->prio, nr_extents,
@@ -3224,7 +3205,6 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
si->flags = 0;
spin_unlock(&swap_lock);
vfree(swap_map);
- kvfree(zeromap);
if (cluster_info)
free_cluster_info(cluster_info, maxpages);
if (inced_nr_rotate_swap)
--
2.47.3
^ permalink raw reply [flat|nested] 52+ messages in thread
* [PATCH v3 18/20] memcg: swap: only charge physical swap slots
2026-02-08 21:58 [PATCH v3 00/20] Virtual Swap Space Nhat Pham
` (16 preceding siblings ...)
2026-02-08 21:58 ` [PATCH v3 17/20] swapfile: remove zeromap bitmap Nhat Pham
@ 2026-02-08 21:58 ` Nhat Pham
2026-02-09 2:01 ` kernel test robot
2026-02-09 2:12 ` kernel test robot
2026-02-08 21:58 ` [PATCH v3 19/20] swap: simplify swapoff using virtual swap Nhat Pham
` (3 subsequent siblings)
21 siblings, 2 replies; 52+ messages in thread
From: Nhat Pham @ 2026-02-08 21:58 UTC (permalink / raw)
To: linux-mm
Cc: akpm, hannes, hughd, yosry.ahmed, mhocko, roman.gushchin,
shakeel.butt, muchun.song, len.brown, chengming.zhou, kasong,
chrisl, huang.ying.caritas, ryan.roberts, shikemeng, viro,
baohua, bhe, osalvador, lorenzo.stoakes, christophe.leroy, pavel,
kernel-team, linux-kernel, cgroups, linux-pm, peterx, riel,
joshua.hahnjy, npache, gourry, axelrasmussen, yuanchu, weixugc,
rafael, jannh, pfalcato, zhengqi.arch
Now that zswap and the zero-filled swap page optimization no longer
takes up any physical swap space, we should not charge towards the swap
usage and limits of the memcg in these case. We will only record the
memcg id on virtual swap slot allocation, and defer physical swap
charging (i.e towards memory.swap.current) until the virtual swap slot
is backed by an actual physical swap slot (on zswap store failure
fallback or zswap writeback).
Signed-off-by: Nhat Pham <nphamcs@gmail.com>
---
include/linux/swap.h | 16 +++++++++
mm/memcontrol-v1.c | 6 ++++
mm/memcontrol.c | 83 ++++++++++++++++++++++++++++++++------------
mm/vswap.c | 39 +++++++++------------
4 files changed, 98 insertions(+), 46 deletions(-)
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 9cd45eab313f8..a30d382fb5ee1 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -613,6 +613,22 @@ static inline void folio_throttle_swaprate(struct folio *folio, gfp_t gfp)
#endif
#if defined(CONFIG_MEMCG) && defined(CONFIG_SWAP)
+void __mem_cgroup_record_swap(struct folio *folio, swp_entry_t entry);
+static inline void mem_cgroup_record_swap(struct folio *folio,
+ swp_entry_t entry)
+{
+ if (!mem_cgroup_disabled())
+ __mem_cgroup_record_swap(folio, entry);
+}
+
+void __mem_cgroup_clear_swap(swp_entry_t entry, unsigned int nr_pages);
+static inline void mem_cgroup_clear_swap(swp_entry_t entry,
+ unsigned int nr_pages)
+{
+ if (!mem_cgroup_disabled())
+ __mem_cgroup_clear_swap(entry, nr_pages);
+}
+
int __mem_cgroup_try_charge_swap(struct folio *folio, swp_entry_t entry);
static inline int mem_cgroup_try_charge_swap(struct folio *folio,
swp_entry_t entry)
diff --git a/mm/memcontrol-v1.c b/mm/memcontrol-v1.c
index 6eed14bff7426..4580a034dcf72 100644
--- a/mm/memcontrol-v1.c
+++ b/mm/memcontrol-v1.c
@@ -680,6 +680,12 @@ void memcg1_swapin(swp_entry_t entry, unsigned int nr_pages)
* memory+swap charge, drop the swap entry duplicate.
*/
mem_cgroup_uncharge_swap(entry, nr_pages);
+
+ /*
+ * Clear the cgroup association now to prevent double memsw
+ * uncharging when the backends are released later.
+ */
+ mem_cgroup_clear_swap(entry, nr_pages);
}
}
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 2ba5811e7edba..50be8066bebec 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5172,6 +5172,49 @@ int __init mem_cgroup_init(void)
}
#ifdef CONFIG_SWAP
+/**
+ * __mem_cgroup_record_swap - record the folio's cgroup for the swap entries.
+ * @folio: folio being swapped out.
+ * @entry: the first swap entry in the range.
+ */
+void __mem_cgroup_record_swap(struct folio *folio, swp_entry_t entry)
+{
+ unsigned int nr_pages = folio_nr_pages(folio);
+ struct mem_cgroup *memcg;
+
+ /* Recording will be done by memcg1_swapout(). */
+ if (do_memsw_account())
+ return;
+
+ memcg = folio_memcg(folio);
+
+ VM_WARN_ON_ONCE_FOLIO(!memcg, folio);
+ if (!memcg)
+ return;
+
+ memcg = mem_cgroup_id_get_online(memcg);
+ if (nr_pages > 1)
+ mem_cgroup_id_get_many(memcg, nr_pages - 1);
+ swap_cgroup_record(folio, mem_cgroup_id(memcg), entry);
+}
+
+/**
+ * __mem_cgroup_clear_swap - clear cgroup information of the swap entries.
+ * @folio: folio being swapped out.
+ * @entry: the first swap entry in the range.
+ */
+void __mem_cgroup_clear_swap(swp_entry_t entry, unsigned int nr_pages)
+{
+ unsigned short id = swap_cgroup_clear(entry, nr_pages);
+ struct mem_cgroup *memcg;
+
+ rcu_read_lock();
+ memcg = mem_cgroup_from_id(id);
+ if (memcg)
+ mem_cgroup_id_put_many(memcg, nr_pages);
+ rcu_read_unlock();
+}
+
/**
* __mem_cgroup_try_charge_swap - try charging swap space for a folio
* @folio: folio being added to swap
@@ -5190,34 +5233,24 @@ int __mem_cgroup_try_charge_swap(struct folio *folio, swp_entry_t entry)
if (do_memsw_account())
return 0;
- memcg = folio_memcg(folio);
-
- VM_WARN_ON_ONCE_FOLIO(!memcg, folio);
- if (!memcg)
- return 0;
-
- if (!entry.val) {
- memcg_memory_event(memcg, MEMCG_SWAP_FAIL);
- return 0;
- }
-
- memcg = mem_cgroup_id_get_online(memcg);
+ /*
+ * We already record the cgroup on virtual swap allocation.
+ * Note that the virtual swap slot holds a reference to memcg,
+ * so this lookup should be safe.
+ */
+ rcu_read_lock();
+ memcg = mem_cgroup_from_id(lookup_swap_cgroup_id(entry));
+ rcu_read_unlock();
if (!mem_cgroup_is_root(memcg) &&
!page_counter_try_charge(&memcg->swap, nr_pages, &counter)) {
memcg_memory_event(memcg, MEMCG_SWAP_MAX);
memcg_memory_event(memcg, MEMCG_SWAP_FAIL);
- mem_cgroup_id_put(memcg);
return -ENOMEM;
}
- /* Get references for the tail pages, too */
- if (nr_pages > 1)
- mem_cgroup_id_get_many(memcg, nr_pages - 1);
mod_memcg_state(memcg, MEMCG_SWAP, nr_pages);
- swap_cgroup_record(folio, mem_cgroup_id(memcg), entry);
-
return 0;
}
@@ -5231,7 +5264,8 @@ void __mem_cgroup_uncharge_swap(swp_entry_t entry, unsigned int nr_pages)
struct mem_cgroup *memcg;
unsigned short id;
- id = swap_cgroup_clear(entry, nr_pages);
+ id = lookup_swap_cgroup_id(entry);
+
rcu_read_lock();
memcg = mem_cgroup_from_id(id);
if (memcg) {
@@ -5242,7 +5276,6 @@ void __mem_cgroup_uncharge_swap(swp_entry_t entry, unsigned int nr_pages)
page_counter_uncharge(&memcg->swap, nr_pages);
}
mod_memcg_state(memcg, MEMCG_SWAP, -nr_pages);
- mem_cgroup_id_put_many(memcg, nr_pages);
}
rcu_read_unlock();
}
@@ -5251,14 +5284,18 @@ static bool mem_cgroup_may_zswap(struct mem_cgroup *original_memcg);
long mem_cgroup_get_nr_swap_pages(struct mem_cgroup *memcg)
{
- long nr_swap_pages, nr_zswap_pages = 0;
+ long nr_swap_pages;
if (zswap_is_enabled() && (mem_cgroup_disabled() || do_memsw_account() ||
mem_cgroup_may_zswap(memcg))) {
- nr_zswap_pages = PAGE_COUNTER_MAX;
+ /*
+ * No need to check swap cgroup limits, since zswap is not charged
+ * towards swap consumption.
+ */
+ return PAGE_COUNTER_MAX;
}
- nr_swap_pages = max_t(long, nr_zswap_pages, get_nr_swap_pages());
+ nr_swap_pages = get_nr_swap_pages();
if (mem_cgroup_disabled() || do_memsw_account())
return nr_swap_pages;
for (; !mem_cgroup_is_root(memcg); memcg = parent_mem_cgroup(memcg))
diff --git a/mm/vswap.c b/mm/vswap.c
index 7563107eb8eee..2a071d5ae173c 100644
--- a/mm/vswap.c
+++ b/mm/vswap.c
@@ -543,6 +543,7 @@ void vswap_rmap_set(struct swap_cluster_info *ci, swp_slot_t slot,
struct vswap_cluster *cluster = NULL;
struct swp_desc *desc;
unsigned long flush_nr, phys_swap_start = 0, phys_swap_end = 0;
+ unsigned long phys_swap_released = 0;
unsigned int phys_swap_type = 0;
bool need_flushing_phys_swap = false;
swp_slot_t flush_slot;
@@ -572,6 +573,7 @@ void vswap_rmap_set(struct swap_cluster_info *ci, swp_slot_t slot,
if (desc->type == VSWAP_ZSWAP && desc->zswap_entry) {
zswap_entry_free(desc->zswap_entry);
} else if (desc->type == VSWAP_SWAPFILE) {
+ phys_swap_released++;
if (!phys_swap_start) {
/* start a new contiguous range of phys swap */
phys_swap_start = swp_slot_offset(desc->slot);
@@ -602,6 +604,9 @@ void vswap_rmap_set(struct swap_cluster_info *ci, swp_slot_t slot,
flush_nr = phys_swap_end - phys_swap_start;
swap_slot_free_nr(flush_slot, flush_nr);
}
+
+ if (phys_swap_released)
+ mem_cgroup_uncharge_swap(entry, phys_swap_released);
}
/*
@@ -629,7 +634,7 @@ static void vswap_free(struct vswap_cluster *cluster, struct swp_desc *desc,
spin_unlock(&cluster->lock);
release_backing(entry, 1);
- mem_cgroup_uncharge_swap(entry, 1);
+ mem_cgroup_clear_swap(entry, 1);
/* erase forward mapping and release the virtual slot for reallocation */
spin_lock(&cluster->lock);
@@ -644,9 +649,6 @@ static void vswap_free(struct vswap_cluster *cluster, struct swp_desc *desc,
*/
int folio_alloc_swap(struct folio *folio)
{
- struct vswap_cluster *cluster = NULL;
- int i, nr = folio_nr_pages(folio);
- struct swp_desc *desc;
swp_entry_t entry;
VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
@@ -656,25 +658,7 @@ int folio_alloc_swap(struct folio *folio)
if (!entry.val)
return -ENOMEM;
- /*
- * XXX: for now, we charge towards the memory cgroup's swap limit on virtual
- * swap slots allocation. This will be changed soon - we will only charge on
- * physical swap slots allocation.
- */
- if (mem_cgroup_try_charge_swap(folio, entry)) {
- rcu_read_lock();
- for (i = 0; i < nr; i++) {
- desc = vswap_iter(&cluster, entry.val + i);
- VM_WARN_ON(!desc);
- vswap_free(cluster, desc, (swp_entry_t){ entry.val + i });
- }
- spin_unlock(&cluster->lock);
- rcu_read_unlock();
- atomic_add(nr, &vswap_alloc_reject);
- entry.val = 0;
- return -ENOMEM;
- }
-
+ mem_cgroup_record_swap(folio, entry);
swap_cache_add_folio(folio, entry, NULL);
return 0;
@@ -716,6 +700,15 @@ bool vswap_alloc_swap_slot(struct folio *folio)
if (!slot.val)
return false;
+ if (mem_cgroup_try_charge_swap(folio, entry)) {
+ /*
+ * We have not updated the backing type of the virtual swap slot.
+ * Simply free up the physical swap slots here!
+ */
+ swap_slot_free_nr(slot, nr);
+ return false;
+ }
+
/* establish the vrtual <-> physical swap slots linkages. */
si = __swap_slot_to_info(slot);
ci = swap_cluster_lock(si, swp_slot_offset(slot));
--
2.47.3
^ permalink raw reply [flat|nested] 52+ messages in thread
* [PATCH v3 19/20] swap: simplify swapoff using virtual swap
2026-02-08 21:58 [PATCH v3 00/20] Virtual Swap Space Nhat Pham
` (17 preceding siblings ...)
2026-02-08 21:58 ` [PATCH v3 18/20] memcg: swap: only charge physical swap slots Nhat Pham
@ 2026-02-08 21:58 ` Nhat Pham
2026-02-08 21:58 ` [PATCH v3 20/20] swapfile: replace the swap map with bitmaps Nhat Pham
` (2 subsequent siblings)
21 siblings, 0 replies; 52+ messages in thread
From: Nhat Pham @ 2026-02-08 21:58 UTC (permalink / raw)
To: linux-mm
Cc: akpm, hannes, hughd, yosry.ahmed, mhocko, roman.gushchin,
shakeel.butt, muchun.song, len.brown, chengming.zhou, kasong,
chrisl, huang.ying.caritas, ryan.roberts, shikemeng, viro,
baohua, bhe, osalvador, lorenzo.stoakes, christophe.leroy, pavel,
kernel-team, linux-kernel, cgroups, linux-pm, peterx, riel,
joshua.hahnjy, npache, gourry, axelrasmussen, yuanchu, weixugc,
rafael, jannh, pfalcato, zhengqi.arch
This patch presents the second applications of virtual swap design -
simplifying and optimizing swapoff.
With virtual swap slots stored at page table entries and used as indices
to various swap-related data structures, we no longer have to perform a
page table walk in swapoff. Simply iterate through all the allocated
swap slots on the swapfile, find their corresponding virtual swap slots,
and fault them in.
This is significantly cleaner, as well as slightly more performant,
especially when there are a lot of unrelated VMAs (since the old swapoff
code would have to traverse through all of them).
In a simple benchmark, in which we swapoff a 32 GB swapfile that is 50%
full, and in which there is a process that maps a 128GB file into
memory:
Baseline:
sys: 11.48s
New Design:
sys: 9.96s
Disregarding the real time reduction (which is mostly due to more IO
asynchrony), the new design reduces the kernel CPU time by about 13%.
Signed-off-by: Nhat Pham <nphamcs@gmail.com>
---
include/linux/shmem_fs.h | 7 +-
mm/shmem.c | 184 +--------------
mm/swapfile.c | 474 +++++++++------------------------------
3 files changed, 113 insertions(+), 552 deletions(-)
diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
index e2069b3179c41..bac6b6cafe89c 100644
--- a/include/linux/shmem_fs.h
+++ b/include/linux/shmem_fs.h
@@ -41,17 +41,13 @@ struct shmem_inode_info {
unsigned long swapped; /* subtotal assigned to swap */
union {
struct offset_ctx dir_offsets; /* stable directory offsets */
- struct {
- struct list_head shrinklist; /* shrinkable hpage inodes */
- struct list_head swaplist; /* chain of maybes on swap */
- };
+ struct list_head shrinklist; /* shrinkable hpage inodes */
};
struct timespec64 i_crtime; /* file creation time */
struct shared_policy policy; /* NUMA memory alloc policy */
struct simple_xattrs xattrs; /* list of xattrs */
pgoff_t fallocend; /* highest fallocate endindex */
unsigned int fsflags; /* for FS_IOC_[SG]ETFLAGS */
- atomic_t stop_eviction; /* hold when working on inode */
#ifdef CONFIG_TMPFS_QUOTA
struct dquot __rcu *i_dquot[MAXQUOTAS];
#endif
@@ -127,7 +123,6 @@ struct page *shmem_read_mapping_page_gfp(struct address_space *mapping,
int shmem_writeout(struct folio *folio, struct swap_iocb **plug,
struct list_head *folio_list);
void shmem_truncate_range(struct inode *inode, loff_t start, uoff_t end);
-int shmem_unuse(unsigned int type);
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
unsigned long shmem_allowable_huge_orders(struct inode *inode,
diff --git a/mm/shmem.c b/mm/shmem.c
index 3a346cca114ab..61790752bdf6d 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -290,9 +290,6 @@ bool vma_is_shmem(const struct vm_area_struct *vma)
return vma_is_anon_shmem(vma) || vma->vm_ops == &shmem_vm_ops;
}
-static LIST_HEAD(shmem_swaplist);
-static DEFINE_SPINLOCK(shmem_swaplist_lock);
-
#ifdef CONFIG_TMPFS_QUOTA
static int shmem_enable_quotas(struct super_block *sb,
@@ -1413,16 +1410,6 @@ static void shmem_evict_inode(struct inode *inode)
}
spin_unlock(&sbinfo->shrinklist_lock);
}
- while (!list_empty(&info->swaplist)) {
- /* Wait while shmem_unuse() is scanning this inode... */
- wait_var_event(&info->stop_eviction,
- !atomic_read(&info->stop_eviction));
- spin_lock(&shmem_swaplist_lock);
- /* ...but beware of the race if we peeked too early */
- if (!atomic_read(&info->stop_eviction))
- list_del_init(&info->swaplist);
- spin_unlock(&shmem_swaplist_lock);
- }
}
simple_xattrs_free(&info->xattrs, sbinfo->max_inodes ? &freed : NULL);
@@ -1435,153 +1422,6 @@ static void shmem_evict_inode(struct inode *inode)
#endif
}
-static unsigned int shmem_find_swap_entries(struct address_space *mapping,
- pgoff_t start, struct folio_batch *fbatch,
- pgoff_t *indices, unsigned int type)
-{
- XA_STATE(xas, &mapping->i_pages, start);
- struct folio *folio;
- swp_entry_t entry;
- swp_slot_t slot;
-
- rcu_read_lock();
- xas_for_each(&xas, folio, ULONG_MAX) {
- if (xas_retry(&xas, folio))
- continue;
-
- if (!xa_is_value(folio))
- continue;
-
- entry = radix_to_swp_entry(folio);
- slot = swp_entry_to_swp_slot(entry);
-
- /*
- * swapin error entries can be found in the mapping. But they're
- * deliberately ignored here as we've done everything we can do.
- */
- if (!slot.val || swp_slot_type(slot) != type)
- continue;
-
- indices[folio_batch_count(fbatch)] = xas.xa_index;
- if (!folio_batch_add(fbatch, folio))
- break;
-
- if (need_resched()) {
- xas_pause(&xas);
- cond_resched_rcu();
- }
- }
- rcu_read_unlock();
-
- return folio_batch_count(fbatch);
-}
-
-/*
- * Move the swapped pages for an inode to page cache. Returns the count
- * of pages swapped in, or the error in case of failure.
- */
-static int shmem_unuse_swap_entries(struct inode *inode,
- struct folio_batch *fbatch, pgoff_t *indices)
-{
- int i = 0;
- int ret = 0;
- int error = 0;
- struct address_space *mapping = inode->i_mapping;
-
- for (i = 0; i < folio_batch_count(fbatch); i++) {
- struct folio *folio = fbatch->folios[i];
-
- error = shmem_swapin_folio(inode, indices[i], &folio, SGP_CACHE,
- mapping_gfp_mask(mapping), NULL, NULL);
- if (error == 0) {
- folio_unlock(folio);
- folio_put(folio);
- ret++;
- }
- if (error == -ENOMEM)
- break;
- error = 0;
- }
- return error ? error : ret;
-}
-
-/*
- * If swap found in inode, free it and move page from swapcache to filecache.
- */
-static int shmem_unuse_inode(struct inode *inode, unsigned int type)
-{
- struct address_space *mapping = inode->i_mapping;
- pgoff_t start = 0;
- struct folio_batch fbatch;
- pgoff_t indices[PAGEVEC_SIZE];
- int ret = 0;
-
- do {
- folio_batch_init(&fbatch);
- if (!shmem_find_swap_entries(mapping, start, &fbatch,
- indices, type)) {
- ret = 0;
- break;
- }
-
- ret = shmem_unuse_swap_entries(inode, &fbatch, indices);
- if (ret < 0)
- break;
-
- start = indices[folio_batch_count(&fbatch) - 1];
- } while (true);
-
- return ret;
-}
-
-/*
- * Read all the shared memory data that resides in the swap
- * device 'type' back into memory, so the swap device can be
- * unused.
- */
-int shmem_unuse(unsigned int type)
-{
- struct shmem_inode_info *info, *next;
- int error = 0;
-
- if (list_empty(&shmem_swaplist))
- return 0;
-
- spin_lock(&shmem_swaplist_lock);
-start_over:
- list_for_each_entry_safe(info, next, &shmem_swaplist, swaplist) {
- if (!info->swapped) {
- list_del_init(&info->swaplist);
- continue;
- }
- /*
- * Drop the swaplist mutex while searching the inode for swap;
- * but before doing so, make sure shmem_evict_inode() will not
- * remove placeholder inode from swaplist, nor let it be freed
- * (igrab() would protect from unlink, but not from unmount).
- */
- atomic_inc(&info->stop_eviction);
- spin_unlock(&shmem_swaplist_lock);
-
- error = shmem_unuse_inode(&info->vfs_inode, type);
- cond_resched();
-
- spin_lock(&shmem_swaplist_lock);
- if (atomic_dec_and_test(&info->stop_eviction))
- wake_up_var(&info->stop_eviction);
- if (error)
- break;
- if (list_empty(&info->swaplist))
- goto start_over;
- next = list_next_entry(info, swaplist);
- if (!info->swapped)
- list_del_init(&info->swaplist);
- }
- spin_unlock(&shmem_swaplist_lock);
-
- return error;
-}
-
/**
* shmem_writeout - Write the folio to swap
* @folio: The folio to write
@@ -1668,24 +1508,9 @@ int shmem_writeout(struct folio *folio, struct swap_iocb **plug,
}
if (!folio_alloc_swap(folio)) {
- bool first_swapped = shmem_recalc_inode(inode, 0, nr_pages);
int error;
- /*
- * Add inode to shmem_unuse()'s list of swapped-out inodes,
- * if it's not already there. Do it now before the folio is
- * removed from page cache, when its pagelock no longer
- * protects the inode from eviction. And do it now, after
- * we've incremented swapped, because shmem_unuse() will
- * prune a !swapped inode from the swaplist.
- */
- if (first_swapped) {
- spin_lock(&shmem_swaplist_lock);
- if (list_empty(&info->swaplist))
- list_add(&info->swaplist, &shmem_swaplist);
- spin_unlock(&shmem_swaplist_lock);
- }
-
+ shmem_recalc_inode(inode, 0, nr_pages);
swap_shmem_alloc(folio->swap, nr_pages);
shmem_delete_from_page_cache(folio, swp_to_radix_entry(folio->swap));
@@ -3106,7 +2931,6 @@ static struct inode *__shmem_get_inode(struct mnt_idmap *idmap,
info = SHMEM_I(inode);
memset(info, 0, (char *)inode - (char *)info);
spin_lock_init(&info->lock);
- atomic_set(&info->stop_eviction, 0);
info->seals = F_SEAL_SEAL;
info->flags = (flags & VM_NORESERVE) ? SHMEM_F_NORESERVE : 0;
info->i_crtime = inode_get_mtime(inode);
@@ -3115,7 +2939,6 @@ static struct inode *__shmem_get_inode(struct mnt_idmap *idmap,
if (info->fsflags)
shmem_set_inode_flags(inode, info->fsflags, NULL);
INIT_LIST_HEAD(&info->shrinklist);
- INIT_LIST_HEAD(&info->swaplist);
simple_xattrs_init(&info->xattrs);
cache_no_acl(inode);
if (sbinfo->noswap)
@@ -5785,11 +5608,6 @@ void __init shmem_init(void)
BUG_ON(IS_ERR(shm_mnt));
}
-int shmem_unuse(unsigned int type)
-{
- return 0;
-}
-
int shmem_lock(struct file *file, int lock, struct ucounts *ucounts)
{
return 0;
diff --git a/mm/swapfile.c b/mm/swapfile.c
index e1cb01b821ff3..9478707ce3ffa 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1738,300 +1738,12 @@ unsigned int count_swap_pages(int type, int free)
}
#endif /* CONFIG_HIBERNATION */
-static inline int pte_same_as_swp(pte_t pte, pte_t swp_pte)
+static bool swap_slot_allocated(struct swap_info_struct *si,
+ unsigned long offset)
{
- return pte_same(pte_swp_clear_flags(pte), swp_pte);
-}
-
-/*
- * No need to decide whether this PTE shares the swap entry with others,
- * just let do_wp_page work it out if a write is requested later - to
- * force COW, vm_page_prot omits write permission from any private vma.
- */
-static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd,
- unsigned long addr, swp_entry_t entry, struct folio *folio)
-{
- struct page *page;
- struct folio *swapcache;
- spinlock_t *ptl;
- pte_t *pte, new_pte, old_pte;
- bool hwpoisoned = false;
- int ret = 1;
-
- /*
- * If the folio is removed from swap cache by others, continue to
- * unuse other PTEs. try_to_unuse may try again if we missed this one.
- */
- if (!folio_matches_swap_entry(folio, entry))
- return 0;
-
- swapcache = folio;
- folio = ksm_might_need_to_copy(folio, vma, addr);
- if (unlikely(!folio))
- return -ENOMEM;
- else if (unlikely(folio == ERR_PTR(-EHWPOISON))) {
- hwpoisoned = true;
- folio = swapcache;
- }
-
- page = folio_file_page(folio, swp_offset(entry));
- if (PageHWPoison(page))
- hwpoisoned = true;
-
- pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
- if (unlikely(!pte || !pte_same_as_swp(ptep_get(pte),
- swp_entry_to_pte(entry)))) {
- ret = 0;
- goto out;
- }
-
- old_pte = ptep_get(pte);
-
- if (unlikely(hwpoisoned || !folio_test_uptodate(folio))) {
- swp_entry_t swp_entry;
-
- dec_mm_counter(vma->vm_mm, MM_SWAPENTS);
- if (hwpoisoned) {
- swp_entry = make_hwpoison_entry(page);
- } else {
- swp_entry = make_poisoned_swp_entry();
- }
- new_pte = swp_entry_to_pte(swp_entry);
- ret = 0;
- goto setpte;
- }
-
- /*
- * Some architectures may have to restore extra metadata to the page
- * when reading from swap. This metadata may be indexed by swap entry
- * so this must be called before swap_free().
- */
- arch_swap_restore(folio_swap(entry, folio), folio);
-
- dec_mm_counter(vma->vm_mm, MM_SWAPENTS);
- inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
- folio_get(folio);
- if (folio == swapcache) {
- rmap_t rmap_flags = RMAP_NONE;
-
- /*
- * See do_swap_page(): writeback would be problematic.
- * However, we do a folio_wait_writeback() just before this
- * call and have the folio locked.
- */
- VM_BUG_ON_FOLIO(folio_test_writeback(folio), folio);
- if (pte_swp_exclusive(old_pte))
- rmap_flags |= RMAP_EXCLUSIVE;
- /*
- * We currently only expect small !anon folios, which are either
- * fully exclusive or fully shared. If we ever get large folios
- * here, we have to be careful.
- */
- if (!folio_test_anon(folio)) {
- VM_WARN_ON_ONCE(folio_test_large(folio));
- VM_WARN_ON_FOLIO(!folio_test_locked(folio), folio);
- folio_add_new_anon_rmap(folio, vma, addr, rmap_flags);
- } else {
- folio_add_anon_rmap_pte(folio, page, vma, addr, rmap_flags);
- }
- } else { /* ksm created a completely new copy */
- folio_add_new_anon_rmap(folio, vma, addr, RMAP_EXCLUSIVE);
- folio_add_lru_vma(folio, vma);
- }
- new_pte = pte_mkold(mk_pte(page, vma->vm_page_prot));
- if (pte_swp_soft_dirty(old_pte))
- new_pte = pte_mksoft_dirty(new_pte);
- if (pte_swp_uffd_wp(old_pte))
- new_pte = pte_mkuffd_wp(new_pte);
-setpte:
- set_pte_at(vma->vm_mm, addr, pte, new_pte);
- swap_free(entry);
-out:
- if (pte)
- pte_unmap_unlock(pte, ptl);
- if (folio != swapcache) {
- folio_unlock(folio);
- folio_put(folio);
- }
- return ret;
-}
-
-static int unuse_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
- unsigned long addr, unsigned long end,
- unsigned int type)
-{
- pte_t *pte = NULL;
- struct swap_info_struct *si;
-
- si = swap_info[type];
- do {
- struct folio *folio;
- unsigned long offset;
- unsigned char swp_count;
- softleaf_t entry;
- swp_slot_t slot;
- int ret;
- pte_t ptent;
-
- if (!pte++) {
- pte = pte_offset_map(pmd, addr);
- if (!pte)
- break;
- }
-
- ptent = ptep_get_lockless(pte);
- entry = softleaf_from_pte(ptent);
-
- if (!softleaf_is_swap(entry))
- continue;
-
- slot = swp_entry_to_swp_slot(entry);
- if (swp_slot_type(slot) != type)
- continue;
-
- offset = swp_slot_offset(slot);
- pte_unmap(pte);
- pte = NULL;
-
- folio = swap_cache_get_folio(entry);
- if (!folio) {
- struct vm_fault vmf = {
- .vma = vma,
- .address = addr,
- .real_address = addr,
- .pmd = pmd,
- };
-
- folio = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE,
- &vmf);
- }
- if (!folio) {
- swp_count = READ_ONCE(si->swap_map[offset]);
- if (swp_count == 0 || swp_count == SWAP_MAP_BAD)
- continue;
- return -ENOMEM;
- }
-
- folio_lock(folio);
- folio_wait_writeback(folio);
- ret = unuse_pte(vma, pmd, addr, entry, folio);
- if (ret < 0) {
- folio_unlock(folio);
- folio_put(folio);
- return ret;
- }
-
- folio_free_swap(folio);
- folio_unlock(folio);
- folio_put(folio);
- } while (addr += PAGE_SIZE, addr != end);
-
- if (pte)
- pte_unmap(pte);
- return 0;
-}
-
-static inline int unuse_pmd_range(struct vm_area_struct *vma, pud_t *pud,
- unsigned long addr, unsigned long end,
- unsigned int type)
-{
- pmd_t *pmd;
- unsigned long next;
- int ret;
-
- pmd = pmd_offset(pud, addr);
- do {
- cond_resched();
- next = pmd_addr_end(addr, end);
- ret = unuse_pte_range(vma, pmd, addr, next, type);
- if (ret)
- return ret;
- } while (pmd++, addr = next, addr != end);
- return 0;
-}
-
-static inline int unuse_pud_range(struct vm_area_struct *vma, p4d_t *p4d,
- unsigned long addr, unsigned long end,
- unsigned int type)
-{
- pud_t *pud;
- unsigned long next;
- int ret;
-
- pud = pud_offset(p4d, addr);
- do {
- next = pud_addr_end(addr, end);
- if (pud_none_or_clear_bad(pud))
- continue;
- ret = unuse_pmd_range(vma, pud, addr, next, type);
- if (ret)
- return ret;
- } while (pud++, addr = next, addr != end);
- return 0;
-}
-
-static inline int unuse_p4d_range(struct vm_area_struct *vma, pgd_t *pgd,
- unsigned long addr, unsigned long end,
- unsigned int type)
-{
- p4d_t *p4d;
- unsigned long next;
- int ret;
-
- p4d = p4d_offset(pgd, addr);
- do {
- next = p4d_addr_end(addr, end);
- if (p4d_none_or_clear_bad(p4d))
- continue;
- ret = unuse_pud_range(vma, p4d, addr, next, type);
- if (ret)
- return ret;
- } while (p4d++, addr = next, addr != end);
- return 0;
-}
-
-static int unuse_vma(struct vm_area_struct *vma, unsigned int type)
-{
- pgd_t *pgd;
- unsigned long addr, end, next;
- int ret;
-
- addr = vma->vm_start;
- end = vma->vm_end;
-
- pgd = pgd_offset(vma->vm_mm, addr);
- do {
- next = pgd_addr_end(addr, end);
- if (pgd_none_or_clear_bad(pgd))
- continue;
- ret = unuse_p4d_range(vma, pgd, addr, next, type);
- if (ret)
- return ret;
- } while (pgd++, addr = next, addr != end);
- return 0;
-}
+ unsigned char count = READ_ONCE(si->swap_map[offset]);
-static int unuse_mm(struct mm_struct *mm, unsigned int type)
-{
- struct vm_area_struct *vma;
- int ret = 0;
- VMA_ITERATOR(vmi, mm, 0);
-
- mmap_read_lock(mm);
- if (check_stable_address_space(mm))
- goto unlock;
- for_each_vma(vmi, vma) {
- if (vma->anon_vma && !is_vm_hugetlb_page(vma)) {
- ret = unuse_vma(vma, type);
- if (ret)
- break;
- }
-
- cond_resched();
- }
-unlock:
- mmap_read_unlock(mm);
- return ret;
+ return count && swap_count(count) != SWAP_MAP_BAD;
}
/*
@@ -2043,7 +1755,6 @@ static unsigned int find_next_to_unuse(struct swap_info_struct *si,
unsigned int prev)
{
unsigned int i;
- unsigned char count;
/*
* No need for swap_lock here: we're just looking
@@ -2052,8 +1763,7 @@ static unsigned int find_next_to_unuse(struct swap_info_struct *si,
* allocations from this area (while holding swap_lock).
*/
for (i = prev + 1; i < si->max; i++) {
- count = READ_ONCE(si->swap_map[i]);
- if (count && swap_count(count) != SWAP_MAP_BAD)
+ if (swap_slot_allocated(si, i))
break;
if ((i % LATENCY_LIMIT) == 0)
cond_resched();
@@ -2065,101 +1775,139 @@ static unsigned int find_next_to_unuse(struct swap_info_struct *si,
return i;
}
+#define for_each_allocated_offset(si, offset) \
+ while (swap_usage_in_pages(si) && \
+ !signal_pending(current) && \
+ (offset = find_next_to_unuse(si, offset)) != 0)
+
+static struct folio *pagein(swp_entry_t entry, struct swap_iocb **splug,
+ struct mempolicy *mpol)
+{
+ bool folio_was_allocated;
+ struct folio *folio = __read_swap_cache_async(entry, GFP_KERNEL, mpol,
+ NO_INTERLEAVE_INDEX, &folio_was_allocated, false);
+
+ if (folio_was_allocated)
+ swap_read_folio(folio, splug);
+ return folio;
+}
+
static int try_to_unuse(unsigned int type)
{
- struct mm_struct *prev_mm;
- struct mm_struct *mm;
- struct list_head *p;
- int retval = 0;
struct swap_info_struct *si = swap_info[type];
+ struct swap_iocb *splug = NULL;
+ struct mempolicy *mpol;
+ struct blk_plug plug;
+ unsigned long offset;
struct folio *folio;
swp_entry_t entry;
swp_slot_t slot;
- unsigned int i;
+ int ret = 0;
if (!swap_usage_in_pages(si))
goto success;
-retry:
- retval = shmem_unuse(type);
- if (retval)
- return retval;
-
- prev_mm = &init_mm;
- mmget(prev_mm);
-
- spin_lock(&mmlist_lock);
- p = &init_mm.mmlist;
- while (swap_usage_in_pages(si) &&
- !signal_pending(current) &&
- (p = p->next) != &init_mm.mmlist) {
+ mpol = get_task_policy(current);
+ blk_start_plug(&plug);
- mm = list_entry(p, struct mm_struct, mmlist);
- if (!mmget_not_zero(mm))
+ /* first round - submit the reads */
+ offset = 0;
+ for_each_allocated_offset(si, offset) {
+ slot = swp_slot(type, offset);
+ entry = swp_slot_to_swp_entry(slot);
+ if (!entry.val)
continue;
- spin_unlock(&mmlist_lock);
- mmput(prev_mm);
- prev_mm = mm;
- retval = unuse_mm(mm, type);
- if (retval) {
- mmput(prev_mm);
- return retval;
- }
- /*
- * Make sure that we aren't completely killing
- * interactive performance.
- */
- cond_resched();
- spin_lock(&mmlist_lock);
+ folio = pagein(entry, &splug, mpol);
+ if (folio)
+ folio_put(folio);
}
- spin_unlock(&mmlist_lock);
+ blk_finish_plug(&plug);
+ swap_read_unplug(splug);
+ splug = NULL;
+ lru_add_drain();
+
+ /* second round - updating the virtual swap slots' backing state */
+ offset = 0;
+ for_each_allocated_offset(si, offset) {
+ slot = swp_slot(type, offset);
+retry:
+ entry = swp_slot_to_swp_entry(slot);
+ if (!entry.val) {
+ if (!swap_slot_allocated(si, offset))
+ continue;
- mmput(prev_mm);
+ if (signal_pending(current)) {
+ ret = -EINTR;
+ goto out;
+ }
- i = 0;
- while (swap_usage_in_pages(si) &&
- !signal_pending(current) &&
- (i = find_next_to_unuse(si, i)) != 0) {
+ /* we might be racing with zswap writeback or disk swapout */
+ schedule_timeout_uninterruptible(1);
+ goto retry;
+ }
- slot = swp_slot(type, i);
- entry = swp_slot_to_swp_entry(slot);
- folio = swap_cache_get_folio(entry);
- if (!folio)
- continue;
+ /* try to allocate swap cache folio */
+ folio = pagein(entry, &splug, mpol);
+ if (!folio) {
+ if (!swp_slot_to_swp_entry(swp_slot(type, offset)).val)
+ continue;
+ ret = -ENOMEM;
+ pr_err("swapoff: unable to allocate swap cache folio for %lu\n",
+ entry.val);
+ goto out;
+ }
+
+ folio_lock(folio);
/*
- * It is conceivable that a racing task removed this folio from
- * swap cache just before we acquired the page lock. The folio
- * might even be back in swap cache on another swap area. But
- * that is okay, folio_free_swap() only removes stale folios.
+ * We need to check if the folio is still in swap cache, and is still
+ * backed by the physical swap slot we are trying to release.
+ *
+ * We can, for instance, race with zswap writeback, obtaining the
+ * temporary folio it allocated for decompression and writeback, which
+ * would be promptly deleted from swap cache. By the time we lock that
+ * folio, it might have already contained stale data.
+ *
+ * Concurrent swap operations might have also come in before we
+ * reobtain the folio's lock, deleting the folio from swap cache,
+ * invalidating the virtual swap slot, then swapping out the folio
+ * again to a different swap backends.
+ *
+ * In all of these cases, we must retry the physical -> virtual lookup.
*/
- folio_lock(folio);
+ if (!folio_matches_swap_slot(folio, entry, slot)) {
+ folio_unlock(folio);
+ folio_put(folio);
+ if (signal_pending(current)) {
+ ret = -EINTR;
+ goto out;
+ }
+ schedule_timeout_uninterruptible(1);
+ goto retry;
+ }
+
folio_wait_writeback(folio);
- folio_free_swap(folio);
+ vswap_store_folio(entry, folio);
+ folio_mark_dirty(folio);
folio_unlock(folio);
folio_put(folio);
}
- /*
- * Lets check again to see if there are still swap entries in the map.
- * If yes, we would need to do retry the unuse logic again.
- * Under global memory pressure, swap entries can be reinserted back
- * into process space after the mmlist loop above passes over them.
- *
- * Limit the number of retries? No: when mmget_not_zero()
- * above fails, that mm is likely to be freeing swap from
- * exit_mmap(), which proceeds at its own independent pace;
- * and even shmem_writeout() could have been preempted after
- * folio_alloc_swap(), temporarily hiding that swap. It's easy
- * and robust (though cpu-intensive) just to keep retrying.
- */
- if (swap_usage_in_pages(si)) {
- if (!signal_pending(current))
- goto retry;
- return -EINTR;
+ /* concurrent swappers might still be releasing physical swap slots... */
+ while (swap_usage_in_pages(si)) {
+ if (signal_pending(current)) {
+ ret = -EINTR;
+ goto out;
+ }
+ schedule_timeout_uninterruptible(1);
}
+out:
+ swap_read_unplug(splug);
+ if (ret)
+ return ret;
+
success:
/*
* Make sure that further cleanups after try_to_unuse() returns happen
--
2.47.3
^ permalink raw reply [flat|nested] 52+ messages in thread
* [PATCH v3 20/20] swapfile: replace the swap map with bitmaps
2026-02-08 21:58 [PATCH v3 00/20] Virtual Swap Space Nhat Pham
` (18 preceding siblings ...)
2026-02-08 21:58 ` [PATCH v3 19/20] swap: simplify swapoff using virtual swap Nhat Pham
@ 2026-02-08 21:58 ` Nhat Pham
2026-02-08 22:51 ` [PATCH v3 00/20] Virtual Swap Space Nhat Pham
2026-02-10 15:45 ` [syzbot ci] " syzbot ci
21 siblings, 0 replies; 52+ messages in thread
From: Nhat Pham @ 2026-02-08 21:58 UTC (permalink / raw)
To: linux-mm
Cc: akpm, hannes, hughd, yosry.ahmed, mhocko, roman.gushchin,
shakeel.butt, muchun.song, len.brown, chengming.zhou, kasong,
chrisl, huang.ying.caritas, ryan.roberts, shikemeng, viro,
baohua, bhe, osalvador, lorenzo.stoakes, christophe.leroy, pavel,
kernel-team, linux-kernel, cgroups, linux-pm, peterx, riel,
joshua.hahnjy, npache, gourry, axelrasmussen, yuanchu, weixugc,
rafael, jannh, pfalcato, zhengqi.arch
Now that we have moved the swap count state to virtual swap layer, each
swap map entry only has 3 possible states: free, allocated, and bad.
Replace the swap map with 2 bitmaps (one for allocated state and one for
bad state), saving 6 bits per swap entry.
Signed-off-by: Nhat Pham <nphamcs@gmail.com>
---
include/linux/swap.h | 3 +-
mm/swapfile.c | 81 +++++++++++++++++++++++---------------------
2 files changed, 44 insertions(+), 40 deletions(-)
diff --git a/include/linux/swap.h b/include/linux/swap.h
index a30d382fb5ee1..a02ce3fb2358b 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -259,7 +259,8 @@ struct swap_info_struct {
struct plist_node list; /* entry in swap_active_head */
signed char type; /* strange name for an index */
unsigned int max; /* extent of the swap_map */
- unsigned char *swap_map; /* vmalloc'ed array of usage counts */
+ unsigned long *swap_map; /* bitmap for allocated state */
+ unsigned long *bad_map; /* bitmap for bad state */
struct swap_cluster_info *cluster_info; /* cluster info. Only for SSD */
struct list_head free_clusters; /* free clusters list */
struct list_head full_clusters; /* full clusters list */
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 9478707ce3ffa..b7661ffa312be 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -760,25 +760,19 @@ static bool cluster_reclaim_range(struct swap_info_struct *si,
struct swap_cluster_info *ci,
unsigned long start, unsigned long end)
{
- unsigned char *map = si->swap_map;
unsigned long offset = start;
int nr_reclaim;
spin_unlock(&ci->lock);
do {
- switch (READ_ONCE(map[offset])) {
- case 0:
+ if (!test_bit(offset, si->swap_map)) {
offset++;
- break;
- case SWAP_MAP_ALLOCATED:
+ } else {
nr_reclaim = __try_to_reclaim_swap(si, offset, TTRS_ANYWAY);
if (nr_reclaim > 0)
offset += nr_reclaim;
else
goto out;
- break;
- default:
- goto out;
}
} while (offset < end);
out:
@@ -787,11 +781,7 @@ static bool cluster_reclaim_range(struct swap_info_struct *si,
* Recheck the range no matter reclaim succeeded or not, the slot
* could have been be freed while we are not holding the lock.
*/
- for (offset = start; offset < end; offset++)
- if (READ_ONCE(map[offset]))
- return false;
-
- return true;
+ return find_next_bit(si->swap_map, end, start) >= end;
}
static bool cluster_scan_range(struct swap_info_struct *si,
@@ -800,15 +790,16 @@ static bool cluster_scan_range(struct swap_info_struct *si,
bool *need_reclaim)
{
unsigned long offset, end = start + nr_pages;
- unsigned char *map = si->swap_map;
- unsigned char count;
if (cluster_is_empty(ci))
return true;
for (offset = start; offset < end; offset++) {
- count = READ_ONCE(map[offset]);
- if (!count)
+ /* Bad slots cannot be used for allocation */
+ if (test_bit(offset, si->bad_map))
+ return false;
+
+ if (!test_bit(offset, si->swap_map))
continue;
if (swap_cache_only(si, offset)) {
@@ -841,7 +832,7 @@ static bool cluster_alloc_range(struct swap_info_struct *si, struct swap_cluster
if (cluster_is_empty(ci))
ci->order = order;
- memset(si->swap_map + start, usage, nr_pages);
+ bitmap_set(si->swap_map, start, nr_pages);
swap_range_alloc(si, nr_pages);
ci->count += nr_pages;
@@ -1404,7 +1395,7 @@ static struct swap_info_struct *_swap_info_get(swp_slot_t slot)
offset = swp_slot_offset(slot);
if (offset >= si->max)
goto bad_offset;
- if (data_race(!si->swap_map[swp_slot_offset(slot)]))
+ if (data_race(!test_bit(offset, si->swap_map)))
goto bad_free;
return si;
@@ -1518,8 +1509,7 @@ static void swap_slots_free(struct swap_info_struct *si,
swp_slot_t slot, unsigned int nr_pages)
{
unsigned long offset = swp_slot_offset(slot);
- unsigned char *map = si->swap_map + offset;
- unsigned char *map_end = map + nr_pages;
+ unsigned long end = offset + nr_pages;
/* It should never free entries across different clusters */
VM_BUG_ON(ci != __swap_offset_to_cluster(si, offset + nr_pages - 1));
@@ -1527,10 +1517,8 @@ static void swap_slots_free(struct swap_info_struct *si,
VM_BUG_ON(ci->count < nr_pages);
ci->count -= nr_pages;
- do {
- VM_BUG_ON(!swap_is_last_ref(*map));
- *map = 0;
- } while (++map < map_end);
+ VM_BUG_ON(find_next_zero_bit(si->swap_map, end, offset) < end);
+ bitmap_clear(si->swap_map, offset, nr_pages);
swap_range_free(si, offset, nr_pages);
@@ -1741,9 +1729,7 @@ unsigned int count_swap_pages(int type, int free)
static bool swap_slot_allocated(struct swap_info_struct *si,
unsigned long offset)
{
- unsigned char count = READ_ONCE(si->swap_map[offset]);
-
- return count && swap_count(count) != SWAP_MAP_BAD;
+ return test_bit(offset, si->swap_map);
}
/*
@@ -2064,7 +2050,7 @@ static int setup_swap_extents(struct swap_info_struct *sis, sector_t *span)
}
static void setup_swap_info(struct swap_info_struct *si, int prio,
- unsigned char *swap_map,
+ unsigned long *swap_map,
struct swap_cluster_info *cluster_info)
{
si->prio = prio;
@@ -2092,7 +2078,7 @@ static void _enable_swap_info(struct swap_info_struct *si)
}
static void enable_swap_info(struct swap_info_struct *si, int prio,
- unsigned char *swap_map,
+ unsigned long *swap_map,
struct swap_cluster_info *cluster_info)
{
spin_lock(&swap_lock);
@@ -2185,7 +2171,8 @@ static void flush_percpu_swap_cluster(struct swap_info_struct *si)
SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
{
struct swap_info_struct *p = NULL;
- unsigned char *swap_map;
+ unsigned long *swap_map;
+ unsigned long *bad_map;
struct swap_cluster_info *cluster_info;
struct file *swap_file, *victim;
struct address_space *mapping;
@@ -2280,6 +2267,8 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
p->swap_file = NULL;
swap_map = p->swap_map;
p->swap_map = NULL;
+ bad_map = p->bad_map;
+ p->bad_map = NULL;
maxpages = p->max;
cluster_info = p->cluster_info;
p->max = 0;
@@ -2290,7 +2279,8 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
mutex_unlock(&swapon_mutex);
kfree(p->global_cluster);
p->global_cluster = NULL;
- vfree(swap_map);
+ kvfree(swap_map);
+ kvfree(bad_map);
free_cluster_info(cluster_info, maxpages);
inode = mapping->host;
@@ -2638,18 +2628,20 @@ static unsigned long read_swap_header(struct swap_info_struct *si,
static int setup_swap_map(struct swap_info_struct *si,
union swap_header *swap_header,
- unsigned char *swap_map,
+ unsigned long *swap_map,
+ unsigned long *bad_map,
unsigned long maxpages)
{
unsigned long i;
- swap_map[0] = SWAP_MAP_BAD; /* omit header page */
+ set_bit(0, bad_map); /* omit header page */
+
for (i = 0; i < swap_header->info.nr_badpages; i++) {
unsigned int page_nr = swap_header->info.badpages[i];
if (page_nr == 0 || page_nr > swap_header->info.last_page)
return -EINVAL;
if (page_nr < maxpages) {
- swap_map[page_nr] = SWAP_MAP_BAD;
+ set_bit(page_nr, bad_map);
si->pages--;
}
}
@@ -2753,7 +2745,7 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
int nr_extents;
sector_t span;
unsigned long maxpages;
- unsigned char *swap_map = NULL;
+ unsigned long *swap_map = NULL, *bad_map = NULL;
struct swap_cluster_info *cluster_info = NULL;
struct folio *folio = NULL;
struct inode *inode = NULL;
@@ -2849,16 +2841,24 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
maxpages = si->max;
/* OK, set up the swap map and apply the bad block list */
- swap_map = vzalloc(maxpages);
+ swap_map = kvcalloc(BITS_TO_LONGS(maxpages), sizeof(long), GFP_KERNEL);
if (!swap_map) {
error = -ENOMEM;
goto bad_swap_unlock_inode;
}
- error = setup_swap_map(si, swap_header, swap_map, maxpages);
+ bad_map = kvcalloc(BITS_TO_LONGS(maxpages), sizeof(long), GFP_KERNEL);
+ if (!bad_map) {
+ error = -ENOMEM;
+ goto bad_swap_unlock_inode;
+ }
+
+ error = setup_swap_map(si, swap_header, swap_map, bad_map, maxpages);
if (error)
goto bad_swap_unlock_inode;
+ si->bad_map = bad_map;
+
if (si->bdev && bdev_stable_writes(si->bdev))
si->flags |= SWP_STABLE_WRITES;
@@ -2952,7 +2952,10 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
si->swap_file = NULL;
si->flags = 0;
spin_unlock(&swap_lock);
- vfree(swap_map);
+ if (swap_map)
+ kvfree(swap_map);
+ if (bad_map)
+ kvfree(bad_map);
if (cluster_info)
free_cluster_info(cluster_info, maxpages);
if (inced_nr_rotate_swap)
--
2.47.3
^ permalink raw reply [flat|nested] 52+ messages in thread
* [PATCH v3 00/20] Virtual Swap Space
2026-02-08 21:58 ` [PATCH v3 01/20] mm/swap: decouple swap cache from physical swap infrastructure Nhat Pham
@ 2026-02-08 22:26 ` Nhat Pham
2026-02-10 17:59 ` Kairui Song
2026-02-08 22:31 ` [PATCH v3 00/20] Virtual Swap Space Nhat Pham
` (2 subsequent siblings)
3 siblings, 1 reply; 52+ messages in thread
From: Nhat Pham @ 2026-02-08 22:26 UTC (permalink / raw)
To: linux-mm
Cc: akpm, hannes, hughd, yosry.ahmed, mhocko, roman.gushchin,
shakeel.butt, muchun.song, len.brown, chengming.zhou, kasong,
chrisl, huang.ying.caritas, ryan.roberts, shikemeng, viro,
baohua, bhe, osalvador, christophe.leroy, pavel, kernel-team,
linux-kernel, cgroups, linux-pm, peterx, riel, joshua.hahnjy,
npache, gourry, axelrasmussen, yuanchu, weixugc, rafael, jannh,
pfalcato, zhengqi.arch
My sincerest apologies - it seems like the cover letter (and just the
cover letter) fails to be sent out, for some reason. I'm trying to figure
out what happened - it works when I send the entire patch series to
myself...
Anyway, resending this (in-reply-to patch 1 of the series):
Changelog:
* RFC v2 -> v3:
* Implement a cluster-based allocation algorithm for virtual swap
slots, inspired by Kairui Song and Chris Li's implementation, as
well as Johannes Weiner's suggestions. This eliminates the lock
contention issues on the virtual swap layer.
* Re-use swap table for the reverse mapping.
* Remove CONFIG_VIRTUAL_SWAP.
* Reducing the size of the swap descriptor from 48 bytes to 24
bytes, i.e another 50% reduction in memory overhead from v2.
* Remove swap cache and zswap tree and use the swap descriptor
for this.
* Remove zeromap, and replace the swap_map bytemap with 2 bitmaps
(one for allocated slots, and one for bad slots).
* Rebase on top of 6.19 (7d0a66e4bb9081d75c82ec4957c50034cb0ea449)
* Update cover letter to include new benchmark results and discussion
on overhead in various cases.
* RFC v1 -> RFC v2:
* Use a single atomic type (swap_refs) for reference counting
purpose. This brings the size of the swap descriptor from 64 B
down to 48 B (25% reduction). Suggested by Yosry Ahmed.
* Zeromap bitmap is removed in the virtual swap implementation.
This saves one bit per phyiscal swapfile slot.
* Rearrange the patches and the code change to make things more
reviewable. Suggested by Johannes Weiner.
* Update the cover letter a bit.
This patch series implements the virtual swap space idea, based on Yosry's
proposals at LSFMMBPF 2023 (see [1], [2], [3]), as well as valuable
inputs from Johannes Weiner. The same idea (with different
implementation details) has been floated by Rik van Riel since at least
2011 (see [8]).
This patch series is based on 6.19. There are a couple more
swap-related changes in the mm-stable branch that I would need to
coordinate with, but I would like to send this out as an update, to show
that the lock contention issues that plagued earlier versions have been
resolved and performance on the kernel build benchmark is now on-par with
baseline. Furthermore, memory overhead has been substantially reduced
compared to the last RFC version.
I. Motivation
Currently, when an anon page is swapped out, a slot in a backing swap
device is allocated and stored in the page table entries that refer to
the original page. This slot is also used as the "key" to find the
swapped out content, as well as the index to swap data structures, such
as the swap cache, or the swap cgroup mapping. Tying a swap entry to its
backing slot in this way is performant and efficient when swap is purely
just disk space, and swapoff is rare.
However, the advent of many swap optimizations has exposed major
drawbacks of this design. The first problem is that we occupy a physical
slot in the swap space, even for pages that are NEVER expected to hit
the disk: pages compressed and stored in the zswap pool, zero-filled
pages, or pages rejected by both of these optimizations when zswap
writeback is disabled. This is the arguably central shortcoming of
zswap:
* In deployments when no disk space can be afforded for swap (such as
mobile and embedded devices), users cannot adopt zswap, and are forced
to use zram. This is confusing for users, and creates extra burdens
for developers, having to develop and maintain similar features for
two separate swap backends (writeback, cgroup charging, THP support,
etc.). For instance, see the discussion in [4].
* Resource-wise, it is hugely wasteful in terms of disk usage. At Meta,
we have swapfile in the order of tens to hundreds of GBs, which are
mostly unused and only exist to enable zswap usage and zero-filled
pages swap optimizations.
* Tying zswap (and more generally, other in-memory swap backends) to
the current physical swapfile infrastructure makes zswap implicitly
statically sized. This does not make sense, as unlike disk swap, in
which we consume a limited resource (disk space or swapfile space) to
save another resource (memory), zswap consume the same resource it is
saving (memory). The more we zswap, the more memory we have available,
not less. We are not rationing a limited resource when we limit
the size of he zswap pool, but rather we are capping the resource
(memory) saving potential of zswap. Under memory pressure, using
more zswap is almost always better than the alternative (disk IOs, or
even worse, OOMs), and dynamically sizing the zswap pool on demand
allows the system to flexibly respond to these precarious scenarios.
* Operationally, static provisioning the swapfile for zswap pose
significant challenges, because the sysadmin has to prescribe how
much swap is needed a priori, for each combination of
(memory size x disk space x workload usage). It is even more
complicated when we take into account the variance of memory
compression, which changes the reclaim dynamics (and as a result,
swap space size requirement). The problem is further exarcebated for
users who rely on swap utilization (and exhaustion) as an OOM signal.
All of these factors make it very difficult to configure the swapfile
for zswap: too small of a swapfile and we risk preventable OOMs and
limit the memory saving potentials of zswap; too big of a swapfile
and we waste disk space and memory due to swap metadata overhead.
This dilemma becomes more drastic in high memory systems, which can
have up to TBs worth of memory.
Past attempts to decouple disk and compressed swap backends, namely the
ghost swapfile approach (see [13]), as well as the alternative
compressed swap backend zram, have mainly focused on eliminating the
disk space usage of compressed backends. We want a solution that not
only tackles that same problem, but also achieve the dyamicization of
swap space to maximize the memory saving potentials while reducing
operational and static memory overhead.
Finally, any swap redesign should support efficient backend transfer,
i.e without having to perform the expensive page table walk to
update all the PTEs that refer to the swap entry:
* The main motivation for this requirement is zswap writeback. To quote
Johannes (from [14]): "Combining compression with disk swap is
extremely powerful, because it dramatically reduces the worst aspects
of both: it reduces the memory footprint of compression by shedding
the coldest data to disk; it reduces the IO latencies and flash wear
of disk swap through the writeback cache. In practice, this reduces
*average event rates of the entire reclaim/paging/IO stack*."
* Another motivation is to simplify swapoff, which is both complicated
and expensive in the current design, precisely because we are storing
an encoding of the backend positional information in the page table,
and thus requires a full page table walk to remove these references.
II. High Level Design Overview
To fix the aforementioned issues, we need an abstraction that separates
a swap entry from its physical backing storage. IOW, we need to
“virtualize” the swap space: swap clients will work with a dynamically
allocated virtual swap slot, storing it in page table entries, and
using it to index into various swap-related data structures. The
backing storage is decoupled from the virtual swap slot, and the newly
introduced layer will “resolve” the virtual swap slot to the actual
storage. This layer also manages other metadata of the swap entry, such
as its lifetime information (swap count), via a dynamically allocated,
per-swap-entry descriptor:
struct swp_desc {
union {
swp_slot_t slot; /* 0 8 */
struct zswap_entry * zswap_entry; /* 0 8 */
}; /* 0 8 */
union {
struct folio * swap_cache; /* 8 8 */
void * shadow; /* 8 8 */
}; /* 8 8 */
unsigned int swap_count; /* 16 4 */
unsigned short memcgid:16; /* 20: 0 2 */
bool in_swapcache:1; /* 22: 0 1 */
/* Bitfield combined with previous fields */
enum swap_type type:2; /* 20:17 4 */
/* size: 24, cachelines: 1, members: 6 */
/* bit_padding: 13 bits */
/* last cacheline: 24 bytes */
};
(output from pahole).
This design allows us to:
* Decouple zswap (and zeromapped swap entry) from backing swapfile:
simply associate the virtual swap slot with one of the supported
backends: a zswap entry, a zero-filled swap page, a slot on the
swapfile, or an in-memory page.
* Simplify and optimize swapoff: we only have to fault the page in and
have the virtual swap slot points to the page instead of the on-disk
physical swap slot. No need to perform any page table walking.
The size of the virtual swap descriptor is 24 bytes. Note that this is
not all "new" overhead, as the swap descriptor will replace:
* the swap_cgroup arrays (one per swap type) in the old design, which
is a massive source of static memory overhead. With the new design,
it is only allocated for used clusters.
* the swap tables, which holds the swap cache and workingset shadows.
* the zeromap bitmap, which is a bitmap of physical swap slots to
indicate whether the swapped out page is zero-filled or not.
* huge chunk of the swap_map. The swap_map is now replaced by 2 bitmaps,
one for allocated slots, and one for bad slots, representing 3 possible
states of a slot on the swapfile: allocated, free, and bad.
* the zswap tree.
So, in terms of additional memory overhead:
* For zswap entries, the added memory overhead is rather minimal. The
new indirection pointer neatly replaces the existing zswap tree.
We really only incur less than one word of overhead for swap count
blow up (since we no longer use swap continuation) and the swap type.
* For physical swap entries, the new design will impose fewer than 3 words
memory overhead. However, as noted above this overhead is only for
actively used swap entries, whereas in the current design the overhead is
static (including the swap cgroup array for example).
The primary victim of this overhead will be zram users. However, as
zswap now no longer takes up disk space, zram users can consider
switching to zswap (which, as a bonus, has a lot of useful features
out of the box, such as cgroup tracking, dynamic zswap pool sizing,
LRU-ordering writeback, etc.).
For a more concrete example, suppose we have a 32 GB swapfile (i.e.
8,388,608 swap entries), and we use zswap.
0% usage, or 0 entries: 0.00 MB
* Old design total overhead: 25.00 MB
* Vswap total overhead: 0.00 MB
25% usage, or 2,097,152 entries:
* Old design total overhead: 57.00 MB
* Vswap total overhead: 48.25 MB
50% usage, or 4,194,304 entries:
* Old design total overhead: 89.00 MB
* Vswap total overhead: 96.50 MB
75% usage, or 6,291,456 entries:
* Old design total overhead: 121.00 MB
* Vswap total overhead: 144.75 MB
100% usage, or 8,388,608 entries:
* Old design total overhead: 153.00 MB
* Vswap total overhead: 193.00 MB
So even in the worst case scenario for virtual swap, i.e when we
somehow have an oracle to correctly size the swapfile for zswap
pool to 32 GB, the added overhead is only 40 MB, which is a mere
0.12% of the total swapfile :)
In practice, the overhead will be closer to the 50-75% usage case, as
systems tend to leave swap headroom for pathological events or sudden
spikes in memory requirements. The added overhead in these cases are
practically neglible. And in deployments where swapfiles for zswap
are previously sparsely used, switching over to virtual swap will
actually reduce memory overhead.
Doing the same math for the disk swap, which is the worst case for
virtual swap in terms of swap backends:
0% usage, or 0 entries: 0.00 MB
* Old design total overhead: 25.00 MB
* Vswap total overhead: 2.00 MB
25% usage, or 2,097,152 entries:
* Old design total overhead: 41.00 MB
* Vswap total overhead: 66.25 MB
50% usage, or 4,194,304 entries:
* Old design total overhead: 57.00 MB
* Vswap total overhead: 130.50 MB
75% usage, or 6,291,456 entries:
* Old design total overhead: 73.00 MB
* Vswap total overhead: 194.75 MB
100% usage, or 8,388,608 entries:
* Old design total overhead: 89.00 MB
* Vswap total overhead: 259.00 MB
The added overhead is 170MB, which is 0.5% of the total swapfile size,
again in the worst case when we have a sizing oracle.
Please see the attached patches for more implementation details.
III. Usage and Benchmarking
This patch series introduce no new syscalls or userspace API. Existing
userspace setups will work as-is, except we no longer have to create a
swapfile or set memory.swap.max if we want to use zswap, as zswap is no
longer tied to physical swap. The zswap pool will be automatically and
dynamically sized based on memory usage and reclaim dynamics.
To measure the performance of the new implementation, I have run the
following benchmarks:
1. Kernel building: 52 workers (one per processor), memory.max = 3G.
Using zswap as the backend:
Baseline:
real: mean: 185.2s, stdev: 0.93s
sys: mean: 683.7s, stdev: 33.77s
Vswap:
real: mean: 184.88s, stdev: 0.57s
sys: mean: 675.14s, stdev: 32.8s
We actually see a slight improvement in systime (by 1.5%) :) This is
likely because we no longer have to perform swap charging for zswap
entries, and virtual swap allocator is simpler than that of physical
swap.
Using SSD swap as the backend:
Baseline:
real: mean: 200.3s, stdev: 2.33s
sys: mean: 489.88s, stdev: 9.62s
Vswap:
real: mean: 201.47s, stdev: 2.98s
sys: mean: 487.36s, stdev: 5.53s
The performance is neck-to-neck.
IV. Future Use Cases
While the patch series focus on two applications (decoupling swap
backends and swapoff optimization/simplification), this new,
future-proof design also allows us to implement new swap features more
easily and efficiently:
* Multi-tier swapping (as mentioned in [5]), with transparent
transferring (promotion/demotion) of pages across tiers (see [8] and
[9]). Similar to swapoff, with the old design we would need to
perform the expensive page table walk.
* Swapfile compaction to alleviate fragmentation (as proposed by Ying
Huang in [6]).
* Mixed backing THP swapin (see [7]): Once you have pinned down the
backing store of THPs, then you can dispatch each range of subpages
to appropriate backend swapin handler.
* Swapping a folio out with discontiguous physical swap slots
(see [10]).
* Zswap writeback optimization: The current architecture pre-reserves
physical swap space for pages when they enter the zswap pool, giving
the kernel no flexibility at writeback time. With the virtual swap
implementation, the backends are decoupled, and physical swap space
is allocated on-demand at writeback time, at which point we can make
much smarter decisions: we can batch multiple zswap writeback
operations into a single IO request, allocating contiguous physical
swap slots for that request. We can even perform compressed writeback
(i.e writing these pages without decompressing them) (see [12]).
V. References
[1]: https://lore.kernel.org/all/CAJD7tkbCnXJ95Qow_aOjNX6NOMU5ovMSHRC+95U4wtW6cM+puw@mail.gmail.com/
[2]: https://lwn.net/Articles/932077/
[3]: https://www.youtube.com/watch?v=Hwqw_TBGEhg
[4]: https://lore.kernel.org/all/Zqe_Nab-Df1CN7iW@infradead.org/
[5]: https://lore.kernel.org/lkml/CAF8kJuN-4UE0skVHvjUzpGefavkLULMonjgkXUZSBVJrcGFXCA@mail.gmail.com/
[6]: https://lore.kernel.org/linux-mm/87o78mzp24.fsf@yhuang6-desk2.ccr.corp.intel.com/
[7]: https://lore.kernel.org/all/CAGsJ_4ysCN6f7qt=6gvee1x3ttbOnifGneqcRm9Hoeun=uFQ2w@mail.gmail.com/
[8]: https://lore.kernel.org/linux-mm/4DA25039.3020700@redhat.com/
[9]: https://lore.kernel.org/all/CA+ZsKJ7DCE8PMOSaVmsmYZL9poxK6rn0gvVXbjpqxMwxS2C9TQ@mail.gmail.com/
[10]: https://lore.kernel.org/all/CACePvbUkMYMencuKfpDqtG1Ej7LiUS87VRAXb8sBn1yANikEmQ@mail.gmail.com/
[11]: https://lore.kernel.org/all/CAMgjq7BvQ0ZXvyLGp2YP96+i+6COCBBJCYmjXHGBnfisCAb8VA@mail.gmail.com/
[12]: https://lore.kernel.org/linux-mm/ZeZSDLWwDed0CgT3@casper.infradead.org/
[13]: https://lore.kernel.org/all/20251121-ghost-v1-1-cfc0efcf3855@kernel.org/
[14]: https://lore.kernel.org/linux-mm/20251202170222.GD430226@cmpxchg.org/
Nhat Pham (20):
mm/swap: decouple swap cache from physical swap infrastructure
swap: rearrange the swap header file
mm: swap: add an abstract API for locking out swapoff
zswap: add new helpers for zswap entry operations
mm/swap: add a new function to check if a swap entry is in swap
cached.
mm: swap: add a separate type for physical swap slots
mm: create scaffolds for the new virtual swap implementation
zswap: prepare zswap for swap virtualization
mm: swap: allocate a virtual swap slot for each swapped out page
swap: move swap cache to virtual swap descriptor
zswap: move zswap entry management to the virtual swap descriptor
swap: implement the swap_cgroup API using virtual swap
swap: manage swap entry lifecycle at the virtual swap layer
mm: swap: decouple virtual swap slot from backing store
zswap: do not start zswap shrinker if there is no physical swap slots
swap: do not unnecesarily pin readahead swap entries
swapfile: remove zeromap bitmap
memcg: swap: only charge physical swap slots
swap: simplify swapoff using virtual swap
swapfile: replace the swap map with bitmaps
Documentation/mm/swap-table.rst | 69 --
MAINTAINERS | 2 +
include/linux/cpuhotplug.h | 1 +
include/linux/mm_types.h | 16 +
include/linux/shmem_fs.h | 7 +-
include/linux/swap.h | 135 ++-
include/linux/swap_cgroup.h | 13 -
include/linux/swapops.h | 25 +
include/linux/zswap.h | 17 +-
kernel/power/swap.c | 6 +-
mm/Makefile | 5 +-
mm/huge_memory.c | 11 +-
mm/internal.h | 12 +-
mm/memcontrol-v1.c | 6 +
mm/memcontrol.c | 142 ++-
mm/memory.c | 101 +-
mm/migrate.c | 13 +-
mm/mincore.c | 15 +-
mm/page_io.c | 83 +-
mm/shmem.c | 215 +---
mm/swap.h | 157 +--
mm/swap_cgroup.c | 172 ---
mm/swap_state.c | 306 +----
mm/swap_table.h | 78 +-
mm/swapfile.c | 1518 ++++-------------------
mm/userfaultfd.c | 18 +-
mm/vmscan.c | 28 +-
mm/vswap.c | 2025 +++++++++++++++++++++++++++++++
mm/zswap.c | 142 +--
29 files changed, 2853 insertions(+), 2485 deletions(-)
delete mode 100644 Documentation/mm/swap-table.rst
delete mode 100644 mm/swap_cgroup.c
create mode 100644 mm/vswap.c
base-commit: 05f7e89ab9731565d8a62e3b5d1ec206485eeb0b
--
2.47.3
^ permalink raw reply [flat|nested] 52+ messages in thread
* [PATCH v3 00/20] Virtual Swap Space
2026-02-08 21:58 ` [PATCH v3 01/20] mm/swap: decouple swap cache from physical swap infrastructure Nhat Pham
2026-02-08 22:26 ` [PATCH v3 00/20] Virtual Swap Space Nhat Pham
@ 2026-02-08 22:31 ` Nhat Pham
2026-02-09 12:20 ` Chris Li
2026-02-08 22:39 ` Nhat Pham
2026-02-09 2:22 ` [PATCH v3 01/20] mm/swap: decouple swap cache from physical swap infrastructure kernel test robot
3 siblings, 1 reply; 52+ messages in thread
From: Nhat Pham @ 2026-02-08 22:31 UTC (permalink / raw)
To: akpm
Cc: nphamcs, hannes, hughd, yosry.ahmed, mhocko, roman.gushchin,
shakeel.butt, muchun.song, len.brown, chengming.zhou, kasong,
chrisl, huang.ying.caritas, ryan.roberts, shikemeng, viro,
baohua, bhe, osalvador, christophe.leroy, pavel, linux-mm,
kernel-team, linux-kernel, cgroups, linux-pm, peterx, riel,
joshua.hahnjy, npache, gourry, axelrasmussen, yuanchu, weixugc,
rafael, jannh, pfalcato, zhengqi.arch
My sincerest apologies - it seems like the cover letter (and just the
cover letter) fails to be sent out, for some reason. I'm trying to figure
out what happened - it works when I send the entire patch series to
myself...
Anyway, resending this (in-reply-to patch 1 of the series):
Changelog:
* RFC v2 -> v3:
* Implement a cluster-based allocation algorithm for virtual swap
slots, inspired by Kairui Song and Chris Li's implementation, as
well as Johannes Weiner's suggestions. This eliminates the lock
contention issues on the virtual swap layer.
* Re-use swap table for the reverse mapping.
* Remove CONFIG_VIRTUAL_SWAP.
* Reducing the size of the swap descriptor from 48 bytes to 24
bytes, i.e another 50% reduction in memory overhead from v2.
* Remove swap cache and zswap tree and use the swap descriptor
for this.
* Remove zeromap, and replace the swap_map bytemap with 2 bitmaps
(one for allocated slots, and one for bad slots).
* Rebase on top of 6.19 (7d0a66e4bb9081d75c82ec4957c50034cb0ea449)
* Update cover letter to include new benchmark results and discussion
on overhead in various cases.
* RFC v1 -> RFC v2:
* Use a single atomic type (swap_refs) for reference counting
purpose. This brings the size of the swap descriptor from 64 B
down to 48 B (25% reduction). Suggested by Yosry Ahmed.
* Zeromap bitmap is removed in the virtual swap implementation.
This saves one bit per phyiscal swapfile slot.
* Rearrange the patches and the code change to make things more
reviewable. Suggested by Johannes Weiner.
* Update the cover letter a bit.
This patch series implements the virtual swap space idea, based on Yosry's
proposals at LSFMMBPF 2023 (see [1], [2], [3]), as well as valuable
inputs from Johannes Weiner. The same idea (with different
implementation details) has been floated by Rik van Riel since at least
2011 (see [8]).
This patch series is based on 6.19. There are a couple more
swap-related changes in the mm-stable branch that I would need to
coordinate with, but I would like to send this out as an update, to show
that the lock contention issues that plagued earlier versions have been
resolved and performance on the kernel build benchmark is now on-par with
baseline. Furthermore, memory overhead has been substantially reduced
compared to the last RFC version.
I. Motivation
Currently, when an anon page is swapped out, a slot in a backing swap
device is allocated and stored in the page table entries that refer to
the original page. This slot is also used as the "key" to find the
swapped out content, as well as the index to swap data structures, such
as the swap cache, or the swap cgroup mapping. Tying a swap entry to its
backing slot in this way is performant and efficient when swap is purely
just disk space, and swapoff is rare.
However, the advent of many swap optimizations has exposed major
drawbacks of this design. The first problem is that we occupy a physical
slot in the swap space, even for pages that are NEVER expected to hit
the disk: pages compressed and stored in the zswap pool, zero-filled
pages, or pages rejected by both of these optimizations when zswap
writeback is disabled. This is the arguably central shortcoming of
zswap:
* In deployments when no disk space can be afforded for swap (such as
mobile and embedded devices), users cannot adopt zswap, and are forced
to use zram. This is confusing for users, and creates extra burdens
for developers, having to develop and maintain similar features for
two separate swap backends (writeback, cgroup charging, THP support,
etc.). For instance, see the discussion in [4].
* Resource-wise, it is hugely wasteful in terms of disk usage. At Meta,
we have swapfile in the order of tens to hundreds of GBs, which are
mostly unused and only exist to enable zswap usage and zero-filled
pages swap optimizations.
* Tying zswap (and more generally, other in-memory swap backends) to
the current physical swapfile infrastructure makes zswap implicitly
statically sized. This does not make sense, as unlike disk swap, in
which we consume a limited resource (disk space or swapfile space) to
save another resource (memory), zswap consume the same resource it is
saving (memory). The more we zswap, the more memory we have available,
not less. We are not rationing a limited resource when we limit
the size of he zswap pool, but rather we are capping the resource
(memory) saving potential of zswap. Under memory pressure, using
more zswap is almost always better than the alternative (disk IOs, or
even worse, OOMs), and dynamically sizing the zswap pool on demand
allows the system to flexibly respond to these precarious scenarios.
* Operationally, static provisioning the swapfile for zswap pose
significant challenges, because the sysadmin has to prescribe how
much swap is needed a priori, for each combination of
(memory size x disk space x workload usage). It is even more
complicated when we take into account the variance of memory
compression, which changes the reclaim dynamics (and as a result,
swap space size requirement). The problem is further exarcebated for
users who rely on swap utilization (and exhaustion) as an OOM signal.
All of these factors make it very difficult to configure the swapfile
for zswap: too small of a swapfile and we risk preventable OOMs and
limit the memory saving potentials of zswap; too big of a swapfile
and we waste disk space and memory due to swap metadata overhead.
This dilemma becomes more drastic in high memory systems, which can
have up to TBs worth of memory.
Past attempts to decouple disk and compressed swap backends, namely the
ghost swapfile approach (see [13]), as well as the alternative
compressed swap backend zram, have mainly focused on eliminating the
disk space usage of compressed backends. We want a solution that not
only tackles that same problem, but also achieve the dyamicization of
swap space to maximize the memory saving potentials while reducing
operational and static memory overhead.
Finally, any swap redesign should support efficient backend transfer,
i.e without having to perform the expensive page table walk to
update all the PTEs that refer to the swap entry:
* The main motivation for this requirement is zswap writeback. To quote
Johannes (from [14]): "Combining compression with disk swap is
extremely powerful, because it dramatically reduces the worst aspects
of both: it reduces the memory footprint of compression by shedding
the coldest data to disk; it reduces the IO latencies and flash wear
of disk swap through the writeback cache. In practice, this reduces
*average event rates of the entire reclaim/paging/IO stack*."
* Another motivation is to simplify swapoff, which is both complicated
and expensive in the current design, precisely because we are storing
an encoding of the backend positional information in the page table,
and thus requires a full page table walk to remove these references.
II. High Level Design Overview
To fix the aforementioned issues, we need an abstraction that separates
a swap entry from its physical backing storage. IOW, we need to
“virtualize” the swap space: swap clients will work with a dynamically
allocated virtual swap slot, storing it in page table entries, and
using it to index into various swap-related data structures. The
backing storage is decoupled from the virtual swap slot, and the newly
introduced layer will “resolve” the virtual swap slot to the actual
storage. This layer also manages other metadata of the swap entry, such
as its lifetime information (swap count), via a dynamically allocated,
per-swap-entry descriptor:
struct swp_desc {
union {
swp_slot_t slot; /* 0 8 */
struct zswap_entry * zswap_entry; /* 0 8 */
}; /* 0 8 */
union {
struct folio * swap_cache; /* 8 8 */
void * shadow; /* 8 8 */
}; /* 8 8 */
unsigned int swap_count; /* 16 4 */
unsigned short memcgid:16; /* 20: 0 2 */
bool in_swapcache:1; /* 22: 0 1 */
/* Bitfield combined with previous fields */
enum swap_type type:2; /* 20:17 4 */
/* size: 24, cachelines: 1, members: 6 */
/* bit_padding: 13 bits */
/* last cacheline: 24 bytes */
};
(output from pahole).
This design allows us to:
* Decouple zswap (and zeromapped swap entry) from backing swapfile:
simply associate the virtual swap slot with one of the supported
backends: a zswap entry, a zero-filled swap page, a slot on the
swapfile, or an in-memory page.
* Simplify and optimize swapoff: we only have to fault the page in and
have the virtual swap slot points to the page instead of the on-disk
physical swap slot. No need to perform any page table walking.
The size of the virtual swap descriptor is 24 bytes. Note that this is
not all "new" overhead, as the swap descriptor will replace:
* the swap_cgroup arrays (one per swap type) in the old design, which
is a massive source of static memory overhead. With the new design,
it is only allocated for used clusters.
* the swap tables, which holds the swap cache and workingset shadows.
* the zeromap bitmap, which is a bitmap of physical swap slots to
indicate whether the swapped out page is zero-filled or not.
* huge chunk of the swap_map. The swap_map is now replaced by 2 bitmaps,
one for allocated slots, and one for bad slots, representing 3 possible
states of a slot on the swapfile: allocated, free, and bad.
* the zswap tree.
So, in terms of additional memory overhead:
* For zswap entries, the added memory overhead is rather minimal. The
new indirection pointer neatly replaces the existing zswap tree.
We really only incur less than one word of overhead for swap count
blow up (since we no longer use swap continuation) and the swap type.
* For physical swap entries, the new design will impose fewer than 3 words
memory overhead. However, as noted above this overhead is only for
actively used swap entries, whereas in the current design the overhead is
static (including the swap cgroup array for example).
The primary victim of this overhead will be zram users. However, as
zswap now no longer takes up disk space, zram users can consider
switching to zswap (which, as a bonus, has a lot of useful features
out of the box, such as cgroup tracking, dynamic zswap pool sizing,
LRU-ordering writeback, etc.).
For a more concrete example, suppose we have a 32 GB swapfile (i.e.
8,388,608 swap entries), and we use zswap.
0% usage, or 0 entries: 0.00 MB
* Old design total overhead: 25.00 MB
* Vswap total overhead: 0.00 MB
25% usage, or 2,097,152 entries:
* Old design total overhead: 57.00 MB
* Vswap total overhead: 48.25 MB
50% usage, or 4,194,304 entries:
* Old design total overhead: 89.00 MB
* Vswap total overhead: 96.50 MB
75% usage, or 6,291,456 entries:
* Old design total overhead: 121.00 MB
* Vswap total overhead: 144.75 MB
100% usage, or 8,388,608 entries:
* Old design total overhead: 153.00 MB
* Vswap total overhead: 193.00 MB
So even in the worst case scenario for virtual swap, i.e when we
somehow have an oracle to correctly size the swapfile for zswap
pool to 32 GB, the added overhead is only 40 MB, which is a mere
0.12% of the total swapfile :)
In practice, the overhead will be closer to the 50-75% usage case, as
systems tend to leave swap headroom for pathological events or sudden
spikes in memory requirements. The added overhead in these cases are
practically neglible. And in deployments where swapfiles for zswap
are previously sparsely used, switching over to virtual swap will
actually reduce memory overhead.
Doing the same math for the disk swap, which is the worst case for
virtual swap in terms of swap backends:
0% usage, or 0 entries: 0.00 MB
* Old design total overhead: 25.00 MB
* Vswap total overhead: 2.00 MB
25% usage, or 2,097,152 entries:
* Old design total overhead: 41.00 MB
* Vswap total overhead: 66.25 MB
50% usage, or 4,194,304 entries:
* Old design total overhead: 57.00 MB
* Vswap total overhead: 130.50 MB
75% usage, or 6,291,456 entries:
* Old design total overhead: 73.00 MB
* Vswap total overhead: 194.75 MB
100% usage, or 8,388,608 entries:
* Old design total overhead: 89.00 MB
* Vswap total overhead: 259.00 MB
The added overhead is 170MB, which is 0.5% of the total swapfile size,
again in the worst case when we have a sizing oracle.
Please see the attached patches for more implementation details.
III. Usage and Benchmarking
This patch series introduce no new syscalls or userspace API. Existing
userspace setups will work as-is, except we no longer have to create a
swapfile or set memory.swap.max if we want to use zswap, as zswap is no
longer tied to physical swap. The zswap pool will be automatically and
dynamically sized based on memory usage and reclaim dynamics.
To measure the performance of the new implementation, I have run the
following benchmarks:
1. Kernel building: 52 workers (one per processor), memory.max = 3G.
Using zswap as the backend:
Baseline:
real: mean: 185.2s, stdev: 0.93s
sys: mean: 683.7s, stdev: 33.77s
Vswap:
real: mean: 184.88s, stdev: 0.57s
sys: mean: 675.14s, stdev: 32.8s
We actually see a slight improvement in systime (by 1.5%) :) This is
likely because we no longer have to perform swap charging for zswap
entries, and virtual swap allocator is simpler than that of physical
swap.
Using SSD swap as the backend:
Baseline:
real: mean: 200.3s, stdev: 2.33s
sys: mean: 489.88s, stdev: 9.62s
Vswap:
real: mean: 201.47s, stdev: 2.98s
sys: mean: 487.36s, stdev: 5.53s
The performance is neck-to-neck.
IV. Future Use Cases
While the patch series focus on two applications (decoupling swap
backends and swapoff optimization/simplification), this new,
future-proof design also allows us to implement new swap features more
easily and efficiently:
* Multi-tier swapping (as mentioned in [5]), with transparent
transferring (promotion/demotion) of pages across tiers (see [8] and
[9]). Similar to swapoff, with the old design we would need to
perform the expensive page table walk.
* Swapfile compaction to alleviate fragmentation (as proposed by Ying
Huang in [6]).
* Mixed backing THP swapin (see [7]): Once you have pinned down the
backing store of THPs, then you can dispatch each range of subpages
to appropriate backend swapin handler.
* Swapping a folio out with discontiguous physical swap slots
(see [10]).
* Zswap writeback optimization: The current architecture pre-reserves
physical swap space for pages when they enter the zswap pool, giving
the kernel no flexibility at writeback time. With the virtual swap
implementation, the backends are decoupled, and physical swap space
is allocated on-demand at writeback time, at which point we can make
much smarter decisions: we can batch multiple zswap writeback
operations into a single IO request, allocating contiguous physical
swap slots for that request. We can even perform compressed writeback
(i.e writing these pages without decompressing them) (see [12]).
V. References
[1]: https://lore.kernel.org/all/CAJD7tkbCnXJ95Qow_aOjNX6NOMU5ovMSHRC+95U4wtW6cM+puw@mail.gmail.com/
[2]: https://lwn.net/Articles/932077/
[3]: https://www.youtube.com/watch?v=Hwqw_TBGEhg
[4]: https://lore.kernel.org/all/Zqe_Nab-Df1CN7iW@infradead.org/
[5]: https://lore.kernel.org/lkml/CAF8kJuN-4UE0skVHvjUzpGefavkLULMonjgkXUZSBVJrcGFXCA@mail.gmail.com/
[6]: https://lore.kernel.org/linux-mm/87o78mzp24.fsf@yhuang6-desk2.ccr.corp.intel.com/
[7]: https://lore.kernel.org/all/CAGsJ_4ysCN6f7qt=6gvee1x3ttbOnifGneqcRm9Hoeun=uFQ2w@mail.gmail.com/
[8]: https://lore.kernel.org/linux-mm/4DA25039.3020700@redhat.com/
[9]: https://lore.kernel.org/all/CA+ZsKJ7DCE8PMOSaVmsmYZL9poxK6rn0gvVXbjpqxMwxS2C9TQ@mail.gmail.com/
[10]: https://lore.kernel.org/all/CACePvbUkMYMencuKfpDqtG1Ej7LiUS87VRAXb8sBn1yANikEmQ@mail.gmail.com/
[11]: https://lore.kernel.org/all/CAMgjq7BvQ0ZXvyLGp2YP96+i+6COCBBJCYmjXHGBnfisCAb8VA@mail.gmail.com/
[12]: https://lore.kernel.org/linux-mm/ZeZSDLWwDed0CgT3@casper.infradead.org/
[13]: https://lore.kernel.org/all/20251121-ghost-v1-1-cfc0efcf3855@kernel.org/
[14]: https://lore.kernel.org/linux-mm/20251202170222.GD430226@cmpxchg.org/
Nhat Pham (20):
mm/swap: decouple swap cache from physical swap infrastructure
swap: rearrange the swap header file
mm: swap: add an abstract API for locking out swapoff
zswap: add new helpers for zswap entry operations
mm/swap: add a new function to check if a swap entry is in swap
cached.
mm: swap: add a separate type for physical swap slots
mm: create scaffolds for the new virtual swap implementation
zswap: prepare zswap for swap virtualization
mm: swap: allocate a virtual swap slot for each swapped out page
swap: move swap cache to virtual swap descriptor
zswap: move zswap entry management to the virtual swap descriptor
swap: implement the swap_cgroup API using virtual swap
swap: manage swap entry lifecycle at the virtual swap layer
mm: swap: decouple virtual swap slot from backing store
zswap: do not start zswap shrinker if there is no physical swap slots
swap: do not unnecesarily pin readahead swap entries
swapfile: remove zeromap bitmap
memcg: swap: only charge physical swap slots
swap: simplify swapoff using virtual swap
swapfile: replace the swap map with bitmaps
Documentation/mm/swap-table.rst | 69 --
MAINTAINERS | 2 +
include/linux/cpuhotplug.h | 1 +
include/linux/mm_types.h | 16 +
include/linux/shmem_fs.h | 7 +-
include/linux/swap.h | 135 ++-
include/linux/swap_cgroup.h | 13 -
include/linux/swapops.h | 25 +
include/linux/zswap.h | 17 +-
kernel/power/swap.c | 6 +-
mm/Makefile | 5 +-
mm/huge_memory.c | 11 +-
mm/internal.h | 12 +-
mm/memcontrol-v1.c | 6 +
mm/memcontrol.c | 142 ++-
mm/memory.c | 101 +-
mm/migrate.c | 13 +-
mm/mincore.c | 15 +-
mm/page_io.c | 83 +-
mm/shmem.c | 215 +---
mm/swap.h | 157 +--
mm/swap_cgroup.c | 172 ---
mm/swap_state.c | 306 +----
mm/swap_table.h | 78 +-
mm/swapfile.c | 1518 ++++-------------------
mm/userfaultfd.c | 18 +-
mm/vmscan.c | 28 +-
mm/vswap.c | 2025 +++++++++++++++++++++++++++++++
mm/zswap.c | 142 +--
29 files changed, 2853 insertions(+), 2485 deletions(-)
delete mode 100644 Documentation/mm/swap-table.rst
delete mode 100644 mm/swap_cgroup.c
create mode 100644 mm/vswap.c
base-commit: 05f7e89ab9731565d8a62e3b5d1ec206485eeb0b
--
2.47.3
^ permalink raw reply [flat|nested] 52+ messages in thread
* [PATCH v3 00/20] Virtual Swap Space
2026-02-08 21:58 ` [PATCH v3 01/20] mm/swap: decouple swap cache from physical swap infrastructure Nhat Pham
2026-02-08 22:26 ` [PATCH v3 00/20] Virtual Swap Space Nhat Pham
2026-02-08 22:31 ` [PATCH v3 00/20] Virtual Swap Space Nhat Pham
@ 2026-02-08 22:39 ` Nhat Pham
2026-02-09 2:22 ` [PATCH v3 01/20] mm/swap: decouple swap cache from physical swap infrastructure kernel test robot
3 siblings, 0 replies; 52+ messages in thread
From: Nhat Pham @ 2026-02-08 22:39 UTC (permalink / raw)
To: nphamcs, akpm, hannes, hughd, yosry.ahmed, mhocko,
roman.gushchin, shakeel.butt, muchun.song, len.brown,
chengming.zhou, kasong, chrisl, huang.ying.caritas, ryan.roberts,
shikemeng, viro, baohua, bhe, osalvador, christophe.leroy, pavel,
linux-mm, kernel-team, linux-kernel, cgroups, linux-pm, peterx,
riel, joshua.hahnjy, npache, gourry, axelrasmussen, yuanchu,
weixugc, rafael, jannh, pfalcato, zhengqi.arch
My sincerest apologies - it seems like the cover letter (and just the
cover letter) fails to be sent out, for some reason. I'm trying to figure
out what happened - it works when I send the entire patch series to
myself...
Anyway, resending this (in-reply-to patch 1 of the series):
Changelog:
* RFC v2 -> v3:
* Implement a cluster-based allocation algorithm for virtual swap
slots, inspired by Kairui Song and Chris Li's implementation, as
well as Johannes Weiner's suggestions. This eliminates the lock
contention issues on the virtual swap layer.
* Re-use swap table for the reverse mapping.
* Remove CONFIG_VIRTUAL_SWAP.
* Reducing the size of the swap descriptor from 48 bytes to 24
bytes, i.e another 50% reduction in memory overhead from v2.
* Remove swap cache and zswap tree and use the swap descriptor
for this.
* Remove zeromap, and replace the swap_map bytemap with 2 bitmaps
(one for allocated slots, and one for bad slots).
* Rebase on top of 6.19 (7d0a66e4bb9081d75c82ec4957c50034cb0ea449)
* Update cover letter to include new benchmark results and discussion
on overhead in various cases.
* RFC v1 -> RFC v2:
* Use a single atomic type (swap_refs) for reference counting
purpose. This brings the size of the swap descriptor from 64 B
down to 48 B (25% reduction). Suggested by Yosry Ahmed.
* Zeromap bitmap is removed in the virtual swap implementation.
This saves one bit per phyiscal swapfile slot.
* Rearrange the patches and the code change to make things more
reviewable. Suggested by Johannes Weiner.
* Update the cover letter a bit.
This patch series implements the virtual swap space idea, based on Yosry's
proposals at LSFMMBPF 2023 (see [1], [2], [3]), as well as valuable
inputs from Johannes Weiner. The same idea (with different
implementation details) has been floated by Rik van Riel since at least
2011 (see [8]).
This patch series is based on 6.19. There are a couple more
swap-related changes in the mm-stable branch that I would need to
coordinate with, but I would like to send this out as an update, to show
that the lock contention issues that plagued earlier versions have been
resolved and performance on the kernel build benchmark is now on-par with
baseline. Furthermore, memory overhead has been substantially reduced
compared to the last RFC version.
I. Motivation
Currently, when an anon page is swapped out, a slot in a backing swap
device is allocated and stored in the page table entries that refer to
the original page. This slot is also used as the "key" to find the
swapped out content, as well as the index to swap data structures, such
as the swap cache, or the swap cgroup mapping. Tying a swap entry to its
backing slot in this way is performant and efficient when swap is purely
just disk space, and swapoff is rare.
However, the advent of many swap optimizations has exposed major
drawbacks of this design. The first problem is that we occupy a physical
slot in the swap space, even for pages that are NEVER expected to hit
the disk: pages compressed and stored in the zswap pool, zero-filled
pages, or pages rejected by both of these optimizations when zswap
writeback is disabled. This is the arguably central shortcoming of
zswap:
* In deployments when no disk space can be afforded for swap (such as
mobile and embedded devices), users cannot adopt zswap, and are forced
to use zram. This is confusing for users, and creates extra burdens
for developers, having to develop and maintain similar features for
two separate swap backends (writeback, cgroup charging, THP support,
etc.). For instance, see the discussion in [4].
* Resource-wise, it is hugely wasteful in terms of disk usage. At Meta,
we have swapfile in the order of tens to hundreds of GBs, which are
mostly unused and only exist to enable zswap usage and zero-filled
pages swap optimizations.
* Tying zswap (and more generally, other in-memory swap backends) to
the current physical swapfile infrastructure makes zswap implicitly
statically sized. This does not make sense, as unlike disk swap, in
which we consume a limited resource (disk space or swapfile space) to
save another resource (memory), zswap consume the same resource it is
saving (memory). The more we zswap, the more memory we have available,
not less. We are not rationing a limited resource when we limit
the size of he zswap pool, but rather we are capping the resource
(memory) saving potential of zswap. Under memory pressure, using
more zswap is almost always better than the alternative (disk IOs, or
even worse, OOMs), and dynamically sizing the zswap pool on demand
allows the system to flexibly respond to these precarious scenarios.
* Operationally, static provisioning the swapfile for zswap pose
significant challenges, because the sysadmin has to prescribe how
much swap is needed a priori, for each combination of
(memory size x disk space x workload usage). It is even more
complicated when we take into account the variance of memory
compression, which changes the reclaim dynamics (and as a result,
swap space size requirement). The problem is further exarcebated for
users who rely on swap utilization (and exhaustion) as an OOM signal.
All of these factors make it very difficult to configure the swapfile
for zswap: too small of a swapfile and we risk preventable OOMs and
limit the memory saving potentials of zswap; too big of a swapfile
and we waste disk space and memory due to swap metadata overhead.
This dilemma becomes more drastic in high memory systems, which can
have up to TBs worth of memory.
Past attempts to decouple disk and compressed swap backends, namely the
ghost swapfile approach (see [13]), as well as the alternative
compressed swap backend zram, have mainly focused on eliminating the
disk space usage of compressed backends. We want a solution that not
only tackles that same problem, but also achieve the dyamicization of
swap space to maximize the memory saving potentials while reducing
operational and static memory overhead.
Finally, any swap redesign should support efficient backend transfer,
i.e without having to perform the expensive page table walk to
update all the PTEs that refer to the swap entry:
* The main motivation for this requirement is zswap writeback. To quote
Johannes (from [14]): "Combining compression with disk swap is
extremely powerful, because it dramatically reduces the worst aspects
of both: it reduces the memory footprint of compression by shedding
the coldest data to disk; it reduces the IO latencies and flash wear
of disk swap through the writeback cache. In practice, this reduces
*average event rates of the entire reclaim/paging/IO stack*."
* Another motivation is to simplify swapoff, which is both complicated
and expensive in the current design, precisely because we are storing
an encoding of the backend positional information in the page table,
and thus requires a full page table walk to remove these references.
II. High Level Design Overview
To fix the aforementioned issues, we need an abstraction that separates
a swap entry from its physical backing storage. IOW, we need to
“virtualize” the swap space: swap clients will work with a dynamically
allocated virtual swap slot, storing it in page table entries, and
using it to index into various swap-related data structures. The
backing storage is decoupled from the virtual swap slot, and the newly
introduced layer will “resolve” the virtual swap slot to the actual
storage. This layer also manages other metadata of the swap entry, such
as its lifetime information (swap count), via a dynamically allocated,
per-swap-entry descriptor:
struct swp_desc {
union {
swp_slot_t slot; /* 0 8 */
struct zswap_entry * zswap_entry; /* 0 8 */
}; /* 0 8 */
union {
struct folio * swap_cache; /* 8 8 */
void * shadow; /* 8 8 */
}; /* 8 8 */
unsigned int swap_count; /* 16 4 */
unsigned short memcgid:16; /* 20: 0 2 */
bool in_swapcache:1; /* 22: 0 1 */
/* Bitfield combined with previous fields */
enum swap_type type:2; /* 20:17 4 */
/* size: 24, cachelines: 1, members: 6 */
/* bit_padding: 13 bits */
/* last cacheline: 24 bytes */
};
(output from pahole).
This design allows us to:
* Decouple zswap (and zeromapped swap entry) from backing swapfile:
simply associate the virtual swap slot with one of the supported
backends: a zswap entry, a zero-filled swap page, a slot on the
swapfile, or an in-memory page.
* Simplify and optimize swapoff: we only have to fault the page in and
have the virtual swap slot points to the page instead of the on-disk
physical swap slot. No need to perform any page table walking.
The size of the virtual swap descriptor is 24 bytes. Note that this is
not all "new" overhead, as the swap descriptor will replace:
* the swap_cgroup arrays (one per swap type) in the old design, which
is a massive source of static memory overhead. With the new design,
it is only allocated for used clusters.
* the swap tables, which holds the swap cache and workingset shadows.
* the zeromap bitmap, which is a bitmap of physical swap slots to
indicate whether the swapped out page is zero-filled or not.
* huge chunk of the swap_map. The swap_map is now replaced by 2 bitmaps,
one for allocated slots, and one for bad slots, representing 3 possible
states of a slot on the swapfile: allocated, free, and bad.
* the zswap tree.
So, in terms of additional memory overhead:
* For zswap entries, the added memory overhead is rather minimal. The
new indirection pointer neatly replaces the existing zswap tree.
We really only incur less than one word of overhead for swap count
blow up (since we no longer use swap continuation) and the swap type.
* For physical swap entries, the new design will impose fewer than 3 words
memory overhead. However, as noted above this overhead is only for
actively used swap entries, whereas in the current design the overhead is
static (including the swap cgroup array for example).
The primary victim of this overhead will be zram users. However, as
zswap now no longer takes up disk space, zram users can consider
switching to zswap (which, as a bonus, has a lot of useful features
out of the box, such as cgroup tracking, dynamic zswap pool sizing,
LRU-ordering writeback, etc.).
For a more concrete example, suppose we have a 32 GB swapfile (i.e.
8,388,608 swap entries), and we use zswap.
0% usage, or 0 entries: 0.00 MB
* Old design total overhead: 25.00 MB
* Vswap total overhead: 0.00 MB
25% usage, or 2,097,152 entries:
* Old design total overhead: 57.00 MB
* Vswap total overhead: 48.25 MB
50% usage, or 4,194,304 entries:
* Old design total overhead: 89.00 MB
* Vswap total overhead: 96.50 MB
75% usage, or 6,291,456 entries:
* Old design total overhead: 121.00 MB
* Vswap total overhead: 144.75 MB
100% usage, or 8,388,608 entries:
* Old design total overhead: 153.00 MB
* Vswap total overhead: 193.00 MB
So even in the worst case scenario for virtual swap, i.e when we
somehow have an oracle to correctly size the swapfile for zswap
pool to 32 GB, the added overhead is only 40 MB, which is a mere
0.12% of the total swapfile :)
In practice, the overhead will be closer to the 50-75% usage case, as
systems tend to leave swap headroom for pathological events or sudden
spikes in memory requirements. The added overhead in these cases are
practically neglible. And in deployments where swapfiles for zswap
are previously sparsely used, switching over to virtual swap will
actually reduce memory overhead.
Doing the same math for the disk swap, which is the worst case for
virtual swap in terms of swap backends:
0% usage, or 0 entries: 0.00 MB
* Old design total overhead: 25.00 MB
* Vswap total overhead: 2.00 MB
25% usage, or 2,097,152 entries:
* Old design total overhead: 41.00 MB
* Vswap total overhead: 66.25 MB
50% usage, or 4,194,304 entries:
* Old design total overhead: 57.00 MB
* Vswap total overhead: 130.50 MB
75% usage, or 6,291,456 entries:
* Old design total overhead: 73.00 MB
* Vswap total overhead: 194.75 MB
100% usage, or 8,388,608 entries:
* Old design total overhead: 89.00 MB
* Vswap total overhead: 259.00 MB
The added overhead is 170MB, which is 0.5% of the total swapfile size,
again in the worst case when we have a sizing oracle.
Please see the attached patches for more implementation details.
III. Usage and Benchmarking
This patch series introduce no new syscalls or userspace API. Existing
userspace setups will work as-is, except we no longer have to create a
swapfile or set memory.swap.max if we want to use zswap, as zswap is no
longer tied to physical swap. The zswap pool will be automatically and
dynamically sized based on memory usage and reclaim dynamics.
To measure the performance of the new implementation, I have run the
following benchmarks:
1. Kernel building: 52 workers (one per processor), memory.max = 3G.
Using zswap as the backend:
Baseline:
real: mean: 185.2s, stdev: 0.93s
sys: mean: 683.7s, stdev: 33.77s
Vswap:
real: mean: 184.88s, stdev: 0.57s
sys: mean: 675.14s, stdev: 32.8s
We actually see a slight improvement in systime (by 1.5%) :) This is
likely because we no longer have to perform swap charging for zswap
entries, and virtual swap allocator is simpler than that of physical
swap.
Using SSD swap as the backend:
Baseline:
real: mean: 200.3s, stdev: 2.33s
sys: mean: 489.88s, stdev: 9.62s
Vswap:
real: mean: 201.47s, stdev: 2.98s
sys: mean: 487.36s, stdev: 5.53s
The performance is neck-to-neck.
IV. Future Use Cases
While the patch series focus on two applications (decoupling swap
backends and swapoff optimization/simplification), this new,
future-proof design also allows us to implement new swap features more
easily and efficiently:
* Multi-tier swapping (as mentioned in [5]), with transparent
transferring (promotion/demotion) of pages across tiers (see [8] and
[9]). Similar to swapoff, with the old design we would need to
perform the expensive page table walk.
* Swapfile compaction to alleviate fragmentation (as proposed by Ying
Huang in [6]).
* Mixed backing THP swapin (see [7]): Once you have pinned down the
backing store of THPs, then you can dispatch each range of subpages
to appropriate backend swapin handler.
* Swapping a folio out with discontiguous physical swap slots
(see [10]).
* Zswap writeback optimization: The current architecture pre-reserves
physical swap space for pages when they enter the zswap pool, giving
the kernel no flexibility at writeback time. With the virtual swap
implementation, the backends are decoupled, and physical swap space
is allocated on-demand at writeback time, at which point we can make
much smarter decisions: we can batch multiple zswap writeback
operations into a single IO request, allocating contiguous physical
swap slots for that request. We can even perform compressed writeback
(i.e writing these pages without decompressing them) (see [12]).
V. References
[1]: https://lore.kernel.org/all/CAJD7tkbCnXJ95Qow_aOjNX6NOMU5ovMSHRC+95U4wtW6cM+puw@mail.gmail.com/
[2]: https://lwn.net/Articles/932077/
[3]: https://www.youtube.com/watch?v=Hwqw_TBGEhg
[4]: https://lore.kernel.org/all/Zqe_Nab-Df1CN7iW@infradead.org/
[5]: https://lore.kernel.org/lkml/CAF8kJuN-4UE0skVHvjUzpGefavkLULMonjgkXUZSBVJrcGFXCA@mail.gmail.com/
[6]: https://lore.kernel.org/linux-mm/87o78mzp24.fsf@yhuang6-desk2.ccr.corp.intel.com/
[7]: https://lore.kernel.org/all/CAGsJ_4ysCN6f7qt=6gvee1x3ttbOnifGneqcRm9Hoeun=uFQ2w@mail.gmail.com/
[8]: https://lore.kernel.org/linux-mm/4DA25039.3020700@redhat.com/
[9]: https://lore.kernel.org/all/CA+ZsKJ7DCE8PMOSaVmsmYZL9poxK6rn0gvVXbjpqxMwxS2C9TQ@mail.gmail.com/
[10]: https://lore.kernel.org/all/CACePvbUkMYMencuKfpDqtG1Ej7LiUS87VRAXb8sBn1yANikEmQ@mail.gmail.com/
[11]: https://lore.kernel.org/all/CAMgjq7BvQ0ZXvyLGp2YP96+i+6COCBBJCYmjXHGBnfisCAb8VA@mail.gmail.com/
[12]: https://lore.kernel.org/linux-mm/ZeZSDLWwDed0CgT3@casper.infradead.org/
[13]: https://lore.kernel.org/all/20251121-ghost-v1-1-cfc0efcf3855@kernel.org/
[14]: https://lore.kernel.org/linux-mm/20251202170222.GD430226@cmpxchg.org/
Nhat Pham (20):
mm/swap: decouple swap cache from physical swap infrastructure
swap: rearrange the swap header file
mm: swap: add an abstract API for locking out swapoff
zswap: add new helpers for zswap entry operations
mm/swap: add a new function to check if a swap entry is in swap
cached.
mm: swap: add a separate type for physical swap slots
mm: create scaffolds for the new virtual swap implementation
zswap: prepare zswap for swap virtualization
mm: swap: allocate a virtual swap slot for each swapped out page
swap: move swap cache to virtual swap descriptor
zswap: move zswap entry management to the virtual swap descriptor
swap: implement the swap_cgroup API using virtual swap
swap: manage swap entry lifecycle at the virtual swap layer
mm: swap: decouple virtual swap slot from backing store
zswap: do not start zswap shrinker if there is no physical swap slots
swap: do not unnecesarily pin readahead swap entries
swapfile: remove zeromap bitmap
memcg: swap: only charge physical swap slots
swap: simplify swapoff using virtual swap
swapfile: replace the swap map with bitmaps
Documentation/mm/swap-table.rst | 69 --
MAINTAINERS | 2 +
include/linux/cpuhotplug.h | 1 +
include/linux/mm_types.h | 16 +
include/linux/shmem_fs.h | 7 +-
include/linux/swap.h | 135 ++-
include/linux/swap_cgroup.h | 13 -
include/linux/swapops.h | 25 +
include/linux/zswap.h | 17 +-
kernel/power/swap.c | 6 +-
mm/Makefile | 5 +-
mm/huge_memory.c | 11 +-
mm/internal.h | 12 +-
mm/memcontrol-v1.c | 6 +
mm/memcontrol.c | 142 ++-
mm/memory.c | 101 +-
mm/migrate.c | 13 +-
mm/mincore.c | 15 +-
mm/page_io.c | 83 +-
mm/shmem.c | 215 +---
mm/swap.h | 157 +--
mm/swap_cgroup.c | 172 ---
mm/swap_state.c | 306 +----
mm/swap_table.h | 78 +-
mm/swapfile.c | 1518 ++++-------------------
mm/userfaultfd.c | 18 +-
mm/vmscan.c | 28 +-
mm/vswap.c | 2025 +++++++++++++++++++++++++++++++
mm/zswap.c | 142 +--
29 files changed, 2853 insertions(+), 2485 deletions(-)
delete mode 100644 Documentation/mm/swap-table.rst
delete mode 100644 mm/swap_cgroup.c
create mode 100644 mm/vswap.c
base-commit: 05f7e89ab9731565d8a62e3b5d1ec206485eeb0b
--
2.47.3
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH v3 00/20] Virtual Swap Space
2026-02-08 21:58 [PATCH v3 00/20] Virtual Swap Space Nhat Pham
` (19 preceding siblings ...)
2026-02-08 21:58 ` [PATCH v3 20/20] swapfile: replace the swap map with bitmaps Nhat Pham
@ 2026-02-08 22:51 ` Nhat Pham
2026-02-12 12:23 ` David Hildenbrand (Arm)
2026-02-10 15:45 ` [syzbot ci] " syzbot ci
21 siblings, 1 reply; 52+ messages in thread
From: Nhat Pham @ 2026-02-08 22:51 UTC (permalink / raw)
To: linux-mm
Cc: akpm, hannes, hughd, yosry.ahmed, mhocko, roman.gushchin,
shakeel.butt, muchun.song, len.brown, chengming.zhou, kasong,
chrisl, huang.ying.caritas, ryan.roberts, shikemeng, viro,
baohua, bhe, osalvador, lorenzo.stoakes, christophe.leroy, pavel,
kernel-team, linux-kernel, cgroups, linux-pm, peterx, riel,
joshua.hahnjy, npache, gourry, axelrasmussen, yuanchu, weixugc,
rafael, jannh, pfalcato, zhengqi.arch
On Sun, Feb 8, 2026 at 1:58 PM Nhat Pham <nphamcs@gmail.com> wrote:
>
> Changelog:
> * RFC v2 -> v3:
> * Implement a cluster-based allocation algorithm for virtual swap
> slots, inspired by Kairui Song and Chris Li's implementation, as
> well as Johannes Weiner's suggestions. This eliminates the lock
> contention issues on the virtual swap layer.
> * Re-use swap table for the reverse mapping.
> * Remove CONFIG_VIRTUAL_SWAP.
> * Reducing the size of the swap descriptor from 48 bytes to 24
> bytes, i.e another 50% reduction in memory overhead from v2.
> * Remove swap cache and zswap tree and use the swap descriptor
> for this.
> * Remove zeromap, and replace the swap_map bytemap with 2 bitmaps
> (one for allocated slots, and one for bad slots).
> * Rebase on top of 6.19 (7d0a66e4bb9081d75c82ec4957c50034cb0ea449)
> * Update cover letter to include new benchmark results and discussion
> on overhead in various cases.
> * RFC v1 -> RFC v2:
> * Use a single atomic type (swap_refs) for reference counting
> purpose. This brings the size of the swap descriptor from 64 B
> down to 48 B (25% reduction). Suggested by Yosry Ahmed.
> * Zeromap bitmap is removed in the virtual swap implementation.
> This saves one bit per phyiscal swapfile slot.
> * Rearrange the patches and the code change to make things more
> reviewable. Suggested by Johannes Weiner.
> * Update the cover letter a bit.
>
> This patch series implements the virtual swap space idea, based on Yosry's
> proposals at LSFMMBPF 2023 (see [1], [2], [3]), as well as valuable
> inputs from Johannes Weiner. The same idea (with different
> implementation details) has been floated by Rik van Riel since at least
> 2011 (see [8]).
>
> This patch series is based on 6.19. There are a couple more
> swap-related changes in the mm-stable branch that I would need to
> coordinate with, but I would like to send this out as an update, to show
> that the lock contention issues that plagued earlier versions have been
> resolved and performance on the kernel build benchmark is now on-par with
> baseline. Furthermore, memory overhead has been substantially reduced
> compared to the last RFC version.
>
>
> I. Motivation
>
> Currently, when an anon page is swapped out, a slot in a backing swap
> device is allocated and stored in the page table entries that refer to
> the original page. This slot is also used as the "key" to find the
> swapped out content, as well as the index to swap data structures, such
> as the swap cache, or the swap cgroup mapping. Tying a swap entry to its
> backing slot in this way is performant and efficient when swap is purely
> just disk space, and swapoff is rare.
>
> However, the advent of many swap optimizations has exposed major
> drawbacks of this design. The first problem is that we occupy a physical
> slot in the swap space, even for pages that are NEVER expected to hit
> the disk: pages compressed and stored in the zswap pool, zero-filled
> pages, or pages rejected by both of these optimizations when zswap
> writeback is disabled. This is the arguably central shortcoming of
> zswap:
> * In deployments when no disk space can be afforded for swap (such as
> mobile and embedded devices), users cannot adopt zswap, and are forced
> to use zram. This is confusing for users, and creates extra burdens
> for developers, having to develop and maintain similar features for
> two separate swap backends (writeback, cgroup charging, THP support,
> etc.). For instance, see the discussion in [4].
> * Resource-wise, it is hugely wasteful in terms of disk usage. At Meta,
> we have swapfile in the order of tens to hundreds of GBs, which are
> mostly unused and only exist to enable zswap usage and zero-filled
> pages swap optimizations.
> * Tying zswap (and more generally, other in-memory swap backends) to
> the current physical swapfile infrastructure makes zswap implicitly
> statically sized. This does not make sense, as unlike disk swap, in
> which we consume a limited resource (disk space or swapfile space) to
> save another resource (memory), zswap consume the same resource it is
> saving (memory). The more we zswap, the more memory we have available,
> not less. We are not rationing a limited resource when we limit
> the size of he zswap pool, but rather we are capping the resource
> (memory) saving potential of zswap. Under memory pressure, using
> more zswap is almost always better than the alternative (disk IOs, or
> even worse, OOMs), and dynamically sizing the zswap pool on demand
> allows the system to flexibly respond to these precarious scenarios.
> * Operationally, static provisioning the swapfile for zswap pose
> significant challenges, because the sysadmin has to prescribe how
> much swap is needed a priori, for each combination of
> (memory size x disk space x workload usage). It is even more
> complicated when we take into account the variance of memory
> compression, which changes the reclaim dynamics (and as a result,
> swap space size requirement). The problem is further exarcebated for
> users who rely on swap utilization (and exhaustion) as an OOM signal.
>
> All of these factors make it very difficult to configure the swapfile
> for zswap: too small of a swapfile and we risk preventable OOMs and
> limit the memory saving potentials of zswap; too big of a swapfile
> and we waste disk space and memory due to swap metadata overhead.
> This dilemma becomes more drastic in high memory systems, which can
> have up to TBs worth of memory.
>
> Past attempts to decouple disk and compressed swap backends, namely the
> ghost swapfile approach (see [13]), as well as the alternative
> compressed swap backend zram, have mainly focused on eliminating the
> disk space usage of compressed backends. We want a solution that not
> only tackles that same problem, but also achieve the dyamicization of
> swap space to maximize the memory saving potentials while reducing
> operational and static memory overhead.
>
> Finally, any swap redesign should support efficient backend transfer,
> i.e without having to perform the expensive page table walk to
> update all the PTEs that refer to the swap entry:
> * The main motivation for this requirement is zswap writeback. To quote
> Johannes (from [14]): "Combining compression with disk swap is
> extremely powerful, because it dramatically reduces the worst aspects
> of both: it reduces the memory footprint of compression by shedding
> the coldest data to disk; it reduces the IO latencies and flash wear
> of disk swap through the writeback cache. In practice, this reduces
> *average event rates of the entire reclaim/paging/IO stack*."
> * Another motivation is to simplify swapoff, which is both complicated
> and expensive in the current design, precisely because we are storing
> an encoding of the backend positional information in the page table,
> and thus requires a full page table walk to remove these references.
>
>
> II. High Level Design Overview
>
> To fix the aforementioned issues, we need an abstraction that separates
> a swap entry from its physical backing storage. IOW, we need to
> “virtualize” the swap space: swap clients will work with a dynamically
> allocated virtual swap slot, storing it in page table entries, and
> using it to index into various swap-related data structures. The
> backing storage is decoupled from the virtual swap slot, and the newly
> introduced layer will “resolve” the virtual swap slot to the actual
> storage. This layer also manages other metadata of the swap entry, such
> as its lifetime information (swap count), via a dynamically allocated,
> per-swap-entry descriptor:
>
> struct swp_desc {
> union {
> swp_slot_t slot; /* 0 8 */
> struct zswap_entry * zswap_entry; /* 0 8 */
> }; /* 0 8 */
> union {
> struct folio * swap_cache; /* 8 8 */
> void * shadow; /* 8 8 */
> }; /* 8 8 */
> unsigned int swap_count; /* 16 4 */
> unsigned short memcgid:16; /* 20: 0 2 */
> bool in_swapcache:1; /* 22: 0 1 */
>
> /* Bitfield combined with previous fields */
>
> enum swap_type type:2; /* 20:17 4 */
>
> /* size: 24, cachelines: 1, members: 6 */
> /* bit_padding: 13 bits */
> /* last cacheline: 24 bytes */
> };
>
> (output from pahole).
>
> This design allows us to:
> * Decouple zswap (and zeromapped swap entry) from backing swapfile:
> simply associate the virtual swap slot with one of the supported
> backends: a zswap entry, a zero-filled swap page, a slot on the
> swapfile, or an in-memory page.
> * Simplify and optimize swapoff: we only have to fault the page in and
> have the virtual swap slot points to the page instead of the on-disk
> physical swap slot. No need to perform any page table walking.
>
> The size of the virtual swap descriptor is 24 bytes. Note that this is
> not all "new" overhead, as the swap descriptor will replace:
> * the swap_cgroup arrays (one per swap type) in the old design, which
> is a massive source of static memory overhead. With the new design,
> it is only allocated for used clusters.
> * the swap tables, which holds the swap cache and workingset shadows.
> * the zeromap bitmap, which is a bitmap of physical swap slots to
> indicate whether the swapped out page is zero-filled or not.
> * huge chunk of the swap_map. The swap_map is now replaced by 2 bitmaps,
> one for allocated slots, and one for bad slots, representing 3 possible
> states of a slot on the swapfile: allocated, free, and bad.
> * the zswap tree.
>
> So, in terms of additional memory overhead:
> * For zswap entries, the added memory overhead is rather minimal. The
> new indirection pointer neatly replaces the existing zswap tree.
> We really only incur less than one word of overhead for swap count
> blow up (since we no longer use swap continuation) and the swap type.
> * For physical swap entries, the new design will impose fewer than 3 words
> memory overhead. However, as noted above this overhead is only for
> actively used swap entries, whereas in the current design the overhead is
> static (including the swap cgroup array for example).
>
> The primary victim of this overhead will be zram users. However, as
> zswap now no longer takes up disk space, zram users can consider
> switching to zswap (which, as a bonus, has a lot of useful features
> out of the box, such as cgroup tracking, dynamic zswap pool sizing,
> LRU-ordering writeback, etc.).
>
> For a more concrete example, suppose we have a 32 GB swapfile (i.e.
> 8,388,608 swap entries), and we use zswap.
>
> 0% usage, or 0 entries: 0.00 MB
> * Old design total overhead: 25.00 MB
> * Vswap total overhead: 0.00 MB
>
> 25% usage, or 2,097,152 entries:
> * Old design total overhead: 57.00 MB
> * Vswap total overhead: 48.25 MB
>
> 50% usage, or 4,194,304 entries:
> * Old design total overhead: 89.00 MB
> * Vswap total overhead: 96.50 MB
>
> 75% usage, or 6,291,456 entries:
> * Old design total overhead: 121.00 MB
> * Vswap total overhead: 144.75 MB
>
> 100% usage, or 8,388,608 entries:
> * Old design total overhead: 153.00 MB
> * Vswap total overhead: 193.00 MB
>
> So even in the worst case scenario for virtual swap, i.e when we
> somehow have an oracle to correctly size the swapfile for zswap
> pool to 32 GB, the added overhead is only 40 MB, which is a mere
> 0.12% of the total swapfile :)
>
> In practice, the overhead will be closer to the 50-75% usage case, as
> systems tend to leave swap headroom for pathological events or sudden
> spikes in memory requirements. The added overhead in these cases are
> practically neglible. And in deployments where swapfiles for zswap
> are previously sparsely used, switching over to virtual swap will
> actually reduce memory overhead.
>
> Doing the same math for the disk swap, which is the worst case for
> virtual swap in terms of swap backends:
>
> 0% usage, or 0 entries: 0.00 MB
> * Old design total overhead: 25.00 MB
> * Vswap total overhead: 2.00 MB
>
> 25% usage, or 2,097,152 entries:
> * Old design total overhead: 41.00 MB
> * Vswap total overhead: 66.25 MB
>
> 50% usage, or 4,194,304 entries:
> * Old design total overhead: 57.00 MB
> * Vswap total overhead: 130.50 MB
>
> 75% usage, or 6,291,456 entries:
> * Old design total overhead: 73.00 MB
> * Vswap total overhead: 194.75 MB
>
> 100% usage, or 8,388,608 entries:
> * Old design total overhead: 89.00 MB
> * Vswap total overhead: 259.00 MB
>
> The added overhead is 170MB, which is 0.5% of the total swapfile size,
> again in the worst case when we have a sizing oracle.
>
> Please see the attached patches for more implementation details.
>
>
> III. Usage and Benchmarking
>
> This patch series introduce no new syscalls or userspace API. Existing
> userspace setups will work as-is, except we no longer have to create a
> swapfile or set memory.swap.max if we want to use zswap, as zswap is no
> longer tied to physical swap. The zswap pool will be automatically and
> dynamically sized based on memory usage and reclaim dynamics.
>
> To measure the performance of the new implementation, I have run the
> following benchmarks:
>
> 1. Kernel building: 52 workers (one per processor), memory.max = 3G.
>
> Using zswap as the backend:
>
> Baseline:
> real: mean: 185.2s, stdev: 0.93s
> sys: mean: 683.7s, stdev: 33.77s
>
> Vswap:
> real: mean: 184.88s, stdev: 0.57s
> sys: mean: 675.14s, stdev: 32.8s
>
> We actually see a slight improvement in systime (by 1.5%) :) This is
> likely because we no longer have to perform swap charging for zswap
> entries, and virtual swap allocator is simpler than that of physical
> swap.
>
> Using SSD swap as the backend:
>
> Baseline:
> real: mean: 200.3s, stdev: 2.33s
> sys: mean: 489.88s, stdev: 9.62s
>
> Vswap:
> real: mean: 201.47s, stdev: 2.98s
> sys: mean: 487.36s, stdev: 5.53s
>
> The performance is neck-to-neck.
>
>
> IV. Future Use Cases
>
> While the patch series focus on two applications (decoupling swap
> backends and swapoff optimization/simplification), this new,
> future-proof design also allows us to implement new swap features more
> easily and efficiently:
>
> * Multi-tier swapping (as mentioned in [5]), with transparent
> transferring (promotion/demotion) of pages across tiers (see [8] and
> [9]). Similar to swapoff, with the old design we would need to
> perform the expensive page table walk.
> * Swapfile compaction to alleviate fragmentation (as proposed by Ying
> Huang in [6]).
> * Mixed backing THP swapin (see [7]): Once you have pinned down the
> backing store of THPs, then you can dispatch each range of subpages
> to appropriate backend swapin handler.
> * Swapping a folio out with discontiguous physical swap slots
> (see [10]).
> * Zswap writeback optimization: The current architecture pre-reserves
> physical swap space for pages when they enter the zswap pool, giving
> the kernel no flexibility at writeback time. With the virtual swap
> implementation, the backends are decoupled, and physical swap space
> is allocated on-demand at writeback time, at which point we can make
> much smarter decisions: we can batch multiple zswap writeback
> operations into a single IO request, allocating contiguous physical
> swap slots for that request. We can even perform compressed writeback
> (i.e writing these pages without decompressing them) (see [12]).
>
>
> V. References
>
> [1]: https://lore.kernel.org/all/CAJD7tkbCnXJ95Qow_aOjNX6NOMU5ovMSHRC+95U4wtW6cM+puw@mail.gmail.com/
> [2]: https://lwn.net/Articles/932077/
> [3]: https://www.youtube.com/watch?v=Hwqw_TBGEhg
> [4]: https://lore.kernel.org/all/Zqe_Nab-Df1CN7iW@infradead.org/
> [5]: https://lore.kernel.org/lkml/CAF8kJuN-4UE0skVHvjUzpGefavkLULMonjgkXUZSBVJrcGFXCA@mail.gmail.com/
> [6]: https://lore.kernel.org/linux-mm/87o78mzp24.fsf@yhuang6-desk2.ccr.corp.intel.com/
> [7]: https://lore.kernel.org/all/CAGsJ_4ysCN6f7qt=6gvee1x3ttbOnifGneqcRm9Hoeun=uFQ2w@mail.gmail.com/
> [8]: https://lore.kernel.org/linux-mm/4DA25039.3020700@redhat.com/
> [9]: https://lore.kernel.org/all/CA+ZsKJ7DCE8PMOSaVmsmYZL9poxK6rn0gvVXbjpqxMwxS2C9TQ@mail.gmail.com/
> [10]: https://lore.kernel.org/all/CACePvbUkMYMencuKfpDqtG1Ej7LiUS87VRAXb8sBn1yANikEmQ@mail.gmail.com/
> [11]: https://lore.kernel.org/all/CAMgjq7BvQ0ZXvyLGp2YP96+i+6COCBBJCYmjXHGBnfisCAb8VA@mail.gmail.com/
> [12]: https://lore.kernel.org/linux-mm/ZeZSDLWwDed0CgT3@casper.infradead.org/
> [13]: https://lore.kernel.org/all/20251121-ghost-v1-1-cfc0efcf3855@kernel.org/
> [14]: https://lore.kernel.org/linux-mm/20251202170222.GD430226@cmpxchg.org/
>
> Nhat Pham (20):
> mm/swap: decouple swap cache from physical swap infrastructure
> swap: rearrange the swap header file
> mm: swap: add an abstract API for locking out swapoff
> zswap: add new helpers for zswap entry operations
> mm/swap: add a new function to check if a swap entry is in swap
> cached.
> mm: swap: add a separate type for physical swap slots
> mm: create scaffolds for the new virtual swap implementation
> zswap: prepare zswap for swap virtualization
> mm: swap: allocate a virtual swap slot for each swapped out page
> swap: move swap cache to virtual swap descriptor
> zswap: move zswap entry management to the virtual swap descriptor
> swap: implement the swap_cgroup API using virtual swap
> swap: manage swap entry lifecycle at the virtual swap layer
> mm: swap: decouple virtual swap slot from backing store
> zswap: do not start zswap shrinker if there is no physical swap slots
> swap: do not unnecesarily pin readahead swap entries
> swapfile: remove zeromap bitmap
> memcg: swap: only charge physical swap slots
> swap: simplify swapoff using virtual swap
> swapfile: replace the swap map with bitmaps
>
> Documentation/mm/swap-table.rst | 69 --
> MAINTAINERS | 2 +
> include/linux/cpuhotplug.h | 1 +
> include/linux/mm_types.h | 16 +
> include/linux/shmem_fs.h | 7 +-
> include/linux/swap.h | 135 ++-
> include/linux/swap_cgroup.h | 13 -
> include/linux/swapops.h | 25 +
> include/linux/zswap.h | 17 +-
> kernel/power/swap.c | 6 +-
> mm/Makefile | 5 +-
> mm/huge_memory.c | 11 +-
> mm/internal.h | 12 +-
> mm/memcontrol-v1.c | 6 +
> mm/memcontrol.c | 142 ++-
> mm/memory.c | 101 +-
> mm/migrate.c | 13 +-
> mm/mincore.c | 15 +-
> mm/page_io.c | 83 +-
> mm/shmem.c | 215 +---
> mm/swap.h | 157 +--
> mm/swap_cgroup.c | 172 ---
> mm/swap_state.c | 306 +----
> mm/swap_table.h | 78 +-
> mm/swapfile.c | 1518 ++++-------------------
> mm/userfaultfd.c | 18 +-
> mm/vmscan.c | 28 +-
> mm/vswap.c | 2025 +++++++++++++++++++++++++++++++
> mm/zswap.c | 142 +--
> 29 files changed, 2853 insertions(+), 2485 deletions(-)
> delete mode 100644 Documentation/mm/swap-table.rst
> delete mode 100644 mm/swap_cgroup.c
> create mode 100644 mm/vswap.c
>
>
> base-commit: 05f7e89ab9731565d8a62e3b5d1ec206485eeb0b
> --
> 2.47.3
Weirdly, it seems like the cover letter (and only the cover letter) is
not being delivered...
I'm trying to figure out what's going on :( My apologies for the
inconvenience...
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH v3 18/20] memcg: swap: only charge physical swap slots
2026-02-08 21:58 ` [PATCH v3 18/20] memcg: swap: only charge physical swap slots Nhat Pham
@ 2026-02-09 2:01 ` kernel test robot
2026-02-09 2:12 ` kernel test robot
1 sibling, 0 replies; 52+ messages in thread
From: kernel test robot @ 2026-02-09 2:01 UTC (permalink / raw)
To: Nhat Pham, linux-mm
Cc: llvm, oe-kbuild-all, akpm, hannes, hughd, yosry.ahmed, mhocko,
roman.gushchin, shakeel.butt, muchun.song, len.brown,
chengming.zhou, kasong, chrisl, huang.ying.caritas, ryan.roberts,
shikemeng, viro, baohua, bhe, osalvador, lorenzo.stoakes,
christophe.leroy, pavel, kernel-team, linux-kernel, cgroups,
linux-pm, peterx, riel, joshua.hahnjy
Hi Nhat,
kernel test robot noticed the following build errors:
[auto build test ERROR on linus/master]
[also build test ERROR on v6.19]
[cannot apply to akpm-mm/mm-everything tj-cgroup/for-next tip/smp/core next-20260205]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/Nhat-Pham/swap-rearrange-the-swap-header-file/20260209-065842
base: linus/master
patch link: https://lore.kernel.org/r/20260208215839.87595-19-nphamcs%40gmail.com
patch subject: [PATCH v3 18/20] memcg: swap: only charge physical swap slots
config: hexagon-randconfig-001-20260209 (https://download.01.org/0day-ci/archive/20260209/202602090941.opY2jzUD-lkp@intel.com/config)
compiler: clang version 16.0.6 (https://github.com/llvm/llvm-project 7cbf1a2591520c2491aa35339f227775f4d3adf6)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20260209/202602090941.opY2jzUD-lkp@intel.com/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202602090941.opY2jzUD-lkp@intel.com/
All errors (new ones prefixed by >>):
>> mm/memcontrol-v1.c:688:3: error: call to undeclared function 'mem_cgroup_clear_swap'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
mem_cgroup_clear_swap(entry, nr_pages);
^
mm/memcontrol-v1.c:688:3: note: did you mean 'mem_cgroup_uncharge_swap'?
include/linux/swap.h:658:20: note: 'mem_cgroup_uncharge_swap' declared here
static inline void mem_cgroup_uncharge_swap(swp_entry_t entry,
^
1 error generated.
vim +/mem_cgroup_clear_swap +688 mm/memcontrol-v1.c
651
652 /*
653 * memcg1_swapin - uncharge swap slot
654 * @entry: the first swap entry for which the pages are charged
655 * @nr_pages: number of pages which will be uncharged
656 *
657 * Call this function after successfully adding the charged page to swapcache.
658 *
659 * Note: This function assumes the page for which swap slot is being uncharged
660 * is order 0 page.
661 */
662 void memcg1_swapin(swp_entry_t entry, unsigned int nr_pages)
663 {
664 /*
665 * Cgroup1's unified memory+swap counter has been charged with the
666 * new swapcache page, finish the transfer by uncharging the swap
667 * slot. The swap slot would also get uncharged when it dies, but
668 * it can stick around indefinitely and we'd count the page twice
669 * the entire time.
670 *
671 * Cgroup2 has separate resource counters for memory and swap,
672 * so this is a non-issue here. Memory and swap charge lifetimes
673 * correspond 1:1 to page and swap slot lifetimes: we charge the
674 * page to memory here, and uncharge swap when the slot is freed.
675 */
676 if (do_memsw_account()) {
677 /*
678 * The swap entry might not get freed for a long time,
679 * let's not wait for it. The page already received a
680 * memory+swap charge, drop the swap entry duplicate.
681 */
682 mem_cgroup_uncharge_swap(entry, nr_pages);
683
684 /*
685 * Clear the cgroup association now to prevent double memsw
686 * uncharging when the backends are released later.
687 */
> 688 mem_cgroup_clear_swap(entry, nr_pages);
689 }
690 }
691
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH v3 18/20] memcg: swap: only charge physical swap slots
2026-02-08 21:58 ` [PATCH v3 18/20] memcg: swap: only charge physical swap slots Nhat Pham
2026-02-09 2:01 ` kernel test robot
@ 2026-02-09 2:12 ` kernel test robot
1 sibling, 0 replies; 52+ messages in thread
From: kernel test robot @ 2026-02-09 2:12 UTC (permalink / raw)
To: Nhat Pham, linux-mm
Cc: llvm, oe-kbuild-all, akpm, hannes, hughd, yosry.ahmed, mhocko,
roman.gushchin, shakeel.butt, muchun.song, len.brown,
chengming.zhou, kasong, chrisl, huang.ying.caritas, ryan.roberts,
shikemeng, viro, baohua, bhe, osalvador, lorenzo.stoakes,
christophe.leroy, pavel, kernel-team, linux-kernel, cgroups,
linux-pm, peterx, riel, joshua.hahnjy
Hi Nhat,
kernel test robot noticed the following build errors:
[auto build test ERROR on linus/master]
[also build test ERROR on v6.19]
[cannot apply to akpm-mm/mm-everything tj-cgroup/for-next tip/smp/core next-20260205]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/Nhat-Pham/swap-rearrange-the-swap-header-file/20260209-065842
base: linus/master
patch link: https://lore.kernel.org/r/20260208215839.87595-19-nphamcs%40gmail.com
patch subject: [PATCH v3 18/20] memcg: swap: only charge physical swap slots
config: sparc64-defconfig (https://download.01.org/0day-ci/archive/20260209/202602091006.0jXoavPW-lkp@intel.com/config)
compiler: clang version 20.1.8 (https://github.com/llvm/llvm-project 87f0227cb60147a26a1eeb4fb06e3b505e9c7261)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20260209/202602091006.0jXoavPW-lkp@intel.com/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202602091006.0jXoavPW-lkp@intel.com/
All errors (new ones prefixed by >>):
>> mm/vswap.c:637:2: error: call to undeclared function 'mem_cgroup_clear_swap'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
637 | mem_cgroup_clear_swap(entry, 1);
| ^
mm/vswap.c:637:2: note: did you mean 'mem_cgroup_uncharge_swap'?
include/linux/swap.h:658:20: note: 'mem_cgroup_uncharge_swap' declared here
658 | static inline void mem_cgroup_uncharge_swap(swp_entry_t entry,
| ^
>> mm/vswap.c:661:2: error: call to undeclared function 'mem_cgroup_record_swap'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
661 | mem_cgroup_record_swap(folio, entry);
| ^
mm/vswap.c:661:2: note: did you mean 'mem_cgroup_uncharge_swap'?
include/linux/swap.h:658:20: note: 'mem_cgroup_uncharge_swap' declared here
658 | static inline void mem_cgroup_uncharge_swap(swp_entry_t entry,
| ^
2 errors generated.
vim +/mem_cgroup_clear_swap +637 mm/vswap.c
528
529 /*
530 * Caller needs to handle races with other operations themselves.
531 *
532 * Specifically, this function is safe to be called in contexts where the swap
533 * entry has been added to the swap cache and the associated folio is locked.
534 * We cannot race with other accessors, and the swap entry is guaranteed to be
535 * valid the whole time (since swap cache implies one refcount).
536 *
537 * We cannot assume that the backends will be of the same type,
538 * contiguous, etc. We might have a large folio coalesced from subpages with
539 * mixed backend, which is only rectified when it is reclaimed.
540 */
541 static void release_backing(swp_entry_t entry, int nr)
542 {
543 struct vswap_cluster *cluster = NULL;
544 struct swp_desc *desc;
545 unsigned long flush_nr, phys_swap_start = 0, phys_swap_end = 0;
546 unsigned long phys_swap_released = 0;
547 unsigned int phys_swap_type = 0;
548 bool need_flushing_phys_swap = false;
549 swp_slot_t flush_slot;
550 int i;
551
552 VM_WARN_ON(!entry.val);
553
554 rcu_read_lock();
555 for (i = 0; i < nr; i++) {
556 desc = vswap_iter(&cluster, entry.val + i);
557 VM_WARN_ON(!desc);
558
559 /*
560 * We batch contiguous physical swap slots for more efficient
561 * freeing.
562 */
563 if (phys_swap_start != phys_swap_end &&
564 (desc->type != VSWAP_SWAPFILE ||
565 swp_slot_type(desc->slot) != phys_swap_type ||
566 swp_slot_offset(desc->slot) != phys_swap_end)) {
567 need_flushing_phys_swap = true;
568 flush_slot = swp_slot(phys_swap_type, phys_swap_start);
569 flush_nr = phys_swap_end - phys_swap_start;
570 phys_swap_start = phys_swap_end = 0;
571 }
572
573 if (desc->type == VSWAP_ZSWAP && desc->zswap_entry) {
574 zswap_entry_free(desc->zswap_entry);
575 } else if (desc->type == VSWAP_SWAPFILE) {
576 phys_swap_released++;
577 if (!phys_swap_start) {
578 /* start a new contiguous range of phys swap */
579 phys_swap_start = swp_slot_offset(desc->slot);
580 phys_swap_end = phys_swap_start + 1;
581 phys_swap_type = swp_slot_type(desc->slot);
582 } else {
583 /* extend the current contiguous range of phys swap */
584 phys_swap_end++;
585 }
586 }
587
588 desc->slot.val = 0;
589
590 if (need_flushing_phys_swap) {
591 spin_unlock(&cluster->lock);
592 cluster = NULL;
593 swap_slot_free_nr(flush_slot, flush_nr);
594 need_flushing_phys_swap = false;
595 }
596 }
597 if (cluster)
598 spin_unlock(&cluster->lock);
599 rcu_read_unlock();
600
601 /* Flush any remaining physical swap range */
602 if (phys_swap_start) {
603 flush_slot = swp_slot(phys_swap_type, phys_swap_start);
604 flush_nr = phys_swap_end - phys_swap_start;
605 swap_slot_free_nr(flush_slot, flush_nr);
606 }
607
608 if (phys_swap_released)
609 mem_cgroup_uncharge_swap(entry, phys_swap_released);
610 }
611
612 /*
613 * Entered with the cluster locked, but might unlock the cluster.
614 * This is because several operations, such as releasing physical swap slots
615 * (i.e swap_slot_free_nr()) require the cluster to be unlocked to avoid
616 * deadlocks.
617 *
618 * This is safe, because:
619 *
620 * 1. The swap entry to be freed has refcnt (swap count and swapcache pin)
621 * down to 0, so no one can change its internal state
622 *
623 * 2. The swap entry to be freed still holds a refcnt to the cluster, keeping
624 * the cluster itself valid.
625 *
626 * We will exit the function with the cluster re-locked.
627 */
628 static void vswap_free(struct vswap_cluster *cluster, struct swp_desc *desc,
629 swp_entry_t entry)
630 {
631 /* Clear shadow if present */
632 if (xa_is_value(desc->shadow))
633 desc->shadow = NULL;
634 spin_unlock(&cluster->lock);
635
636 release_backing(entry, 1);
> 637 mem_cgroup_clear_swap(entry, 1);
638
639 /* erase forward mapping and release the virtual slot for reallocation */
640 spin_lock(&cluster->lock);
641 release_vswap_slot(cluster, entry.val);
642 }
643
644 /**
645 * folio_alloc_swap - allocate virtual swap space for a folio.
646 * @folio: the folio.
647 *
648 * Return: 0, if the allocation succeeded, -ENOMEM, if the allocation failed.
649 */
650 int folio_alloc_swap(struct folio *folio)
651 {
652 swp_entry_t entry;
653
654 VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
655 VM_BUG_ON_FOLIO(!folio_test_uptodate(folio), folio);
656
657 entry = vswap_alloc(folio);
658 if (!entry.val)
659 return -ENOMEM;
660
> 661 mem_cgroup_record_swap(folio, entry);
662 swap_cache_add_folio(folio, entry, NULL);
663
664 return 0;
665 }
666
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH v3 01/20] mm/swap: decouple swap cache from physical swap infrastructure
2026-02-08 21:58 ` [PATCH v3 01/20] mm/swap: decouple swap cache from physical swap infrastructure Nhat Pham
` (2 preceding siblings ...)
2026-02-08 22:39 ` Nhat Pham
@ 2026-02-09 2:22 ` kernel test robot
3 siblings, 0 replies; 52+ messages in thread
From: kernel test robot @ 2026-02-09 2:22 UTC (permalink / raw)
To: Nhat Pham, linux-mm
Cc: llvm, oe-kbuild-all, akpm, hannes, hughd, yosry.ahmed, mhocko,
roman.gushchin, shakeel.butt, muchun.song, len.brown,
chengming.zhou, kasong, chrisl, huang.ying.caritas, ryan.roberts,
shikemeng, viro, baohua, bhe, osalvador, lorenzo.stoakes,
christophe.leroy, pavel, kernel-team, linux-kernel, cgroups,
linux-pm, peterx, riel, joshua.hahnjy
Hi Nhat,
kernel test robot noticed the following build errors:
[auto build test ERROR on linus/master]
[also build test ERROR on v6.19]
[cannot apply to akpm-mm/mm-everything tj-cgroup/for-next tip/smp/core next-20260205]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/Nhat-Pham/swap-rearrange-the-swap-header-file/20260209-065842
base: linus/master
patch link: https://lore.kernel.org/r/20260208215839.87595-2-nphamcs%40gmail.com
patch subject: [PATCH v3 01/20] mm/swap: decouple swap cache from physical swap infrastructure
config: x86_64-allnoconfig (https://download.01.org/0day-ci/archive/20260209/202602091044.soVrWeDA-lkp@intel.com/config)
compiler: clang version 20.1.8 (https://github.com/llvm/llvm-project 87f0227cb60147a26a1eeb4fb06e3b505e9c7261)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20260209/202602091044.soVrWeDA-lkp@intel.com/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202602091044.soVrWeDA-lkp@intel.com/
All errors (new ones prefixed by >>):
>> mm/vmscan.c:715:3: error: call to undeclared function 'swap_cache_lock_irq'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
715 | swap_cache_lock_irq();
| ^
>> mm/vmscan.c:762:3: error: call to undeclared function 'swap_cache_unlock_irq'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
762 | swap_cache_unlock_irq();
| ^
mm/vmscan.c:762:3: note: did you mean 'swap_cluster_unlock_irq'?
mm/swap.h:350:20: note: 'swap_cluster_unlock_irq' declared here
350 | static inline void swap_cluster_unlock_irq(struct swap_cluster_info *ci)
| ^
mm/vmscan.c:801:3: error: call to undeclared function 'swap_cache_unlock_irq'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
801 | swap_cache_unlock_irq();
| ^
3 errors generated.
--
>> mm/shmem.c:2168:2: error: call to undeclared function 'swap_cache_lock_irq'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
2168 | swap_cache_lock_irq();
| ^
>> mm/shmem.c:2173:2: error: call to undeclared function 'swap_cache_unlock_irq'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
2173 | swap_cache_unlock_irq();
| ^
2 errors generated.
vim +/swap_cache_lock_irq +715 mm/vmscan.c
700
701 /*
702 * Same as remove_mapping, but if the folio is removed from the mapping, it
703 * gets returned with a refcount of 0.
704 */
705 static int __remove_mapping(struct address_space *mapping, struct folio *folio,
706 bool reclaimed, struct mem_cgroup *target_memcg)
707 {
708 int refcount;
709 void *shadow = NULL;
710
711 BUG_ON(!folio_test_locked(folio));
712 BUG_ON(mapping != folio_mapping(folio));
713
714 if (folio_test_swapcache(folio)) {
> 715 swap_cache_lock_irq();
716 } else {
717 spin_lock(&mapping->host->i_lock);
718 xa_lock_irq(&mapping->i_pages);
719 }
720
721 /*
722 * The non racy check for a busy folio.
723 *
724 * Must be careful with the order of the tests. When someone has
725 * a ref to the folio, it may be possible that they dirty it then
726 * drop the reference. So if the dirty flag is tested before the
727 * refcount here, then the following race may occur:
728 *
729 * get_user_pages(&page);
730 * [user mapping goes away]
731 * write_to(page);
732 * !folio_test_dirty(folio) [good]
733 * folio_set_dirty(folio);
734 * folio_put(folio);
735 * !refcount(folio) [good, discard it]
736 *
737 * [oops, our write_to data is lost]
738 *
739 * Reversing the order of the tests ensures such a situation cannot
740 * escape unnoticed. The smp_rmb is needed to ensure the folio->flags
741 * load is not satisfied before that of folio->_refcount.
742 *
743 * Note that if the dirty flag is always set via folio_mark_dirty,
744 * and thus under the i_pages lock, then this ordering is not required.
745 */
746 refcount = 1 + folio_nr_pages(folio);
747 if (!folio_ref_freeze(folio, refcount))
748 goto cannot_free;
749 /* note: atomic_cmpxchg in folio_ref_freeze provides the smp_rmb */
750 if (unlikely(folio_test_dirty(folio))) {
751 folio_ref_unfreeze(folio, refcount);
752 goto cannot_free;
753 }
754
755 if (folio_test_swapcache(folio)) {
756 swp_entry_t swap = folio->swap;
757
758 if (reclaimed && !mapping_exiting(mapping))
759 shadow = workingset_eviction(folio, target_memcg);
760 __swap_cache_del_folio(folio, swap, shadow);
761 memcg1_swapout(folio, swap);
> 762 swap_cache_unlock_irq();
763 put_swap_folio(folio, swap);
764 } else {
765 void (*free_folio)(struct folio *);
766
767 free_folio = mapping->a_ops->free_folio;
768 /*
769 * Remember a shadow entry for reclaimed file cache in
770 * order to detect refaults, thus thrashing, later on.
771 *
772 * But don't store shadows in an address space that is
773 * already exiting. This is not just an optimization,
774 * inode reclaim needs to empty out the radix tree or
775 * the nodes are lost. Don't plant shadows behind its
776 * back.
777 *
778 * We also don't store shadows for DAX mappings because the
779 * only page cache folios found in these are zero pages
780 * covering holes, and because we don't want to mix DAX
781 * exceptional entries and shadow exceptional entries in the
782 * same address_space.
783 */
784 if (reclaimed && folio_is_file_lru(folio) &&
785 !mapping_exiting(mapping) && !dax_mapping(mapping))
786 shadow = workingset_eviction(folio, target_memcg);
787 __filemap_remove_folio(folio, shadow);
788 xa_unlock_irq(&mapping->i_pages);
789 if (mapping_shrinkable(mapping))
790 inode_lru_list_add(mapping->host);
791 spin_unlock(&mapping->host->i_lock);
792
793 if (free_folio)
794 free_folio(folio);
795 }
796
797 return 1;
798
799 cannot_free:
800 if (folio_test_swapcache(folio)) {
801 swap_cache_unlock_irq();
802 } else {
803 xa_unlock_irq(&mapping->i_pages);
804 spin_unlock(&mapping->host->i_lock);
805 }
806 return 0;
807 }
808
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH v3 00/20] Virtual Swap Space
2026-02-08 22:31 ` [PATCH v3 00/20] Virtual Swap Space Nhat Pham
@ 2026-02-09 12:20 ` Chris Li
2026-02-10 2:36 ` Johannes Weiner
2026-02-10 18:00 ` Nhat Pham
0 siblings, 2 replies; 52+ messages in thread
From: Chris Li @ 2026-02-09 12:20 UTC (permalink / raw)
To: Nhat Pham
Cc: akpm, hannes, hughd, yosry.ahmed, mhocko, roman.gushchin,
shakeel.butt, muchun.song, len.brown, chengming.zhou, kasong,
huang.ying.caritas, ryan.roberts, shikemeng, viro, baohua, bhe,
osalvador, christophe.leroy, pavel, linux-mm, kernel-team,
linux-kernel, cgroups, linux-pm, peterx, riel, joshua.hahnjy,
npache, gourry, axelrasmussen, yuanchu, weixugc, rafael, jannh,
pfalcato, zhengqi.arch
On Sun, Feb 8, 2026 at 4:15 PM Nhat Pham <nphamcs@gmail.com> wrote:
>
> My sincerest apologies - it seems like the cover letter (and just the
> cover letter) fails to be sent out, for some reason. I'm trying to figure
> out what happened - it works when I send the entire patch series to
> myself...
>
> Anyway, resending this (in-reply-to patch 1 of the series):
For the record I did receive your original V3 cover letter from the
linux-mm mailing list.
> Changelog:
> * RFC v2 -> v3:
> * Implement a cluster-based allocation algorithm for virtual swap
> slots, inspired by Kairui Song and Chris Li's implementation, as
> well as Johannes Weiner's suggestions. This eliminates the lock
> contention issues on the virtual swap layer.
> * Re-use swap table for the reverse mapping.
> * Remove CONFIG_VIRTUAL_SWAP.
> * Reducing the size of the swap descriptor from 48 bytes to 24
Is the per swap slot entry overhead 24 bytes in your implementation?
The current swap overhead is 3 static +8 dynamic, your 24 dynamic is a
big jump. You can argue that 8->24 is not a big jump . But it is an
unnecessary price compared to the alternatives, which is 8 dynamic +
4(optional redirect).
> bytes, i.e another 50% reduction in memory overhead from v2.
> * Remove swap cache and zswap tree and use the swap descriptor
> for this.
> * Remove zeromap, and replace the swap_map bytemap with 2 bitmaps
> (one for allocated slots, and one for bad slots).
> * Rebase on top of 6.19 (7d0a66e4bb9081d75c82ec4957c50034cb0ea449)
My git log shows 7d0a66e4bb9081d75c82ec4957c50034cb0ea449 is tag "v6.18".
> * Update cover letter to include new benchmark results and discussion
> on overhead in various cases.
> * RFC v1 -> RFC v2:
> * Use a single atomic type (swap_refs) for reference counting
> purpose. This brings the size of the swap descriptor from 64 B
> down to 48 B (25% reduction). Suggested by Yosry Ahmed.
> * Zeromap bitmap is removed in the virtual swap implementation.
> This saves one bit per phyiscal swapfile slot.
> * Rearrange the patches and the code change to make things more
> reviewable. Suggested by Johannes Weiner.
> * Update the cover letter a bit.
>
> This patch series implements the virtual swap space idea, based on Yosry's
> proposals at LSFMMBPF 2023 (see [1], [2], [3]), as well as valuable
> inputs from Johannes Weiner. The same idea (with different
> implementation details) has been floated by Rik van Riel since at least
> 2011 (see [8]).
>
> This patch series is based on 6.19. There are a couple more
> swap-related changes in the mm-stable branch that I would need to
> coordinate with, but I would like to send this out as an update, to show
Ah, you need to mention that in the first line to Andrew. Spell out
this series is not for Andrew to consume in the MM series. It can't
any way because it does not apply to mm-unstable nor mm-stable.
BTW, I have the following compile error with this series (fedora 43).
Same config compile fine on v6.19.
In file included from ./include/linux/local_lock.h:5,
from ./include/linux/mmzone.h:24,
from ./include/linux/gfp.h:7,
from ./include/linux/mm.h:7,
from mm/vswap.c:7:
mm/vswap.c: In function ‘vswap_cpu_dead’:
./include/linux/percpu-defs.h:221:45: error: initialization from
pointer to non-enclosed address space
221 | const void __percpu *__vpp_verify = (typeof((ptr) +
0))NULL; \
| ^
./include/linux/local_lock_internal.h:105:40: note: in definition of
macro ‘__local_lock_acquire’
105 | __l = (local_lock_t *)(lock);
\
| ^~~~
./include/linux/local_lock.h:17:41: note: in expansion of macro
‘__local_lock’
17 | #define local_lock(lock) __local_lock(this_cpu_ptr(lock))
| ^~~~~~~~~~~~
./include/linux/percpu-defs.h:245:9: note: in expansion of macro
‘__verify_pcpu_ptr’
245 | __verify_pcpu_ptr(ptr);
\
| ^~~~~~~~~~~~~~~~~
./include/linux/percpu-defs.h:256:27: note: in expansion of macro ‘raw_cpu_ptr’
256 | #define this_cpu_ptr(ptr) raw_cpu_ptr(ptr)
| ^~~~~~~~~~~
./include/linux/local_lock.h:17:54: note: in expansion of macro
‘this_cpu_ptr’
17 | #define local_lock(lock)
__local_lock(this_cpu_ptr(lock))
|
^~~~~~~~~~~~
mm/vswap.c:1518:9: note: in expansion of macro ‘local_lock’
1518 | local_lock(&percpu_cluster->lock);
| ^~~~~~~~~~
> that the lock contention issues that plagued earlier versions have been
> resolved and performance on the kernel build benchmark is now on-par with
> baseline. Furthermore, memory overhead has been substantially reduced
> compared to the last RFC version.
>
>
> I. Motivation
>
> Currently, when an anon page is swapped out, a slot in a backing swap
> device is allocated and stored in the page table entries that refer to
> the original page. This slot is also used as the "key" to find the
> swapped out content, as well as the index to swap data structures, such
> as the swap cache, or the swap cgroup mapping. Tying a swap entry to its
> backing slot in this way is performant and efficient when swap is purely
> just disk space, and swapoff is rare.
>
> However, the advent of many swap optimizations has exposed major
> drawbacks of this design. The first problem is that we occupy a physical
> slot in the swap space, even for pages that are NEVER expected to hit
> the disk: pages compressed and stored in the zswap pool, zero-filled
> pages, or pages rejected by both of these optimizations when zswap
> writeback is disabled. This is the arguably central shortcoming of
> zswap:
> * In deployments when no disk space can be afforded for swap (such as
> mobile and embedded devices), users cannot adopt zswap, and are forced
> to use zram. This is confusing for users, and creates extra burdens
> for developers, having to develop and maintain similar features for
> two separate swap backends (writeback, cgroup charging, THP support,
> etc.). For instance, see the discussion in [4].
> * Resource-wise, it is hugely wasteful in terms of disk usage. At Meta,
> we have swapfile in the order of tens to hundreds of GBs, which are
> mostly unused and only exist to enable zswap usage and zero-filled
> pages swap optimizations.
> * Tying zswap (and more generally, other in-memory swap backends) to
> the current physical swapfile infrastructure makes zswap implicitly
> statically sized. This does not make sense, as unlike disk swap, in
> which we consume a limited resource (disk space or swapfile space) to
> save another resource (memory), zswap consume the same resource it is
> saving (memory). The more we zswap, the more memory we have available,
> not less. We are not rationing a limited resource when we limit
> the size of he zswap pool, but rather we are capping the resource
> (memory) saving potential of zswap. Under memory pressure, using
> more zswap is almost always better than the alternative (disk IOs, or
> even worse, OOMs), and dynamically sizing the zswap pool on demand
> allows the system to flexibly respond to these precarious scenarios.
> * Operationally, static provisioning the swapfile for zswap pose
> significant challenges, because the sysadmin has to prescribe how
> much swap is needed a priori, for each combination of
> (memory size x disk space x workload usage). It is even more
> complicated when we take into account the variance of memory
> compression, which changes the reclaim dynamics (and as a result,
> swap space size requirement). The problem is further exarcebated for
> users who rely on swap utilization (and exhaustion) as an OOM signal.
>
> All of these factors make it very difficult to configure the swapfile
> for zswap: too small of a swapfile and we risk preventable OOMs and
> limit the memory saving potentials of zswap; too big of a swapfile
> and we waste disk space and memory due to swap metadata overhead.
> This dilemma becomes more drastic in high memory systems, which can
> have up to TBs worth of memory.
>
> Past attempts to decouple disk and compressed swap backends, namely the
> ghost swapfile approach (see [13]), as well as the alternative
> compressed swap backend zram, have mainly focused on eliminating the
> disk space usage of compressed backends. We want a solution that not
> only tackles that same problem, but also achieve the dyamicization of
> swap space to maximize the memory saving potentials while reducing
> operational and static memory overhead.
>
> Finally, any swap redesign should support efficient backend transfer,
> i.e without having to perform the expensive page table walk to
> update all the PTEs that refer to the swap entry:
> * The main motivation for this requirement is zswap writeback. To quote
> Johannes (from [14]): "Combining compression with disk swap is
> extremely powerful, because it dramatically reduces the worst aspects
> of both: it reduces the memory footprint of compression by shedding
> the coldest data to disk; it reduces the IO latencies and flash wear
> of disk swap through the writeback cache. In practice, this reduces
> *average event rates of the entire reclaim/paging/IO stack*."
> * Another motivation is to simplify swapoff, which is both complicated
> and expensive in the current design, precisely because we are storing
> an encoding of the backend positional information in the page table,
> and thus requires a full page table walk to remove these references.
>
>
> II. High Level Design Overview
>
> To fix the aforementioned issues, we need an abstraction that separates
> a swap entry from its physical backing storage. IOW, we need to
> “virtualize” the swap space: swap clients will work with a dynamically
> allocated virtual swap slot, storing it in page table entries, and
> using it to index into various swap-related data structures. The
> backing storage is decoupled from the virtual swap slot, and the newly
> introduced layer will “resolve” the virtual swap slot to the actual
> storage. This layer also manages other metadata of the swap entry, such
> as its lifetime information (swap count), via a dynamically allocated,
> per-swap-entry descriptor:
>
> struct swp_desc {
> union {
> swp_slot_t slot; /* 0 8 */
> struct zswap_entry * zswap_entry; /* 0 8 */
> }; /* 0 8 */
> union {
> struct folio * swap_cache; /* 8 8 */
> void * shadow; /* 8 8 */
> }; /* 8 8 */
> unsigned int swap_count; /* 16 4 */
> unsigned short memcgid:16; /* 20: 0 2 */
> bool in_swapcache:1; /* 22: 0 1 */
>
> /* Bitfield combined with previous fields */
>
> enum swap_type type:2; /* 20:17 4 */
>
> /* size: 24, cachelines: 1, members: 6 */
> /* bit_padding: 13 bits */
> /* last cacheline: 24 bytes */
> };
>
> (output from pahole).
>
> This design allows us to:
> * Decouple zswap (and zeromapped swap entry) from backing swapfile:
> simply associate the virtual swap slot with one of the supported
> backends: a zswap entry, a zero-filled swap page, a slot on the
> swapfile, or an in-memory page.
> * Simplify and optimize swapoff: we only have to fault the page in and
> have the virtual swap slot points to the page instead of the on-disk
> physical swap slot. No need to perform any page table walking.
>
> The size of the virtual swap descriptor is 24 bytes. Note that this is
> not all "new" overhead, as the swap descriptor will replace:
> * the swap_cgroup arrays (one per swap type) in the old design, which
> is a massive source of static memory overhead. With the new design,
> it is only allocated for used clusters.
> * the swap tables, which holds the swap cache and workingset shadows.
> * the zeromap bitmap, which is a bitmap of physical swap slots to
> indicate whether the swapped out page is zero-filled or not.
> * huge chunk of the swap_map. The swap_map is now replaced by 2 bitmaps,
> one for allocated slots, and one for bad slots, representing 3 possible
> states of a slot on the swapfile: allocated, free, and bad.
> * the zswap tree.
>
> So, in terms of additional memory overhead:
> * For zswap entries, the added memory overhead is rather minimal. The
> new indirection pointer neatly replaces the existing zswap tree.
> We really only incur less than one word of overhead for swap count
> blow up (since we no longer use swap continuation) and the swap type.
> * For physical swap entries, the new design will impose fewer than 3 words
> memory overhead. However, as noted above this overhead is only for
> actively used swap entries, whereas in the current design the overhead is
> static (including the swap cgroup array for example).
>
> The primary victim of this overhead will be zram users. However, as
> zswap now no longer takes up disk space, zram users can consider
> switching to zswap (which, as a bonus, has a lot of useful features
> out of the box, such as cgroup tracking, dynamic zswap pool sizing,
> LRU-ordering writeback, etc.).
>
> For a more concrete example, suppose we have a 32 GB swapfile (i.e.
> 8,388,608 swap entries), and we use zswap.
>
> 0% usage, or 0 entries: 0.00 MB
> * Old design total overhead: 25.00 MB
> * Vswap total overhead: 0.00 MB
>
> 25% usage, or 2,097,152 entries:
> * Old design total overhead: 57.00 MB
> * Vswap total overhead: 48.25 MB
>
> 50% usage, or 4,194,304 entries:
> * Old design total overhead: 89.00 MB
> * Vswap total overhead: 96.50 MB
>
> 75% usage, or 6,291,456 entries:
> * Old design total overhead: 121.00 MB
> * Vswap total overhead: 144.75 MB
>
> 100% usage, or 8,388,608 entries:
> * Old design total overhead: 153.00 MB
> * Vswap total overhead: 193.00 MB
>
> So even in the worst case scenario for virtual swap, i.e when we
> somehow have an oracle to correctly size the swapfile for zswap
> pool to 32 GB, the added overhead is only 40 MB, which is a mere
> 0.12% of the total swapfile :)
>
> In practice, the overhead will be closer to the 50-75% usage case, as
> systems tend to leave swap headroom for pathological events or sudden
> spikes in memory requirements. The added overhead in these cases are
> practically neglible. And in deployments where swapfiles for zswap
> are previously sparsely used, switching over to virtual swap will
> actually reduce memory overhead.
>
> Doing the same math for the disk swap, which is the worst case for
> virtual swap in terms of swap backends:
>
> 0% usage, or 0 entries: 0.00 MB
> * Old design total overhead: 25.00 MB
> * Vswap total overhead: 2.00 MB
>
> 25% usage, or 2,097,152 entries:
> * Old design total overhead: 41.00 MB
> * Vswap total overhead: 66.25 MB
>
> 50% usage, or 4,194,304 entries:
> * Old design total overhead: 57.00 MB
> * Vswap total overhead: 130.50 MB
>
> 75% usage, or 6,291,456 entries:
> * Old design total overhead: 73.00 MB
> * Vswap total overhead: 194.75 MB
>
> 100% usage, or 8,388,608 entries:
> * Old design total overhead: 89.00 MB
> * Vswap total overhead: 259.00 MB
>
> The added overhead is 170MB, which is 0.5% of the total swapfile size,
> again in the worst case when we have a sizing oracle.
>
> Please see the attached patches for more implementation details.
>
>
> III. Usage and Benchmarking
>
> This patch series introduce no new syscalls or userspace API. Existing
> userspace setups will work as-is, except we no longer have to create a
> swapfile or set memory.swap.max if we want to use zswap, as zswap is no
> longer tied to physical swap. The zswap pool will be automatically and
> dynamically sized based on memory usage and reclaim dynamics.
>
> To measure the performance of the new implementation, I have run the
> following benchmarks:
>
> 1. Kernel building: 52 workers (one per processor), memory.max = 3G.
>
> Using zswap as the backend:
>
> Baseline:
> real: mean: 185.2s, stdev: 0.93s
> sys: mean: 683.7s, stdev: 33.77s
>
> Vswap:
> real: mean: 184.88s, stdev: 0.57s
> sys: mean: 675.14s, stdev: 32.8s
Can you show your user space time as well to complete the picture?
How many runs do you have for stdev 32.8s?
>
> We actually see a slight improvement in systime (by 1.5%) :) This is
> likely because we no longer have to perform swap charging for zswap
> entries, and virtual swap allocator is simpler than that of physical
> swap.
>
> Using SSD swap as the backend:
Please include zram swap test data as well. Android heavily uses zram
for swapping.
>
> Baseline:
> real: mean: 200.3s, stdev: 2.33s
> sys: mean: 489.88s, stdev: 9.62s
>
> Vswap:
> real: mean: 201.47s, stdev: 2.98s
> sys: mean: 487.36s, stdev: 5.53s
>
> The performance is neck-to-neck.
I strongly suspect there is some performance difference that hasn't
been covered by your test yet. Need more conformation by others on the
performance measurement. The swap testing is tricky. You want to push
to stress barely within the OOM limit. Need more data.
Chris
>
>
> IV. Future Use Cases
>
> While the patch series focus on two applications (decoupling swap
> backends and swapoff optimization/simplification), this new,
> future-proof design also allows us to implement new swap features more
> easily and efficiently:
>
> * Multi-tier swapping (as mentioned in [5]), with transparent
> transferring (promotion/demotion) of pages across tiers (see [8] and
> [9]). Similar to swapoff, with the old design we would need to
> perform the expensive page table walk.
> * Swapfile compaction to alleviate fragmentation (as proposed by Ying
> Huang in [6]).
> * Mixed backing THP swapin (see [7]): Once you have pinned down the
> backing store of THPs, then you can dispatch each range of subpages
> to appropriate backend swapin handler.
> * Swapping a folio out with discontiguous physical swap slots
> (see [10]).
> * Zswap writeback optimization: The current architecture pre-reserves
> physical swap space for pages when they enter the zswap pool, giving
> the kernel no flexibility at writeback time. With the virtual swap
> implementation, the backends are decoupled, and physical swap space
> is allocated on-demand at writeback time, at which point we can make
> much smarter decisions: we can batch multiple zswap writeback
> operations into a single IO request, allocating contiguous physical
> swap slots for that request. We can even perform compressed writeback
> (i.e writing these pages without decompressing them) (see [12]).
>
>
> V. References
>
> [1]: https://lore.kernel.org/all/CAJD7tkbCnXJ95Qow_aOjNX6NOMU5ovMSHRC+95U4wtW6cM+puw@mail.gmail.com/
> [2]: https://lwn.net/Articles/932077/
> [3]: https://www.youtube.com/watch?v=Hwqw_TBGEhg
> [4]: https://lore.kernel.org/all/Zqe_Nab-Df1CN7iW@infradead.org/
> [5]: https://lore.kernel.org/lkml/CAF8kJuN-4UE0skVHvjUzpGefavkLULMonjgkXUZSBVJrcGFXCA@mail.gmail.com/
> [6]: https://lore.kernel.org/linux-mm/87o78mzp24.fsf@yhuang6-desk2.ccr.corp.intel.com/
> [7]: https://lore.kernel.org/all/CAGsJ_4ysCN6f7qt=6gvee1x3ttbOnifGneqcRm9Hoeun=uFQ2w@mail.gmail.com/
> [8]: https://lore.kernel.org/linux-mm/4DA25039.3020700@redhat.com/
> [9]: https://lore.kernel.org/all/CA+ZsKJ7DCE8PMOSaVmsmYZL9poxK6rn0gvVXbjpqxMwxS2C9TQ@mail.gmail.com/
> [10]: https://lore.kernel.org/all/CACePvbUkMYMencuKfpDqtG1Ej7LiUS87VRAXb8sBn1yANikEmQ@mail.gmail.com/
> [11]: https://lore.kernel.org/all/CAMgjq7BvQ0ZXvyLGp2YP96+i+6COCBBJCYmjXHGBnfisCAb8VA@mail.gmail.com/
> [12]: https://lore.kernel.org/linux-mm/ZeZSDLWwDed0CgT3@casper.infradead.org/
> [13]: https://lore.kernel.org/all/20251121-ghost-v1-1-cfc0efcf3855@kernel.org/
> [14]: https://lore.kernel.org/linux-mm/20251202170222.GD430226@cmpxchg.org/
>
> Nhat Pham (20):
> mm/swap: decouple swap cache from physical swap infrastructure
> swap: rearrange the swap header file
> mm: swap: add an abstract API for locking out swapoff
> zswap: add new helpers for zswap entry operations
> mm/swap: add a new function to check if a swap entry is in swap
> cached.
> mm: swap: add a separate type for physical swap slots
> mm: create scaffolds for the new virtual swap implementation
> zswap: prepare zswap for swap virtualization
> mm: swap: allocate a virtual swap slot for each swapped out page
> swap: move swap cache to virtual swap descriptor
> zswap: move zswap entry management to the virtual swap descriptor
> swap: implement the swap_cgroup API using virtual swap
> swap: manage swap entry lifecycle at the virtual swap layer
> mm: swap: decouple virtual swap slot from backing store
> zswap: do not start zswap shrinker if there is no physical swap slots
> swap: do not unnecesarily pin readahead swap entries
> swapfile: remove zeromap bitmap
> memcg: swap: only charge physical swap slots
> swap: simplify swapoff using virtual swap
> swapfile: replace the swap map with bitmaps
>
> Documentation/mm/swap-table.rst | 69 --
> MAINTAINERS | 2 +
> include/linux/cpuhotplug.h | 1 +
> include/linux/mm_types.h | 16 +
> include/linux/shmem_fs.h | 7 +-
> include/linux/swap.h | 135 ++-
> include/linux/swap_cgroup.h | 13 -
> include/linux/swapops.h | 25 +
> include/linux/zswap.h | 17 +-
> kernel/power/swap.c | 6 +-
> mm/Makefile | 5 +-
> mm/huge_memory.c | 11 +-
> mm/internal.h | 12 +-
> mm/memcontrol-v1.c | 6 +
> mm/memcontrol.c | 142 ++-
> mm/memory.c | 101 +-
> mm/migrate.c | 13 +-
> mm/mincore.c | 15 +-
> mm/page_io.c | 83 +-
> mm/shmem.c | 215 +---
> mm/swap.h | 157 +--
> mm/swap_cgroup.c | 172 ---
> mm/swap_state.c | 306 +----
> mm/swap_table.h | 78 +-
> mm/swapfile.c | 1518 ++++-------------------
> mm/userfaultfd.c | 18 +-
> mm/vmscan.c | 28 +-
> mm/vswap.c | 2025 +++++++++++++++++++++++++++++++
> mm/zswap.c | 142 +--
> 29 files changed, 2853 insertions(+), 2485 deletions(-)
> delete mode 100644 Documentation/mm/swap-table.rst
> delete mode 100644 mm/swap_cgroup.c
> create mode 100644 mm/vswap.c
>
>
> base-commit: 05f7e89ab9731565d8a62e3b5d1ec206485eeb0b
> --
> 2.47.3
>
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH v3 09/20] mm: swap: allocate a virtual swap slot for each swapped out page
2026-02-08 21:58 ` [PATCH v3 09/20] mm: swap: allocate a virtual swap slot for each swapped out page Nhat Pham
@ 2026-02-09 17:12 ` kernel test robot
2026-02-11 13:42 ` kernel test robot
1 sibling, 0 replies; 52+ messages in thread
From: kernel test robot @ 2026-02-09 17:12 UTC (permalink / raw)
To: Nhat Pham, linux-mm
Cc: oe-kbuild-all, akpm, hannes, hughd, yosry.ahmed, mhocko,
roman.gushchin, shakeel.butt, muchun.song, len.brown,
chengming.zhou, kasong, chrisl, huang.ying.caritas, ryan.roberts,
shikemeng, viro, baohua, bhe, osalvador, lorenzo.stoakes,
christophe.leroy, pavel, kernel-team, linux-kernel, cgroups,
linux-pm, peterx, riel, joshua.hahnjy
Hi Nhat,
kernel test robot noticed the following build warnings:
[auto build test WARNING on linus/master]
[also build test WARNING on v6.19]
[cannot apply to akpm-mm/mm-everything tj-cgroup/for-next tip/smp/core next-20260205]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/Nhat-Pham/swap-rearrange-the-swap-header-file/20260209-065842
base: linus/master
patch link: https://lore.kernel.org/r/20260208215839.87595-10-nphamcs%40gmail.com
patch subject: [PATCH v3 09/20] mm: swap: allocate a virtual swap slot for each swapped out page
config: s390-randconfig-r134-20260209 (https://download.01.org/0day-ci/archive/20260210/202602100110.Au8uHgc8-lkp@intel.com/config)
compiler: s390-linux-gcc (GCC) 14.3.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20260210/202602100110.Au8uHgc8-lkp@intel.com/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202602100110.Au8uHgc8-lkp@intel.com/
sparse warnings: (new ones prefixed by >>)
>> mm/vswap.c:653:9: sparse: sparse: incorrect type in initializer (different address spaces) @@ expected void const [noderef] __percpu *__vpp_verify @@ got struct local_lock_t * @@
mm/vswap.c:653:9: sparse: expected void const [noderef] __percpu *__vpp_verify
mm/vswap.c:653:9: sparse: got struct local_lock_t *
>> mm/vswap.c:653:9: sparse: sparse: incorrect type in initializer (different address spaces) @@ expected void const [noderef] __percpu *__vpp_verify @@ got struct local_lock_t * @@
mm/vswap.c:653:9: sparse: expected void const [noderef] __percpu *__vpp_verify
mm/vswap.c:653:9: sparse: got struct local_lock_t *
mm/vswap.c:665:9: sparse: sparse: incorrect type in initializer (different address spaces) @@ expected void const [noderef] __percpu *__vpp_verify @@ got struct local_lock_t * @@
mm/vswap.c:665:9: sparse: expected void const [noderef] __percpu *__vpp_verify
mm/vswap.c:665:9: sparse: got struct local_lock_t *
mm/vswap.c:665:9: sparse: sparse: incorrect type in initializer (different address spaces) @@ expected void const [noderef] __percpu *__vpp_verify @@ got struct local_lock_t * @@
mm/vswap.c:665:9: sparse: expected void const [noderef] __percpu *__vpp_verify
mm/vswap.c:665:9: sparse: got struct local_lock_t *
mm/vswap.c:182:36: sparse: sparse: context imbalance in 'vswap_iter' - unexpected unlock
mm/vswap.c:284:17: sparse: sparse: context imbalance in 'vswap_alloc' - different lock contexts for basic block
mm/vswap.c:413:19: sparse: sparse: context imbalance in 'vswap_free' - unexpected unlock
mm/vswap.c: note: in included file (through include/linux/rbtree.h, include/linux/mm_types.h, include/linux/mmzone.h, ...):
include/linux/rcupdate.h:897:25: sparse: sparse: context imbalance in 'folio_alloc_swap' - unexpected unlock
mm/vswap.c:570:9: sparse: sparse: context imbalance in 'swp_entry_to_swp_slot' - unexpected unlock
vim +653 mm/vswap.c
643
644 static int vswap_cpu_dead(unsigned int cpu)
645 {
646 struct percpu_vswap_cluster *percpu_cluster;
647 struct vswap_cluster *cluster;
648 int order;
649
650 percpu_cluster = per_cpu_ptr(&percpu_vswap_cluster, cpu);
651
652 rcu_read_lock();
> 653 local_lock(&percpu_cluster->lock);
654 for (order = 0; order < SWAP_NR_ORDERS; order++) {
655 cluster = percpu_cluster->clusters[order];
656 if (cluster) {
657 percpu_cluster->clusters[order] = NULL;
658 spin_lock(&cluster->lock);
659 cluster->cached = false;
660 if (refcount_dec_and_test(&cluster->refcnt))
661 vswap_cluster_free(cluster);
662 spin_unlock(&cluster->lock);
663 }
664 }
665 local_unlock(&percpu_cluster->lock);
666 rcu_read_unlock();
667
668 return 0;
669 }
670
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH v3 00/20] Virtual Swap Space
2026-02-09 12:20 ` Chris Li
@ 2026-02-10 2:36 ` Johannes Weiner
2026-02-10 21:24 ` Chris Li
2026-02-10 18:00 ` Nhat Pham
1 sibling, 1 reply; 52+ messages in thread
From: Johannes Weiner @ 2026-02-10 2:36 UTC (permalink / raw)
To: Chris Li
Cc: Nhat Pham, akpm, hughd, yosry.ahmed, mhocko, roman.gushchin,
shakeel.butt, muchun.song, len.brown, chengming.zhou, kasong,
huang.ying.caritas, ryan.roberts, shikemeng, viro, baohua, bhe,
osalvador, christophe.leroy, pavel, linux-mm, kernel-team,
linux-kernel, cgroups, linux-pm, peterx, riel, joshua.hahnjy,
npache, gourry, axelrasmussen, yuanchu, weixugc, rafael, jannh,
pfalcato, zhengqi.arch
Hi Chris,
On Mon, Feb 09, 2026 at 04:20:21AM -0800, Chris Li wrote:
> On Sun, Feb 8, 2026 at 4:15 PM Nhat Pham <nphamcs@gmail.com> wrote:
> >
> > My sincerest apologies - it seems like the cover letter (and just the
> > cover letter) fails to be sent out, for some reason. I'm trying to figure
> > out what happened - it works when I send the entire patch series to
> > myself...
> >
> > Anyway, resending this (in-reply-to patch 1 of the series):
>
> For the record I did receive your original V3 cover letter from the
> linux-mm mailing list.
>
> > Changelog:
> > * RFC v2 -> v3:
> > * Implement a cluster-based allocation algorithm for virtual swap
> > slots, inspired by Kairui Song and Chris Li's implementation, as
> > well as Johannes Weiner's suggestions. This eliminates the lock
> > contention issues on the virtual swap layer.
> > * Re-use swap table for the reverse mapping.
> > * Remove CONFIG_VIRTUAL_SWAP.
> > * Reducing the size of the swap descriptor from 48 bytes to 24
>
> Is the per swap slot entry overhead 24 bytes in your implementation?
> The current swap overhead is 3 static +8 dynamic, your 24 dynamic is a
> big jump. You can argue that 8->24 is not a big jump . But it is an
> unnecessary price compared to the alternatives, which is 8 dynamic +
> 4(optional redirect).
No, this is not the net overhead.
The descriptor consolidates and eliminates several other data
structures.
Here is the more detailed breakdown:
> > The size of the virtual swap descriptor is 24 bytes. Note that this is
> > not all "new" overhead, as the swap descriptor will replace:
> > * the swap_cgroup arrays (one per swap type) in the old design, which
> > is a massive source of static memory overhead. With the new design,
> > it is only allocated for used clusters.
> > * the swap tables, which holds the swap cache and workingset shadows.
> > * the zeromap bitmap, which is a bitmap of physical swap slots to
> > indicate whether the swapped out page is zero-filled or not.
> > * huge chunk of the swap_map. The swap_map is now replaced by 2 bitmaps,
> > one for allocated slots, and one for bad slots, representing 3 possible
> > states of a slot on the swapfile: allocated, free, and bad.
> > * the zswap tree.
> >
> > So, in terms of additional memory overhead:
> > * For zswap entries, the added memory overhead is rather minimal. The
> > new indirection pointer neatly replaces the existing zswap tree.
> > We really only incur less than one word of overhead for swap count
> > blow up (since we no longer use swap continuation) and the swap type.
> > * For physical swap entries, the new design will impose fewer than 3 words
> > memory overhead. However, as noted above this overhead is only for
> > actively used swap entries, whereas in the current design the overhead is
> > static (including the swap cgroup array for example).
> >
> > The primary victim of this overhead will be zram users. However, as
> > zswap now no longer takes up disk space, zram users can consider
> > switching to zswap (which, as a bonus, has a lot of useful features
> > out of the box, such as cgroup tracking, dynamic zswap pool sizing,
> > LRU-ordering writeback, etc.).
> >
> > For a more concrete example, suppose we have a 32 GB swapfile (i.e.
> > 8,388,608 swap entries), and we use zswap.
> >
> > 0% usage, or 0 entries: 0.00 MB
> > * Old design total overhead: 25.00 MB
> > * Vswap total overhead: 0.00 MB
> >
> > 25% usage, or 2,097,152 entries:
> > * Old design total overhead: 57.00 MB
> > * Vswap total overhead: 48.25 MB
> >
> > 50% usage, or 4,194,304 entries:
> > * Old design total overhead: 89.00 MB
> > * Vswap total overhead: 96.50 MB
> >
> > 75% usage, or 6,291,456 entries:
> > * Old design total overhead: 121.00 MB
> > * Vswap total overhead: 144.75 MB
> >
> > 100% usage, or 8,388,608 entries:
> > * Old design total overhead: 153.00 MB
> > * Vswap total overhead: 193.00 MB
> >
> > So even in the worst case scenario for virtual swap, i.e when we
> > somehow have an oracle to correctly size the swapfile for zswap
> > pool to 32 GB, the added overhead is only 40 MB, which is a mere
> > 0.12% of the total swapfile :)
> >
> > In practice, the overhead will be closer to the 50-75% usage case, as
> > systems tend to leave swap headroom for pathological events or sudden
> > spikes in memory requirements. The added overhead in these cases are
> > practically neglible. And in deployments where swapfiles for zswap
> > are previously sparsely used, switching over to virtual swap will
> > actually reduce memory overhead.
> >
> > Doing the same math for the disk swap, which is the worst case for
> > virtual swap in terms of swap backends:
> >
> > 0% usage, or 0 entries: 0.00 MB
> > * Old design total overhead: 25.00 MB
> > * Vswap total overhead: 2.00 MB
> >
> > 25% usage, or 2,097,152 entries:
> > * Old design total overhead: 41.00 MB
> > * Vswap total overhead: 66.25 MB
> >
> > 50% usage, or 4,194,304 entries:
> > * Old design total overhead: 57.00 MB
> > * Vswap total overhead: 130.50 MB
> >
> > 75% usage, or 6,291,456 entries:
> > * Old design total overhead: 73.00 MB
> > * Vswap total overhead: 194.75 MB
> >
> > 100% usage, or 8,388,608 entries:
> > * Old design total overhead: 89.00 MB
> > * Vswap total overhead: 259.00 MB
> >
> > The added overhead is 170MB, which is 0.5% of the total swapfile size,
> > again in the worst case when we have a sizing oracle.
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH v3 14/20] mm: swap: decouple virtual swap slot from backing store
2026-02-08 21:58 ` [PATCH v3 14/20] mm: swap: decouple virtual swap slot from backing store Nhat Pham
@ 2026-02-10 6:31 ` Dan Carpenter
0 siblings, 0 replies; 52+ messages in thread
From: Dan Carpenter @ 2026-02-10 6:31 UTC (permalink / raw)
To: oe-kbuild, Nhat Pham, linux-mm
Cc: lkp, oe-kbuild-all, akpm, hannes, hughd, yosry.ahmed, mhocko,
roman.gushchin, shakeel.butt, muchun.song, len.brown,
chengming.zhou, kasong, chrisl, huang.ying.caritas, ryan.roberts,
shikemeng, viro, baohua, bhe, osalvador, lorenzo.stoakes,
christophe.leroy, pavel, kernel-team, linux-kernel, cgroups,
linux-pm, peterx, riel, joshua.hahnjy
Hi Nhat,
kernel test robot noticed the following build warnings:
url: https://github.com/intel-lab-lkp/linux/commits/Nhat-Pham/mm-swap-decouple-swap-cache-from-physical-swap-infrastructure/20260209-120606
base: 05f7e89ab9731565d8a62e3b5d1ec206485eeb0b
patch link: https://lore.kernel.org/r/20260208215839.87595-15-nphamcs%40gmail.com
patch subject: [PATCH v3 14/20] mm: swap: decouple virtual swap slot from backing store
config: powerpc-randconfig-r073-20260209 (https://download.01.org/0day-ci/archive/20260209/202602092300.lZO4Ee4N-lkp@intel.com/config)
compiler: powerpc-linux-gcc (GCC) 15.2.0
smatch version: v0.5.0-8994-gd50c5a4c
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
| Closes: https://lore.kernel.org/r/202602092300.lZO4Ee4N-lkp@intel.com/
smatch warnings:
mm/vswap.c:733 vswap_alloc_swap_slot() warn: variable dereferenced before check 'folio' (see line 701)
vim +/folio +733 mm/vswap.c
19a5fe94e9aae4 Nhat Pham 2026-02-08 694 bool vswap_alloc_swap_slot(struct folio *folio)
19a5fe94e9aae4 Nhat Pham 2026-02-08 695 {
19a5fe94e9aae4 Nhat Pham 2026-02-08 696 int i, nr = folio_nr_pages(folio);
19a5fe94e9aae4 Nhat Pham 2026-02-08 697 struct vswap_cluster *cluster = NULL;
19a5fe94e9aae4 Nhat Pham 2026-02-08 698 struct swap_info_struct *si;
19a5fe94e9aae4 Nhat Pham 2026-02-08 699 struct swap_cluster_info *ci;
19a5fe94e9aae4 Nhat Pham 2026-02-08 700 swp_slot_t slot = { .val = 0 };
19a5fe94e9aae4 Nhat Pham 2026-02-08 @701 swp_entry_t entry = folio->swap;
folio dereference here
19a5fe94e9aae4 Nhat Pham 2026-02-08 702 struct swp_desc *desc;
19a5fe94e9aae4 Nhat Pham 2026-02-08 703 bool fallback = false;
19a5fe94e9aae4 Nhat Pham 2026-02-08 704
19a5fe94e9aae4 Nhat Pham 2026-02-08 705 /*
19a5fe94e9aae4 Nhat Pham 2026-02-08 706 * We might have already allocated a backing physical swap slot in past
19a5fe94e9aae4 Nhat Pham 2026-02-08 707 * attempts (for instance, when we disable zswap). If the entire range is
19a5fe94e9aae4 Nhat Pham 2026-02-08 708 * already swapfile-backed we can skip swapfile case.
19a5fe94e9aae4 Nhat Pham 2026-02-08 709 */
19a5fe94e9aae4 Nhat Pham 2026-02-08 710 if (vswap_swapfile_backed(entry, nr))
19a5fe94e9aae4 Nhat Pham 2026-02-08 711 return true;
19a5fe94e9aae4 Nhat Pham 2026-02-08 712
19a5fe94e9aae4 Nhat Pham 2026-02-08 713 if (swap_slot_alloc(&slot, folio_order(folio)))
and here
19a5fe94e9aae4 Nhat Pham 2026-02-08 714 return false;
19a5fe94e9aae4 Nhat Pham 2026-02-08 715
19a5fe94e9aae4 Nhat Pham 2026-02-08 716 if (!slot.val)
19a5fe94e9aae4 Nhat Pham 2026-02-08 717 return false;
19a5fe94e9aae4 Nhat Pham 2026-02-08 718
7f88e3ea20f231 Nhat Pham 2026-02-08 719 /* establish the vrtual <-> physical swap slots linkages. */
7f88e3ea20f231 Nhat Pham 2026-02-08 720 si = __swap_slot_to_info(slot);
7f88e3ea20f231 Nhat Pham 2026-02-08 721 ci = swap_cluster_lock(si, swp_slot_offset(slot));
7f88e3ea20f231 Nhat Pham 2026-02-08 722 vswap_rmap_set(ci, slot, entry.val, nr);
7f88e3ea20f231 Nhat Pham 2026-02-08 723 swap_cluster_unlock(ci);
7f88e3ea20f231 Nhat Pham 2026-02-08 724
7f88e3ea20f231 Nhat Pham 2026-02-08 725 rcu_read_lock();
7f88e3ea20f231 Nhat Pham 2026-02-08 726 for (i = 0; i < nr; i++) {
7f88e3ea20f231 Nhat Pham 2026-02-08 727 desc = vswap_iter(&cluster, entry.val + i);
7f88e3ea20f231 Nhat Pham 2026-02-08 728 VM_WARN_ON(!desc);
7f88e3ea20f231 Nhat Pham 2026-02-08 729
19a5fe94e9aae4 Nhat Pham 2026-02-08 730 if (desc->type == VSWAP_FOLIO) {
19a5fe94e9aae4 Nhat Pham 2026-02-08 731 /* case 1: fallback from zswap store failure */
19a5fe94e9aae4 Nhat Pham 2026-02-08 732 fallback = true;
19a5fe94e9aae4 Nhat Pham 2026-02-08 @733 if (!folio)
So it can't be NULL here.
19a5fe94e9aae4 Nhat Pham 2026-02-08 734 folio = desc->swap_cache;
So we'll never do this assignment and it will never become NULL.
19a5fe94e9aae4 Nhat Pham 2026-02-08 735 else
19a5fe94e9aae4 Nhat Pham 2026-02-08 736 VM_WARN_ON(folio != desc->swap_cache);
19a5fe94e9aae4 Nhat Pham 2026-02-08 737 } else {
19a5fe94e9aae4 Nhat Pham 2026-02-08 738 /*
19a5fe94e9aae4 Nhat Pham 2026-02-08 739 * Case 2: zswap writeback.
19a5fe94e9aae4 Nhat Pham 2026-02-08 740 *
19a5fe94e9aae4 Nhat Pham 2026-02-08 741 * No need to free zswap entry here - it will be freed once zswap
19a5fe94e9aae4 Nhat Pham 2026-02-08 742 * writeback suceeds.
19a5fe94e9aae4 Nhat Pham 2026-02-08 743 */
19a5fe94e9aae4 Nhat Pham 2026-02-08 744 VM_WARN_ON(desc->type != VSWAP_ZSWAP);
19a5fe94e9aae4 Nhat Pham 2026-02-08 745 VM_WARN_ON(fallback);
19a5fe94e9aae4 Nhat Pham 2026-02-08 746 }
19a5fe94e9aae4 Nhat Pham 2026-02-08 747 desc->type = VSWAP_SWAPFILE;
7f88e3ea20f231 Nhat Pham 2026-02-08 748 desc->slot.val = slot.val + i;
7f88e3ea20f231 Nhat Pham 2026-02-08 749 }
7f88e3ea20f231 Nhat Pham 2026-02-08 750 spin_unlock(&cluster->lock);
7f88e3ea20f231 Nhat Pham 2026-02-08 751 rcu_read_unlock();
19a5fe94e9aae4 Nhat Pham 2026-02-08 752 return true;
7f88e3ea20f231 Nhat Pham 2026-02-08 753 }
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 52+ messages in thread
* [syzbot ci] Re: Virtual Swap Space
2026-02-08 21:58 [PATCH v3 00/20] Virtual Swap Space Nhat Pham
` (20 preceding siblings ...)
2026-02-08 22:51 ` [PATCH v3 00/20] Virtual Swap Space Nhat Pham
@ 2026-02-10 15:45 ` syzbot ci
21 siblings, 0 replies; 52+ messages in thread
From: syzbot ci @ 2026-02-10 15:45 UTC (permalink / raw)
To: akpm, axelrasmussen, baohua, bhe, cgroups, chengming.zhou,
chrisl, christophe.leroy, gourry, hannes, huang.ying.caritas,
hughd, jannh, joshua.hahnjy, kasong, kernel-team, len.brown,
linux-kernel, linux-mm, linux-pm, lorenzo.stoakes, mhocko,
muchun.song, npache, nphamcs, osalvador, pavel, peterx, pfalcato,
rafael, riel, roman.gushchin, ryan.roberts, shakeel.butt,
shikemeng, viro, weixugc, yosry.ahmed, yuanchu, zhengqi.arch
Cc: syzbot, syzkaller-bugs
syzbot ci has tested the following series
[v3] Virtual Swap Space
https://lore.kernel.org/all/20260208215839.87595-1-nphamcs@gmail.com
* [PATCH v3 01/20] mm/swap: decouple swap cache from physical swap infrastructure
* [PATCH v3 02/20] swap: rearrange the swap header file
* [PATCH v3 03/20] mm: swap: add an abstract API for locking out swapoff
* [PATCH v3 04/20] zswap: add new helpers for zswap entry operations
* [PATCH v3 05/20] mm/swap: add a new function to check if a swap entry is in swap cached.
* [PATCH v3 06/20] mm: swap: add a separate type for physical swap slots
* [PATCH v3 07/20] mm: create scaffolds for the new virtual swap implementation
* [PATCH v3 08/20] zswap: prepare zswap for swap virtualization
* [PATCH v3 09/20] mm: swap: allocate a virtual swap slot for each swapped out page
* [PATCH v3 10/20] swap: move swap cache to virtual swap descriptor
* [PATCH v3 11/20] zswap: move zswap entry management to the virtual swap descriptor
* [PATCH v3 12/20] swap: implement the swap_cgroup API using virtual swap
* [PATCH v3 13/20] swap: manage swap entry lifecycle at the virtual swap layer
* [PATCH v3 14/20] mm: swap: decouple virtual swap slot from backing store
* [PATCH v3 15/20] zswap: do not start zswap shrinker if there is no physical swap slots
* [PATCH v3 16/20] swap: do not unnecesarily pin readahead swap entries
* [PATCH v3 17/20] swapfile: remove zeromap bitmap
* [PATCH v3 18/20] memcg: swap: only charge physical swap slots
* [PATCH v3 19/20] swap: simplify swapoff using virtual swap
* [PATCH v3 20/20] swapfile: replace the swap map with bitmaps
and found the following issue:
possible deadlock in vswap_iter
Full report is available here:
https://ci.syzbot.org/series/b9defda6-daec-4c41-bbf9-7d3b7fabd7cb
***
possible deadlock in vswap_iter
tree: bpf
URL: https://kernel.googlesource.com/pub/scm/linux/kernel/git/bpf/bpf.git
base: 05f7e89ab9731565d8a62e3b5d1ec206485eeb0b
arch: amd64
compiler: Debian clang version 21.1.8 (++20251221033036+2078da43e25a-1~exp1~20251221153213.50), Debian LLD 21.1.8
config: https://ci.syzbot.org/builds/f444cfbe-4ce0-4917-94aa-3a8bd96ee376/config
C repro: https://ci.syzbot.org/findings/7b8c50b1-47d6-42e0-bcfc-814e7b3bb596/c_repro
syz repro: https://ci.syzbot.org/findings/7b8c50b1-47d6-42e0-bcfc-814e7b3bb596/syz_repro
loop0: detected capacity change from 0 to 764
============================================
WARNING: possible recursive locking detected
syzkaller #0 Not tainted
--------------------------------------------
syz-executor625/5806 is trying to acquire lock:
ffff88811884c018 (&cluster->lock){+.+.}-{3:3}, at: spin_lock include/linux/spinlock.h:351 [inline]
ffff88811884c018 (&cluster->lock){+.+.}-{3:3}, at: vswap_iter+0xfa/0x1b0 mm/vswap.c:274
but task is already holding lock:
ffff88811884c018 (&cluster->lock){+.+.}-{3:3}, at: spin_lock_irq include/linux/spinlock.h:376 [inline]
ffff88811884c018 (&cluster->lock){+.+.}-{3:3}, at: swap_cache_lock_irq+0xe2/0x190 mm/vswap.c:1586
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0
----
lock(&cluster->lock);
lock(&cluster->lock);
*** DEADLOCK ***
May be due to missing lock nesting notation
3 locks held by syz-executor625/5806:
#0: ffff888174bc2800 (&mm->mmap_lock){++++}-{4:4}, at: mmap_read_lock include/linux/mmap_lock.h:391 [inline]
#0: ffff888174bc2800 (&mm->mmap_lock){++++}-{4:4}, at: madvise_lock+0x152/0x2e0 mm/madvise.c:1789
#1: ffff88811884c018 (&cluster->lock){+.+.}-{3:3}, at: spin_lock_irq include/linux/spinlock.h:376 [inline]
#1: ffff88811884c018 (&cluster->lock){+.+.}-{3:3}, at: swap_cache_lock_irq+0xe2/0x190 mm/vswap.c:1586
#2: ffffffff8e55a360 (rcu_read_lock){....}-{1:3}, at: rcu_lock_acquire include/linux/rcupdate.h:331 [inline]
#2: ffffffff8e55a360 (rcu_read_lock){....}-{1:3}, at: rcu_read_lock include/linux/rcupdate.h:867 [inline]
#2: ffffffff8e55a360 (rcu_read_lock){....}-{1:3}, at: vswap_cgroup_record+0x40/0x290 mm/vswap.c:1925
stack backtrace:
***
If these findings have caused you to resend the series or submit a
separate fix, please add the following tag to your commit message:
Tested-by: syzbot@syzkaller.appspotmail.com
---
This report is generated by a bot. It may contain errors.
syzbot ci engineers can be reached at syzkaller@googlegroups.com.
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH v3 00/20] Virtual Swap Space
2026-02-08 22:26 ` [PATCH v3 00/20] Virtual Swap Space Nhat Pham
@ 2026-02-10 17:59 ` Kairui Song
2026-02-10 18:52 ` Johannes Weiner
` (3 more replies)
0 siblings, 4 replies; 52+ messages in thread
From: Kairui Song @ 2026-02-10 17:59 UTC (permalink / raw)
To: Nhat Pham
Cc: linux-mm, akpm, hannes, hughd, yosry.ahmed, mhocko,
roman.gushchin, shakeel.butt, muchun.song, len.brown,
chengming.zhou, chrisl, huang.ying.caritas, ryan.roberts,
shikemeng, viro, baohua, bhe, osalvador, christophe.leroy, pavel,
kernel-team, linux-kernel, cgroups, linux-pm, peterx, riel,
joshua.hahnjy, npache, gourry, axelrasmussen, yuanchu, weixugc,
rafael, jannh, pfalcato, zhengqi.arch
On Mon, Feb 9, 2026 at 7:57 AM Nhat Pham <nphamcs@gmail.com> wrote:
>
> Anyway, resending this (in-reply-to patch 1 of the series):
Hi Nhat,
> Changelog:
> * RFC v2 -> v3:
> * Implement a cluster-based allocation algorithm for virtual swap
> slots, inspired by Kairui Song and Chris Li's implementation, as
> well as Johannes Weiner's suggestions. This eliminates the lock
> contention issues on the virtual swap layer.
> * Re-use swap table for the reverse mapping.
> * Remove CONFIG_VIRTUAL_SWAP.
I really do think we better make this optional, not a replacement or
mandatory. There are many hard to evaluate effects as this
fundamentally changes the swap workflow with a lot of behavior changes
at once. e.g. it seems the folio will be reactivated instead of
splitted if the physical swap device is fragmented; slot is allocated
at IO and not at unmap, and maybe many others. Just like zswap is
optional. Some common workloads would see an obvious performance or
memory usage regression following this design, see below.
> * Reducing the size of the swap descriptor from 48 bytes to 24
> bytes, i.e another 50% reduction in memory overhead from v2.
Honestly if you keep reducing that you might just end up
reimplementing the swap table format :)
> This patch series is based on 6.19. There are a couple more
> swap-related changes in the mm-stable branch that I would need to
> coordinate with, but I would like to send this out as an update, to show
> that the lock contention issues that plagued earlier versions have been
> resolved and performance on the kernel build benchmark is now on-par with
> baseline. Furthermore, memory overhead has been substantially reduced
> compared to the last RFC version.
Thanks for the effort!
> * Operationally, static provisioning the swapfile for zswap pose
> significant challenges, because the sysadmin has to prescribe how
> much swap is needed a priori, for each combination of
> (memory size x disk space x workload usage). It is even more
> complicated when we take into account the variance of memory
> compression, which changes the reclaim dynamics (and as a result,
> swap space size requirement). The problem is further exarcebated for
> users who rely on swap utilization (and exhaustion) as an OOM signal.
So I thought about it again, this one seems not to be an issue. In
most cases, having a 1:1 virtual swap setup is enough, and very soon
the static overhead will be really trivial. There won't even be any
fragmentation issue either, since if the physical memory size is
identical to swap space, then you can always find a matching part. And
besides, dynamic growth of swap files is actually very doable and
useful, that will make physical swap files adjustable at runtime, so
users won't need to waste a swap type id to extend physical swap
space.
> * Another motivation is to simplify swapoff, which is both complicated
> and expensive in the current design, precisely because we are storing
> an encoding of the backend positional information in the page table,
> and thus requires a full page table walk to remove these references.
The swapoff here is not really a clean swapoff, minor faults will
still be triggered afterwards, and metadata is not released. So this
new swapoff cannot really guarantee the same performance as the old
swapoff. And on the other hand we can already just read everything
into the swap cache then ignore the page table walk with the older
design too, that's just not a clean swapoff.
> struct swp_desc {
> union {
> swp_slot_t slot; /* 0 8 */
> struct zswap_entry * zswap_entry; /* 0 8 */
> }; /* 0 8 */
> union {
> struct folio * swap_cache; /* 8 8 */
> void * shadow; /* 8 8 */
> }; /* 8 8 */
> unsigned int swap_count; /* 16 4 */
> unsigned short memcgid:16; /* 20: 0 2 */
> bool in_swapcache:1; /* 22: 0 1 */
A standalone bit for swapcache looks like the old SWAP_HAS_CACHE that
causes many issues...
>
> /* Bitfield combined with previous fields */
>
> enum swap_type type:2; /* 20:17 4 */
>
> /* size: 24, cachelines: 1, members: 6 */
> /* bit_padding: 13 bits */
> /* last cacheline: 24 bytes */
> };
Having a struct larger than 8 bytes means you can't load it
atomically, that limits your lock design. About a year ago Chris
shared with me an idea to use CAS on swap entries once they are small
and unified, that's why swap table is using atomic_long_t and have
helpers like __swap_table_xchg, we are not making good use of them yet
though. Meanwhile we have already consolidated the lock scope to folio
in many places, holding the folio lock then doing the CAS without
touching cluster lock at all for many swap operations might be
feasible soon.
E.g. we already have a cluster-lockless version of swap check in swap table p3:
https://lore.kernel.org/linux-mm/20260128-swap-table-p3-v2-11-fe0b67ef0215@tencent.com/
That might also greatly simplify the locking on IO and migration
performance between swap devices.
> Doing the same math for the disk swap, which is the worst case for
> virtual swap in terms of swap backends:
Actually this worst case is a very common case... see below.
> 0% usage, or 0 entries: 0.00 MB
> * Old design total overhead: 25.00 MB
> * Vswap total overhead: 2.00 MB
>
> 25% usage, or 2,097,152 entries:
> * Old design total overhead: 41.00 MB
> * Vswap total overhead: 66.25 MB
>
> 50% usage, or 4,194,304 entries:
> * Old design total overhead: 57.00 MB
> * Vswap total overhead: 130.50 MB
>
> 75% usage, or 6,291,456 entries:
> * Old design total overhead: 73.00 MB
> * Vswap total overhead: 194.75 MB
>
> 100% usage, or 8,388,608 entries:
> * Old design total overhead: 89.00 MB
> * Vswap total overhead: 259.00 MB
>
> The added overhead is 170MB, which is 0.5% of the total swapfile size,
> again in the worst case when we have a sizing oracle.
Hmm.. With the swap table we will have a stable 8 bytes per slot in
all cases, in current mm-stable we use 11 bytes (8 bytes dyn and 3
bytes static), and in the posted p3 we already get 10 bytes (8 bytes
dyn and 2 bytes static). P4 or follow up was already demonstrated
last year with working code, and it makes everything dynamic
(8 bytes fully dyn, I'll rebase and send that once p3 is merged).
So with mm-stable and follow up, for 32G swap device:
0% usage, or 0/8,388,608 entries: 0.00 MB
* mm-stable total overhead: 25.50 MB (which is swap table p2)
* swap-table p3 overhead: 17.50 MB
* swap-table p4 overhead: 0.50 MB
* Vswap total overhead: 2.00 MB
100% usage, or 8,388,608/8,388,608 entries:
* mm-stable total overhead: 89.5 MB (which is swap table p2)
* swap-table p3 overhead: 81.5 MB
* swap-table p4 overhead: 64.5 MB
* Vswap total overhead: 259.00 MB
That 3 - 4 times more memory usage, quite a trade off. With a
128G device, which is not something rare, it would be 1G of memory.
Swap table p3 / p4 is about 320M / 256M, and we do have a way to cut
that down close to be <1 byte or 3 byte per page with swap table
compaction, which was discussed in LSFMM last year, or even 1 bit
which was once suggested by Baolin, that would make it much smaller
down to <24MB (This is just an idea for now, but the compaction is
very doable as we already have "LRU"s for swap clusters in swap
allocator).
I don't think it looks good as a mandatory overhead. We do have a huge
user base of swap over many different kinds of devices, it was not
long ago two new kernel bugzilla issue or bug reported was sent to
the maillist about swap over disk, and I'm still trying to investigate
one of them which seems to be actually a page LRU issue and not swap
problem.. OK a little off topic, anyway, I'm not saying that we don't
want more features, as I mentioned above, it would be better if this
can be optional and minimal. See more test info below.
> We actually see a slight improvement in systime (by 1.5%) :) This is
> likely because we no longer have to perform swap charging for zswap
> entries, and virtual swap allocator is simpler than that of physical
> swap.
Congrats! Yeah, I guess that's because vswap has a smaller lock scope
than zswap with a reduced callpath?
>
> Using SSD swap as the backend:
>
> Baseline:
> real: mean: 200.3s, stdev: 2.33s
> sys: mean: 489.88s, stdev: 9.62s
>
> Vswap:
> real: mean: 201.47s, stdev: 2.98s
> sys: mean: 487.36s, stdev: 5.53s
>
> The performance is neck-to-neck.
Thanks for the bench, but please also test with global pressure too.
One mistake I made when working on the prototype of swap tables was
only focusing on cgroup memory pressure, which is really not how
everyone uses Linux, and that's why I reworked it for a long time to
tweak the RCU allocation / freeing of swap table pages so there won't
be any regression even for lowend and global pressure. That's kind of
critical for devices like Android.
I did an overnight bench on this with global pressure, comparing to
mainline 6.19 and swap table p3 (I do include such test for each swap
table serie, p2 / p3 is close so I just rebase and latest p3 on top of
your base commit just to be fair and that's easier for me too) and it
doesn't look that good.
Test machine setup for vm-scalability:
# lscpu | grep "Model name"
Model name: AMD EPYC 7K62 48-Core Processor
# free -m
total used free shared buff/cache available
Mem: 31582 909 26388 8 4284 29989
Swap: 40959 41 40918
The swap setup follows the recommendation from Huang
(https://lore.kernel.org/linux-mm/87ed474kvx.fsf@yhuang6-desk2.ccr.corp.intel.com/).
Test (average of 18 test run):
vm-scalability/usemem --init-time -O -y -x -n 1 56G
6.19:
Throughput: 618.49 MB/s (stdev 31.3)
Free latency: 5754780.50us (stdev 69542.7)
swap-table-p3 (3.8%, 0.5% better):
Throughput: 642.02 MB/s (stdev 25.1)
Free latency: 5728544.16us (stdev 48592.51)
vswap (3.2%, 244% worse):
Throughput: 598.67 MB/s (stdev 25.1)
Free latency: 13987175.66us (stdev 125148.57)
That's a huge regression with freeing. I have a vm-scatiliby test
matrix, not every setup has such significant >200% regression, but on
average the freeing time is about at least 15 - 50% slower (for
example /data/vm-scalability/usemem --init-time -O -y -x -n 32 1536M
the regression is about 2583221.62us vs 2153735.59us). Throughput is
all lower too.
Freeing is important as it was causing many problems before, it's the
reason why we had a swap slot freeing cache years ago (and later we
removed that since the freeing cache causes more problems and swap
allocator already improved it better than having the cache). People
even tried to optimize that:
https://lore.kernel.org/linux-mm/20250909065349.574894-1-liulei.rjpt@vivo.com/
(This seems a already fixed downstream issue, solved by swap allocator
or swap table). Some workloads might amplify the free latency greatly
and cause serious lags as shown above.
Another thing I personally cares about is how swap works on my daily
laptop :), building the kernel in a 2G test VM using NVME as swap,
which is a very practical workload I do everyday, the result is also
not good (average of 8 test run, make -j12):
#free -m
total used free shared buff/cache available
Mem: 1465 216 1026 0 300 1248
Swap: 4095 36 4059
6.19 systime:
109.6s
swap-table p3:
108.9s
vswap systime:
118.7s
On a build server, it's also slower (make -j48 with 4G memory VM and
NVME swap, average of 10 testrun):
# free -m
total used free shared buff/cache available
Mem: 3877 1444 2019 737 1376 2432
Swap: 32767 1886 30881
# lscpu | grep "Model name"
Model name: Intel(R) Xeon(R) Platinum
8255C CPU @ 2.50GHz
6.19 systime:
435.601s
swap-table p3:
432.793s
vswap systime:
455.652s
In conclusion it's about 4.3 - 8.3% slower for common workloads under
global pressure, and there is a up to 200% regression on freeing. ZRAM
shows an even larger workload regression but I'll skip that part since
your series is focusing on zswap now. Redis is also ~20% slower
compared to mm-stable (327515.00 RPS vs 405827.81 RPS), that's mostly
due to swap-table-p2 in mm-stable so I didn't do further comparisons.
So if that's not a bug with this series, I think the double free or
decoupling of swap / underlying slots might be the problem with the
freeing regression shown above. That's really a serious issue, and the
global pressure might be a critical issue too as the metadata is much
larger, and is already causing regressions for very common workloads.
Low end users could hit the min watermark easily and could have
serious jitters or allocation failures.
That's part of the issue I've found, so I really do think we need a
flexible way to implementa that and not have a mandatory layer. After
swap table P4 we should be able to figure out a way to fit all needs,
with a clean defined set of swap API, metadata and layers, as was
discussed at LSFMM last year.
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH v3 00/20] Virtual Swap Space
2026-02-09 12:20 ` Chris Li
2026-02-10 2:36 ` Johannes Weiner
@ 2026-02-10 18:00 ` Nhat Pham
2026-02-10 23:17 ` Chris Li
1 sibling, 1 reply; 52+ messages in thread
From: Nhat Pham @ 2026-02-10 18:00 UTC (permalink / raw)
To: Chris Li
Cc: akpm, hannes, hughd, yosry.ahmed, mhocko, roman.gushchin,
shakeel.butt, muchun.song, len.brown, chengming.zhou, kasong,
huang.ying.caritas, ryan.roberts, shikemeng, viro, baohua, bhe,
osalvador, christophe.leroy, pavel, linux-mm, kernel-team,
linux-kernel, cgroups, linux-pm, peterx, riel, joshua.hahnjy,
npache, gourry, axelrasmussen, yuanchu, weixugc, rafael, jannh,
pfalcato, zhengqi.arch
On Mon, Feb 9, 2026 at 4:20 AM Chris Li <chrisl@kernel.org> wrote:
>
> On Sun, Feb 8, 2026 at 4:15 PM Nhat Pham <nphamcs@gmail.com> wrote:
> >
> > My sincerest apologies - it seems like the cover letter (and just the
> > cover letter) fails to be sent out, for some reason. I'm trying to figure
> > out what happened - it works when I send the entire patch series to
> > myself...
> >
> > Anyway, resending this (in-reply-to patch 1 of the series):
>
> For the record I did receive your original V3 cover letter from the
> linux-mm mailing list.
I have no idea what happened to be honest. It did not show up on lore
for a couple of hours, and my coworkers did not receive the cover
letter email initially. I did not receive any error message or logs
either - git send-email returns Success to me, and when I checked on
the web gmail client (since I used a gmail email account), the whole
series is there.
I tried re-sending a couple times, to no avail. Then, in a couple of
hours, all of these attempts showed up.
Anyway, this is my bad - I'll be more patient next time. If it does
not show up for a couple of hours then I'll do some more digging.
>
> > Changelog:
> > * RFC v2 -> v3:
> > * Implement a cluster-based allocation algorithm for virtual swap
> > slots, inspired by Kairui Song and Chris Li's implementation, as
> > well as Johannes Weiner's suggestions. This eliminates the lock
> > contention issues on the virtual swap layer.
> > * Re-use swap table for the reverse mapping.
> > * Remove CONFIG_VIRTUAL_SWAP.
> > * Reducing the size of the swap descriptor from 48 bytes to 24
>
> Is the per swap slot entry overhead 24 bytes in your implementation?
> The current swap overhead is 3 static +8 dynamic, your 24 dynamic is a
> big jump. You can argue that 8->24 is not a big jump . But it is an
> unnecessary price compared to the alternatives, which is 8 dynamic +
> 4(optional redirect).
It depends in cases - you can check the memory overhead discussion below :)
>
> > bytes, i.e another 50% reduction in memory overhead from v2.
> > * Remove swap cache and zswap tree and use the swap descriptor
> > for this.
> > * Remove zeromap, and replace the swap_map bytemap with 2 bitmaps
> > (one for allocated slots, and one for bad slots).
> > * Rebase on top of 6.19 (7d0a66e4bb9081d75c82ec4957c50034cb0ea449)
>
> My git log shows 7d0a66e4bb9081d75c82ec4957c50034cb0ea449 is tag "v6.18".
Oh yeah I forgot to update that. That was from an old cover letter of
an old version that never got sent out - I'll correct that in future
versions
(if you scroll down to the bottom of the cover letter you should see
the correct base, which should be 6.19).
>
> > * Update cover letter to include new benchmark results and discussion
> > on overhead in various cases.
> > * RFC v1 -> RFC v2:
> > * Use a single atomic type (swap_refs) for reference counting
> > purpose. This brings the size of the swap descriptor from 64 B
> > down to 48 B (25% reduction). Suggested by Yosry Ahmed.
> > * Zeromap bitmap is removed in the virtual swap implementation.
> > This saves one bit per phyiscal swapfile slot.
> > * Rearrange the patches and the code change to make things more
> > reviewable. Suggested by Johannes Weiner.
> > * Update the cover letter a bit.
> >
> > This patch series implements the virtual swap space idea, based on Yosry's
> > proposals at LSFMMBPF 2023 (see [1], [2], [3]), as well as valuable
> > inputs from Johannes Weiner. The same idea (with different
> > implementation details) has been floated by Rik van Riel since at least
> > 2011 (see [8]).
> >
> > This patch series is based on 6.19. There are a couple more
> > swap-related changes in the mm-stable branch that I would need to
> > coordinate with, but I would like to send this out as an update, to show
>
> Ah, you need to mention that in the first line to Andrew. Spell out
> this series is not for Andrew to consume in the MM series. It can't
> any way because it does not apply to mm-unstable nor mm-stable.
Fair - I'll make sure to move this paragraph to above the changelog next time :)
>
> BTW, I have the following compile error with this series (fedora 43).
> Same config compile fine on v6.19.
>
> In file included from ./include/linux/local_lock.h:5,
> from ./include/linux/mmzone.h:24,
> from ./include/linux/gfp.h:7,
> from ./include/linux/mm.h:7,
> from mm/vswap.c:7:
> mm/vswap.c: In function ‘vswap_cpu_dead’:
> ./include/linux/percpu-defs.h:221:45: error: initialization from
> pointer to non-enclosed address space
> 221 | const void __percpu *__vpp_verify = (typeof((ptr) +
> 0))NULL; \
> | ^
> ./include/linux/local_lock_internal.h:105:40: note: in definition of
> macro ‘__local_lock_acquire’
> 105 | __l = (local_lock_t *)(lock);
> \
> | ^~~~
> ./include/linux/local_lock.h:17:41: note: in expansion of macro
> ‘__local_lock’
> 17 | #define local_lock(lock) __local_lock(this_cpu_ptr(lock))
> | ^~~~~~~~~~~~
> ./include/linux/percpu-defs.h:245:9: note: in expansion of macro
> ‘__verify_pcpu_ptr’
> 245 | __verify_pcpu_ptr(ptr);
> \
> | ^~~~~~~~~~~~~~~~~
> ./include/linux/percpu-defs.h:256:27: note: in expansion of macro ‘raw_cpu_ptr’
> 256 | #define this_cpu_ptr(ptr) raw_cpu_ptr(ptr)
> | ^~~~~~~~~~~
> ./include/linux/local_lock.h:17:54: note: in expansion of macro
> ‘this_cpu_ptr’
> 17 | #define local_lock(lock)
> __local_lock(this_cpu_ptr(lock))
> |
> ^~~~~~~~~~~~
> mm/vswap.c:1518:9: note: in expansion of macro ‘local_lock’
> 1518 | local_lock(&percpu_cluster->lock);
> | ^~~~~~~~~~
Ah that's strange. It compiled on all of my setups (I tested with a couple
different ones), but I must have missed some cases. Would you mind
sharing your configs so that I can reproduce this compilation error?
(although I'm sure kernel test robot will scream at me soon, which
usually includes configs that cause the compilation issue).
>
> > that the lock contention issues that plagued earlier versions have been
> > resolved and performance on the kernel build benchmark is now on-par with
> > baseline. Furthermore, memory overhead has been substantially reduced
> > compared to the last RFC version.
> >
> >
> > I. Motivation
> >
> > Currently, when an anon page is swapped out, a slot in a backing swap
> > device is allocated and stored in the page table entries that refer to
> > the original page. This slot is also used as the "key" to find the
> > swapped out content, as well as the index to swap data structures, such
> > as the swap cache, or the swap cgroup mapping. Tying a swap entry to its
> > backing slot in this way is performant and efficient when swap is purely
> > just disk space, and swapoff is rare.
> >
> > However, the advent of many swap optimizations has exposed major
> > drawbacks of this design. The first problem is that we occupy a physical
> > slot in the swap space, even for pages that are NEVER expected to hit
> > the disk: pages compressed and stored in the zswap pool, zero-filled
> > pages, or pages rejected by both of these optimizations when zswap
> > writeback is disabled. This is the arguably central shortcoming of
> > zswap:
> > * In deployments when no disk space can be afforded for swap (such as
> > mobile and embedded devices), users cannot adopt zswap, and are forced
> > to use zram. This is confusing for users, and creates extra burdens
> > for developers, having to develop and maintain similar features for
> > two separate swap backends (writeback, cgroup charging, THP support,
> > etc.). For instance, see the discussion in [4].
> > * Resource-wise, it is hugely wasteful in terms of disk usage. At Meta,
> > we have swapfile in the order of tens to hundreds of GBs, which are
> > mostly unused and only exist to enable zswap usage and zero-filled
> > pages swap optimizations.
> > * Tying zswap (and more generally, other in-memory swap backends) to
> > the current physical swapfile infrastructure makes zswap implicitly
> > statically sized. This does not make sense, as unlike disk swap, in
> > which we consume a limited resource (disk space or swapfile space) to
> > save another resource (memory), zswap consume the same resource it is
> > saving (memory). The more we zswap, the more memory we have available,
> > not less. We are not rationing a limited resource when we limit
> > the size of he zswap pool, but rather we are capping the resource
> > (memory) saving potential of zswap. Under memory pressure, using
> > more zswap is almost always better than the alternative (disk IOs, or
> > even worse, OOMs), and dynamically sizing the zswap pool on demand
> > allows the system to flexibly respond to these precarious scenarios.
> > * Operationally, static provisioning the swapfile for zswap pose
> > significant challenges, because the sysadmin has to prescribe how
> > much swap is needed a priori, for each combination of
> > (memory size x disk space x workload usage). It is even more
> > complicated when we take into account the variance of memory
> > compression, which changes the reclaim dynamics (and as a result,
> > swap space size requirement). The problem is further exarcebated for
> > users who rely on swap utilization (and exhaustion) as an OOM signal.
> >
> > All of these factors make it very difficult to configure the swapfile
> > for zswap: too small of a swapfile and we risk preventable OOMs and
> > limit the memory saving potentials of zswap; too big of a swapfile
> > and we waste disk space and memory due to swap metadata overhead.
> > This dilemma becomes more drastic in high memory systems, which can
> > have up to TBs worth of memory.
> >
> > Past attempts to decouple disk and compressed swap backends, namely the
> > ghost swapfile approach (see [13]), as well as the alternative
> > compressed swap backend zram, have mainly focused on eliminating the
> > disk space usage of compressed backends. We want a solution that not
> > only tackles that same problem, but also achieve the dyamicization of
> > swap space to maximize the memory saving potentials while reducing
> > operational and static memory overhead.
> >
> > Finally, any swap redesign should support efficient backend transfer,
> > i.e without having to perform the expensive page table walk to
> > update all the PTEs that refer to the swap entry:
> > * The main motivation for this requirement is zswap writeback. To quote
> > Johannes (from [14]): "Combining compression with disk swap is
> > extremely powerful, because it dramatically reduces the worst aspects
> > of both: it reduces the memory footprint of compression by shedding
> > the coldest data to disk; it reduces the IO latencies and flash wear
> > of disk swap through the writeback cache. In practice, this reduces
> > *average event rates of the entire reclaim/paging/IO stack*."
> > * Another motivation is to simplify swapoff, which is both complicated
> > and expensive in the current design, precisely because we are storing
> > an encoding of the backend positional information in the page table,
> > and thus requires a full page table walk to remove these references.
> >
> >
> > II. High Level Design Overview
> >
> > To fix the aforementioned issues, we need an abstraction that separates
> > a swap entry from its physical backing storage. IOW, we need to
> > “virtualize” the swap space: swap clients will work with a dynamically
> > allocated virtual swap slot, storing it in page table entries, and
> > using it to index into various swap-related data structures. The
> > backing storage is decoupled from the virtual swap slot, and the newly
> > introduced layer will “resolve” the virtual swap slot to the actual
> > storage. This layer also manages other metadata of the swap entry, such
> > as its lifetime information (swap count), via a dynamically allocated,
> > per-swap-entry descriptor:
> >
> > struct swp_desc {
> > union {
> > swp_slot_t slot; /* 0 8 */
> > struct zswap_entry * zswap_entry; /* 0 8 */
> > }; /* 0 8 */
> > union {
> > struct folio * swap_cache; /* 8 8 */
> > void * shadow; /* 8 8 */
> > }; /* 8 8 */
> > unsigned int swap_count; /* 16 4 */
> > unsigned short memcgid:16; /* 20: 0 2 */
> > bool in_swapcache:1; /* 22: 0 1 */
> >
> > /* Bitfield combined with previous fields */
> >
> > enum swap_type type:2; /* 20:17 4 */
> >
> > /* size: 24, cachelines: 1, members: 6 */
> > /* bit_padding: 13 bits */
> > /* last cacheline: 24 bytes */
> > };
> >
> > (output from pahole).
> >
> > This design allows us to:
> > * Decouple zswap (and zeromapped swap entry) from backing swapfile:
> > simply associate the virtual swap slot with one of the supported
> > backends: a zswap entry, a zero-filled swap page, a slot on the
> > swapfile, or an in-memory page.
> > * Simplify and optimize swapoff: we only have to fault the page in and
> > have the virtual swap slot points to the page instead of the on-disk
> > physical swap slot. No need to perform any page table walking.
> >
> > The size of the virtual swap descriptor is 24 bytes. Note that this is
> > not all "new" overhead, as the swap descriptor will replace:
> > * the swap_cgroup arrays (one per swap type) in the old design, which
> > is a massive source of static memory overhead. With the new design,
> > it is only allocated for used clusters.
> > * the swap tables, which holds the swap cache and workingset shadows.
> > * the zeromap bitmap, which is a bitmap of physical swap slots to
> > indicate whether the swapped out page is zero-filled or not.
> > * huge chunk of the swap_map. The swap_map is now replaced by 2 bitmaps,
> > one for allocated slots, and one for bad slots, representing 3 possible
> > states of a slot on the swapfile: allocated, free, and bad.
> > * the zswap tree.
> >
> > So, in terms of additional memory overhead:
> > * For zswap entries, the added memory overhead is rather minimal. The
> > new indirection pointer neatly replaces the existing zswap tree.
> > We really only incur less than one word of overhead for swap count
> > blow up (since we no longer use swap continuation) and the swap type.
> > * For physical swap entries, the new design will impose fewer than 3 words
> > memory overhead. However, as noted above this overhead is only for
> > actively used swap entries, whereas in the current design the overhead is
> > static (including the swap cgroup array for example).
> >
> > The primary victim of this overhead will be zram users. However, as
> > zswap now no longer takes up disk space, zram users can consider
> > switching to zswap (which, as a bonus, has a lot of useful features
> > out of the box, such as cgroup tracking, dynamic zswap pool sizing,
> > LRU-ordering writeback, etc.).
> >
> > For a more concrete example, suppose we have a 32 GB swapfile (i.e.
> > 8,388,608 swap entries), and we use zswap.
> >
> > 0% usage, or 0 entries: 0.00 MB
> > * Old design total overhead: 25.00 MB
> > * Vswap total overhead: 0.00 MB
> >
> > 25% usage, or 2,097,152 entries:
> > * Old design total overhead: 57.00 MB
> > * Vswap total overhead: 48.25 MB
> >
> > 50% usage, or 4,194,304 entries:
> > * Old design total overhead: 89.00 MB
> > * Vswap total overhead: 96.50 MB
> >
> > 75% usage, or 6,291,456 entries:
> > * Old design total overhead: 121.00 MB
> > * Vswap total overhead: 144.75 MB
> >
> > 100% usage, or 8,388,608 entries:
> > * Old design total overhead: 153.00 MB
> > * Vswap total overhead: 193.00 MB
> >
> > So even in the worst case scenario for virtual swap, i.e when we
> > somehow have an oracle to correctly size the swapfile for zswap
> > pool to 32 GB, the added overhead is only 40 MB, which is a mere
> > 0.12% of the total swapfile :)
> >
> > In practice, the overhead will be closer to the 50-75% usage case, as
> > systems tend to leave swap headroom for pathological events or sudden
> > spikes in memory requirements. The added overhead in these cases are
> > practically neglible. And in deployments where swapfiles for zswap
> > are previously sparsely used, switching over to virtual swap will
> > actually reduce memory overhead.
> >
> > Doing the same math for the disk swap, which is the worst case for
> > virtual swap in terms of swap backends:
> >
> > 0% usage, or 0 entries: 0.00 MB
> > * Old design total overhead: 25.00 MB
> > * Vswap total overhead: 2.00 MB
> >
> > 25% usage, or 2,097,152 entries:
> > * Old design total overhead: 41.00 MB
> > * Vswap total overhead: 66.25 MB
> >
> > 50% usage, or 4,194,304 entries:
> > * Old design total overhead: 57.00 MB
> > * Vswap total overhead: 130.50 MB
> >
> > 75% usage, or 6,291,456 entries:
> > * Old design total overhead: 73.00 MB
> > * Vswap total overhead: 194.75 MB
> >
> > 100% usage, or 8,388,608 entries:
> > * Old design total overhead: 89.00 MB
> > * Vswap total overhead: 259.00 MB
> >
> > The added overhead is 170MB, which is 0.5% of the total swapfile size,
> > again in the worst case when we have a sizing oracle.
> >
> > Please see the attached patches for more implementation details.
> >
> >
> > III. Usage and Benchmarking
> >
> > This patch series introduce no new syscalls or userspace API. Existing
> > userspace setups will work as-is, except we no longer have to create a
> > swapfile or set memory.swap.max if we want to use zswap, as zswap is no
> > longer tied to physical swap. The zswap pool will be automatically and
> > dynamically sized based on memory usage and reclaim dynamics.
> >
> > To measure the performance of the new implementation, I have run the
> > following benchmarks:
> >
> > 1. Kernel building: 52 workers (one per processor), memory.max = 3G.
> >
> > Using zswap as the backend:
> >
> > Baseline:
> > real: mean: 185.2s, stdev: 0.93s
> > sys: mean: 683.7s, stdev: 33.77s
> >
> > Vswap:
> > real: mean: 184.88s, stdev: 0.57s
> > sys: mean: 675.14s, stdev: 32.8s
>
> Can you show your user space time as well to complete the picture?
Will do next time! I used to include user time as well, but I noticed
that folks (for e.g see [1]) only include systime, not even real time,
so I figure nobody cares about user time :)
(I still include real time because some of my past work improves sys
time but regresses real time, so I figure that's relevant).
[1]: https://lore.kernel.org/linux-mm/20260128-swap-table-p3-v2-0-fe0b67ef0215@tencent.com/
But yeah no big deal. I'll dig through my logs to see if I still have
the numbers, but if not I'll include it in next version.
>
> How many runs do you have for stdev 32.8s?
5 runs! I average out the result of 5 runs.
>
> >
> > We actually see a slight improvement in systime (by 1.5%) :) This is
> > likely because we no longer have to perform swap charging for zswap
> > entries, and virtual swap allocator is simpler than that of physical
> > swap.
> >
> > Using SSD swap as the backend:
> Please include zram swap test data as well. Android heavily uses zram
> for swapping.
> >
> > Baseline:
> > real: mean: 200.3s, stdev: 2.33s
> > sys: mean: 489.88s, stdev: 9.62s
> >
> > Vswap:
> > real: mean: 201.47s, stdev: 2.98s
> > sys: mean: 487.36s, stdev: 5.53s
> >
> > The performance is neck-to-neck.
>
> I strongly suspect there is some performance difference that hasn't
> been covered by your test yet. Need more conformation by others on the
> performance measurement. The swap testing is tricky. You want to push
> to stress barely within the OOM limit. Need more data.
Very fair point :) I will say though - the kernel build test, with
memory.max limit sets, does generate a sizable amount of swapping, and
does OOM if you don't set up swap. Take my words for now, but I will
try to include average per-run (z)swap activity stats (zswpout zswpin
etc.) in future versions if you're interested :)
I've been trying to running more stress tests to trigger crashes and
performance regression. One of the big reasons why I haven't sent
anything til now is to fix obvious performance issues (the
aforementioned lock contention) and bugs. It's a complicated piece of
work.
As always, would love to receive code/design feedback from you (and
Kairui, and other swap reviewers), and I would appreciate very much if
other swap folks can play with the patch series on their setup as well
for performance testing, or let me know if there is any particular
case that they're interested in :)
Thanks for your review, Chris!
>
> Chris
>
> >
> >
> > IV. Future Use Cases
> >
> > While the patch series focus on two applications (decoupling swap
> > backends and swapoff optimization/simplification), this new,
> > future-proof design also allows us to implement new swap features more
> > easily and efficiently:
> >
> > * Multi-tier swapping (as mentioned in [5]), with transparent
> > transferring (promotion/demotion) of pages across tiers (see [8] and
> > [9]). Similar to swapoff, with the old design we would need to
> > perform the expensive page table walk.
> > * Swapfile compaction to alleviate fragmentation (as proposed by Ying
> > Huang in [6]).
> > * Mixed backing THP swapin (see [7]): Once you have pinned down the
> > backing store of THPs, then you can dispatch each range of subpages
> > to appropriate backend swapin handler.
> > * Swapping a folio out with discontiguous physical swap slots
> > (see [10]).
> > * Zswap writeback optimization: The current architecture pre-reserves
> > physical swap space for pages when they enter the zswap pool, giving
> > the kernel no flexibility at writeback time. With the virtual swap
> > implementation, the backends are decoupled, and physical swap space
> > is allocated on-demand at writeback time, at which point we can make
> > much smarter decisions: we can batch multiple zswap writeback
> > operations into a single IO request, allocating contiguous physical
> > swap slots for that request. We can even perform compressed writeback
> > (i.e writing these pages without decompressing them) (see [12]).
> >
> >
> > V. References
> >
> > [1]: https://lore.kernel.org/all/CAJD7tkbCnXJ95Qow_aOjNX6NOMU5ovMSHRC+95U4wtW6cM+puw@mail.gmail.com/
> > [2]: https://lwn.net/Articles/932077/
> > [3]: https://www.youtube.com/watch?v=Hwqw_TBGEhg
> > [4]: https://lore.kernel.org/all/Zqe_Nab-Df1CN7iW@infradead.org/
> > [5]: https://lore.kernel.org/lkml/CAF8kJuN-4UE0skVHvjUzpGefavkLULMonjgkXUZSBVJrcGFXCA@mail.gmail.com/
> > [6]: https://lore.kernel.org/linux-mm/87o78mzp24.fsf@yhuang6-desk2.ccr.corp.intel.com/
> > [7]: https://lore.kernel.org/all/CAGsJ_4ysCN6f7qt=6gvee1x3ttbOnifGneqcRm9Hoeun=uFQ2w@mail.gmail.com/
> > [8]: https://lore.kernel.org/linux-mm/4DA25039.3020700@redhat.com/
> > [9]: https://lore.kernel.org/all/CA+ZsKJ7DCE8PMOSaVmsmYZL9poxK6rn0gvVXbjpqxMwxS2C9TQ@mail.gmail.com/
> > [10]: https://lore.kernel.org/all/CACePvbUkMYMencuKfpDqtG1Ej7LiUS87VRAXb8sBn1yANikEmQ@mail.gmail.com/
> > [11]: https://lore.kernel.org/all/CAMgjq7BvQ0ZXvyLGp2YP96+i+6COCBBJCYmjXHGBnfisCAb8VA@mail.gmail.com/
> > [12]: https://lore.kernel.org/linux-mm/ZeZSDLWwDed0CgT3@casper.infradead.org/
> > [13]: https://lore.kernel.org/all/20251121-ghost-v1-1-cfc0efcf3855@kernel.org/
> > [14]: https://lore.kernel.org/linux-mm/20251202170222.GD430226@cmpxchg.org/
> >
> > Nhat Pham (20):
> > mm/swap: decouple swap cache from physical swap infrastructure
> > swap: rearrange the swap header file
> > mm: swap: add an abstract API for locking out swapoff
> > zswap: add new helpers for zswap entry operations
> > mm/swap: add a new function to check if a swap entry is in swap
> > cached.
> > mm: swap: add a separate type for physical swap slots
> > mm: create scaffolds for the new virtual swap implementation
> > zswap: prepare zswap for swap virtualization
> > mm: swap: allocate a virtual swap slot for each swapped out page
> > swap: move swap cache to virtual swap descriptor
> > zswap: move zswap entry management to the virtual swap descriptor
> > swap: implement the swap_cgroup API using virtual swap
> > swap: manage swap entry lifecycle at the virtual swap layer
> > mm: swap: decouple virtual swap slot from backing store
> > zswap: do not start zswap shrinker if there is no physical swap slots
> > swap: do not unnecesarily pin readahead swap entries
> > swapfile: remove zeromap bitmap
> > memcg: swap: only charge physical swap slots
> > swap: simplify swapoff using virtual swap
> > swapfile: replace the swap map with bitmaps
> >
> > Documentation/mm/swap-table.rst | 69 --
> > MAINTAINERS | 2 +
> > include/linux/cpuhotplug.h | 1 +
> > include/linux/mm_types.h | 16 +
> > include/linux/shmem_fs.h | 7 +-
> > include/linux/swap.h | 135 ++-
> > include/linux/swap_cgroup.h | 13 -
> > include/linux/swapops.h | 25 +
> > include/linux/zswap.h | 17 +-
> > kernel/power/swap.c | 6 +-
> > mm/Makefile | 5 +-
> > mm/huge_memory.c | 11 +-
> > mm/internal.h | 12 +-
> > mm/memcontrol-v1.c | 6 +
> > mm/memcontrol.c | 142 ++-
> > mm/memory.c | 101 +-
> > mm/migrate.c | 13 +-
> > mm/mincore.c | 15 +-
> > mm/page_io.c | 83 +-
> > mm/shmem.c | 215 +---
> > mm/swap.h | 157 +--
> > mm/swap_cgroup.c | 172 ---
> > mm/swap_state.c | 306 +----
> > mm/swap_table.h | 78 +-
> > mm/swapfile.c | 1518 ++++-------------------
> > mm/userfaultfd.c | 18 +-
> > mm/vmscan.c | 28 +-
> > mm/vswap.c | 2025 +++++++++++++++++++++++++++++++
> > mm/zswap.c | 142 +--
> > 29 files changed, 2853 insertions(+), 2485 deletions(-)
> > delete mode 100644 Documentation/mm/swap-table.rst
> > delete mode 100644 mm/swap_cgroup.c
> > create mode 100644 mm/vswap.c
> >
> >
> > base-commit: 05f7e89ab9731565d8a62e3b5d1ec206485eeb0b
> > --
> > 2.47.3
> >
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH v3 00/20] Virtual Swap Space
2026-02-10 17:59 ` Kairui Song
@ 2026-02-10 18:52 ` Johannes Weiner
2026-02-10 19:11 ` Nhat Pham
` (2 subsequent siblings)
3 siblings, 0 replies; 52+ messages in thread
From: Johannes Weiner @ 2026-02-10 18:52 UTC (permalink / raw)
To: Kairui Song
Cc: Nhat Pham, linux-mm, akpm, hughd, yosry.ahmed, mhocko,
roman.gushchin, shakeel.butt, muchun.song, len.brown,
chengming.zhou, chrisl, huang.ying.caritas, ryan.roberts,
shikemeng, viro, baohua, bhe, osalvador, christophe.leroy, pavel,
kernel-team, linux-kernel, cgroups, linux-pm, peterx, riel,
joshua.hahnjy, npache, gourry, axelrasmussen, yuanchu, weixugc,
rafael, jannh, pfalcato, zhengqi.arch
Hello Kairui,
On Wed, Feb 11, 2026 at 01:59:34AM +0800, Kairui Song wrote:
> On Mon, Feb 9, 2026 at 7:57 AM Nhat Pham <nphamcs@gmail.com> wrote:
> > * Reducing the size of the swap descriptor from 48 bytes to 24
> > bytes, i.e another 50% reduction in memory overhead from v2.
>
> Honestly if you keep reducing that you might just end up
> reimplementing the swap table format :)
Yeah, it turns out we need the same data points to describe and track
a swapped out page ;)
> > * Operationally, static provisioning the swapfile for zswap pose
> > significant challenges, because the sysadmin has to prescribe how
> > much swap is needed a priori, for each combination of
> > (memory size x disk space x workload usage). It is even more
> > complicated when we take into account the variance of memory
> > compression, which changes the reclaim dynamics (and as a result,
> > swap space size requirement). The problem is further exarcebated for
> > users who rely on swap utilization (and exhaustion) as an OOM signal.
>
> So I thought about it again, this one seems not to be an issue. In
> most cases, having a 1:1 virtual swap setup is enough, and very soon
> the static overhead will be really trivial. There won't even be any
> fragmentation issue either, since if the physical memory size is
> identical to swap space, then you can always find a matching part. And
> besides, dynamic growth of swap files is actually very doable and
> useful, that will make physical swap files adjustable at runtime, so
> users won't need to waste a swap type id to extend physical swap
> space.
The issue is address space separation. We don't want things inside the
compressed pool to consume disk space; nor do we want entries that
live on disk to take usable space away from the compressed pool.
The regression reports are fair, thanks for highlighting those. And
whether to make this optional is also a fair discussion.
But some of the numbers comparisons really strike me as apples to
oranges comparisons. It seems to miss the core issue this series is
trying to address.
> > * Another motivation is to simplify swapoff, which is both complicated
> > and expensive in the current design, precisely because we are storing
> > an encoding of the backend positional information in the page table,
> > and thus requires a full page table walk to remove these references.
>
> The swapoff here is not really a clean swapoff, minor faults will
> still be triggered afterwards, and metadata is not released. So this
> new swapoff cannot really guarantee the same performance as the old
> swapoff.
That seems very academic to me. The goal is to relinquish disk space,
and these patches make that a lot faster.
Let's put it the other way round: if today we had a fast swapoff read
sequence with lazy minor faults to resolve page tables, would we
accept patches that implement the expensive try_to_unuse() scans and
make it mandatory? Considering the worst-case runtime it can cause?
I don't think so. We have this scan because the page table references
are pointing to disk slots, and this is the only way to free them.
> And on the other hand we can already just read everything
> into the swap cache then ignore the page table walk with the older
> design too, that's just not a clean swapoff.
How can you relinquish the disk slot as long as the swp_entry_t is in
circulation?
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH v3 00/20] Virtual Swap Space
2026-02-10 17:59 ` Kairui Song
2026-02-10 18:52 ` Johannes Weiner
@ 2026-02-10 19:11 ` Nhat Pham
2026-02-10 19:23 ` Nhat Pham
` (2 more replies)
2026-02-10 21:58 ` Chris Li
2026-02-20 21:05 ` [PATCH] vswap: fix poor batching behavior of vswap free path Nhat Pham
3 siblings, 3 replies; 52+ messages in thread
From: Nhat Pham @ 2026-02-10 19:11 UTC (permalink / raw)
To: Kairui Song
Cc: linux-mm, akpm, hannes, hughd, yosry.ahmed, mhocko,
roman.gushchin, shakeel.butt, muchun.song, len.brown,
chengming.zhou, chrisl, huang.ying.caritas, ryan.roberts,
shikemeng, viro, baohua, bhe, osalvador, christophe.leroy, pavel,
kernel-team, linux-kernel, cgroups, linux-pm, peterx, riel,
joshua.hahnjy, npache, gourry, axelrasmussen, yuanchu, weixugc,
rafael, jannh, pfalcato, zhengqi.arch
On Tue, Feb 10, 2026 at 10:00 AM Kairui Song <ryncsn@gmail.com> wrote:
>
> On Mon, Feb 9, 2026 at 7:57 AM Nhat Pham <nphamcs@gmail.com> wrote:
> >
> > Anyway, resending this (in-reply-to patch 1 of the series):
>
> Hi Nhat,
>
> > Changelog:
> > * RFC v2 -> v3:
> > * Implement a cluster-based allocation algorithm for virtual swap
> > slots, inspired by Kairui Song and Chris Li's implementation, as
> > well as Johannes Weiner's suggestions. This eliminates the lock
> > contention issues on the virtual swap layer.
> > * Re-use swap table for the reverse mapping.
> > * Remove CONFIG_VIRTUAL_SWAP.
>
> I really do think we better make this optional, not a replacement or
> mandatory. There are many hard to evaluate effects as this
> fundamentally changes the swap workflow with a lot of behavior changes
> at once. e.g. it seems the folio will be reactivated instead of
> splitted if the physical swap device is fragmented; slot is allocated
> at IO and not at unmap, and maybe many others. Just like zswap is
> optional. Some common workloads would see an obvious performance or
> memory usage regression following this design, see below.
Ideally, if we can close the performance gap and have only one
version, then that would be the best :)
Problem with making it optional, or maintaining effectively two swap
implementations, is that it will make the patch series unreadable and
unreviewable, and the code base unmaintanable :) You'll have x2 the
amount of code to reason about and test, much more merge conflicts at
rebase and cherry-pick time. And any improvement to one version takes
extra work to graft onto the other version.
>
> > * Reducing the size of the swap descriptor from 48 bytes to 24
> > bytes, i.e another 50% reduction in memory overhead from v2.
>
> Honestly if you keep reducing that you might just end up
> reimplementing the swap table format :)
There's nothing wrong with that ;)
I like the swap table format (and your cluster-based swap allocator) a
lot. This patch series does not aim to remove that design - I just
want to separate the address space of physical and virtual swaps to
enable new use cases...
>
> > This patch series is based on 6.19. There are a couple more
> > swap-related changes in the mm-stable branch that I would need to
> > coordinate with, but I would like to send this out as an update, to show
> > that the lock contention issues that plagued earlier versions have been
> > resolved and performance on the kernel build benchmark is now on-par with
> > baseline. Furthermore, memory overhead has been substantially reduced
> > compared to the last RFC version.
>
> Thanks for the effort!
>
> > * Operationally, static provisioning the swapfile for zswap pose
> > significant challenges, because the sysadmin has to prescribe how
> > much swap is needed a priori, for each combination of
> > (memory size x disk space x workload usage). It is even more
> > complicated when we take into account the variance of memory
> > compression, which changes the reclaim dynamics (and as a result,
> > swap space size requirement). The problem is further exarcebated for
> > users who rely on swap utilization (and exhaustion) as an OOM signal.
>
> So I thought about it again, this one seems not to be an issue. In
I mean, it is a real production issue :) We have a variety of server
machines and services. Each of the former has its own memory and drive
size. Each of the latter has its own access characteristics,
compressibility, latency tolerance (and hence would prefer a different
swapping solutions - zswap, disk swap, zswap x disk swap). Coupled
with the fact that now multiple services can cooccur on one host, and
one services can be deployed on different kinds of hosts, statically
sizing the swapfile becomes operationally impossible and leaves a lot
of wins on the table. So swap space has to be dynamic.
> most cases, having a 1:1 virtual swap setup is enough, and very soon
> the static overhead will be really trivial. There won't even be any
> fragmentation issue either, since if the physical memory size is
> identical to swap space, then you can always find a matching part. And
> besides, dynamic growth of swap files is actually very doable and
> useful, that will make physical swap files adjustable at runtime, so
> users won't need to waste a swap type id to extend physical swap
> space.
By "dynamic growth of swap files", do you mean dynamically adjusting
the size of the swapfile? then that capacity does not exist right now,
and I don't see a good design laid out for it... At the very least,
the swap allocator needs to be dynamic in nature. I assume it's going
to look something very similar to vswap's current attempt, which
relies on a tree structure (radix tree i.e xarray). Sounds familiar?
;)
I feel like each of the problem I mention in this cover letter can be
solved partially with some amount of hacks, but none of them will
solve it all. And once you slaps all the hacks together, you just get
virtual swap, potentially shoved within specific backend codebase
(zswap or zram). That's not... ideal.
>
> > * Another motivation is to simplify swapoff, which is both complicated
> > and expensive in the current design, precisely because we are storing
> > an encoding of the backend positional information in the page table,
> > and thus requires a full page table walk to remove these references.
>
> The swapoff here is not really a clean swapoff, minor faults will
> still be triggered afterwards, and metadata is not released. So this
> new swapoff cannot really guarantee the same performance as the old
> swapoff. And on the other hand we can already just read everything
> into the swap cache then ignore the page table walk with the older
> design too, that's just not a clean swapoff.
I don't understand your point regarding the "reading everything into
swap cache". Yes, you can do that, but you would still lock the swap
device in place, because the page table entries still refer to slots
on the physical swap device - you cannot free the swap device, nor
space on disk, not even the swapfile's metadata (especially since the
swap cache is now intertwined with the physical swap layer).
>
> > struct swp_desc {
> > union {
> > swp_slot_t slot; /* 0 8 */
> > struct zswap_entry * zswap_entry; /* 0 8 */
> > }; /* 0 8 */
> > union {
> > struct folio * swap_cache; /* 8 8 */
> > void * shadow; /* 8 8 */
> > }; /* 8 8 */
> > unsigned int swap_count; /* 16 4 */
> > unsigned short memcgid:16; /* 20: 0 2 */
> > bool in_swapcache:1; /* 22: 0 1 */
>
> A standalone bit for swapcache looks like the old SWAP_HAS_CACHE that
> causes many issues...
Yeah this was based on 6.19, which did not have your swap cache change yet :)
I have taken a look at your latest swap table work in mm-stable, and I
think most of that can conceptually incorporated in to this line of
work as well.
Chiefly, the new swap cache synchronization scheme (i.e whoever puts
the folio in swap cache first gets exclusive rights) still works in
virtual swap world (and hence, the removal of swap cache pin, which is
one bit in the virtual swap descriptor).
Similarly, do you think we cannot hold the folio lock in place of the
cluster lock in the virtual swap world? Same for a lot of the memory
overhead reduction tricks (such as using shadow for cgroup id instead
of a separate swap_cgroup unsigned short field). I think comparing the
two this way is a bit apples-to-oranges (especially given the new
features enabled by vswap).
[...]
> That 3 - 4 times more memory usage, quite a trade off. With a
> 128G device, which is not something rare, it would be 1G of memory.
> Swap table p3 / p4 is about 320M / 256M, and we do have a way to cut
> that down close to be <1 byte or 3 byte per page with swap table
> compaction, which was discussed in LSFMM last year, or even 1 bit
> which was once suggested by Baolin, that would make it much smaller
> down to <24MB (This is just an idea for now, but the compaction is
> very doable as we already have "LRU"s for swap clusters in swap
> allocator).
>
> I don't think it looks good as a mandatory overhead. We do have a huge
> user base of swap over many different kinds of devices, it was not
> long ago two new kernel bugzilla issue or bug reported was sent to
> the maillist about swap over disk, and I'm still trying to investigate
> one of them which seems to be actually a page LRU issue and not swap
> problem.. OK a little off topic, anyway, I'm not saying that we don't
> want more features, as I mentioned above, it would be better if this
> can be optional and minimal. See more test info below.
Side note - I might have missed this. If it's still ongoing, would
love to help debug this :)
>
> > We actually see a slight improvement in systime (by 1.5%) :) This is
> > likely because we no longer have to perform swap charging for zswap
> > entries, and virtual swap allocator is simpler than that of physical
> > swap.
>
> Congrats! Yeah, I guess that's because vswap has a smaller lock scope
> than zswap with a reduced callpath?
Ah yeah that too. I neglected to mention this, but with vswap you can
merge several swap operations in zswap code path and no longer have to
release-then-reacquire the swap locks, since zswap entries live in the
same lock scope as swap cache entries.
It's more of a side note either way, because my main goal with this
patch series is to enable new features. Getting a performance win is
always nice of course :)
>
> >
> > Using SSD swap as the backend:
> >
> > Baseline:
> > real: mean: 200.3s, stdev: 2.33s
> > sys: mean: 489.88s, stdev: 9.62s
> >
> > Vswap:
> > real: mean: 201.47s, stdev: 2.98s
> > sys: mean: 487.36s, stdev: 5.53s
> >
> > The performance is neck-to-neck.
>
> Thanks for the bench, but please also test with global pressure too.
Do you mean using memory to the point where it triggered the global watermarks?
> One mistake I made when working on the prototype of swap tables was
> only focusing on cgroup memory pressure, which is really not how
> everyone uses Linux, and that's why I reworked it for a long time to
> tweak the RCU allocation / freeing of swap table pages so there won't
> be any regression even for lowend and global pressure. That's kind of
> critical for devices like Android.
>
> I did an overnight bench on this with global pressure, comparing to
> mainline 6.19 and swap table p3 (I do include such test for each swap
> table serie, p2 / p3 is close so I just rebase and latest p3 on top of
> your base commit just to be fair and that's easier for me too) and it
> doesn't look that good.
>
> Test machine setup for vm-scalability:
> # lscpu | grep "Model name"
> Model name: AMD EPYC 7K62 48-Core Processor
>
> # free -m
> total used free shared buff/cache available
> Mem: 31582 909 26388 8 4284 29989
> Swap: 40959 41 40918
>
> The swap setup follows the recommendation from Huang
> (https://lore.kernel.org/linux-mm/87ed474kvx.fsf@yhuang6-desk2.ccr.corp.intel.com/).
>
> Test (average of 18 test run):
> vm-scalability/usemem --init-time -O -y -x -n 1 56G
>
> 6.19:
> Throughput: 618.49 MB/s (stdev 31.3)
> Free latency: 5754780.50us (stdev 69542.7)
>
> swap-table-p3 (3.8%, 0.5% better):
> Throughput: 642.02 MB/s (stdev 25.1)
> Free latency: 5728544.16us (stdev 48592.51)
>
> vswap (3.2%, 244% worse):
> Throughput: 598.67 MB/s (stdev 25.1)
> Free latency: 13987175.66us (stdev 125148.57)
>
> That's a huge regression with freeing. I have a vm-scatiliby test
> matrix, not every setup has such significant >200% regression, but on
> average the freeing time is about at least 15 - 50% slower (for
> example /data/vm-scalability/usemem --init-time -O -y -x -n 32 1536M
> the regression is about 2583221.62us vs 2153735.59us). Throughput is
> all lower too.
>
> Freeing is important as it was causing many problems before, it's the
> reason why we had a swap slot freeing cache years ago (and later we
> removed that since the freeing cache causes more problems and swap
> allocator already improved it better than having the cache). People
> even tried to optimize that:
> https://lore.kernel.org/linux-mm/20250909065349.574894-1-liulei.rjpt@vivo.com/
> (This seems a already fixed downstream issue, solved by swap allocator
> or swap table). Some workloads might amplify the free latency greatly
> and cause serious lags as shown above.
>
> Another thing I personally cares about is how swap works on my daily
> laptop :), building the kernel in a 2G test VM using NVME as swap,
> which is a very practical workload I do everyday, the result is also
> not good (average of 8 test run, make -j12):
Hmm this one I don't think I can reproduce without your laptop ;)
Jokes aside, I did try to run the kernel build with disk swapping, and
the performance is on par with baseline. Swap performance with NVME
swap tends to be dominated by IO work in my experiments. Do you think
I missed something here? Maybe it's the concurrency difference (since
I always run with -j$(nproc), i.e the number of workers == the number
of processors).
> #free -m
> total used free shared buff/cache available
> Mem: 1465 216 1026 0 300 1248
> Swap: 4095 36 4059
>
> 6.19 systime:
> 109.6s
> swap-table p3:
> 108.9s
> vswap systime:
> 118.7s
>
> On a build server, it's also slower (make -j48 with 4G memory VM and
> NVME swap, average of 10 testrun):
> # free -m
> total used free shared buff/cache available
> Mem: 3877 1444 2019 737 1376 2432
> Swap: 32767 1886 30881
>
> # lscpu | grep "Model name"
> Model name: Intel(R) Xeon(R) Platinum
> 8255C CPU @ 2.50GHz
>
> 6.19 systime:
> 435.601s
> swap-table p3:
> 432.793s
> vswap systime:
> 455.652s
>
> In conclusion it's about 4.3 - 8.3% slower for common workloads under
> global pressure, and there is a up to 200% regression on freeing. ZRAM
> shows an even larger workload regression but I'll skip that part since
> your series is focusing on zswap now. Redis is also ~20% slower
> compared to mm-stable (327515.00 RPS vs 405827.81 RPS), that's mostly
> due to swap-table-p2 in mm-stable so I didn't do further comparisons.
I'll see if I can reproduce the issues! I'll start with usemem one
first, as that seems easier to reproduce...
>
> So if that's not a bug with this series, I think the double free or
It could be a non-crashing bug that subtly regresses certain swap
operations, but yeah let me study your test case first!
> decoupling of swap / underlying slots might be the problem with the
> freeing regression shown above. That's really a serious issue, and the
> global pressure might be a critical issue too as the metadata is much
> larger, and is already causing regressions for very common workloads.
> Low end users could hit the min watermark easily and could have
> serious jitters or allocation failures.
>
> That's part of the issue I've found, so I really do think we need a
> flexible way to implementa that and not have a mandatory layer. After
> swap table P4 we should be able to figure out a way to fit all needs,
> with a clean defined set of swap API, metadata and layers, as was
> discussed at LSFMM last year.
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH v3 00/20] Virtual Swap Space
2026-02-10 19:11 ` Nhat Pham
@ 2026-02-10 19:23 ` Nhat Pham
2026-02-12 5:07 ` Chris Li
2026-02-17 23:36 ` Nhat Pham
2 siblings, 0 replies; 52+ messages in thread
From: Nhat Pham @ 2026-02-10 19:23 UTC (permalink / raw)
To: Kairui Song
Cc: linux-mm, akpm, hannes, hughd, yosry.ahmed, mhocko,
roman.gushchin, shakeel.butt, muchun.song, len.brown,
chengming.zhou, chrisl, huang.ying.caritas, ryan.roberts,
shikemeng, viro, baohua, bhe, osalvador, christophe.leroy, pavel,
kernel-team, linux-kernel, cgroups, linux-pm, peterx, riel,
joshua.hahnjy, npache, gourry, axelrasmussen, yuanchu, weixugc,
rafael, jannh, pfalcato, zhengqi.arch
On Tue, Feb 10, 2026 at 11:11 AM Nhat Pham <nphamcs@gmail.com> wrote:
>>
> Hmm this one I don't think I can reproduce without your laptop ;)
>
> Jokes aside, I did try to run the kernel build with disk swapping, and
> the performance is on par with baseline. Swap performance with NVME
> swap tends to be dominated by IO work in my experiments. Do you think
> I missed something here? Maybe it's the concurrency difference (since
> I always run with -j$(nproc), i.e the number of workers == the number
> of processors).
Ah I just noticed that your numbers include only systime. Ignore my IO
comments then.
(I still think in real production system, with disk swapping enabled,
then IO wait time is going to be really important. If you're going to
use disk swap, then this affects real time just as much if not more
than kernel CPU overhead).
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH v3 00/20] Virtual Swap Space
2026-02-10 2:36 ` Johannes Weiner
@ 2026-02-10 21:24 ` Chris Li
2026-02-10 23:01 ` Johannes Weiner
0 siblings, 1 reply; 52+ messages in thread
From: Chris Li @ 2026-02-10 21:24 UTC (permalink / raw)
To: Johannes Weiner
Cc: Nhat Pham, akpm, hughd, yosry.ahmed, mhocko, roman.gushchin,
shakeel.butt, muchun.song, len.brown, chengming.zhou, kasong,
huang.ying.caritas, ryan.roberts, shikemeng, viro, baohua, bhe,
osalvador, christophe.leroy, pavel, linux-mm, kernel-team,
linux-kernel, cgroups, linux-pm, peterx, riel, joshua.hahnjy,
npache, gourry, axelrasmussen, yuanchu, weixugc, rafael, jannh,
pfalcato, zhengqi.arch
Hi Johannes,
On Mon, Feb 9, 2026 at 6:36 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> Hi Chris,
>
> On Mon, Feb 09, 2026 at 04:20:21AM -0800, Chris Li wrote:
> > Is the per swap slot entry overhead 24 bytes in your implementation?
> > The current swap overhead is 3 static +8 dynamic, your 24 dynamic is a
> > big jump. You can argue that 8->24 is not a big jump . But it is an
> > unnecessary price compared to the alternatives, which is 8 dynamic +
> > 4(optional redirect).
>
> No, this is not the net overhead.
I am talking about the total metadata overhead per swap entry. Not net.
> The descriptor consolidates and eliminates several other data
> structures.
Adding members previously not there and making some members bigger
along the way. For example, the swap_map from 1 byte to a 4 byte
count.
>
> Here is the more detailed breakdown:
It seems you did not finish your sentence before sending your reply.
Anyway, I saw the total per swap entry overhead bump to 24 bytes
dynamic. Let me know what is the correct number for VS if you
disagree.
Chris
> > > The size of the virtual swap descriptor is 24 bytes. Note that this is
> > > not all "new" overhead, as the swap descriptor will replace:
> > > * the swap_cgroup arrays (one per swap type) in the old design, which
> > > is a massive source of static memory overhead. With the new design,
> > > it is only allocated for used clusters.
> > > * the swap tables, which holds the swap cache and workingset shadows.
> > > * the zeromap bitmap, which is a bitmap of physical swap slots to
> > > indicate whether the swapped out page is zero-filled or not.
> > > * huge chunk of the swap_map. The swap_map is now replaced by 2 bitmaps,
> > > one for allocated slots, and one for bad slots, representing 3 possible
> > > states of a slot on the swapfile: allocated, free, and bad.
> > > * the zswap tree.
> > >
> > > So, in terms of additional memory overhead:
> > > * For zswap entries, the added memory overhead is rather minimal. The
> > > new indirection pointer neatly replaces the existing zswap tree.
> > > We really only incur less than one word of overhead for swap count
> > > blow up (since we no longer use swap continuation) and the swap type.
> > > * For physical swap entries, the new design will impose fewer than 3 words
> > > memory overhead. However, as noted above this overhead is only for
> > > actively used swap entries, whereas in the current design the overhead is
> > > static (including the swap cgroup array for example).
> > >
> > > The primary victim of this overhead will be zram users. However, as
> > > zswap now no longer takes up disk space, zram users can consider
> > > switching to zswap (which, as a bonus, has a lot of useful features
> > > out of the box, such as cgroup tracking, dynamic zswap pool sizing,
> > > LRU-ordering writeback, etc.).
> > >
> > > For a more concrete example, suppose we have a 32 GB swapfile (i.e.
> > > 8,388,608 swap entries), and we use zswap.
> > >
> > > 0% usage, or 0 entries: 0.00 MB
> > > * Old design total overhead: 25.00 MB
> > > * Vswap total overhead: 0.00 MB
> > >
> > > 25% usage, or 2,097,152 entries:
> > > * Old design total overhead: 57.00 MB
> > > * Vswap total overhead: 48.25 MB
> > >
> > > 50% usage, or 4,194,304 entries:
> > > * Old design total overhead: 89.00 MB
> > > * Vswap total overhead: 96.50 MB
> > >
> > > 75% usage, or 6,291,456 entries:
> > > * Old design total overhead: 121.00 MB
> > > * Vswap total overhead: 144.75 MB
> > >
> > > 100% usage, or 8,388,608 entries:
> > > * Old design total overhead: 153.00 MB
> > > * Vswap total overhead: 193.00 MB
> > >
> > > So even in the worst case scenario for virtual swap, i.e when we
> > > somehow have an oracle to correctly size the swapfile for zswap
> > > pool to 32 GB, the added overhead is only 40 MB, which is a mere
> > > 0.12% of the total swapfile :)
> > >
> > > In practice, the overhead will be closer to the 50-75% usage case, as
> > > systems tend to leave swap headroom for pathological events or sudden
> > > spikes in memory requirements. The added overhead in these cases are
> > > practically neglible. And in deployments where swapfiles for zswap
> > > are previously sparsely used, switching over to virtual swap will
> > > actually reduce memory overhead.
> > >
> > > Doing the same math for the disk swap, which is the worst case for
> > > virtual swap in terms of swap backends:
> > >
> > > 0% usage, or 0 entries: 0.00 MB
> > > * Old design total overhead: 25.00 MB
> > > * Vswap total overhead: 2.00 MB
> > >
> > > 25% usage, or 2,097,152 entries:
> > > * Old design total overhead: 41.00 MB
> > > * Vswap total overhead: 66.25 MB
> > >
> > > 50% usage, or 4,194,304 entries:
> > > * Old design total overhead: 57.00 MB
> > > * Vswap total overhead: 130.50 MB
> > >
> > > 75% usage, or 6,291,456 entries:
> > > * Old design total overhead: 73.00 MB
> > > * Vswap total overhead: 194.75 MB
> > >
> > > 100% usage, or 8,388,608 entries:
> > > * Old design total overhead: 89.00 MB
> > > * Vswap total overhead: 259.00 MB
> > >
> > > The added overhead is 170MB, which is 0.5% of the total swapfile size,
> > > again in the worst case when we have a sizing oracle.
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH v3 00/20] Virtual Swap Space
2026-02-10 17:59 ` Kairui Song
2026-02-10 18:52 ` Johannes Weiner
2026-02-10 19:11 ` Nhat Pham
@ 2026-02-10 21:58 ` Chris Li
2026-02-20 21:05 ` [PATCH] vswap: fix poor batching behavior of vswap free path Nhat Pham
3 siblings, 0 replies; 52+ messages in thread
From: Chris Li @ 2026-02-10 21:58 UTC (permalink / raw)
To: Kairui Song
Cc: Nhat Pham, linux-mm, akpm, hannes, hughd, yosry.ahmed, mhocko,
roman.gushchin, shakeel.butt, muchun.song, len.brown,
chengming.zhou, huang.ying.caritas, ryan.roberts, shikemeng,
viro, baohua, bhe, osalvador, christophe.leroy, pavel,
kernel-team, linux-kernel, cgroups, linux-pm, peterx, riel,
joshua.hahnjy, npache, gourry, axelrasmussen, yuanchu, weixugc,
rafael, jannh, pfalcato, zhengqi.arch
Hi Kairui,
Thank you so much for the performance test.
I will only comment on the performance number in this sub email thread.
On Tue, Feb 10, 2026 at 10:00 AM Kairui Song <ryncsn@gmail.com> wrote:
> Actually this worst case is a very common case... see below.
>
> > 0% usage, or 0 entries: 0.00 MB
> > * Old design total overhead: 25.00 MB
> > * Vswap total overhead: 2.00 MB
> >
> > 25% usage, or 2,097,152 entries:
> > * Old design total overhead: 41.00 MB
> > * Vswap total overhead: 66.25 MB
> >
> > 50% usage, or 4,194,304 entries:
> > * Old design total overhead: 57.00 MB
> > * Vswap total overhead: 130.50 MB
> >
> > 75% usage, or 6,291,456 entries:
> > * Old design total overhead: 73.00 MB
> > * Vswap total overhead: 194.75 MB
> >
> > 100% usage, or 8,388,608 entries:
> > * Old design total overhead: 89.00 MB
> > * Vswap total overhead: 259.00 MB
> >
> > The added overhead is 170MB, which is 0.5% of the total swapfile size,
> > again in the worst case when we have a sizing oracle.
>
> Hmm.. With the swap table we will have a stable 8 bytes per slot in
> all cases, in current mm-stable we use 11 bytes (8 bytes dyn and 3
> bytes static), and in the posted p3 we already get 10 bytes (8 bytes
> dyn and 2 bytes static). P4 or follow up was already demonstrated
> last year with working code, and it makes everything dynamic
> (8 bytes fully dyn, I'll rebase and send that once p3 is merged).
>
> So with mm-stable and follow up, for 32G swap device:
>
> 0% usage, or 0/8,388,608 entries: 0.00 MB
> * mm-stable total overhead: 25.50 MB (which is swap table p2)
> * swap-table p3 overhead: 17.50 MB
> * swap-table p4 overhead: 0.50 MB
> * Vswap total overhead: 2.00 MB
>
> 100% usage, or 8,388,608/8,388,608 entries:
> * mm-stable total overhead: 89.5 MB (which is swap table p2)
> * swap-table p3 overhead: 81.5 MB
> * swap-table p4 overhead: 64.5 MB
> * Vswap total overhead: 259.00 MB
>
> That 3 - 4 times more memory usage, quite a trade off. With a
Agree. That has been my main complaint about VS is the per swap entry
metadata overhead. This VS series reverted the swap table, but memory
and CPU performance is worse than swap table.
> 128G device, which is not something rare, it would be 1G of memory.
> Swap table p3 / p4 is about 320M / 256M, and we do have a way to cut
> that down close to be <1 byte or 3 byte per page with swap table
> compaction, which was discussed in LSFMM last year, or even 1 bit
> which was once suggested by Baolin, that would make it much smaller
> down to <24MB (This is just an idea for now, but the compaction is
> very doable as we already have "LRU"s for swap clusters in swap
> allocator).
>
> I don't think it looks good as a mandatory overhead. We do have a huge
> user base of swap over many different kinds of devices, it was not
> long ago two new kernel bugzilla issue or bug reported was sent to
> the maillist about swap over disk, and I'm still trying to investigate
> one of them which seems to be actually a page LRU issue and not swap
> problem.. OK a little off topic, anyway, I'm not saying that we don't
> want more features, as I mentioned above, it would be better if this
> can be optional and minimal. See more test info below.
>
> > We actually see a slight improvement in systime (by 1.5%) :) This is
> > likely because we no longer have to perform swap charging for zswap
> > entries, and virtual swap allocator is simpler than that of physical
> > swap.
>
> Congrats! Yeah, I guess that's because vswap has a smaller lock scope
> than zswap with a reduced callpath?
Whole series is too much zswap centric and punishes other swaps.
>
> >
> > Using SSD swap as the backend:
> >
> > Baseline:
> > real: mean: 200.3s, stdev: 2.33s
> > sys: mean: 489.88s, stdev: 9.62s
> >
> > Vswap:
> > real: mean: 201.47s, stdev: 2.98s
> > sys: mean: 487.36s, stdev: 5.53s
> >
> > The performance is neck-to-neck.
>
> Thanks for the bench, but please also test with global pressure too.
> One mistake I made when working on the prototype of swap tables was
> only focusing on cgroup memory pressure, which is really not how
> everyone uses Linux, and that's why I reworked it for a long time to
> tweak the RCU allocation / freeing of swap table pages so there won't
> be any regression even for lowend and global pressure. That's kind of
> critical for devices like Android.
>
> I did an overnight bench on this with global pressure, comparing to
> mainline 6.19 and swap table p3 (I do include such test for each swap
> table serie, p2 / p3 is close so I just rebase and latest p3 on top of
> your base commit just to be fair and that's easier for me too) and it
> doesn't look that good.
>
> Test machine setup for vm-scalability:
> # lscpu | grep "Model name"
> Model name: AMD EPYC 7K62 48-Core Processor
>
> # free -m
> total used free shared buff/cache available
> Mem: 31582 909 26388 8 4284 29989
> Swap: 40959 41 40918
>
> The swap setup follows the recommendation from Huang
> (https://lore.kernel.org/linux-mm/87ed474kvx.fsf@yhuang6-desk2.ccr.corp.intel.com/).
>
> Test (average of 18 test run):
> vm-scalability/usemem --init-time -O -y -x -n 1 56G
>
> 6.19:
> Throughput: 618.49 MB/s (stdev 31.3)
> Free latency: 5754780.50us (stdev 69542.7)
>
> swap-table-p3 (3.8%, 0.5% better):
> Throughput: 642.02 MB/s (stdev 25.1)
> Free latency: 5728544.16us (stdev 48592.51)
>
> vswap (3.2%, 244% worse):
Now that is a deal breaker for me. Not the similar performance with
baseline or swap table P3.
> Throughput: 598.67 MB/s (stdev 25.1)
> Free latency: 13987175.66us (stdev 125148.57)
>
> That's a huge regression with freeing. I have a vm-scatiliby test
> matrix, not every setup has such significant >200% regression, but on
> average the freeing time is about at least 15 - 50% slower (for
> example /data/vm-scalability/usemem --init-time -O -y -x -n 32 1536M
> the regression is about 2583221.62us vs 2153735.59us). Throughput is
> all lower too.
>
> Freeing is important as it was causing many problems before, it's the
> reason why we had a swap slot freeing cache years ago (and later we
> removed that since the freeing cache causes more problems and swap
> allocator already improved it better than having the cache). People
> even tried to optimize that:
> https://lore.kernel.org/linux-mm/20250909065349.574894-1-liulei.rjpt@vivo.com/
> (This seems a already fixed downstream issue, solved by swap allocator
> or swap table). Some workloads might amplify the free latency greatly
> and cause serious lags as shown above.
>
> Another thing I personally cares about is how swap works on my daily
> laptop :), building the kernel in a 2G test VM using NVME as swap,
> which is a very practical workload I do everyday, the result is also
> not good (average of 8 test run, make -j12):
> #free -m
> total used free shared buff/cache available
> Mem: 1465 216 1026 0 300 1248
> Swap: 4095 36 4059
>
> 6.19 systime:
> 109.6s
> swap-table p3:
> 108.9s
> vswap systime:
> 118.7s
>
> On a build server, it's also slower (make -j48 with 4G memory VM and
> NVME swap, average of 10 testrun):
> # free -m
> total used free shared buff/cache available
> Mem: 3877 1444 2019 737 1376 2432
> Swap: 32767 1886 30881
>
> # lscpu | grep "Model name"
> Model name: Intel(R) Xeon(R) Platinum
> 8255C CPU @ 2.50GHz
>
> 6.19 systime:
> 435.601s
> swap-table p3:
> 432.793s
> vswap systime:
> 455.652s
>
> In conclusion it's about 4.3 - 8.3% slower for common workloads under
At 4-8% I would consider it a statically significant performance
regression to favor swap table implementations.
> global pressure, and there is a up to 200% regression on freeing. ZRAM
> shows an even larger workload regression but I'll skip that part since
> your series is focusing on zswap now. Redis is also ~20% slower
> compared to mm-stable (327515.00 RPS vs 405827.81 RPS), that's mostly
> due to swap-table-p2 in mm-stable so I didn't do further comparisons.
>
> So if that's not a bug with this series, I think the double free or
> decoupling of swap / underlying slots might be the problem with the
> freeing regression shown above. That's really a serious issue, and the
> global pressure might be a critical issue too as the metadata is much
> larger, and is already causing regressions for very common workloads.
> Low end users could hit the min watermark easily and could have
> serious jitters or allocation failures.
>
> That's part of the issue I've found, so I really do think we need a
> flexible way to implementa that and not have a mandatory layer. After
> swap table P4 we should be able to figure out a way to fit all needs,
> with a clean defined set of swap API, metadata and layers, as was
> discussed at LSFMM last year.
Agree. That matches my view, get the fundamental infrastructure for
swap right first (swap table), then do those fancier feature
enhancement like online growing the size of swapfile.
Chris
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH v3 00/20] Virtual Swap Space
2026-02-10 21:24 ` Chris Li
@ 2026-02-10 23:01 ` Johannes Weiner
0 siblings, 0 replies; 52+ messages in thread
From: Johannes Weiner @ 2026-02-10 23:01 UTC (permalink / raw)
To: Chris Li
Cc: Nhat Pham, akpm, hughd, yosry.ahmed, mhocko, roman.gushchin,
shakeel.butt, muchun.song, len.brown, chengming.zhou, kasong,
huang.ying.caritas, ryan.roberts, shikemeng, viro, baohua, bhe,
osalvador, christophe.leroy, pavel, linux-mm, kernel-team,
linux-kernel, cgroups, linux-pm, peterx, riel, joshua.hahnjy,
npache, gourry, axelrasmussen, yuanchu, weixugc, rafael, jannh,
pfalcato, zhengqi.arch
Hi Chris,
On Tue, Feb 10, 2026 at 01:24:03PM -0800, Chris Li wrote:
> Hi Johannes,
> On Mon, Feb 9, 2026 at 6:36 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
> > Here is the more detailed breakdown:
>
> It seems you did not finish your sentence before sending your reply.
I did. I trimmed the quote of Nhat's cover letter to the parts
addressing your questions. If you use gmail, click the three dots:
> > > > The size of the virtual swap descriptor is 24 bytes. Note that this is
> > > > not all "new" overhead, as the swap descriptor will replace:
> > > > * the swap_cgroup arrays (one per swap type) in the old design, which
> > > > is a massive source of static memory overhead. With the new design,
> > > > it is only allocated for used clusters.
> > > > * the swap tables, which holds the swap cache and workingset shadows.
> > > > * the zeromap bitmap, which is a bitmap of physical swap slots to
> > > > indicate whether the swapped out page is zero-filled or not.
> > > > * huge chunk of the swap_map. The swap_map is now replaced by 2 bitmaps,
> > > > one for allocated slots, and one for bad slots, representing 3 possible
> > > > states of a slot on the swapfile: allocated, free, and bad.
> > > > * the zswap tree.
> > > >
> > > > So, in terms of additional memory overhead:
> > > > * For zswap entries, the added memory overhead is rather minimal. The
> > > > new indirection pointer neatly replaces the existing zswap tree.
> > > > We really only incur less than one word of overhead for swap count
> > > > blow up (since we no longer use swap continuation) and the swap type.
> > > > * For physical swap entries, the new design will impose fewer than 3 words
> > > > memory overhead. However, as noted above this overhead is only for
> > > > actively used swap entries, whereas in the current design the overhead is
> > > > static (including the swap cgroup array for example).
> > > >
> > > > The primary victim of this overhead will be zram users. However, as
> > > > zswap now no longer takes up disk space, zram users can consider
> > > > switching to zswap (which, as a bonus, has a lot of useful features
> > > > out of the box, such as cgroup tracking, dynamic zswap pool sizing,
> > > > LRU-ordering writeback, etc.).
> > > >
> > > > For a more concrete example, suppose we have a 32 GB swapfile (i.e.
> > > > 8,388,608 swap entries), and we use zswap.
> > > >
> > > > 0% usage, or 0 entries: 0.00 MB
> > > > * Old design total overhead: 25.00 MB
> > > > * Vswap total overhead: 0.00 MB
> > > >
> > > > 25% usage, or 2,097,152 entries:
> > > > * Old design total overhead: 57.00 MB
> > > > * Vswap total overhead: 48.25 MB
> > > >
> > > > 50% usage, or 4,194,304 entries:
> > > > * Old design total overhead: 89.00 MB
> > > > * Vswap total overhead: 96.50 MB
> > > >
> > > > 75% usage, or 6,291,456 entries:
> > > > * Old design total overhead: 121.00 MB
> > > > * Vswap total overhead: 144.75 MB
> > > >
> > > > 100% usage, or 8,388,608 entries:
> > > > * Old design total overhead: 153.00 MB
> > > > * Vswap total overhead: 193.00 MB
> > > >
> > > > So even in the worst case scenario for virtual swap, i.e when we
> > > > somehow have an oracle to correctly size the swapfile for zswap
> > > > pool to 32 GB, the added overhead is only 40 MB, which is a mere
> > > > 0.12% of the total swapfile :)
> > > >
> > > > In practice, the overhead will be closer to the 50-75% usage case, as
> > > > systems tend to leave swap headroom for pathological events or sudden
> > > > spikes in memory requirements. The added overhead in these cases are
> > > > practically neglible. And in deployments where swapfiles for zswap
> > > > are previously sparsely used, switching over to virtual swap will
> > > > actually reduce memory overhead.
> > > >
> > > > Doing the same math for the disk swap, which is the worst case for
> > > > virtual swap in terms of swap backends:
> > > >
> > > > 0% usage, or 0 entries: 0.00 MB
> > > > * Old design total overhead: 25.00 MB
> > > > * Vswap total overhead: 2.00 MB
> > > >
> > > > 25% usage, or 2,097,152 entries:
> > > > * Old design total overhead: 41.00 MB
> > > > * Vswap total overhead: 66.25 MB
> > > >
> > > > 50% usage, or 4,194,304 entries:
> > > > * Old design total overhead: 57.00 MB
> > > > * Vswap total overhead: 130.50 MB
> > > >
> > > > 75% usage, or 6,291,456 entries:
> > > > * Old design total overhead: 73.00 MB
> > > > * Vswap total overhead: 194.75 MB
> > > >
> > > > 100% usage, or 8,388,608 entries:
> > > > * Old design total overhead: 89.00 MB
> > > > * Vswap total overhead: 259.00 MB
> > > >
> > > > The added overhead is 170MB, which is 0.5% of the total swapfile size,
> > > > again in the worst case when we have a sizing oracle.
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH v3 00/20] Virtual Swap Space
2026-02-10 18:00 ` Nhat Pham
@ 2026-02-10 23:17 ` Chris Li
0 siblings, 0 replies; 52+ messages in thread
From: Chris Li @ 2026-02-10 23:17 UTC (permalink / raw)
To: Nhat Pham
Cc: akpm, hannes, hughd, yosry.ahmed, mhocko, roman.gushchin,
shakeel.butt, muchun.song, len.brown, chengming.zhou, kasong,
huang.ying.caritas, ryan.roberts, shikemeng, viro, baohua, bhe,
osalvador, christophe.leroy, pavel, linux-mm, kernel-team,
linux-kernel, cgroups, linux-pm, peterx, riel, joshua.hahnjy,
npache, gourry, axelrasmussen, yuanchu, weixugc, rafael, jannh,
pfalcato, zhengqi.arch
[-- Attachment #1: Type: text/plain, Size: 7726 bytes --]
On Tue, Feb 10, 2026 at 10:00 AM Nhat Pham <nphamcs@gmail.com> wrote:
>
> On Mon, Feb 9, 2026 at 4:20 AM Chris Li <chrisl@kernel.org> wrote:
> >
> > On Sun, Feb 8, 2026 at 4:15 PM Nhat Pham <nphamcs@gmail.com> wrote:
> > >
> > > My sincerest apologies - it seems like the cover letter (and just the
> > > cover letter) fails to be sent out, for some reason. I'm trying to figure
> > > out what happened - it works when I send the entire patch series to
> > > myself...
> > >
> > > Anyway, resending this (in-reply-to patch 1 of the series):
> >
> > For the record I did receive your original V3 cover letter from the
> > linux-mm mailing list.
>
> I have no idea what happened to be honest. It did not show up on lore
> for a couple of hours, and my coworkers did not receive the cover
> letter email initially. I did not receive any error message or logs
> either - git send-email returns Success to me, and when I checked on
> the web gmail client (since I used a gmail email account), the whole
> series is there.
>
> I tried re-sending a couple times, to no avail. Then, in a couple of
> hours, all of these attempts showed up.
>
> Anyway, this is my bad - I'll be more patient next time. If it does
> not show up for a couple of hours then I'll do some more digging.
No problem. Just want to provide more data points if that helps you
debug your email issue.
> > > Changelog:
> > > * RFC v2 -> v3:
> > > * Implement a cluster-based allocation algorithm for virtual swap
> > > slots, inspired by Kairui Song and Chris Li's implementation, as
> > > well as Johannes Weiner's suggestions. This eliminates the lock
> > > contention issues on the virtual swap layer.
> > > * Re-use swap table for the reverse mapping.
> > > * Remove CONFIG_VIRTUAL_SWAP.
> > > * Reducing the size of the swap descriptor from 48 bytes to 24
> >
> > Is the per swap slot entry overhead 24 bytes in your implementation?
> > The current swap overhead is 3 static +8 dynamic, your 24 dynamic is a
> > big jump. You can argue that 8->24 is not a big jump . But it is an
> > unnecessary price compared to the alternatives, which is 8 dynamic +
> > 4(optional redirect).
>
> It depends in cases - you can check the memory overhead discussion below :)
I think the "24B dynamic" sums up the VS memory overhead pretty well
without going into the detail tables. You can drive from case
discussion from that.
> > BTW, I have the following compile error with this series (fedora 43).
> > Same config compile fine on v6.19.
> >
> > In file included from ./include/linux/local_lock.h:5,
> > from ./include/linux/mmzone.h:24,
> > from ./include/linux/gfp.h:7,
> > from ./include/linux/mm.h:7,
> > from mm/vswap.c:7:
> > mm/vswap.c: In function ‘vswap_cpu_dead’:
> > ./include/linux/percpu-defs.h:221:45: error: initialization from
> > pointer to non-enclosed address space
> > 221 | const void __percpu *__vpp_verify = (typeof((ptr) +
> > 0))NULL; \
> > | ^
> > ./include/linux/local_lock_internal.h:105:40: note: in definition of
> > macro ‘__local_lock_acquire’
> > 105 | __l = (local_lock_t *)(lock);
> > \
> > | ^~~~
> > ./include/linux/local_lock.h:17:41: note: in expansion of macro
> > ‘__local_lock’
> > 17 | #define local_lock(lock) __local_lock(this_cpu_ptr(lock))
> > | ^~~~~~~~~~~~
> > ./include/linux/percpu-defs.h:245:9: note: in expansion of macro
> > ‘__verify_pcpu_ptr’
> > 245 | __verify_pcpu_ptr(ptr);
> > \
> > | ^~~~~~~~~~~~~~~~~
> > ./include/linux/percpu-defs.h:256:27: note: in expansion of macro ‘raw_cpu_ptr’
> > 256 | #define this_cpu_ptr(ptr) raw_cpu_ptr(ptr)
> > | ^~~~~~~~~~~
> > ./include/linux/local_lock.h:17:54: note: in expansion of macro
> > ‘this_cpu_ptr’
> > 17 | #define local_lock(lock)
> > __local_lock(this_cpu_ptr(lock))
> > |
> > ^~~~~~~~~~~~
> > mm/vswap.c:1518:9: note: in expansion of macro ‘local_lock’
> > 1518 | local_lock(&percpu_cluster->lock);
> > | ^~~~~~~~~~
>
> Ah that's strange. It compiled on all of my setups (I tested with a couple
> different ones), but I must have missed some cases. Would you mind
> sharing your configs so that I can reproduce this compilation error?
See attached config.gz. It is also possible the newer gcc version
contributes to that error. Anyway, that is preventing me from stress
testing your series on my setup.
>
> >
> > > 1. Kernel building: 52 workers (one per processor), memory.max = 3G.
> > >
> > > Using zswap as the backend:
> > >
> > > Baseline:
> > > real: mean: 185.2s, stdev: 0.93s
> > > sys: mean: 683.7s, stdev: 33.77s
> > >
> > > Vswap:
> > > real: mean: 184.88s, stdev: 0.57s
> > > sys: mean: 675.14s, stdev: 32.8s
> >
> > Can you show your user space time as well to complete the picture?
>
> Will do next time! I used to include user time as well, but I noticed
> that folks (for e.g see [1]) only include systime, not even real time,
> so I figure nobody cares about user time :)
>
> (I still include real time because some of my past work improves sys
> time but regresses real time, so I figure that's relevant).
>
> [1]: https://lore.kernel.org/linux-mm/20260128-swap-table-p3-v2-0-fe0b67ef0215@tencent.com/
>
> But yeah no big deal. I'll dig through my logs to see if I still have
> the numbers, but if not I'll include it in next version.
Mostly I want to get an impression how hard you push our swap test cases.
>
> >
> > How many runs do you have for stdev 32.8s?
>
> 5 runs! I average out the result of 5 runs.
The stddev is 33 seconds. Measure 5 times then average result is not
enough sample to get your to 1.5% resolution (8 seconds), which fall
into the range of noise.
> > I strongly suspect there is some performance difference that hasn't
> > been covered by your test yet. Need more conformation by others on the
> > performance measurement. The swap testing is tricky. You want to push
> > to stress barely within the OOM limit. Need more data.
>
> Very fair point :) I will say though - the kernel build test, with
> memory.max limit sets, does generate a sizable amount of swapping, and
> does OOM if you don't set up swap. Take my words for now, but I will
> try to include average per-run (z)swap activity stats (zswpout zswpin
> etc.) in future versions if you're interested :)
Including the user space time will help determine the level of swap
pressure as well. I don't need the absolutely zswapout count just yet.
> I've been trying to running more stress tests to trigger crashes and
> performance regression. One of the big reasons why I haven't sent
> anything til now is to fix obvious performance issues (the
> aforementioned lock contention) and bugs. It's a complicated piece of
> work.
>
> As always, would love to receive code/design feedback from you (and
> Kairui, and other swap reviewers), and I would appreciate very much if
> other swap folks can play with the patch series on their setup as well
> for performance testing, or let me know if there is any particular
> case that they're interested in :)
I understand Kairui has some measurements that show regressions.
If you can fix the compiling error I can do some stress testing myself
to provide more data points.
Thanks
Chris
[-- Attachment #2: config.gz --]
[-- Type: application/gzip, Size: 41914 bytes --]
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH v3 09/20] mm: swap: allocate a virtual swap slot for each swapped out page
2026-02-08 21:58 ` [PATCH v3 09/20] mm: swap: allocate a virtual swap slot for each swapped out page Nhat Pham
2026-02-09 17:12 ` kernel test robot
@ 2026-02-11 13:42 ` kernel test robot
1 sibling, 0 replies; 52+ messages in thread
From: kernel test robot @ 2026-02-11 13:42 UTC (permalink / raw)
To: Nhat Pham, linux-mm
Cc: oe-kbuild-all, akpm, hannes, hughd, yosry.ahmed, mhocko,
roman.gushchin, shakeel.butt, muchun.song, len.brown,
chengming.zhou, kasong, chrisl, huang.ying.caritas, ryan.roberts,
shikemeng, viro, baohua, bhe, osalvador, lorenzo.stoakes,
christophe.leroy, pavel, kernel-team, linux-kernel, cgroups,
linux-pm, peterx, riel, joshua.hahnjy
Hi Nhat,
kernel test robot noticed the following build errors:
[auto build test ERROR on 05f7e89ab9731565d8a62e3b5d1ec206485eeb0b]
url: https://github.com/intel-lab-lkp/linux/commits/Nhat-Pham/mm-swap-decouple-swap-cache-from-physical-swap-infrastructure/20260209-120606
base: 05f7e89ab9731565d8a62e3b5d1ec206485eeb0b
patch link: https://lore.kernel.org/r/20260208215839.87595-10-nphamcs%40gmail.com
patch subject: [PATCH v3 09/20] mm: swap: allocate a virtual swap slot for each swapped out page
config: x86_64-rhel-9.4 (https://download.01.org/0day-ci/archive/20260211/202602111445.rP38hmwx-lkp@intel.com/config)
compiler: gcc-14 (Debian 14.2.0-19) 14.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20260211/202602111445.rP38hmwx-lkp@intel.com/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202602111445.rP38hmwx-lkp@intel.com/
All errors (new ones prefixed by >>):
In file included from include/linux/local_lock.h:5,
from include/linux/mmzone.h:24,
from include/linux/gfp.h:7,
from include/linux/mm.h:7,
from mm/vswap.c:7:
mm/vswap.c: In function 'vswap_cpu_dead':
>> include/linux/percpu-defs.h:221:45: error: initialization from pointer to non-enclosed address space
221 | const void __percpu *__vpp_verify = (typeof((ptr) + 0))NULL; \
| ^
include/linux/local_lock_internal.h:105:40: note: in definition of macro '__local_lock_acquire'
105 | __l = (local_lock_t *)(lock); \
| ^~~~
include/linux/local_lock.h:17:41: note: in expansion of macro '__local_lock'
17 | #define local_lock(lock) __local_lock(this_cpu_ptr(lock))
| ^~~~~~~~~~~~
include/linux/percpu-defs.h:245:9: note: in expansion of macro '__verify_pcpu_ptr'
245 | __verify_pcpu_ptr(ptr); \
| ^~~~~~~~~~~~~~~~~
include/linux/percpu-defs.h:256:27: note: in expansion of macro 'raw_cpu_ptr'
256 | #define this_cpu_ptr(ptr) raw_cpu_ptr(ptr)
| ^~~~~~~~~~~
include/linux/local_lock.h:17:54: note: in expansion of macro 'this_cpu_ptr'
17 | #define local_lock(lock) __local_lock(this_cpu_ptr(lock))
| ^~~~~~~~~~~~
mm/vswap.c:653:9: note: in expansion of macro 'local_lock'
653 | local_lock(&percpu_cluster->lock);
| ^~~~~~~~~~
include/linux/percpu-defs.h:221:45: note: expected 'const __seg_gs void *' but pointer is of type 'local_lock_t *'
221 | const void __percpu *__vpp_verify = (typeof((ptr) + 0))NULL; \
| ^
include/linux/local_lock_internal.h:105:40: note: in definition of macro '__local_lock_acquire'
105 | __l = (local_lock_t *)(lock); \
| ^~~~
include/linux/local_lock.h:17:41: note: in expansion of macro '__local_lock'
17 | #define local_lock(lock) __local_lock(this_cpu_ptr(lock))
| ^~~~~~~~~~~~
include/linux/percpu-defs.h:245:9: note: in expansion of macro '__verify_pcpu_ptr'
245 | __verify_pcpu_ptr(ptr); \
| ^~~~~~~~~~~~~~~~~
include/linux/percpu-defs.h:256:27: note: in expansion of macro 'raw_cpu_ptr'
256 | #define this_cpu_ptr(ptr) raw_cpu_ptr(ptr)
| ^~~~~~~~~~~
include/linux/local_lock.h:17:54: note: in expansion of macro 'this_cpu_ptr'
17 | #define local_lock(lock) __local_lock(this_cpu_ptr(lock))
| ^~~~~~~~~~~~
mm/vswap.c:653:9: note: in expansion of macro 'local_lock'
653 | local_lock(&percpu_cluster->lock);
| ^~~~~~~~~~
>> include/linux/percpu-defs.h:221:45: error: initialization from pointer to non-enclosed address space
221 | const void __percpu *__vpp_verify = (typeof((ptr) + 0))NULL; \
| ^
include/linux/local_lock_internal.h:107:27: note: in definition of macro '__local_lock_acquire'
107 | _Generic((lock), \
| ^~~~
include/linux/local_lock.h:17:41: note: in expansion of macro '__local_lock'
17 | #define local_lock(lock) __local_lock(this_cpu_ptr(lock))
| ^~~~~~~~~~~~
include/linux/percpu-defs.h:245:9: note: in expansion of macro '__verify_pcpu_ptr'
245 | __verify_pcpu_ptr(ptr); \
| ^~~~~~~~~~~~~~~~~
include/linux/percpu-defs.h:256:27: note: in expansion of macro 'raw_cpu_ptr'
256 | #define this_cpu_ptr(ptr) raw_cpu_ptr(ptr)
| ^~~~~~~~~~~
include/linux/local_lock.h:17:54: note: in expansion of macro 'this_cpu_ptr'
17 | #define local_lock(lock) __local_lock(this_cpu_ptr(lock))
| ^~~~~~~~~~~~
mm/vswap.c:653:9: note: in expansion of macro 'local_lock'
653 | local_lock(&percpu_cluster->lock);
| ^~~~~~~~~~
include/linux/percpu-defs.h:221:45: note: expected 'const __seg_gs void *' but pointer is of type 'local_lock_t *'
221 | const void __percpu *__vpp_verify = (typeof((ptr) + 0))NULL; \
| ^
include/linux/local_lock_internal.h:107:27: note: in definition of macro '__local_lock_acquire'
107 | _Generic((lock), \
| ^~~~
include/linux/local_lock.h:17:41: note: in expansion of macro '__local_lock'
17 | #define local_lock(lock) __local_lock(this_cpu_ptr(lock))
| ^~~~~~~~~~~~
include/linux/percpu-defs.h:245:9: note: in expansion of macro '__verify_pcpu_ptr'
245 | __verify_pcpu_ptr(ptr); \
| ^~~~~~~~~~~~~~~~~
include/linux/percpu-defs.h:256:27: note: in expansion of macro 'raw_cpu_ptr'
256 | #define this_cpu_ptr(ptr) raw_cpu_ptr(ptr)
| ^~~~~~~~~~~
include/linux/local_lock.h:17:54: note: in expansion of macro 'this_cpu_ptr'
17 | #define local_lock(lock) __local_lock(this_cpu_ptr(lock))
| ^~~~~~~~~~~~
mm/vswap.c:653:9: note: in expansion of macro 'local_lock'
653 | local_lock(&percpu_cluster->lock);
| ^~~~~~~~~~
>> include/linux/percpu-defs.h:221:45: error: initialization from pointer to non-enclosed address space
221 | const void __percpu *__vpp_verify = (typeof((ptr) + 0))NULL; \
| ^
include/linux/local_lock_internal.h:176:40: note: in definition of macro '__local_lock_release'
176 | __l = (local_lock_t *)(lock); \
| ^~~~
include/linux/local_lock.h:38:41: note: in expansion of macro '__local_unlock'
38 | #define local_unlock(lock) __local_unlock(this_cpu_ptr(lock))
| ^~~~~~~~~~~~~~
include/linux/percpu-defs.h:245:9: note: in expansion of macro '__verify_pcpu_ptr'
245 | __verify_pcpu_ptr(ptr); \
| ^~~~~~~~~~~~~~~~~
include/linux/percpu-defs.h:256:27: note: in expansion of macro 'raw_cpu_ptr'
256 | #define this_cpu_ptr(ptr) raw_cpu_ptr(ptr)
| ^~~~~~~~~~~
include/linux/local_lock.h:38:56: note: in expansion of macro 'this_cpu_ptr'
38 | #define local_unlock(lock) __local_unlock(this_cpu_ptr(lock))
| ^~~~~~~~~~~~
mm/vswap.c:665:9: note: in expansion of macro 'local_unlock'
665 | local_unlock(&percpu_cluster->lock);
| ^~~~~~~~~~~~
include/linux/percpu-defs.h:221:45: note: expected 'const __seg_gs void *' but pointer is of type 'local_lock_t *'
221 | const void __percpu *__vpp_verify = (typeof((ptr) + 0))NULL; \
| ^
include/linux/local_lock_internal.h:176:40: note: in definition of macro '__local_lock_release'
176 | __l = (local_lock_t *)(lock); \
| ^~~~
include/linux/local_lock.h:38:41: note: in expansion of macro '__local_unlock'
38 | #define local_unlock(lock) __local_unlock(this_cpu_ptr(lock))
| ^~~~~~~~~~~~~~
include/linux/percpu-defs.h:245:9: note: in expansion of macro '__verify_pcpu_ptr'
245 | __verify_pcpu_ptr(ptr); \
| ^~~~~~~~~~~~~~~~~
include/linux/percpu-defs.h:256:27: note: in expansion of macro 'raw_cpu_ptr'
256 | #define this_cpu_ptr(ptr) raw_cpu_ptr(ptr)
| ^~~~~~~~~~~
include/linux/local_lock.h:38:56: note: in expansion of macro 'this_cpu_ptr'
38 | #define local_unlock(lock) __local_unlock(this_cpu_ptr(lock))
| ^~~~~~~~~~~~
mm/vswap.c:665:9: note: in expansion of macro 'local_unlock'
665 | local_unlock(&percpu_cluster->lock);
| ^~~~~~~~~~~~
>> include/linux/percpu-defs.h:221:45: error: initialization from pointer to non-enclosed address space
221 | const void __percpu *__vpp_verify = (typeof((ptr) + 0))NULL; \
| ^
include/linux/local_lock_internal.h:179:27: note: in definition of macro '__local_lock_release'
179 | _Generic((lock), \
| ^~~~
include/linux/local_lock.h:38:41: note: in expansion of macro '__local_unlock'
38 | #define local_unlock(lock) __local_unlock(this_cpu_ptr(lock))
| ^~~~~~~~~~~~~~
include/linux/percpu-defs.h:245:9: note: in expansion of macro '__verify_pcpu_ptr'
245 | __verify_pcpu_ptr(ptr); \
| ^~~~~~~~~~~~~~~~~
include/linux/percpu-defs.h:256:27: note: in expansion of macro 'raw_cpu_ptr'
256 | #define this_cpu_ptr(ptr) raw_cpu_ptr(ptr)
| ^~~~~~~~~~~
include/linux/local_lock.h:38:56: note: in expansion of macro 'this_cpu_ptr'
38 | #define local_unlock(lock) __local_unlock(this_cpu_ptr(lock))
| ^~~~~~~~~~~~
mm/vswap.c:665:9: note: in expansion of macro 'local_unlock'
665 | local_unlock(&percpu_cluster->lock);
| ^~~~~~~~~~~~
include/linux/percpu-defs.h:221:45: note: expected 'const __seg_gs void *' but pointer is of type 'local_lock_t *'
221 | const void __percpu *__vpp_verify = (typeof((ptr) + 0))NULL; \
| ^
include/linux/local_lock_internal.h:179:27: note: in definition of macro '__local_lock_release'
179 | _Generic((lock), \
| ^~~~
include/linux/local_lock.h:38:41: note: in expansion of macro '__local_unlock'
38 | #define local_unlock(lock) __local_unlock(this_cpu_ptr(lock))
| ^~~~~~~~~~~~~~
include/linux/percpu-defs.h:245:9: note: in expansion of macro '__verify_pcpu_ptr'
245 | __verify_pcpu_ptr(ptr); \
| ^~~~~~~~~~~~~~~~~
include/linux/percpu-defs.h:256:27: note: in expansion of macro 'raw_cpu_ptr'
256 | #define this_cpu_ptr(ptr) raw_cpu_ptr(ptr)
| ^~~~~~~~~~~
include/linux/local_lock.h:38:56: note: in expansion of macro 'this_cpu_ptr'
38 | #define local_unlock(lock) __local_unlock(this_cpu_ptr(lock))
| ^~~~~~~~~~~~
mm/vswap.c:665:9: note: in expansion of macro 'local_unlock'
665 | local_unlock(&percpu_cluster->lock);
| ^~~~~~~~~~~~
vim +221 include/linux/percpu-defs.h
62fde54123fb64 Tejun Heo 2014-06-17 207
9c28278a24c01c Tejun Heo 2014-06-17 208 /*
6fbc07bbe2b5a8 Tejun Heo 2014-06-17 209 * __verify_pcpu_ptr() verifies @ptr is a percpu pointer without evaluating
6fbc07bbe2b5a8 Tejun Heo 2014-06-17 210 * @ptr and is invoked once before a percpu area is accessed by all
6fbc07bbe2b5a8 Tejun Heo 2014-06-17 211 * accessors and operations. This is performed in the generic part of
6fbc07bbe2b5a8 Tejun Heo 2014-06-17 212 * percpu and arch overrides don't need to worry about it; however, if an
6fbc07bbe2b5a8 Tejun Heo 2014-06-17 213 * arch wants to implement an arch-specific percpu accessor or operation,
6fbc07bbe2b5a8 Tejun Heo 2014-06-17 214 * it may use __verify_pcpu_ptr() to verify the parameters.
9c28278a24c01c Tejun Heo 2014-06-17 215 *
9c28278a24c01c Tejun Heo 2014-06-17 216 * + 0 is required in order to convert the pointer type from a
9c28278a24c01c Tejun Heo 2014-06-17 217 * potential array type to a pointer to a single item of the array.
9c28278a24c01c Tejun Heo 2014-06-17 218 */
eba117889ac444 Tejun Heo 2014-06-17 219 #define __verify_pcpu_ptr(ptr) \
eba117889ac444 Tejun Heo 2014-06-17 220 do { \
9c28278a24c01c Tejun Heo 2014-06-17 @221 const void __percpu *__vpp_verify = (typeof((ptr) + 0))NULL; \
9c28278a24c01c Tejun Heo 2014-06-17 222 (void)__vpp_verify; \
9c28278a24c01c Tejun Heo 2014-06-17 223 } while (0)
9c28278a24c01c Tejun Heo 2014-06-17 224
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH v3 00/20] Virtual Swap Space
2026-02-10 19:11 ` Nhat Pham
2026-02-10 19:23 ` Nhat Pham
@ 2026-02-12 5:07 ` Chris Li
2026-02-17 23:36 ` Nhat Pham
2 siblings, 0 replies; 52+ messages in thread
From: Chris Li @ 2026-02-12 5:07 UTC (permalink / raw)
To: Nhat Pham
Cc: Kairui Song, linux-mm, akpm, hannes, hughd, yosry.ahmed, mhocko,
roman.gushchin, shakeel.butt, muchun.song, len.brown,
chengming.zhou, huang.ying.caritas, ryan.roberts, shikemeng,
viro, baohua, bhe, osalvador, christophe.leroy, pavel,
kernel-team, linux-kernel, cgroups, linux-pm, peterx, riel,
joshua.hahnjy, npache, gourry, axelrasmussen, yuanchu, weixugc,
rafael, jannh, pfalcato, zhengqi.arch, minchan
On Tue, Feb 10, 2026 at 11:11 AM Nhat Pham <nphamcs@gmail.com> wrote:
>
> On Tue, Feb 10, 2026 at 10:00 AM Kairui Song <ryncsn@gmail.com> wrote:
> >
> > On Mon, Feb 9, 2026 at 7:57 AM Nhat Pham <nphamcs@gmail.com> wrote:
> > >
> > > Anyway, resending this (in-reply-to patch 1 of the series):
> >
> > Hi Nhat,
> >
> > > Changelog:
> > > * RFC v2 -> v3:
> > > * Implement a cluster-based allocation algorithm for virtual swap
> > > slots, inspired by Kairui Song and Chris Li's implementation, as
> > > well as Johannes Weiner's suggestions. This eliminates the lock
> > > contention issues on the virtual swap layer.
> > > * Re-use swap table for the reverse mapping.
> > > * Remove CONFIG_VIRTUAL_SWAP.
> >
> > I really do think we better make this optional, not a replacement or
> > mandatory. There are many hard to evaluate effects as this
> > fundamentally changes the swap workflow with a lot of behavior changes
> > at once. e.g. it seems the folio will be reactivated instead of
> > splitted if the physical swap device is fragmented; slot is allocated
> > at IO and not at unmap, and maybe many others. Just like zswap is
> > optional. Some common workloads would see an obvious performance or
> > memory usage regression following this design, see below.
>
> Ideally, if we can close the performance gap and have only one
> version, then that would be the best :)
>
> Problem with making it optional, or maintaining effectively two swap
> implementations, is that it will make the patch series unreadable and
> unreviewable, and the code base unmaintanable :) You'll have x2 the
> amount of code to reason about and test, much more merge conflicts at
> rebase and cherry-pick time. And any improvement to one version takes
> extra work to graft onto the other version.
I second that this should be run time optional for other types of
swap. It should not be mandatory for other swap that does not benefit
from it. e.g. zram.
Chris
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH v3 00/20] Virtual Swap Space
2026-02-08 22:51 ` [PATCH v3 00/20] Virtual Swap Space Nhat Pham
@ 2026-02-12 12:23 ` David Hildenbrand (Arm)
2026-02-12 17:29 ` Nhat Pham
0 siblings, 1 reply; 52+ messages in thread
From: David Hildenbrand (Arm) @ 2026-02-12 12:23 UTC (permalink / raw)
To: Nhat Pham, linux-mm
Cc: akpm, hannes, hughd, yosry.ahmed, mhocko, roman.gushchin,
shakeel.butt, muchun.song, len.brown, chengming.zhou, kasong,
chrisl, huang.ying.caritas, ryan.roberts, shikemeng, viro,
baohua, bhe, osalvador, lorenzo.stoakes, christophe.leroy, pavel,
kernel-team, linux-kernel, cgroups, linux-pm, peterx, riel,
joshua.hahnjy, npache, gourry, axelrasmussen, yuanchu, weixugc,
rafael, jannh, pfalcato, zhengqi.arch
On 2/8/26 23:51, Nhat Pham wrote:
> On Sun, Feb 8, 2026 at 1:58 PM Nhat Pham <nphamcs@gmail.com> wrote:
>>
>> Changelog:
>> * RFC v2 -> v3:
>> * Implement a cluster-based allocation algorithm for virtual swap
>> slots, inspired by Kairui Song and Chris Li's implementation, as
>> well as Johannes Weiner's suggestions. This eliminates the lock
>> contention issues on the virtual swap layer.
>> * Re-use swap table for the reverse mapping.
>> * Remove CONFIG_VIRTUAL_SWAP.
>> * Reducing the size of the swap descriptor from 48 bytes to 24
>> bytes, i.e another 50% reduction in memory overhead from v2.
>> * Remove swap cache and zswap tree and use the swap descriptor
>> for this.
>> * Remove zeromap, and replace the swap_map bytemap with 2 bitmaps
>> (one for allocated slots, and one for bad slots).
>> * Rebase on top of 6.19 (7d0a66e4bb9081d75c82ec4957c50034cb0ea449)
>> * Update cover letter to include new benchmark results and discussion
>> on overhead in various cases.
>> * RFC v1 -> RFC v2:
>> * Use a single atomic type (swap_refs) for reference counting
>> purpose. This brings the size of the swap descriptor from 64 B
>> down to 48 B (25% reduction). Suggested by Yosry Ahmed.
>> * Zeromap bitmap is removed in the virtual swap implementation.
>> This saves one bit per phyiscal swapfile slot.
>> * Rearrange the patches and the code change to make things more
>> reviewable. Suggested by Johannes Weiner.
>> * Update the cover letter a bit.
>>
>> This patch series implements the virtual swap space idea, based on Yosry's
>> proposals at LSFMMBPF 2023 (see [1], [2], [3]), as well as valuable
>> inputs from Johannes Weiner. The same idea (with different
>> implementation details) has been floated by Rik van Riel since at least
>> 2011 (see [8]).
>>
>> This patch series is based on 6.19. There are a couple more
>> swap-related changes in the mm-stable branch that I would need to
>> coordinate with, but I would like to send this out as an update, to show
>> that the lock contention issues that plagued earlier versions have been
>> resolved and performance on the kernel build benchmark is now on-par with
>> baseline. Furthermore, memory overhead has been substantially reduced
>> compared to the last RFC version.
>>
>>
>> I. Motivation
>>
>> Currently, when an anon page is swapped out, a slot in a backing swap
>> device is allocated and stored in the page table entries that refer to
>> the original page. This slot is also used as the "key" to find the
>> swapped out content, as well as the index to swap data structures, such
>> as the swap cache, or the swap cgroup mapping. Tying a swap entry to its
>> backing slot in this way is performant and efficient when swap is purely
>> just disk space, and swapoff is rare.
>>
>> However, the advent of many swap optimizations has exposed major
>> drawbacks of this design. The first problem is that we occupy a physical
>> slot in the swap space, even for pages that are NEVER expected to hit
>> the disk: pages compressed and stored in the zswap pool, zero-filled
>> pages, or pages rejected by both of these optimizations when zswap
>> writeback is disabled. This is the arguably central shortcoming of
>> zswap:
>> * In deployments when no disk space can be afforded for swap (such as
>> mobile and embedded devices), users cannot adopt zswap, and are forced
>> to use zram. This is confusing for users, and creates extra burdens
>> for developers, having to develop and maintain similar features for
>> two separate swap backends (writeback, cgroup charging, THP support,
>> etc.). For instance, see the discussion in [4].
>> * Resource-wise, it is hugely wasteful in terms of disk usage. At Meta,
>> we have swapfile in the order of tens to hundreds of GBs, which are
>> mostly unused and only exist to enable zswap usage and zero-filled
>> pages swap optimizations.
>> * Tying zswap (and more generally, other in-memory swap backends) to
>> the current physical swapfile infrastructure makes zswap implicitly
>> statically sized. This does not make sense, as unlike disk swap, in
>> which we consume a limited resource (disk space or swapfile space) to
>> save another resource (memory), zswap consume the same resource it is
>> saving (memory). The more we zswap, the more memory we have available,
>> not less. We are not rationing a limited resource when we limit
>> the size of he zswap pool, but rather we are capping the resource
>> (memory) saving potential of zswap. Under memory pressure, using
>> more zswap is almost always better than the alternative (disk IOs, or
>> even worse, OOMs), and dynamically sizing the zswap pool on demand
>> allows the system to flexibly respond to these precarious scenarios.
>> * Operationally, static provisioning the swapfile for zswap pose
>> significant challenges, because the sysadmin has to prescribe how
>> much swap is needed a priori, for each combination of
>> (memory size x disk space x workload usage). It is even more
>> complicated when we take into account the variance of memory
>> compression, which changes the reclaim dynamics (and as a result,
>> swap space size requirement). The problem is further exarcebated for
>> users who rely on swap utilization (and exhaustion) as an OOM signal.
>>
>> All of these factors make it very difficult to configure the swapfile
>> for zswap: too small of a swapfile and we risk preventable OOMs and
>> limit the memory saving potentials of zswap; too big of a swapfile
>> and we waste disk space and memory due to swap metadata overhead.
>> This dilemma becomes more drastic in high memory systems, which can
>> have up to TBs worth of memory.
>>
>> Past attempts to decouple disk and compressed swap backends, namely the
>> ghost swapfile approach (see [13]), as well as the alternative
>> compressed swap backend zram, have mainly focused on eliminating the
>> disk space usage of compressed backends. We want a solution that not
>> only tackles that same problem, but also achieve the dyamicization of
>> swap space to maximize the memory saving potentials while reducing
>> operational and static memory overhead.
>>
>> Finally, any swap redesign should support efficient backend transfer,
>> i.e without having to perform the expensive page table walk to
>> update all the PTEs that refer to the swap entry:
>> * The main motivation for this requirement is zswap writeback. To quote
>> Johannes (from [14]): "Combining compression with disk swap is
>> extremely powerful, because it dramatically reduces the worst aspects
>> of both: it reduces the memory footprint of compression by shedding
>> the coldest data to disk; it reduces the IO latencies and flash wear
>> of disk swap through the writeback cache. In practice, this reduces
>> *average event rates of the entire reclaim/paging/IO stack*."
>> * Another motivation is to simplify swapoff, which is both complicated
>> and expensive in the current design, precisely because we are storing
>> an encoding of the backend positional information in the page table,
>> and thus requires a full page table walk to remove these references.
>>
>>
>> II. High Level Design Overview
>>
>> To fix the aforementioned issues, we need an abstraction that separates
>> a swap entry from its physical backing storage. IOW, we need to
>> “virtualize” the swap space: swap clients will work with a dynamically
>> allocated virtual swap slot, storing it in page table entries, and
>> using it to index into various swap-related data structures. The
>> backing storage is decoupled from the virtual swap slot, and the newly
>> introduced layer will “resolve” the virtual swap slot to the actual
>> storage. This layer also manages other metadata of the swap entry, such
>> as its lifetime information (swap count), via a dynamically allocated,
>> per-swap-entry descriptor:
>>
>> struct swp_desc {
>> union {
>> swp_slot_t slot; /* 0 8 */
>> struct zswap_entry * zswap_entry; /* 0 8 */
>> }; /* 0 8 */
>> union {
>> struct folio * swap_cache; /* 8 8 */
>> void * shadow; /* 8 8 */
>> }; /* 8 8 */
>> unsigned int swap_count; /* 16 4 */
>> unsigned short memcgid:16; /* 20: 0 2 */
>> bool in_swapcache:1; /* 22: 0 1 */
>>
>> /* Bitfield combined with previous fields */
>>
>> enum swap_type type:2; /* 20:17 4 */
>>
>> /* size: 24, cachelines: 1, members: 6 */
>> /* bit_padding: 13 bits */
>> /* last cacheline: 24 bytes */
>> };
>>
>> (output from pahole).
>>
>> This design allows us to:
>> * Decouple zswap (and zeromapped swap entry) from backing swapfile:
>> simply associate the virtual swap slot with one of the supported
>> backends: a zswap entry, a zero-filled swap page, a slot on the
>> swapfile, or an in-memory page.
>> * Simplify and optimize swapoff: we only have to fault the page in and
>> have the virtual swap slot points to the page instead of the on-disk
>> physical swap slot. No need to perform any page table walking.
>>
>> The size of the virtual swap descriptor is 24 bytes. Note that this is
>> not all "new" overhead, as the swap descriptor will replace:
>> * the swap_cgroup arrays (one per swap type) in the old design, which
>> is a massive source of static memory overhead. With the new design,
>> it is only allocated for used clusters.
>> * the swap tables, which holds the swap cache and workingset shadows.
>> * the zeromap bitmap, which is a bitmap of physical swap slots to
>> indicate whether the swapped out page is zero-filled or not.
>> * huge chunk of the swap_map. The swap_map is now replaced by 2 bitmaps,
>> one for allocated slots, and one for bad slots, representing 3 possible
>> states of a slot on the swapfile: allocated, free, and bad.
>> * the zswap tree.
>>
>> So, in terms of additional memory overhead:
>> * For zswap entries, the added memory overhead is rather minimal. The
>> new indirection pointer neatly replaces the existing zswap tree.
>> We really only incur less than one word of overhead for swap count
>> blow up (since we no longer use swap continuation) and the swap type.
>> * For physical swap entries, the new design will impose fewer than 3 words
>> memory overhead. However, as noted above this overhead is only for
>> actively used swap entries, whereas in the current design the overhead is
>> static (including the swap cgroup array for example).
>>
>> The primary victim of this overhead will be zram users. However, as
>> zswap now no longer takes up disk space, zram users can consider
>> switching to zswap (which, as a bonus, has a lot of useful features
>> out of the box, such as cgroup tracking, dynamic zswap pool sizing,
>> LRU-ordering writeback, etc.).
>>
>> For a more concrete example, suppose we have a 32 GB swapfile (i.e.
>> 8,388,608 swap entries), and we use zswap.
>>
>> 0% usage, or 0 entries: 0.00 MB
>> * Old design total overhead: 25.00 MB
>> * Vswap total overhead: 0.00 MB
>>
>> 25% usage, or 2,097,152 entries:
>> * Old design total overhead: 57.00 MB
>> * Vswap total overhead: 48.25 MB
>>
>> 50% usage, or 4,194,304 entries:
>> * Old design total overhead: 89.00 MB
>> * Vswap total overhead: 96.50 MB
>>
>> 75% usage, or 6,291,456 entries:
>> * Old design total overhead: 121.00 MB
>> * Vswap total overhead: 144.75 MB
>>
>> 100% usage, or 8,388,608 entries:
>> * Old design total overhead: 153.00 MB
>> * Vswap total overhead: 193.00 MB
>>
>> So even in the worst case scenario for virtual swap, i.e when we
>> somehow have an oracle to correctly size the swapfile for zswap
>> pool to 32 GB, the added overhead is only 40 MB, which is a mere
>> 0.12% of the total swapfile :)
>>
>> In practice, the overhead will be closer to the 50-75% usage case, as
>> systems tend to leave swap headroom for pathological events or sudden
>> spikes in memory requirements. The added overhead in these cases are
>> practically neglible. And in deployments where swapfiles for zswap
>> are previously sparsely used, switching over to virtual swap will
>> actually reduce memory overhead.
>>
>> Doing the same math for the disk swap, which is the worst case for
>> virtual swap in terms of swap backends:
>>
>> 0% usage, or 0 entries: 0.00 MB
>> * Old design total overhead: 25.00 MB
>> * Vswap total overhead: 2.00 MB
>>
>> 25% usage, or 2,097,152 entries:
>> * Old design total overhead: 41.00 MB
>> * Vswap total overhead: 66.25 MB
>>
>> 50% usage, or 4,194,304 entries:
>> * Old design total overhead: 57.00 MB
>> * Vswap total overhead: 130.50 MB
>>
>> 75% usage, or 6,291,456 entries:
>> * Old design total overhead: 73.00 MB
>> * Vswap total overhead: 194.75 MB
>>
>> 100% usage, or 8,388,608 entries:
>> * Old design total overhead: 89.00 MB
>> * Vswap total overhead: 259.00 MB
>>
>> The added overhead is 170MB, which is 0.5% of the total swapfile size,
>> again in the worst case when we have a sizing oracle.
>>
>> Please see the attached patches for more implementation details.
>>
>>
>> III. Usage and Benchmarking
>>
>> This patch series introduce no new syscalls or userspace API. Existing
>> userspace setups will work as-is, except we no longer have to create a
>> swapfile or set memory.swap.max if we want to use zswap, as zswap is no
>> longer tied to physical swap. The zswap pool will be automatically and
>> dynamically sized based on memory usage and reclaim dynamics.
>>
>> To measure the performance of the new implementation, I have run the
>> following benchmarks:
>>
>> 1. Kernel building: 52 workers (one per processor), memory.max = 3G.
>>
>> Using zswap as the backend:
>>
>> Baseline:
>> real: mean: 185.2s, stdev: 0.93s
>> sys: mean: 683.7s, stdev: 33.77s
>>
>> Vswap:
>> real: mean: 184.88s, stdev: 0.57s
>> sys: mean: 675.14s, stdev: 32.8s
>>
>> We actually see a slight improvement in systime (by 1.5%) :) This is
>> likely because we no longer have to perform swap charging for zswap
>> entries, and virtual swap allocator is simpler than that of physical
>> swap.
>>
>> Using SSD swap as the backend:
>>
>> Baseline:
>> real: mean: 200.3s, stdev: 2.33s
>> sys: mean: 489.88s, stdev: 9.62s
>>
>> Vswap:
>> real: mean: 201.47s, stdev: 2.98s
>> sys: mean: 487.36s, stdev: 5.53s
>>
>> The performance is neck-to-neck.
>>
>>
>> IV. Future Use Cases
>>
>> While the patch series focus on two applications (decoupling swap
>> backends and swapoff optimization/simplification), this new,
>> future-proof design also allows us to implement new swap features more
>> easily and efficiently:
>>
>> * Multi-tier swapping (as mentioned in [5]), with transparent
>> transferring (promotion/demotion) of pages across tiers (see [8] and
>> [9]). Similar to swapoff, with the old design we would need to
>> perform the expensive page table walk.
>> * Swapfile compaction to alleviate fragmentation (as proposed by Ying
>> Huang in [6]).
>> * Mixed backing THP swapin (see [7]): Once you have pinned down the
>> backing store of THPs, then you can dispatch each range of subpages
>> to appropriate backend swapin handler.
>> * Swapping a folio out with discontiguous physical swap slots
>> (see [10]).
>> * Zswap writeback optimization: The current architecture pre-reserves
>> physical swap space for pages when they enter the zswap pool, giving
>> the kernel no flexibility at writeback time. With the virtual swap
>> implementation, the backends are decoupled, and physical swap space
>> is allocated on-demand at writeback time, at which point we can make
>> much smarter decisions: we can batch multiple zswap writeback
>> operations into a single IO request, allocating contiguous physical
>> swap slots for that request. We can even perform compressed writeback
>> (i.e writing these pages without decompressing them) (see [12]).
>>
>>
>> V. References
>>
>> [1]: https://lore.kernel.org/all/CAJD7tkbCnXJ95Qow_aOjNX6NOMU5ovMSHRC+95U4wtW6cM+puw@mail.gmail.com/
>> [2]: https://lwn.net/Articles/932077/
>> [3]: https://www.youtube.com/watch?v=Hwqw_TBGEhg
>> [4]: https://lore.kernel.org/all/Zqe_Nab-Df1CN7iW@infradead.org/
>> [5]: https://lore.kernel.org/lkml/CAF8kJuN-4UE0skVHvjUzpGefavkLULMonjgkXUZSBVJrcGFXCA@mail.gmail.com/
>> [6]: https://lore.kernel.org/linux-mm/87o78mzp24.fsf@yhuang6-desk2.ccr.corp.intel.com/
>> [7]: https://lore.kernel.org/all/CAGsJ_4ysCN6f7qt=6gvee1x3ttbOnifGneqcRm9Hoeun=uFQ2w@mail.gmail.com/
>> [8]: https://lore.kernel.org/linux-mm/4DA25039.3020700@redhat.com/
>> [9]: https://lore.kernel.org/all/CA+ZsKJ7DCE8PMOSaVmsmYZL9poxK6rn0gvVXbjpqxMwxS2C9TQ@mail.gmail.com/
>> [10]: https://lore.kernel.org/all/CACePvbUkMYMencuKfpDqtG1Ej7LiUS87VRAXb8sBn1yANikEmQ@mail.gmail.com/
>> [11]: https://lore.kernel.org/all/CAMgjq7BvQ0ZXvyLGp2YP96+i+6COCBBJCYmjXHGBnfisCAb8VA@mail.gmail.com/
>> [12]: https://lore.kernel.org/linux-mm/ZeZSDLWwDed0CgT3@casper.infradead.org/
>> [13]: https://lore.kernel.org/all/20251121-ghost-v1-1-cfc0efcf3855@kernel.org/
>> [14]: https://lore.kernel.org/linux-mm/20251202170222.GD430226@cmpxchg.org/
>>
>> Nhat Pham (20):
>> mm/swap: decouple swap cache from physical swap infrastructure
>> swap: rearrange the swap header file
>> mm: swap: add an abstract API for locking out swapoff
>> zswap: add new helpers for zswap entry operations
>> mm/swap: add a new function to check if a swap entry is in swap
>> cached.
>> mm: swap: add a separate type for physical swap slots
>> mm: create scaffolds for the new virtual swap implementation
>> zswap: prepare zswap for swap virtualization
>> mm: swap: allocate a virtual swap slot for each swapped out page
>> swap: move swap cache to virtual swap descriptor
>> zswap: move zswap entry management to the virtual swap descriptor
>> swap: implement the swap_cgroup API using virtual swap
>> swap: manage swap entry lifecycle at the virtual swap layer
>> mm: swap: decouple virtual swap slot from backing store
>> zswap: do not start zswap shrinker if there is no physical swap slots
>> swap: do not unnecesarily pin readahead swap entries
>> swapfile: remove zeromap bitmap
>> memcg: swap: only charge physical swap slots
>> swap: simplify swapoff using virtual swap
>> swapfile: replace the swap map with bitmaps
>>
>> Documentation/mm/swap-table.rst | 69 --
>> MAINTAINERS | 2 +
>> include/linux/cpuhotplug.h | 1 +
>> include/linux/mm_types.h | 16 +
>> include/linux/shmem_fs.h | 7 +-
>> include/linux/swap.h | 135 ++-
>> include/linux/swap_cgroup.h | 13 -
>> include/linux/swapops.h | 25 +
>> include/linux/zswap.h | 17 +-
>> kernel/power/swap.c | 6 +-
>> mm/Makefile | 5 +-
>> mm/huge_memory.c | 11 +-
>> mm/internal.h | 12 +-
>> mm/memcontrol-v1.c | 6 +
>> mm/memcontrol.c | 142 ++-
>> mm/memory.c | 101 +-
>> mm/migrate.c | 13 +-
>> mm/mincore.c | 15 +-
>> mm/page_io.c | 83 +-
>> mm/shmem.c | 215 +---
>> mm/swap.h | 157 +--
>> mm/swap_cgroup.c | 172 ---
>> mm/swap_state.c | 306 +----
>> mm/swap_table.h | 78 +-
>> mm/swapfile.c | 1518 ++++-------------------
>> mm/userfaultfd.c | 18 +-
>> mm/vmscan.c | 28 +-
>> mm/vswap.c | 2025 +++++++++++++++++++++++++++++++
>> mm/zswap.c | 142 +--
>> 29 files changed, 2853 insertions(+), 2485 deletions(-)
>> delete mode 100644 Documentation/mm/swap-table.rst
>> delete mode 100644 mm/swap_cgroup.c
>> create mode 100644 mm/vswap.c
>>
>>
>> base-commit: 05f7e89ab9731565d8a62e3b5d1ec206485eeb0b
>> --
>> 2.47.3
>
> Weirdly, it seems like the cover letter (and only the cover letter) is
> not being delivered...
>
> I'm trying to figure out what's going on :( My apologies for the
> inconvenience...
>
Are you CCing all maintainers that get_maintainers.pl suggests you to cc?
--
Cheers,
David
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH v3 00/20] Virtual Swap Space
2026-02-12 12:23 ` David Hildenbrand (Arm)
@ 2026-02-12 17:29 ` Nhat Pham
2026-02-12 17:39 ` Nhat Pham
2026-02-12 17:41 ` David Hildenbrand (Arm)
0 siblings, 2 replies; 52+ messages in thread
From: Nhat Pham @ 2026-02-12 17:29 UTC (permalink / raw)
To: David Hildenbrand (Arm)
Cc: linux-mm, akpm, hannes, hughd, yosry.ahmed, mhocko,
roman.gushchin, shakeel.butt, muchun.song, len.brown,
chengming.zhou, kasong, chrisl, huang.ying.caritas, ryan.roberts,
shikemeng, viro, baohua, bhe, osalvador, lorenzo.stoakes,
christophe.leroy, pavel, kernel-team, linux-kernel, cgroups,
linux-pm, peterx, riel, joshua.hahnjy, npache, gourry,
axelrasmussen, yuanchu, weixugc, rafael, jannh, pfalcato,
zhengqi.arch
On Thu, Feb 12, 2026 at 4:23 AM David Hildenbrand (Arm)
<david@kernel.org> wrote:
>>
> Are you CCing all maintainers that get_maintainers.pl suggests you to cc?
>
> --
> Cheers,
>
> David
I hope so... did I miss someone? If so, my apologies - I manually add
them one at a time to be completely honest. The list is huge...
I'll probably use a script to convert that huge output next time into "--cc".
(Or are you suggesting I should not send it out to everyone? I can try
to trim the list, but tbh it touches areas that I'm not familiar with,
so I figure I should just cc everyone).
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH v3 00/20] Virtual Swap Space
2026-02-12 17:29 ` Nhat Pham
@ 2026-02-12 17:39 ` Nhat Pham
2026-02-12 20:11 ` David Hildenbrand (Arm)
2026-02-12 17:41 ` David Hildenbrand (Arm)
1 sibling, 1 reply; 52+ messages in thread
From: Nhat Pham @ 2026-02-12 17:39 UTC (permalink / raw)
To: David Hildenbrand (Arm)
Cc: linux-mm, akpm, hannes, hughd, yosry.ahmed, mhocko,
roman.gushchin, shakeel.butt, muchun.song, len.brown,
chengming.zhou, kasong, chrisl, huang.ying.caritas, ryan.roberts,
shikemeng, viro, baohua, bhe, osalvador, lorenzo.stoakes,
christophe.leroy, pavel, kernel-team, linux-kernel, cgroups,
linux-pm, peterx, riel, joshua.hahnjy, npache, gourry,
axelrasmussen, yuanchu, weixugc, rafael, jannh, pfalcato,
zhengqi.arch, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan,
Michal Hocko, Jonathan Corbet, tglx, Peter Zijlstra, Baolin Wang,
lenb, Zi Yan, dev.jain, lance.yang, matthew.brost, rakie.kim,
byungchul, Huang, Ying, apopple, linux-doc
On Thu, Feb 12, 2026 at 9:29 AM Nhat Pham <nphamcs@gmail.com> wrote:
>
> On Thu, Feb 12, 2026 at 4:23 AM David Hildenbrand (Arm)
> <david@kernel.org> wrote:
> >>
> > Are you CCing all maintainers that get_maintainers.pl suggests you to cc?
> >
> > --
> > Cheers,
> >
> > David
>
> I hope so... did I miss someone? If so, my apologies - I manually add
> them one at a time to be completely honest. The list is huge...
>
> I'll probably use a script to convert that huge output next time into "--cc".
>
Ok let's try... this :) Probably should have done it from the start,
but better late than never...
Not sure who was missing from the first run - my apologies if I did
that.... I'll be more careful with huge cc list next time and just
scriptify it.
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH v3 00/20] Virtual Swap Space
2026-02-12 17:29 ` Nhat Pham
2026-02-12 17:39 ` Nhat Pham
@ 2026-02-12 17:41 ` David Hildenbrand (Arm)
2026-02-12 17:45 ` Nhat Pham
1 sibling, 1 reply; 52+ messages in thread
From: David Hildenbrand (Arm) @ 2026-02-12 17:41 UTC (permalink / raw)
To: Nhat Pham
Cc: linux-mm, akpm, hannes, hughd, yosry.ahmed, mhocko,
roman.gushchin, shakeel.butt, muchun.song, len.brown,
chengming.zhou, kasong, chrisl, huang.ying.caritas, ryan.roberts,
shikemeng, viro, baohua, bhe, osalvador, lorenzo.stoakes,
christophe.leroy, pavel, kernel-team, linux-kernel, cgroups,
linux-pm, peterx, riel, joshua.hahnjy, npache, gourry,
axelrasmussen, yuanchu, weixugc, rafael, jannh, pfalcato,
zhengqi.arch
On 2/12/26 18:29, Nhat Pham wrote:
> On Thu, Feb 12, 2026 at 4:23 AM David Hildenbrand (Arm)
> <david@kernel.org> wrote:
>>>
>> Are you CCing all maintainers that get_maintainers.pl suggests you to cc?
>>
>> --
>> Cheers,
>>
>> David
>
> I hope so... did I miss someone? If so, my apologies - I manually add
> them one at a time to be completely honest. The list is huge...
>
I stumbled over this patch set while scrolling through the mailing list
after a while (now that my inbox is "mostly" cleaned up) and wondered
why no revision ended in my inbox :)
> I'll probably use a script to convert that huge output next time into "--cc".
I usually add them as
Cc:
to the cover letter and then use something like "--cc-cover " with git
send-email.
--
Cheers,
David
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH v3 00/20] Virtual Swap Space
2026-02-12 17:41 ` David Hildenbrand (Arm)
@ 2026-02-12 17:45 ` Nhat Pham
0 siblings, 0 replies; 52+ messages in thread
From: Nhat Pham @ 2026-02-12 17:45 UTC (permalink / raw)
To: David Hildenbrand (Arm)
Cc: linux-mm, akpm, hannes, hughd, yosry.ahmed, mhocko,
roman.gushchin, shakeel.butt, muchun.song, len.brown,
chengming.zhou, kasong, chrisl, huang.ying.caritas, ryan.roberts,
shikemeng, viro, baohua, bhe, osalvador, lorenzo.stoakes,
christophe.leroy, pavel, kernel-team, linux-kernel, cgroups,
linux-pm, peterx, riel, joshua.hahnjy, npache, gourry,
axelrasmussen, yuanchu, weixugc, rafael, jannh, pfalcato,
zhengqi.arch
On Thu, Feb 12, 2026 at 9:41 AM David Hildenbrand (Arm)
<david@kernel.org> wrote:
>
> On 2/12/26 18:29, Nhat Pham wrote:
> > On Thu, Feb 12, 2026 at 4:23 AM David Hildenbrand (Arm)
> > <david@kernel.org> wrote:
> >>>
> >> Are you CCing all maintainers that get_maintainers.pl suggests you to cc?
> >>
> >> --
> >> Cheers,
> >>
> >> David
> >
> > I hope so... did I miss someone? If so, my apologies - I manually add
> > them one at a time to be completely honest. The list is huge...
> >
>
> I stumbled over this patch set while scrolling through the mailing list
> after a while (now that my inbox is "mostly" cleaned up) and wondered
> why no revision ended in my inbox :)
>
> > I'll probably use a script to convert that huge output next time into "--cc".
>
> I usually add them as
>
> Cc:
>
> to the cover letter and then use something like "--cc-cover " with git
> send-email.
Oh TIL. Thanks, David!
Yeah this is the biggest patch series I've ever sent out. Most of my
past patches are contained in one or two files, so usually only the
maintainers and contributors are pulled in, and the cc list never
exceeds 15-20 cc's. So I've been getting away with just manually
preparing a send command, do a quick eyeball check, then send things
out.
That system breaks down hard this case (the email debacle aside, which
I still haven't figured out - still looking at gmail as the prime
suspect...).
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH v3 00/20] Virtual Swap Space
2026-02-12 17:39 ` Nhat Pham
@ 2026-02-12 20:11 ` David Hildenbrand (Arm)
0 siblings, 0 replies; 52+ messages in thread
From: David Hildenbrand (Arm) @ 2026-02-12 20:11 UTC (permalink / raw)
To: Nhat Pham
Cc: linux-mm, akpm, hannes, hughd, yosry.ahmed, mhocko,
roman.gushchin, shakeel.butt, muchun.song, len.brown,
chengming.zhou, kasong, chrisl, huang.ying.caritas, ryan.roberts,
shikemeng, viro, baohua, bhe, osalvador, lorenzo.stoakes,
christophe.leroy, pavel, kernel-team, linux-kernel, cgroups,
linux-pm, peterx, riel, joshua.hahnjy, npache, gourry,
axelrasmussen, yuanchu, weixugc, rafael, jannh, pfalcato,
zhengqi.arch, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan,
Michal Hocko, Jonathan Corbet, tglx, Peter Zijlstra, Baolin Wang,
lenb, Zi Yan, dev.jain, lance.yang, matthew.brost, rakie.kim,
byungchul, Huang, Ying, apopple, linux-doc
On 2/12/26 18:39, Nhat Pham wrote:
> On Thu, Feb 12, 2026 at 9:29 AM Nhat Pham <nphamcs@gmail.com> wrote:
>>
>> On Thu, Feb 12, 2026 at 4:23 AM David Hildenbrand (Arm)
>> <david@kernel.org> wrote:
>>> Are you CCing all maintainers that get_maintainers.pl suggests you to cc?
>>>
>>> --
>>> Cheers,
>>>
>>> David
>>
>> I hope so... did I miss someone? If so, my apologies - I manually add
>> them one at a time to be completely honest. The list is huge...
>>
>> I'll probably use a script to convert that huge output next time into "--cc".
>>
>
> Ok let's try... this :) Probably should have done it from the start,
> but better late than never...
>
It's usually not as easy as copying the output to the cover letter via Cc:.
Sometimes you want to CC all maintainers+reviewers of some subsystem,
sometimes only the maintainers (heads-up, mostly simplistic unrelated
changes that don't need any real subsystem-specific review).
Fine line between flooding people with patches or annoying people with
patches :)
> Not sure who was missing from the first run - my apologies if I did
> that.... I'll be more careful with huge cc list next time and just
> scriptify it.
No worries, I was just surprised to spot a v3 already!
--
Cheers,
David
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH v3 00/20] Virtual Swap Space
2026-02-10 19:11 ` Nhat Pham
2026-02-10 19:23 ` Nhat Pham
2026-02-12 5:07 ` Chris Li
@ 2026-02-17 23:36 ` Nhat Pham
2 siblings, 0 replies; 52+ messages in thread
From: Nhat Pham @ 2026-02-17 23:36 UTC (permalink / raw)
To: Kairui Song
Cc: linux-mm, akpm, hannes, hughd, yosry.ahmed, mhocko,
roman.gushchin, shakeel.butt, muchun.song, len.brown,
chengming.zhou, chrisl, huang.ying.caritas, ryan.roberts,
shikemeng, viro, baohua, bhe, osalvador, christophe.leroy, pavel,
kernel-team, linux-kernel, cgroups, linux-pm, peterx, riel,
joshua.hahnjy, npache, gourry, axelrasmussen, yuanchu, weixugc,
rafael, jannh, pfalcato, zhengqi.arch
On Tue, Feb 10, 2026 at 11:11 AM Nhat Pham <nphamcs@gmail.com> wrote:
>
> On Tue, Feb 10, 2026 at 10:00 AM Kairui Song <ryncsn@gmail.com> wrote:
> > # free -m
> > total used free shared buff/cache available
> > Mem: 31582 909 26388 8 4284 29989
> > Swap: 40959 41 40918
> >
> > The swap setup follows the recommendation from Huang
> > (https://lore.kernel.org/linux-mm/87ed474kvx.fsf@yhuang6-desk2.ccr.corp.intel.com/).
> >
> > Test (average of 18 test run):
> > vm-scalability/usemem --init-time -O -y -x -n 1 56G
> >
> > 6.19:
> > Throughput: 618.49 MB/s (stdev 31.3)
> > Free latency: 5754780.50us (stdev 69542.7)
> >
> > swap-table-p3 (3.8%, 0.5% better):
> > Throughput: 642.02 MB/s (stdev 25.1)
> > Free latency: 5728544.16us (stdev 48592.51)
> >
> > vswap (3.2%, 244% worse):
> > Throughput: 598.67 MB/s (stdev 25.1)
> > Free latency: 13987175.66us (stdev 125148.57)
> >
> > That's a huge regression with freeing. I have a vm-scatiliby test
> > matrix, not every setup has such significant >200% regression, but on
> > average the freeing time is about at least 15 - 50% slower (for
> > example /data/vm-scalability/usemem --init-time -O -y -x -n 32 1536M
> > the regression is about 2583221.62us vs 2153735.59us). Throughput is
> > all lower too.
Hi Kairui - a quick update.
Took me awhile to get a host that matches your memory spec:
free -m
total used free shared buff/cache available
Mem: 31609 5778 7634 20 18664 25831
Swap: 65535 1 65534
I think I managed to reproduce your observations (average over 5 runs):
Baseline (6.19)
real: mean: 191.19s, stdev: 4.53s
user: mean: 46.98s, stdev: 0.15s
sys: mean: 127.97s, stdev: 3.95s
average throughput: 382057 KB/s
average free time: 8179978 usecs
Vswap:
real: mean: 199.85s, stdev: 6.09s
user: mean: 46.51s, stdev: 0.25s
sys: mean: 137.24s, stdev: 6.46s
average throughput: 367437 KB/s
average free time: 9887107.6 usecs
(command is time ./usemem --init-time -w -O -s 10 -n 1 56g)
I think I figured out where the bulk of the regression lay - it's in
the PTE zapping path. In a nutshell, we're not batching in the case
where these PTEs are backed by virtual swap entries with zswap
backends (even though there is not a good reason not to batch), and
unnecessarily performing unnecesary xarray lookups to resolve the
backend for some superfluous checks (2 xarray lookups for every PTE,
which is wasted work because as noted earlier, we ended up not
batching anyway).
Just by simply fixing this issue, the gap is much closer
real: mean: 192.24s, stdev: 4.82s
user: mean: 46.42s, stdev: 0.27s
sys: mean: 129.84s, stdev: 4.59s
average throughput: 380670 KB/s
average free time: 8583381.4 usecs
I also discovered a couple more inefficiencies in vswap free path.
Hopefully once we fix those, the gap will be non-existent.
^ permalink raw reply [flat|nested] 52+ messages in thread
* [PATCH] vswap: fix poor batching behavior of vswap free path
2026-02-10 17:59 ` Kairui Song
` (2 preceding siblings ...)
2026-02-10 21:58 ` Chris Li
@ 2026-02-20 21:05 ` Nhat Pham
3 siblings, 0 replies; 52+ messages in thread
From: Nhat Pham @ 2026-02-20 21:05 UTC (permalink / raw)
To: kasong
Cc: Liam.Howlett, akpm, apopple, axelrasmussen, baohua, baolin.wang,
bhe, byungchul, cgroups, chengming.zhou, chrisl, corbet, david,
dev.jain, gourry, hannes, hughd, jannh, joshua.hahnjy,
lance.yang, lenb, linux-doc, linux-kernel, linux-mm, linux-pm,
lorenzo.stoakes, matthew.brost, mhocko, muchun.song, npache,
nphamcs, pavel, peterx, peterz, pfalcato, rafael, rakie.kim,
roman.gushchin, rppt, ryan.roberts, shakeel.butt, shikemeng,
surenb, tglx, vbabka, weixugc, ying.huang, yosry.ahmed, yuanchu,
zhengqi.arch, ziy, kernel-team, riel
Kairui, could you apply this patch on top of the vswap series and run it
on your test suite? It runs fairly well on my system (I actually rerun
the benchmark on a different host to double check as well), but I'd love
to get some data from your ends as well.
If there are serious discrepancies, could you also include your build
config etc.? There might be differences in our setups, but since I
managed to reproduce the free time regression on my first try I figured
I should just fix it first :)
---------------
Fix two issues that make the swap free path inefficient:
1. At the PTE zapping step, we are unnecessarily resolving the backends,
and fall back to batch size of 1, even though virtual swap
infrastructure now already supports freeing of mixed backend ranges
(as long the PTEs contain virtually contiguous swap slots).
2. Optimize vswap_free() by batching consecutive free operations, and
avoid releasing locks unnecessarily (most notably, when we release
non-disk-swap backends).
Per a report from Kairui Song ([1]), I have run the following benchmark:
free -m
total used free shared buff/cache available
Mem: 31596 5094 11667 19 15302 26502
Swap: 65535 33 65502
Running the usemem benchmark with n = 1, 56G for 5 times, and average
out the result:
Baseline (6.19):
real: mean: 190.93s, stdev: 5.09s
user: mean: 46.62s, stdev: 0.27s
sys: mean: 128.51s, stdev: 5.17s
throughput: mean: 382093 KB/s, stdev: 11173.6 KB/s
free time: mean: 7916690.2 usecs, stdev: 88923.0 usecs
VSS without this patch:
real: mean: 194.59s, stdev: 7.61s
user: mean: 46.71s, stdev: 0.46s
sys: mean: 131.97s, stdev: 7.93s
throughput: mean: 379236.4 KB/s, stdev: 15912.26 KB/s
free time: mean: 10115572.2 usecs, stdev: 108318.35 usecs
VSS with this patch:
real: mean: 187.66s, stdev: 5.67s
user: mean: 46.5s, stdev: 0.16s
sys: mean: 125.3s, stdev: 5.58s
throughput: mean: 387506.4 KB/s, stdev: 12556.56 KB/s
free time: mean: 7029733.8 usecs, stdev: 124661.34 usecs
[1]: https://lore.kernel.org/linux-mm/CAMgjq7AQNGK-a=AOgvn4-V+zGO21QMbMTVbrYSW_R2oDSLoC+A@mail.gmail.com/
Signed-off-by: Nhat Pham <nphamcs@gmail.com>
---
include/linux/memcontrol.h | 6 +
mm/internal.h | 18 ++-
mm/madvise.c | 2 +-
mm/memcontrol.c | 2 +-
mm/memory.c | 8 +-
mm/vswap.c | 294 ++++++++++++++++++-------------------
6 files changed, 165 insertions(+), 165 deletions(-)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 0651865a4564f..0f7f5489e1675 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -827,6 +827,7 @@ static inline unsigned short mem_cgroup_id(struct mem_cgroup *memcg)
return memcg->id.id;
}
struct mem_cgroup *mem_cgroup_from_id(unsigned short id);
+void mem_cgroup_id_put_many(struct mem_cgroup *memcg, unsigned int n);
#ifdef CONFIG_SHRINKER_DEBUG
static inline unsigned long mem_cgroup_ino(struct mem_cgroup *memcg)
@@ -1289,6 +1290,11 @@ static inline struct mem_cgroup *mem_cgroup_from_id(unsigned short id)
return NULL;
}
+static inline void mem_cgroup_id_put_many(struct mem_cgroup *memcg,
+ unsigned int n)
+{
+}
+
#ifdef CONFIG_SHRINKER_DEBUG
static inline unsigned long mem_cgroup_ino(struct mem_cgroup *memcg)
{
diff --git a/mm/internal.h b/mm/internal.h
index cfe97501e4885..df991f601702c 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -327,8 +327,6 @@ static inline swp_entry_t swap_nth(swp_entry_t entry, long n)
return (swp_entry_t) { entry.val + n };
}
-swp_entry_t swap_move(swp_entry_t entry, long delta);
-
/**
* pte_move_swp_offset - Move the swap entry offset field of a swap pte
* forward or backward by delta
@@ -342,7 +340,7 @@ swp_entry_t swap_move(swp_entry_t entry, long delta);
static inline pte_t pte_move_swp_offset(pte_t pte, long delta)
{
softleaf_t entry = softleaf_from_pte(pte);
- pte_t new = swp_entry_to_pte(swap_move(entry, delta));
+ pte_t new = swp_entry_to_pte(swap_nth(entry, delta));
if (pte_swp_soft_dirty(pte))
new = pte_swp_mksoft_dirty(new);
@@ -372,6 +370,7 @@ static inline pte_t pte_next_swp_offset(pte_t pte)
* @start_ptep: Page table pointer for the first entry.
* @max_nr: The maximum number of table entries to consider.
* @pte: Page table entry for the first entry.
+ * @free_batch: Whether the batch will be passed to free_swap_and_cache_nr().
*
* Detect a batch of contiguous swap entries: consecutive (non-present) PTEs
* containing swap entries all with consecutive offsets and targeting the same
@@ -382,13 +381,15 @@ static inline pte_t pte_next_swp_offset(pte_t pte)
*
* Return: the number of table entries in the batch.
*/
-static inline int swap_pte_batch(pte_t *start_ptep, int max_nr, pte_t pte)
+static inline int swap_pte_batch(pte_t *start_ptep, int max_nr, pte_t pte,
+ bool free_batch)
{
pte_t expected_pte = pte_next_swp_offset(pte);
const pte_t *end_ptep = start_ptep + max_nr;
const softleaf_t entry = softleaf_from_pte(pte);
pte_t *ptep = start_ptep + 1;
unsigned short cgroup_id;
+ int nr;
VM_WARN_ON(max_nr < 1);
VM_WARN_ON(!softleaf_is_swap(entry));
@@ -408,7 +409,14 @@ static inline int swap_pte_batch(pte_t *start_ptep, int max_nr, pte_t pte)
ptep++;
}
- return ptep - start_ptep;
+ nr = ptep - start_ptep;
+ /*
+ * free_swap_and_cache_nr can handle mixed backends, as long as virtual
+ * swap entries backing these PTEs are contiguous.
+ */
+ if (!free_batch && !vswap_can_swapin_thp(entry, nr))
+ return 1;
+ return nr;
}
#endif /* CONFIG_MMU */
diff --git a/mm/madvise.c b/mm/madvise.c
index b617b1be0f535..441da03c5d2b9 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -692,7 +692,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
if (softleaf_is_swap(entry)) {
max_nr = (end - addr) / PAGE_SIZE;
- nr = swap_pte_batch(pte, max_nr, ptent);
+ nr = swap_pte_batch(pte, max_nr, ptent, true);
nr_swap -= nr;
free_swap_and_cache_nr(entry, nr);
clear_not_present_full_ptes(mm, addr, pte, nr, tlb->fullmm);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 50be8066bebec..bfa25eaffa12a 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3597,7 +3597,7 @@ void __maybe_unused mem_cgroup_id_get_many(struct mem_cgroup *memcg,
refcount_add(n, &memcg->id.ref);
}
-static void mem_cgroup_id_put_many(struct mem_cgroup *memcg, unsigned int n)
+void mem_cgroup_id_put_many(struct mem_cgroup *memcg, unsigned int n)
{
if (refcount_sub_and_test(n, &memcg->id.ref)) {
mem_cgroup_id_remove(memcg);
diff --git a/mm/memory.c b/mm/memory.c
index a16bf84ebaaf9..59645ad238e22 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1742,7 +1742,7 @@ static inline int zap_nonpresent_ptes(struct mmu_gather *tlb,
if (!should_zap_cows(details))
return 1;
- nr = swap_pte_batch(pte, max_nr, ptent);
+ nr = swap_pte_batch(pte, max_nr, ptent, true);
rss[MM_SWAPENTS] -= nr;
free_swap_and_cache_nr(entry, nr);
} else if (softleaf_is_migration(entry)) {
@@ -4491,7 +4491,7 @@ static bool can_swapin_thp(struct vm_fault *vmf, pte_t *ptep, int nr_pages)
if (!pte_same(pte, pte_move_swp_offset(vmf->orig_pte, -idx)))
return false;
entry = softleaf_from_pte(pte);
- if (swap_pte_batch(ptep, nr_pages, pte) != nr_pages)
+ if (swap_pte_batch(ptep, nr_pages, pte, false) != nr_pages)
return false;
/*
@@ -4877,7 +4877,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
pte_t folio_pte = ptep_get(folio_ptep);
if (!pte_same(folio_pte, pte_move_swp_offset(vmf->orig_pte, -idx)) ||
- swap_pte_batch(folio_ptep, nr, folio_pte) != nr)
+ swap_pte_batch(folio_ptep, nr, folio_pte, false) != nr)
goto out_nomap;
page_idx = idx;
@@ -4906,7 +4906,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
folio_ptep = vmf->pte - idx;
folio_pte = ptep_get(folio_ptep);
if (!pte_same(folio_pte, pte_move_swp_offset(vmf->orig_pte, -idx)) ||
- swap_pte_batch(folio_ptep, nr, folio_pte) != nr)
+ swap_pte_batch(folio_ptep, nr, folio_pte, false) != nr)
goto check_folio;
page_idx = idx;
diff --git a/mm/vswap.c b/mm/vswap.c
index 2a071d5ae173c..047c6476ef23c 100644
--- a/mm/vswap.c
+++ b/mm/vswap.c
@@ -481,18 +481,18 @@ static void vswap_cluster_free(struct vswap_cluster *cluster)
kvfree_rcu(cluster, rcu);
}
-static inline void release_vswap_slot(struct vswap_cluster *cluster,
- unsigned long index)
+static inline void release_vswap_slot_nr(struct vswap_cluster *cluster,
+ unsigned long index, int nr)
{
unsigned long slot_index = VSWAP_IDX_WITHIN_CLUSTER_VAL(index);
VM_WARN_ON(!spin_is_locked(&cluster->lock));
- cluster->count--;
+ cluster->count -= nr;
- bitmap_clear(cluster->bitmap, slot_index, 1);
+ bitmap_clear(cluster->bitmap, slot_index, nr);
/* we only free uncached empty clusters */
- if (refcount_dec_and_test(&cluster->refcnt))
+ if (refcount_sub_and_test(nr, &cluster->refcnt))
vswap_cluster_free(cluster);
else if (cluster->full && cluster_is_alloc_candidate(cluster)) {
cluster->full = false;
@@ -505,7 +505,7 @@ static inline void release_vswap_slot(struct vswap_cluster *cluster,
}
}
- atomic_dec(&vswap_used);
+ atomic_sub(nr, &vswap_used);
}
/*
@@ -527,23 +527,29 @@ void vswap_rmap_set(struct swap_cluster_info *ci, swp_slot_t slot,
}
/*
- * Caller needs to handle races with other operations themselves.
+ * release_backing - release the backend storage for a given range of virtual
+ * swap slots.
+ *
+ * Entered with the cluster locked, but might drop the lock in between.
+ * This is because several operations, such as releasing physical swap slots
+ * (i.e swap_slot_free_nr()) require the cluster to be unlocked to avoid
+ * deadlocks.
*
- * Specifically, this function is safe to be called in contexts where the swap
- * entry has been added to the swap cache and the associated folio is locked.
- * We cannot race with other accessors, and the swap entry is guaranteed to be
- * valid the whole time (since swap cache implies one refcount).
+ * This is safe, because:
+ *
+ * 1. The swap entry to be freed has refcnt (swap count and swapcache pin)
+ * down to 0, so no one can change its internal state
*
- * We cannot assume that the backends will be of the same type,
- * contiguous, etc. We might have a large folio coalesced from subpages with
- * mixed backend, which is only rectified when it is reclaimed.
+ * 2. The swap entry to be freed still holds a refcnt to the cluster, keeping
+ * the cluster itself valid.
+ *
+ * We will exit the function with the cluster re-locked.
*/
- static void release_backing(swp_entry_t entry, int nr)
+static void release_backing(struct vswap_cluster *cluster, swp_entry_t entry,
+ int nr)
{
- struct vswap_cluster *cluster = NULL;
struct swp_desc *desc;
unsigned long flush_nr, phys_swap_start = 0, phys_swap_end = 0;
- unsigned long phys_swap_released = 0;
unsigned int phys_swap_type = 0;
bool need_flushing_phys_swap = false;
swp_slot_t flush_slot;
@@ -551,9 +557,8 @@ void vswap_rmap_set(struct swap_cluster_info *ci, swp_slot_t slot,
VM_WARN_ON(!entry.val);
- rcu_read_lock();
for (i = 0; i < nr; i++) {
- desc = vswap_iter(&cluster, entry.val + i);
+ desc = __vswap_iter(cluster, entry.val + i);
VM_WARN_ON(!desc);
/*
@@ -573,7 +578,6 @@ void vswap_rmap_set(struct swap_cluster_info *ci, swp_slot_t slot,
if (desc->type == VSWAP_ZSWAP && desc->zswap_entry) {
zswap_entry_free(desc->zswap_entry);
} else if (desc->type == VSWAP_SWAPFILE) {
- phys_swap_released++;
if (!phys_swap_start) {
/* start a new contiguous range of phys swap */
phys_swap_start = swp_slot_offset(desc->slot);
@@ -589,56 +593,49 @@ void vswap_rmap_set(struct swap_cluster_info *ci, swp_slot_t slot,
if (need_flushing_phys_swap) {
spin_unlock(&cluster->lock);
- cluster = NULL;
swap_slot_free_nr(flush_slot, flush_nr);
+ mem_cgroup_uncharge_swap(entry, flush_nr);
+ spin_lock(&cluster->lock);
need_flushing_phys_swap = false;
}
}
- if (cluster)
- spin_unlock(&cluster->lock);
- rcu_read_unlock();
/* Flush any remaining physical swap range */
if (phys_swap_start) {
flush_slot = swp_slot(phys_swap_type, phys_swap_start);
flush_nr = phys_swap_end - phys_swap_start;
+ spin_unlock(&cluster->lock);
swap_slot_free_nr(flush_slot, flush_nr);
+ mem_cgroup_uncharge_swap(entry, flush_nr);
+ spin_lock(&cluster->lock);
}
+}
- if (phys_swap_released)
- mem_cgroup_uncharge_swap(entry, phys_swap_released);
- }
+static void __vswap_swap_cgroup_clear(struct vswap_cluster *cluster,
+ swp_entry_t entry, unsigned int nr_ents);
/*
- * Entered with the cluster locked, but might unlock the cluster.
- * This is because several operations, such as releasing physical swap slots
- * (i.e swap_slot_free_nr()) require the cluster to be unlocked to avoid
- * deadlocks.
- *
- * This is safe, because:
- *
- * 1. The swap entry to be freed has refcnt (swap count and swapcache pin)
- * down to 0, so no one can change its internal state
- *
- * 2. The swap entry to be freed still holds a refcnt to the cluster, keeping
- * the cluster itself valid.
- *
- * We will exit the function with the cluster re-locked.
+ * Entered with the cluster locked. We will exit the function with the cluster
+ * still locked.
*/
-static void vswap_free(struct vswap_cluster *cluster, struct swp_desc *desc,
- swp_entry_t entry)
+static void vswap_free_nr(struct vswap_cluster *cluster, swp_entry_t entry,
+ int nr)
{
- /* Clear shadow if present */
- if (xa_is_value(desc->shadow))
- desc->shadow = NULL;
- spin_unlock(&cluster->lock);
+ struct swp_desc *desc;
+ int i;
- release_backing(entry, 1);
- mem_cgroup_clear_swap(entry, 1);
+ for (i = 0; i < nr; i++) {
+ desc = __vswap_iter(cluster, entry.val + i);
+ /* Clear shadow if present */
+ if (xa_is_value(desc->shadow))
+ desc->shadow = NULL;
+ }
- /* erase forward mapping and release the virtual slot for reallocation */
- spin_lock(&cluster->lock);
- release_vswap_slot(cluster, entry.val);
+ release_backing(cluster, entry, nr);
+ __vswap_swap_cgroup_clear(cluster, entry, nr);
+
+ /* erase forward mapping and release the virtual slots for reallocation */
+ release_vswap_slot_nr(cluster, entry.val, nr);
}
/**
@@ -820,18 +817,32 @@ static bool vswap_free_nr_any_cache_only(swp_entry_t entry, int nr)
struct vswap_cluster *cluster = NULL;
struct swp_desc *desc;
bool ret = false;
- int i;
+ swp_entry_t free_start;
+ int i, free_nr = 0;
+ free_start.val = 0;
rcu_read_lock();
for (i = 0; i < nr; i++) {
+ /* flush pending free batch at cluster boundary */
+ if (free_nr && !VSWAP_IDX_WITHIN_CLUSTER_VAL(entry.val)) {
+ vswap_free_nr(cluster, free_start, free_nr);
+ free_nr = 0;
+ }
desc = vswap_iter(&cluster, entry.val);
VM_WARN_ON(!desc);
ret |= (desc->swap_count == 1 && desc->in_swapcache);
desc->swap_count--;
- if (!desc->swap_count && !desc->in_swapcache)
- vswap_free(cluster, desc, entry);
+ if (!desc->swap_count && !desc->in_swapcache) {
+ if (!free_nr++)
+ free_start = entry;
+ } else if (free_nr) {
+ vswap_free_nr(cluster, free_start, free_nr);
+ free_nr = 0;
+ }
entry.val++;
}
+ if (free_nr)
+ vswap_free_nr(cluster, free_start, free_nr);
if (cluster)
spin_unlock(&cluster->lock);
rcu_read_unlock();
@@ -954,19 +965,33 @@ void swapcache_clear(swp_entry_t entry, int nr)
{
struct vswap_cluster *cluster = NULL;
struct swp_desc *desc;
- int i;
+ swp_entry_t free_start;
+ int i, free_nr = 0;
if (!nr)
return;
+ free_start.val = 0;
rcu_read_lock();
for (i = 0; i < nr; i++) {
+ /* flush pending free batch at cluster boundary */
+ if (free_nr && !VSWAP_IDX_WITHIN_CLUSTER_VAL(entry.val)) {
+ vswap_free_nr(cluster, free_start, free_nr);
+ free_nr = 0;
+ }
desc = vswap_iter(&cluster, entry.val);
desc->in_swapcache = false;
- if (!desc->swap_count)
- vswap_free(cluster, desc, entry);
+ if (!desc->swap_count) {
+ if (!free_nr++)
+ free_start = entry;
+ } else if (free_nr) {
+ vswap_free_nr(cluster, free_start, free_nr);
+ free_nr = 0;
+ }
entry.val++;
}
+ if (free_nr)
+ vswap_free_nr(cluster, free_start, free_nr);
if (cluster)
spin_unlock(&cluster->lock);
rcu_read_unlock();
@@ -1107,11 +1132,13 @@ void vswap_store_folio(swp_entry_t entry, struct folio *folio)
VM_BUG_ON(!folio_test_locked(folio));
VM_BUG_ON(folio->swap.val != entry.val);
- release_backing(entry, nr);
-
rcu_read_lock();
+ desc = vswap_iter(&cluster, entry.val);
+ VM_WARN_ON(!desc);
+ release_backing(cluster, entry, nr);
+
for (i = 0; i < nr; i++) {
- desc = vswap_iter(&cluster, entry.val + i);
+ desc = __vswap_iter(cluster, entry.val + i);
VM_WARN_ON(!desc);
desc->type = VSWAP_FOLIO;
desc->swap_cache = folio;
@@ -1136,11 +1163,13 @@ void swap_zeromap_folio_set(struct folio *folio)
VM_BUG_ON(!folio_test_locked(folio));
VM_BUG_ON(!entry.val);
- release_backing(entry, nr);
-
rcu_read_lock();
+ desc = vswap_iter(&cluster, entry.val);
+ VM_WARN_ON(!desc);
+ release_backing(cluster, entry, nr);
+
for (i = 0; i < nr; i++) {
- desc = vswap_iter(&cluster, entry.val + i);
+ desc = __vswap_iter(cluster, entry.val + i);
VM_WARN_ON(!desc);
desc->type = VSWAP_ZERO;
}
@@ -1261,89 +1290,6 @@ bool vswap_can_swapin_thp(swp_entry_t entry, int nr)
(type == VSWAP_ZERO || type == VSWAP_SWAPFILE);
}
-/**
- * swap_move - increment the swap slot by delta, checking the backing state and
- * return 0 if the backing state does not match (i.e wrong backing
- * state type, or wrong offset on the backing stores).
- * @entry: the original virtual swap slot.
- * @delta: the offset to increment the original slot.
- *
- * Note that this function is racy unless we can pin the backing state of these
- * swap slots down with swapcache_prepare().
- *
- * Caller should only rely on this function as a best-effort hint otherwise,
- * and should double-check after ensuring the whole range is pinned down.
- *
- * Return: the incremented virtual swap slot if the backing state matches, or
- * 0 if the backing state does not match.
- */
-swp_entry_t swap_move(swp_entry_t entry, long delta)
-{
- struct vswap_cluster *cluster = NULL;
- struct swp_desc *desc, *next_desc;
- swp_entry_t next_entry;
- struct folio *folio = NULL, *next_folio = NULL;
- enum swap_type type, next_type;
- swp_slot_t slot = {0}, next_slot = {0};
-
- next_entry.val = entry.val + delta;
-
- rcu_read_lock();
-
- /* Look up first descriptor and get its type and backing store */
- desc = vswap_iter(&cluster, entry.val);
- if (!desc) {
- rcu_read_unlock();
- return (swp_entry_t){0};
- }
-
- type = desc->type;
- if (type == VSWAP_ZSWAP) {
- /* zswap not supported for move */
- spin_unlock(&cluster->lock);
- rcu_read_unlock();
- return (swp_entry_t){0};
- }
- if (type == VSWAP_FOLIO)
- folio = desc->swap_cache;
- else if (type == VSWAP_SWAPFILE)
- slot = desc->slot;
-
- /* Look up second descriptor and get its type and backing store */
- next_desc = vswap_iter(&cluster, next_entry.val);
- if (!next_desc) {
- rcu_read_unlock();
- return (swp_entry_t){0};
- }
-
- next_type = next_desc->type;
- if (next_type == VSWAP_FOLIO)
- next_folio = next_desc->swap_cache;
- else if (next_type == VSWAP_SWAPFILE)
- next_slot = next_desc->slot;
-
- if (cluster)
- spin_unlock(&cluster->lock);
-
- rcu_read_unlock();
-
- /* Check if types match */
- if (next_type != type)
- return (swp_entry_t){0};
-
- /* Check backing state consistency */
- if (type == VSWAP_SWAPFILE &&
- (swp_slot_type(next_slot) != swp_slot_type(slot) ||
- swp_slot_offset(next_slot) !=
- swp_slot_offset(slot) + delta))
- return (swp_entry_t){0};
-
- if (type == VSWAP_FOLIO && next_folio != folio)
- return (swp_entry_t){0};
-
- return next_entry;
-}
-
/*
* Return the count of contiguous swap entries that share the same
* VSWAP_ZERO status as the starting entry. If is_zeromap is not NULL,
@@ -1863,11 +1809,10 @@ void zswap_entry_store(swp_entry_t swpentry, struct zswap_entry *entry)
struct vswap_cluster *cluster = NULL;
struct swp_desc *desc;
- release_backing(swpentry, 1);
-
rcu_read_lock();
desc = vswap_iter(&cluster, swpentry.val);
VM_WARN_ON(!desc);
+ release_backing(cluster, swpentry, 1);
desc->zswap_entry = entry;
desc->type = VSWAP_ZSWAP;
spin_unlock(&cluster->lock);
@@ -1914,17 +1859,22 @@ bool zswap_empty(swp_entry_t swpentry)
#endif /* CONFIG_ZSWAP */
#ifdef CONFIG_MEMCG
-static unsigned short vswap_cgroup_record(swp_entry_t entry,
- unsigned short memcgid, unsigned int nr_ents)
+/*
+ * __vswap_cgroup_record - record mem_cgroup for a set of swap entries
+ *
+ * Entered with the cluster locked. We will exit the function with the cluster
+ * still locked.
+ */
+static unsigned short __vswap_cgroup_record(struct vswap_cluster *cluster,
+ swp_entry_t entry, unsigned short memcgid,
+ unsigned int nr_ents)
{
- struct vswap_cluster *cluster = NULL;
struct swp_desc *desc;
unsigned short oldid, iter = 0;
int i;
- rcu_read_lock();
for (i = 0; i < nr_ents; i++) {
- desc = vswap_iter(&cluster, entry.val + i);
+ desc = __vswap_iter(cluster, entry.val + i);
VM_WARN_ON(!desc);
oldid = desc->memcgid;
desc->memcgid = memcgid;
@@ -1932,6 +1882,37 @@ static unsigned short vswap_cgroup_record(swp_entry_t entry,
iter = oldid;
VM_WARN_ON(iter != oldid);
}
+
+ return oldid;
+}
+
+/*
+ * Clear swap cgroup for a range of swap entries.
+ * Entered with the cluster locked. Caller must be under rcu_read_lock().
+ */
+static void __vswap_swap_cgroup_clear(struct vswap_cluster *cluster,
+ swp_entry_t entry, unsigned int nr_ents)
+{
+ unsigned short id;
+ struct mem_cgroup *memcg;
+
+ id = __vswap_cgroup_record(cluster, entry, 0, nr_ents);
+ memcg = mem_cgroup_from_id(id);
+ if (memcg)
+ mem_cgroup_id_put_many(memcg, nr_ents);
+}
+
+static unsigned short vswap_cgroup_record(swp_entry_t entry,
+ unsigned short memcgid, unsigned int nr_ents)
+{
+ struct vswap_cluster *cluster = NULL;
+ struct swp_desc *desc;
+ unsigned short oldid;
+
+ rcu_read_lock();
+ desc = vswap_iter(&cluster, entry.val);
+ VM_WARN_ON(!desc);
+ oldid = __vswap_cgroup_record(cluster, entry, memcgid, nr_ents);
spin_unlock(&cluster->lock);
rcu_read_unlock();
@@ -1999,6 +1980,11 @@ unsigned short lookup_swap_cgroup_id(swp_entry_t entry)
rcu_read_unlock();
return ret;
}
+#else /* !CONFIG_MEMCG */
+static void __vswap_swap_cgroup_clear(struct vswap_cluster *cluster,
+ swp_entry_t entry, unsigned int nr_ents)
+{
+}
#endif /* CONFIG_MEMCG */
int vswap_init(void)
--
2.47.3
^ permalink raw reply [flat|nested] 52+ messages in thread
end of thread, other threads:[~2026-02-20 22:38 UTC | newest]
Thread overview: 52+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2026-02-08 21:58 [PATCH v3 00/20] Virtual Swap Space Nhat Pham
2026-02-08 21:58 ` [PATCH v3 01/20] mm/swap: decouple swap cache from physical swap infrastructure Nhat Pham
2026-02-08 22:26 ` [PATCH v3 00/20] Virtual Swap Space Nhat Pham
2026-02-10 17:59 ` Kairui Song
2026-02-10 18:52 ` Johannes Weiner
2026-02-10 19:11 ` Nhat Pham
2026-02-10 19:23 ` Nhat Pham
2026-02-12 5:07 ` Chris Li
2026-02-17 23:36 ` Nhat Pham
2026-02-10 21:58 ` Chris Li
2026-02-20 21:05 ` [PATCH] vswap: fix poor batching behavior of vswap free path Nhat Pham
2026-02-08 22:31 ` [PATCH v3 00/20] Virtual Swap Space Nhat Pham
2026-02-09 12:20 ` Chris Li
2026-02-10 2:36 ` Johannes Weiner
2026-02-10 21:24 ` Chris Li
2026-02-10 23:01 ` Johannes Weiner
2026-02-10 18:00 ` Nhat Pham
2026-02-10 23:17 ` Chris Li
2026-02-08 22:39 ` Nhat Pham
2026-02-09 2:22 ` [PATCH v3 01/20] mm/swap: decouple swap cache from physical swap infrastructure kernel test robot
2026-02-08 21:58 ` [PATCH v3 02/20] swap: rearrange the swap header file Nhat Pham
2026-02-08 21:58 ` [PATCH v3 03/20] mm: swap: add an abstract API for locking out swapoff Nhat Pham
2026-02-08 21:58 ` [PATCH v3 04/20] zswap: add new helpers for zswap entry operations Nhat Pham
2026-02-08 21:58 ` [PATCH v3 05/20] mm/swap: add a new function to check if a swap entry is in swap cached Nhat Pham
2026-02-08 21:58 ` [PATCH v3 06/20] mm: swap: add a separate type for physical swap slots Nhat Pham
2026-02-08 21:58 ` [PATCH v3 07/20] mm: create scaffolds for the new virtual swap implementation Nhat Pham
2026-02-08 21:58 ` [PATCH v3 08/20] zswap: prepare zswap for swap virtualization Nhat Pham
2026-02-08 21:58 ` [PATCH v3 09/20] mm: swap: allocate a virtual swap slot for each swapped out page Nhat Pham
2026-02-09 17:12 ` kernel test robot
2026-02-11 13:42 ` kernel test robot
2026-02-08 21:58 ` [PATCH v3 10/20] swap: move swap cache to virtual swap descriptor Nhat Pham
2026-02-08 21:58 ` [PATCH v3 11/20] zswap: move zswap entry management to the " Nhat Pham
2026-02-08 21:58 ` [PATCH v3 12/20] swap: implement the swap_cgroup API using virtual swap Nhat Pham
2026-02-08 21:58 ` [PATCH v3 13/20] swap: manage swap entry lifecycle at the virtual swap layer Nhat Pham
2026-02-08 21:58 ` [PATCH v3 14/20] mm: swap: decouple virtual swap slot from backing store Nhat Pham
2026-02-10 6:31 ` Dan Carpenter
2026-02-08 21:58 ` [PATCH v3 15/20] zswap: do not start zswap shrinker if there is no physical swap slots Nhat Pham
2026-02-08 21:58 ` [PATCH v3 16/20] swap: do not unnecesarily pin readahead swap entries Nhat Pham
2026-02-08 21:58 ` [PATCH v3 17/20] swapfile: remove zeromap bitmap Nhat Pham
2026-02-08 21:58 ` [PATCH v3 18/20] memcg: swap: only charge physical swap slots Nhat Pham
2026-02-09 2:01 ` kernel test robot
2026-02-09 2:12 ` kernel test robot
2026-02-08 21:58 ` [PATCH v3 19/20] swap: simplify swapoff using virtual swap Nhat Pham
2026-02-08 21:58 ` [PATCH v3 20/20] swapfile: replace the swap map with bitmaps Nhat Pham
2026-02-08 22:51 ` [PATCH v3 00/20] Virtual Swap Space Nhat Pham
2026-02-12 12:23 ` David Hildenbrand (Arm)
2026-02-12 17:29 ` Nhat Pham
2026-02-12 17:39 ` Nhat Pham
2026-02-12 20:11 ` David Hildenbrand (Arm)
2026-02-12 17:41 ` David Hildenbrand (Arm)
2026-02-12 17:45 ` Nhat Pham
2026-02-10 15:45 ` [syzbot ci] " syzbot ci
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox