* [LSF/MM/BPF TOPIC] Swap status and roadmap discussion
@ 2026-02-21 10:50 Kairui Song
2026-02-23 18:38 ` Nhat Pham
0 siblings, 1 reply; 3+ messages in thread
From: Kairui Song @ 2026-02-21 10:50 UTC (permalink / raw)
To: lsf-pc
Cc: Kairui Song, Chris Li, YoungJun Park, Barry Song, Baoquan He,
linux-mm, Nhat Pham, Johannes Weiner
Last year, we successfully cleaned up the swap subsystem using the swap
table design [1], and that's not the end of the story. Combined with
layered swap table, ghost swap as posted by Chris, YoungJun's swap tiering
[2] [3], and Nhat's idea of having a dynamic swap size [4], we can have a
flexible, feature-rich swap. And importantly, the overhead of both CPU and
memory will be minimal for all users in all scenarios, lower than the old
swap system. And every component is runtime optional, configurable, and
highly compatible with future features (e.g. I just noticed Baoquan's
swapops [5] which should fit well here. Swap table compaction based
on full list too).
We should be able to achieve a solution that users ranging from sub-GB
devices to TB-level servers will all benefit from.
Based on the swap table P4 RFC [6], we will achieve (see detail in that
series):
- 8 bytes per slot memory usage for plain swap.
- And can be reduced to 3 or only 1 byte.
- 16 bytes per slot memory usage, when using ghost / virtual zswap.
- 24 bytes at most for multi-layer.
- And can be reduced too by simply using the same infrastructure above.
- Minimal code review or maintenance burden. All layers are using the same
infrastructure to manage the metadata/allocation/synchronization, making
all APIs and conventions consistent and easy to maintain.
- Every component is minimal, runtime optional and high-performance so
existing users of ZRAM or high performance devices have literally zero
overhead.
- The ghost / virtual swapfile has a dynamic or infinite size with no
static data overhead.
- Migration and compaction are also easily supportable as both reverse
mapping and reallocation are prepared.
- Highly compatible with YoungJun's swap tier, because everything is just a
device [2] [3].
- Solves large-order swapout and minimum swap order requirements.
- The fast swapoff feature is also supported by just reading the swap entry
into the ghost / vswap's swap cache.
And besides these, swap now has the opportunity for even further
optimizations, e.g. PG_drop for anon reclaim since swap now has a unified
convention; Reducing rmap lock contention as was once suggested by Barry
Song [7]. Growth of the static swap file can also be added later, so plain
swap on top of things like LVM can finally grow without causing memory
pressure.
And there are unsolved design decisions that need discussion, such as:
- Should we use swapon / swapoff on the virtual / ghost device? Or expose
it in other ways, or make it on by default? Using the classical swapon /
off provides huge flexibility; on by default is also doable and hides
complexity.
- Should we expose special devices like /dev/xswap, or just use a dummy
swap header file?
- How to, or should we report the usage of ghost / virtual swap devices as
ordinary swap under /proc/swaps? We definitely need some way to report
that.
- Is 64 bits really needed for reverse mapping? For the context, reverse
mapping here is a swap entry recorded in a lower / physical device
pointing to the ghost / virtual device.
- The swap device size is now just a number, to adjust that, we need an
interface, and what kind of interface is the best choice? Or just
make it dynamic (e.g. increase by 2M for every cluster allocated)?
Link: https://lore.kernel.org/all/CAMgjq7BvQ0ZXvyLGp2YP96+i+6COCBBJCYmjXHGBnfisCAb8VA@mail.gmail.com/
[1]
Link: https://lore.kernel.org/linux-mm/CAMgjq7BA_2-5iCvS-vp9ZEoG=1DwHWYuVZOuH8DWH9wzdoC00g@mail.gmail.com/
[2]
Link: https://lore.kernel.org/linux-mm/20260217000950.4015880-1-youngjun.park@lge.com/
[3]
Link: https://lore.kernel.org/linux-mm/20260208215839.87595-1-nphamcs@gmail.com/
[4]
Link: https://lore.kernel.org/linux-mm/aZiFvzlBJiYBUDre@MiWiFi-R3L-srv/ [5]
Link: https://lore.kernel.org/linux-mm/20260220-swap-table-p4-v1-0-104795d19815@tencent.com/
[6]
Link: https://lore.kernel.org/linux-mm/20250513084620.58231-1-21cnbao@gmail.com/
[7]
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Swap status and roadmap discussion
2026-02-21 10:50 [LSF/MM/BPF TOPIC] Swap status and roadmap discussion Kairui Song
@ 2026-02-23 18:38 ` Nhat Pham
2026-02-23 18:55 ` Yosry Ahmed
0 siblings, 1 reply; 3+ messages in thread
From: Nhat Pham @ 2026-02-23 18:38 UTC (permalink / raw)
To: Kairui Song
Cc: lsf-pc, Kairui Song, Chris Li, YoungJun Park, Barry Song,
Baoquan He, linux-mm, Johannes Weiner
On Sat, Feb 21, 2026 at 2:50 AM Kairui Song <ryncsn@gmail.com> wrote:
>
> Last year, we successfully cleaned up the swap subsystem using the swap
> table design [1], and that's not the end of the story. Combined with
> layered swap table, ghost swap as posted by Chris, YoungJun's swap tiering
> [2] [3], and Nhat's idea of having a dynamic swap size [4], we can have a
> flexible, feature-rich swap. And importantly, the overhead of both CPU and
> memory will be minimal for all users in all scenarios, lower than the old
> swap system. And every component is runtime optional, configurable, and
> highly compatible with future features (e.g. I just noticed Baoquan's
> swapops [5] which should fit well here. Swap table compaction based
> on full list too).
I'd love to chat more about this too :)
>
> We should be able to achieve a solution that users ranging from sub-GB
> devices to TB-level servers will all benefit from.
>
> Based on the swap table P4 RFC [6], we will achieve (see detail in that
> series):
> - 8 bytes per slot memory usage for plain swap.
> - And can be reduced to 3 or only 1 byte.
> - 16 bytes per slot memory usage, when using ghost / virtual zswap.
> - 24 bytes at most for multi-layer.
> - And can be reduced too by simply using the same infrastructure above.
> - Minimal code review or maintenance burden. All layers are using the same
> infrastructure to manage the metadata/allocation/synchronization, making
> all APIs and conventions consistent and easy to maintain.
> - Every component is minimal, runtime optional and high-performance so
> existing users of ZRAM or high performance devices have literally zero
> overhead.
> - The ghost / virtual swapfile has a dynamic or infinite size with no
> static data overhead.
> - Migration and compaction are also easily supportable as both reverse
> mapping and reallocation are prepared.
> - Highly compatible with YoungJun's swap tier, because everything is just a
> device [2] [3].
> - Solves large-order swapout and minimum swap order requirements.
> - The fast swapoff feature is also supported by just reading the swap entry
> into the ghost / vswap's swap cache.
>
> And besides these, swap now has the opportunity for even further
> optimizations, e.g. PG_drop for anon reclaim since swap now has a unified
> convention; Reducing rmap lock contention as was once suggested by Barry
> Song [7]. Growth of the static swap file can also be added later, so plain
> swap on top of things like LVM can finally grow without causing memory
> pressure.
>
> And there are unsolved design decisions that need discussion, such as:
> - Should we use swapon / swapoff on the virtual / ghost device? Or expose
> it in other ways, or make it on by default? Using the classical swapon /
> off provides huge flexibility; on by default is also doable and hides
> complexity.
I don't think we should put limit in virtual swap space per se, as we
are not consuming a real, physical, scarce resource.
We should put limit on the physical backend itself, where appropriate (see [1])/
> - Should we expose special devices like /dev/xswap, or just use a dummy
> swap header file?
> - How to, or should we report the usage of ghost / virtual swap devices as
> ordinary swap under /proc/swaps? We definitely need some way to report
> that.
Honestly, just a couple of sysfs counters? :)
> - Is 64 bits really needed for reverse mapping? For the context, reverse
> mapping here is a swap entry recorded in a lower / physical device
> pointing to the ghost / virtual device.
I think you can compact this a bit. Swap space itself is not fully 64
bits right?
Just not sure if the juice is worth the squeeze to save a couple of
bits here and there, especially if the reverse mapping is already
dynamic :)
> - The swap device size is now just a number, to adjust that, we need an
> interface, and what kind of interface is the best choice? Or just
> make it dynamic (e.g. increase by 2M for every cluster allocated)?
This is very type dependent.
For physical swapfile, it's consuming a limited physical resource
(disk space), so it should be userspace decided. It would be nice to
make swapfile extensible at runtime tho :)
For zswap then I think it really should be dynamic. You can read my
arguments in my virtual swap cover letter (see section I of [1]).
[1]: https://lore.kernel.org/linux-mm/20260208222652.328284-1-nphamcs@gmail.com/
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Swap status and roadmap discussion
2026-02-23 18:38 ` Nhat Pham
@ 2026-02-23 18:55 ` Yosry Ahmed
0 siblings, 0 replies; 3+ messages in thread
From: Yosry Ahmed @ 2026-02-23 18:55 UTC (permalink / raw)
To: Nhat Pham
Cc: Kairui Song, lsf-pc, Kairui Song, Chris Li, YoungJun Park,
Barry Song, Baoquan He, linux-mm, Johannes Weiner
> > - Is 64 bits really needed for reverse mapping? For the context, reverse
> > mapping here is a swap entry recorded in a lower / physical device
> > pointing to the ghost / virtual device.
>
> I think you can compact this a bit. Swap space itself is not fully 64
> bits right?
>
> Just not sure if the juice is worth the squeeze to save a couple of
> bits here and there, especially if the reverse mapping is already
> dynamic :)
I think we should actually revisit the need for a reverse mapping to
begin with. For swapoff, we can probably scan the virtual swap space
looking for entries that belong to the backend being swapped off. Not
as efficient as a reverse map, but still better than the status quo of
scanning page tables. I don't think optimizing for swapoff is worth
the consistent overhead.
The other use cases are probably cluster readahead and swapcache-only
reclaim, and I think both of these can also be revisited.
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2026-02-23 18:55 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2026-02-21 10:50 [LSF/MM/BPF TOPIC] Swap status and roadmap discussion Kairui Song
2026-02-23 18:38 ` Nhat Pham
2026-02-23 18:55 ` Yosry Ahmed
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox