From: Nhat Pham <nphamcs@gmail.com>
To: Kairui Song <ryncsn@gmail.com>
Cc: lsf-pc@lists.linux-foundation.org,
Kairui Song <kasong@tencent.com>, Chris Li <chrisl@kernel.org>,
YoungJun Park <youngjun.park@lge.com>,
Barry Song <21cnbao@gmail.com>, Baoquan He <bhe@redhat.com>,
linux-mm <linux-mm@kvack.org>,
Johannes Weiner <hannes@cmpxchg.org>
Subject: Re: [LSF/MM/BPF TOPIC] Swap status and roadmap discussion
Date: Mon, 23 Feb 2026 10:38:36 -0800 [thread overview]
Message-ID: <CAKEwX=O4ishgvhhZ1ssgbDUQewFamkyFT-uCpEWecWfe8SzwGg@mail.gmail.com> (raw)
In-Reply-To: <CAMgjq7CCcsh2twG-Ud2b4U0hZrAMV8zmvLaM=NBDvmoaWziPqg@mail.gmail.com>
On Sat, Feb 21, 2026 at 2:50 AM Kairui Song <ryncsn@gmail.com> wrote:
>
> Last year, we successfully cleaned up the swap subsystem using the swap
> table design [1], and that's not the end of the story. Combined with
> layered swap table, ghost swap as posted by Chris, YoungJun's swap tiering
> [2] [3], and Nhat's idea of having a dynamic swap size [4], we can have a
> flexible, feature-rich swap. And importantly, the overhead of both CPU and
> memory will be minimal for all users in all scenarios, lower than the old
> swap system. And every component is runtime optional, configurable, and
> highly compatible with future features (e.g. I just noticed Baoquan's
> swapops [5] which should fit well here. Swap table compaction based
> on full list too).
I'd love to chat more about this too :)
>
> We should be able to achieve a solution that users ranging from sub-GB
> devices to TB-level servers will all benefit from.
>
> Based on the swap table P4 RFC [6], we will achieve (see detail in that
> series):
> - 8 bytes per slot memory usage for plain swap.
> - And can be reduced to 3 or only 1 byte.
> - 16 bytes per slot memory usage, when using ghost / virtual zswap.
> - 24 bytes at most for multi-layer.
> - And can be reduced too by simply using the same infrastructure above.
> - Minimal code review or maintenance burden. All layers are using the same
> infrastructure to manage the metadata/allocation/synchronization, making
> all APIs and conventions consistent and easy to maintain.
> - Every component is minimal, runtime optional and high-performance so
> existing users of ZRAM or high performance devices have literally zero
> overhead.
> - The ghost / virtual swapfile has a dynamic or infinite size with no
> static data overhead.
> - Migration and compaction are also easily supportable as both reverse
> mapping and reallocation are prepared.
> - Highly compatible with YoungJun's swap tier, because everything is just a
> device [2] [3].
> - Solves large-order swapout and minimum swap order requirements.
> - The fast swapoff feature is also supported by just reading the swap entry
> into the ghost / vswap's swap cache.
>
> And besides these, swap now has the opportunity for even further
> optimizations, e.g. PG_drop for anon reclaim since swap now has a unified
> convention; Reducing rmap lock contention as was once suggested by Barry
> Song [7]. Growth of the static swap file can also be added later, so plain
> swap on top of things like LVM can finally grow without causing memory
> pressure.
>
> And there are unsolved design decisions that need discussion, such as:
> - Should we use swapon / swapoff on the virtual / ghost device? Or expose
> it in other ways, or make it on by default? Using the classical swapon /
> off provides huge flexibility; on by default is also doable and hides
> complexity.
I don't think we should put limit in virtual swap space per se, as we
are not consuming a real, physical, scarce resource.
We should put limit on the physical backend itself, where appropriate (see [1])/
> - Should we expose special devices like /dev/xswap, or just use a dummy
> swap header file?
> - How to, or should we report the usage of ghost / virtual swap devices as
> ordinary swap under /proc/swaps? We definitely need some way to report
> that.
Honestly, just a couple of sysfs counters? :)
> - Is 64 bits really needed for reverse mapping? For the context, reverse
> mapping here is a swap entry recorded in a lower / physical device
> pointing to the ghost / virtual device.
I think you can compact this a bit. Swap space itself is not fully 64
bits right?
Just not sure if the juice is worth the squeeze to save a couple of
bits here and there, especially if the reverse mapping is already
dynamic :)
> - The swap device size is now just a number, to adjust that, we need an
> interface, and what kind of interface is the best choice? Or just
> make it dynamic (e.g. increase by 2M for every cluster allocated)?
This is very type dependent.
For physical swapfile, it's consuming a limited physical resource
(disk space), so it should be userspace decided. It would be nice to
make swapfile extensible at runtime tho :)
For zswap then I think it really should be dynamic. You can read my
arguments in my virtual swap cover letter (see section I of [1]).
[1]: https://lore.kernel.org/linux-mm/20260208222652.328284-1-nphamcs@gmail.com/
next prev parent reply other threads:[~2026-02-23 18:38 UTC|newest]
Thread overview: 3+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-02-21 10:50 Kairui Song
2026-02-23 18:38 ` Nhat Pham [this message]
2026-02-23 18:55 ` Yosry Ahmed
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to='CAKEwX=O4ishgvhhZ1ssgbDUQewFamkyFT-uCpEWecWfe8SzwGg@mail.gmail.com' \
--to=nphamcs@gmail.com \
--cc=21cnbao@gmail.com \
--cc=bhe@redhat.com \
--cc=chrisl@kernel.org \
--cc=hannes@cmpxchg.org \
--cc=kasong@tencent.com \
--cc=linux-mm@kvack.org \
--cc=lsf-pc@lists.linux-foundation.org \
--cc=ryncsn@gmail.com \
--cc=youngjun.park@lge.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox