Re: [PATCH v3 00/20] Virtual Swap Space

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Kairui Song <ryncsn@gmail.com>
To: Nhat Pham <nphamcs@gmail.com>
Cc: linux-mm@kvack.org, akpm@linux-foundation.org,
	hannes@cmpxchg.org,  hughd@google.com, yosry.ahmed@linux.dev,
	mhocko@kernel.org,  roman.gushchin@linux.dev,
	shakeel.butt@linux.dev, muchun.song@linux.dev,
	 len.brown@intel.com, chengming.zhou@linux.dev,
	chrisl@kernel.org,  huang.ying.caritas@gmail.com,
	ryan.roberts@arm.com, shikemeng@huaweicloud.com,
	 viro@zeniv.linux.org.uk, baohua@kernel.org, bhe@redhat.com,
	osalvador@suse.de,  christophe.leroy@csgroup.eu,
	pavel@kernel.org, kernel-team@meta.com,
	 linux-kernel@vger.kernel.org, cgroups@vger.kernel.org,
	 linux-pm@vger.kernel.org, peterx@redhat.com, riel@surriel.com,
	 joshua.hahnjy@gmail.com, npache@redhat.com, gourry@gourry.net,
	 axelrasmussen@google.com, yuanchu@google.com,
	weixugc@google.com,  rafael@kernel.org, jannh@google.com,
	pfalcato@suse.de,  zhengqi.arch@bytedance.com
Subject: Re: [PATCH v3 00/20] Virtual Swap Space
Date: Wed, 11 Feb 2026 01:59:34 +0800	[thread overview]
Message-ID: <CAMgjq7AQNGK-a=AOgvn4-V+zGO21QMbMTVbrYSW_R2oDSLoC+A@mail.gmail.com> (raw)
In-Reply-To: <20260208222652.328284-1-nphamcs@gmail.com>

On Mon, Feb 9, 2026 at 7:57 AM Nhat Pham <nphamcs@gmail.com> wrote:
>
> Anyway, resending this (in-reply-to patch 1 of the series):

Hi Nhat,

> Changelog:
> * RFC v2 -> v3:
>     * Implement a cluster-based allocation algorithm for virtual swap
>       slots, inspired by Kairui Song and Chris Li's implementation, as
>       well as Johannes Weiner's suggestions. This eliminates the lock
>           contention issues on the virtual swap layer.
>     * Re-use swap table for the reverse mapping.
>     * Remove CONFIG_VIRTUAL_SWAP.

I really do think we better make this optional, not a replacement or
mandatory. There are many hard to evaluate effects as this
fundamentally changes the swap workflow with a lot of behavior changes
at once. e.g. it seems the folio will be reactivated instead of
splitted if the physical swap device is fragmented; slot is allocated
at IO and not at unmap, and maybe many others. Just like zswap is
optional. Some common workloads would see an obvious performance or
memory usage regression following this design, see below.

>     * Reducing the size of the swap descriptor from 48 bytes to 24
>       bytes, i.e another 50% reduction in memory overhead from v2.

Honestly if you keep reducing that you might just end up
reimplementing the swap table format :)

> This patch series is based on 6.19. There are a couple more
> swap-related changes in the mm-stable branch that I would need to
> coordinate with, but I would like to send this out as an update, to show
> that the lock contention issues that plagued earlier versions have been
> resolved and performance on the kernel build benchmark is now on-par with
> baseline. Furthermore, memory overhead has been substantially reduced
> compared to the last RFC version.

Thanks for the effort!

> * Operationally, static provisioning the swapfile for zswap pose
>   significant challenges, because the sysadmin has to prescribe how
>   much swap is needed a priori, for each combination of
>   (memory size x disk space x workload usage). It is even more
>   complicated when we take into account the variance of memory
>   compression, which changes the reclaim dynamics (and as a result,
>   swap space size requirement). The problem is further exarcebated for
>   users who rely on swap utilization (and exhaustion) as an OOM signal.

So I thought about it again, this one seems not to be an issue. In
most cases, having a 1:1 virtual swap setup is enough, and very soon
the static overhead will be really trivial. There won't even be any
fragmentation issue either, since if the physical memory size is
identical to swap space, then you can always find a matching part. And
besides, dynamic growth of swap files is actually very doable and
useful, that will make physical swap files adjustable at runtime, so
users won't need to waste a swap type id to extend physical swap
space.

> * Another motivation is to simplify swapoff, which is both complicated
>   and expensive in the current design, precisely because we are storing
>   an encoding of the backend positional information in the page table,
>   and thus requires a full page table walk to remove these references.

The swapoff here is not really a clean swapoff, minor faults will
still be triggered afterwards, and metadata is not released. So this
new swapoff cannot really guarantee the same performance as the old
swapoff. And on the other hand we can already just read everything
into the swap cache then ignore the page table walk with the older
design too, that's just not a clean swapoff.

> struct swp_desc {
>         union {
>                 swp_slot_t         slot;                 /*     0     8 */
>                 struct zswap_entry * zswap_entry;        /*     0     8 */
>         };                                               /*     0     8 */
>         union {
>                 struct folio *     swap_cache;           /*     8     8 */
>                 void *             shadow;               /*     8     8 */
>         };                                               /*     8     8 */
>         unsigned int               swap_count;           /*    16     4 */
>         unsigned short             memcgid:16;           /*    20: 0  2 */
>         bool                       in_swapcache:1;       /*    22: 0  1 */

A standalone bit for swapcache looks like the old SWAP_HAS_CACHE that
causes many issues...

>
>         /* Bitfield combined with previous fields */
>
>         enum swap_type             type:2;               /*    20:17  4 */
>
>         /* size: 24, cachelines: 1, members: 6 */
>         /* bit_padding: 13 bits */
>         /* last cacheline: 24 bytes */
> };

Having a struct larger than 8 bytes means you can't load it
atomically, that limits your lock design. About a year ago Chris
shared with me an idea to use CAS on swap entries once they are small
and unified, that's why swap table is using atomic_long_t and have
helpers like __swap_table_xchg, we are not making good use of them yet
though. Meanwhile we have already consolidated the lock scope to folio
in many places, holding the folio lock then doing the CAS without
touching cluster lock at all for many swap operations might be
feasible soon.

E.g. we already have a cluster-lockless version of swap check in swap table p3:
https://lore.kernel.org/linux-mm/20260128-swap-table-p3-v2-11-fe0b67ef0215@tencent.com/

That might also greatly simplify the locking on IO and migration
performance between swap devices.

> Doing the same math for the disk swap, which is the worst case for
> virtual swap in terms of swap backends:

Actually this worst case is a very common case... see below.

> 0% usage, or 0 entries: 0.00 MB
> * Old design total overhead: 25.00 MB
> * Vswap total overhead: 2.00 MB
>
> 25% usage, or 2,097,152 entries:
> * Old design total overhead: 41.00 MB
> * Vswap total overhead: 66.25 MB
>
> 50% usage, or 4,194,304 entries:
> * Old design total overhead: 57.00 MB
> * Vswap total overhead: 130.50 MB
>
> 75% usage, or 6,291,456 entries:
> * Old design total overhead: 73.00 MB
> * Vswap total overhead: 194.75 MB
>
> 100% usage, or 8,388,608 entries:
> * Old design total overhead: 89.00 MB
> * Vswap total overhead: 259.00 MB
>
> The added overhead is 170MB, which is 0.5% of the total swapfile size,
> again in the worst case when we have a sizing oracle.

Hmm.. With the swap table we will have a stable 8 bytes per slot in
all cases, in current mm-stable we use 11 bytes (8 bytes dyn and 3
bytes static), and in the posted p3 we already get 10 bytes (8 bytes
dyn and 2 bytes static). P4 or follow up was already demonstrated
last year with working code, and it makes everything dynamic
(8 bytes fully dyn, I'll rebase and send that once p3 is merged).

So with mm-stable and follow up, for 32G swap device:

0% usage, or 0/8,388,608 entries: 0.00 MB
* mm-stable total overhead: 25.50 MB (which is swap table p2)
* swap-table p3 overhead: 17.50 MB
* swap-table p4 overhead: 0.50 MB
* Vswap total overhead: 2.00 MB

100% usage, or 8,388,608/8,388,608 entries:
* mm-stable total overhead: 89.5 MB (which is swap table p2)
* swap-table p3 overhead: 81.5 MB
* swap-table p4 overhead: 64.5 MB
* Vswap total overhead: 259.00 MB

That 3 - 4 times more memory usage, quite a trade off. With a
128G device, which is not something rare, it would be 1G of memory.
Swap table p3 / p4 is about 320M / 256M, and we do have a way to cut
that down close to be <1 byte or 3 byte per page with swap table
compaction, which was discussed in LSFMM last year, or even 1 bit
which was once suggested by Baolin, that would make it much smaller
down to <24MB (This is just an idea for now, but the compaction is
very doable as we already have "LRU"s for swap clusters in swap
allocator).

I don't think it looks good as a mandatory overhead. We do have a huge
user base of swap over many different kinds of devices, it was not
long ago two new kernel bugzilla issue  or bug reported was sent to
the maillist about swap over disk, and I'm still trying to investigate
one of them which seems to be actually a page LRU issue and not swap
problem..  OK a little off topic, anyway, I'm not saying that we don't
want more features, as I mentioned above, it would be better if this
can be optional and minimal. See more test info below.

> We actually see a slight improvement in systime (by 1.5%) :) This is
> likely because we no longer have to perform swap charging for zswap
> entries, and virtual swap allocator is simpler than that of physical
> swap.

Congrats! Yeah, I guess that's because vswap has a smaller lock scope
than zswap with a reduced callpath?

>
> Using SSD swap as the backend:
>
> Baseline:
> real: mean: 200.3s, stdev: 2.33s
> sys: mean: 489.88s, stdev: 9.62s
>
> Vswap:
> real: mean: 201.47s, stdev: 2.98s
> sys: mean: 487.36s, stdev: 5.53s
>
> The performance is neck-to-neck.

Thanks for the bench, but please also test with global pressure too.
One mistake I made when working on the prototype of swap tables was
only focusing on cgroup memory pressure, which is really not how
everyone uses Linux, and that's why I reworked it for a long time to
tweak the RCU allocation / freeing of swap table pages so there won't
be any regression even for lowend and global pressure. That's kind of
critical for devices like Android.

I did an overnight bench on this with global pressure, comparing to
mainline 6.19 and swap table p3 (I do include such test for each swap
table serie, p2 / p3 is close so I just rebase and latest p3 on top of
your base commit just to be fair and that's easier for me too) and it
doesn't look that good.

Test machine setup for vm-scalability:
# lscpu | grep "Model name"
Model name:          AMD EPYC 7K62 48-Core Processor

# free -m
              total        used        free      shared  buff/cache   available
Mem:          31582         909       26388           8        4284       29989
Swap:         40959          41       40918

The swap setup follows the recommendation from Huang
(https://lore.kernel.org/linux-mm/87ed474kvx.fsf@yhuang6-desk2.ccr.corp.intel.com/).

Test (average of 18 test run):
vm-scalability/usemem --init-time -O -y -x -n 1 56G

6.19:
Throughput: 618.49 MB/s (stdev 31.3)
Free latency: 5754780.50us (stdev 69542.7)

swap-table-p3 (3.8%, 0.5% better):
Throughput: 642.02 MB/s (stdev 25.1)
Free latency: 5728544.16us (stdev 48592.51)

vswap (3.2%, 244% worse):
Throughput: 598.67 MB/s (stdev 25.1)
Free latency: 13987175.66us (stdev 125148.57)

That's a huge regression with freeing. I have a vm-scatiliby test
matrix, not every setup has such significant >200% regression, but on
average the freeing time is about at least 15 - 50% slower (for
example /data/vm-scalability/usemem --init-time -O -y -x -n 32 1536M
the regression is about 2583221.62us vs 2153735.59us). Throughput is
all lower too.

Freeing is important as it was causing many problems before, it's the
reason why we had a swap slot freeing cache years ago (and later we
removed that since the freeing cache causes more problems and swap
allocator already improved it better than having the cache). People
even tried to optimize that:
https://lore.kernel.org/linux-mm/20250909065349.574894-1-liulei.rjpt@vivo.com/
(This seems a already fixed downstream issue, solved by swap allocator
or swap table). Some workloads might amplify the free latency greatly
and cause serious lags as shown above.

Another thing I personally cares about is how swap works on my daily
laptop :), building the kernel in a 2G test VM using NVME as swap,
which is a very practical workload I do everyday, the result is also
not good (average of 8 test run, make -j12):
#free -m
               total        used        free      shared  buff/cache   available
Mem:            1465         216        1026           0         300        1248
Swap:           4095          36        4059

6.19 systime:
109.6s
swap-table p3:
108.9s
vswap systime:
118.7s

On a build server, it's also slower (make -j48 with 4G memory VM and
NVME swap, average of 10 testrun):
# free -m
               total        used        free      shared  buff/cache   available
Mem:            3877        1444        2019         737        1376        2432
Swap:          32767        1886       30881

# lscpu | grep "Model name"
Model name:                              Intel(R) Xeon(R) Platinum
8255C CPU @ 2.50GHz

6.19 systime:
435.601s
swap-table p3:
432.793s
vswap systime:
455.652s

In conclusion it's about 4.3 - 8.3% slower for common workloads under
global pressure, and there is a up to 200% regression on freeing. ZRAM
shows an even larger workload regression but I'll skip that part since
your series is focusing on zswap now. Redis is also ~20% slower
compared to mm-stable (327515.00 RPS vs 405827.81 RPS), that's mostly
due to swap-table-p2 in mm-stable so I didn't do further comparisons.

So if that's not a bug with this series, I think the double free or
decoupling of swap / underlying slots might be the problem with the
freeing regression shown above. That's really a serious issue, and the
global pressure might be a critical issue too as the metadata is much
larger, and is already causing regressions for very common workloads.
Low end users could hit the min watermark easily and could have
serious jitters or allocation failures.

That's part of the issue I've found, so I really do think we need a
flexible way to implementa that and not have a mandatory layer. After
swap table P4 we should be able to figure out a way to fit all needs,
with a clean defined set of swap API, metadata and layers, as was
discussed at LSFMM last year.

next prev parent reply	other threads:[~2026-02-10 18:00 UTC|newest]

Thread overview: 52+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-02-08 21:58 Nhat Pham
2026-02-08 21:58 ` [PATCH v3 01/20] mm/swap: decouple swap cache from physical swap infrastructure Nhat Pham
2026-02-08 22:26   ` [PATCH v3 00/20] Virtual Swap Space Nhat Pham
2026-02-10 17:59     ` Kairui Song [this message]
2026-02-10 18:52       ` Johannes Weiner
2026-02-10 19:11       ` Nhat Pham
2026-02-10 19:23         ` Nhat Pham
2026-02-12  5:07         ` Chris Li
2026-02-17 23:36         ` Nhat Pham
2026-02-10 21:58       ` Chris Li
2026-02-20 21:05       ` [PATCH] vswap: fix poor batching behavior of vswap free path Nhat Pham
2026-02-08 22:31   ` [PATCH v3 00/20] Virtual Swap Space Nhat Pham
2026-02-09 12:20     ` Chris Li
2026-02-10  2:36       ` Johannes Weiner
2026-02-10 21:24         ` Chris Li
2026-02-10 23:01           ` Johannes Weiner
2026-02-10 18:00       ` Nhat Pham
2026-02-10 23:17         ` Chris Li
2026-02-08 22:39   ` Nhat Pham
2026-02-09  2:22   ` [PATCH v3 01/20] mm/swap: decouple swap cache from physical swap infrastructure kernel test robot
2026-02-08 21:58 ` [PATCH v3 02/20] swap: rearrange the swap header file Nhat Pham
2026-02-08 21:58 ` [PATCH v3 03/20] mm: swap: add an abstract API for locking out swapoff Nhat Pham
2026-02-08 21:58 ` [PATCH v3 04/20] zswap: add new helpers for zswap entry operations Nhat Pham
2026-02-08 21:58 ` [PATCH v3 05/20] mm/swap: add a new function to check if a swap entry is in swap cached Nhat Pham
2026-02-08 21:58 ` [PATCH v3 06/20] mm: swap: add a separate type for physical swap slots Nhat Pham
2026-02-08 21:58 ` [PATCH v3 07/20] mm: create scaffolds for the new virtual swap implementation Nhat Pham
2026-02-08 21:58 ` [PATCH v3 08/20] zswap: prepare zswap for swap virtualization Nhat Pham
2026-02-08 21:58 ` [PATCH v3 09/20] mm: swap: allocate a virtual swap slot for each swapped out page Nhat Pham
2026-02-09 17:12   ` kernel test robot
2026-02-11 13:42   ` kernel test robot
2026-02-08 21:58 ` [PATCH v3 10/20] swap: move swap cache to virtual swap descriptor Nhat Pham
2026-02-08 21:58 ` [PATCH v3 11/20] zswap: move zswap entry management to the " Nhat Pham
2026-02-08 21:58 ` [PATCH v3 12/20] swap: implement the swap_cgroup API using virtual swap Nhat Pham
2026-02-08 21:58 ` [PATCH v3 13/20] swap: manage swap entry lifecycle at the virtual swap layer Nhat Pham
2026-02-08 21:58 ` [PATCH v3 14/20] mm: swap: decouple virtual swap slot from backing store Nhat Pham
2026-02-10  6:31   ` Dan Carpenter
2026-02-08 21:58 ` [PATCH v3 15/20] zswap: do not start zswap shrinker if there is no physical swap slots Nhat Pham
2026-02-08 21:58 ` [PATCH v3 16/20] swap: do not unnecesarily pin readahead swap entries Nhat Pham
2026-02-08 21:58 ` [PATCH v3 17/20] swapfile: remove zeromap bitmap Nhat Pham
2026-02-08 21:58 ` [PATCH v3 18/20] memcg: swap: only charge physical swap slots Nhat Pham
2026-02-09  2:01   ` kernel test robot
2026-02-09  2:12   ` kernel test robot
2026-02-08 21:58 ` [PATCH v3 19/20] swap: simplify swapoff using virtual swap Nhat Pham
2026-02-08 21:58 ` [PATCH v3 20/20] swapfile: replace the swap map with bitmaps Nhat Pham
2026-02-08 22:51 ` [PATCH v3 00/20] Virtual Swap Space Nhat Pham
2026-02-12 12:23   ` David Hildenbrand (Arm)
2026-02-12 17:29     ` Nhat Pham
2026-02-12 17:39       ` Nhat Pham
2026-02-12 20:11         ` David Hildenbrand (Arm)
2026-02-12 17:41       ` David Hildenbrand (Arm)
2026-02-12 17:45         ` Nhat Pham
2026-02-10 15:45 ` [syzbot ci] " syzbot ci

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAMgjq7AQNGK-a=AOgvn4-V+zGO21QMbMTVbrYSW_R2oDSLoC+A@mail.gmail.com' \
    --to=ryncsn@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=axelrasmussen@google.com \
    --cc=baohua@kernel.org \
    --cc=bhe@redhat.com \
    --cc=cgroups@vger.kernel.org \
    --cc=chengming.zhou@linux.dev \
    --cc=chrisl@kernel.org \
    --cc=christophe.leroy@csgroup.eu \
    --cc=gourry@gourry.net \
    --cc=hannes@cmpxchg.org \
    --cc=huang.ying.caritas@gmail.com \
    --cc=hughd@google.com \
    --cc=jannh@google.com \
    --cc=joshua.hahnjy@gmail.com \
    --cc=kernel-team@meta.com \
    --cc=len.brown@intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-pm@vger.kernel.org \
    --cc=mhocko@kernel.org \
    --cc=muchun.song@linux.dev \
    --cc=npache@redhat.com \
    --cc=nphamcs@gmail.com \
    --cc=osalvador@suse.de \
    --cc=pavel@kernel.org \
    --cc=peterx@redhat.com \
    --cc=pfalcato@suse.de \
    --cc=rafael@kernel.org \
    --cc=riel@surriel.com \
    --cc=roman.gushchin@linux.dev \
    --cc=ryan.roberts@arm.com \
    --cc=shakeel.butt@linux.dev \
    --cc=shikemeng@huaweicloud.com \
    --cc=viro@zeniv.linux.org.uk \
    --cc=weixugc@google.com \
    --cc=yosry.ahmed@linux.dev \
    --cc=yuanchu@google.com \
    --cc=zhengqi.arch@bytedance.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox