linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Nhat Pham <nphamcs@gmail.com>
To: Yosry Ahmed <yosry.ahmed@linux.dev>
Cc: Johannes Weiner <hannes@cmpxchg.org>,
	Chris Li <chrisl@kernel.org>,
	 Andrew Morton <akpm@linux-foundation.org>,
	Kairui Song <kasong@tencent.com>,
	 Kemeng Shi <shikemeng@huaweicloud.com>,
	Baoquan He <bhe@redhat.com>,  Barry Song <baohua@kernel.org>,
	Chengming Zhou <chengming.zhou@linux.dev>,
	linux-mm@kvack.org,  Rik van Riel <riel@surriel.com>,
	linux-kernel@vger.kernel.org, pratmal@google.com,
	 sweettea@google.com, gthelen@google.com, weixugc@google.com
Subject: Re: [PATCH RFC] mm: ghost swapfile support for zswap
Date: Mon, 24 Nov 2025 12:24:29 -0800	[thread overview]
Message-ID: <CAKEwX=Pq=9nLb+SrTXkBWH2yyoYzzOSJqdeASweFh+EpEokKzg@mail.gmail.com> (raw)
In-Reply-To: <2a8fd7bd35939b9aa4a7267c93e1fda995137966@linux.dev>

On Mon, Nov 24, 2025 at 11:32 AM Yosry Ahmed <yosry.ahmed@linux.dev> wrote:
>
> On Mon, Nov 24, 2025 at 12:27:17PM -0500, Johannes Weiner wrote:
> > On Fri, Nov 21, 2025 at 05:52:09PM -0800, Chris Li wrote:
> > > On Fri, Nov 21, 2025 at 3:40 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
> > > >
> > > > On Fri, Nov 21, 2025 at 01:31:43AM -0800, Chris Li wrote:
> > > > > The current zswap requires a backing swapfile. The swap slot used
> > > > > by zswap is not able to be used by the swapfile. That waste swapfile
> > > > > space.
> > > > >
> > > > > The ghost swapfile is a swapfile that only contains the swapfile header
> > > > > for zswap. The swapfile header indicate the size of the swapfile. There
> > > > > is no swap data section in the ghost swapfile, therefore, no waste of
> > > > > swapfile space.  As such, any write to a ghost swapfile will fail. To
> > > > > prevents accidental read or write of ghost swapfile, bdev of
> > > > > swap_info_struct is set to NULL. Ghost swapfile will also set the SSD
> > > > > flag because there is no rotation disk access when using zswap.
> > > >
> > > > Zswap is primarily a compressed cache for real swap on secondary
> > > > storage. It's indeed quite important that entries currently in zswap
> > > > don't occupy disk slots; but for a solution to this to be acceptable,
> > > > it has to work with the primary usecase and support disk writeback.
> > >
> > > Well, my plan is to support the writeback via swap.tiers.
> >
> > Do you have a link to that proposal?
> >
> > My understanding of swap tiers was about grouping different swapfiles
> > and assigning them to cgroups. The issue with writeback is relocating
> > the data that a swp_entry_t page table refers to - without having to
> > find and update all the possible page tables. I'm not sure how
> > swap.tiers solve this problem.
> >
> > > > This direction is a dead-end. Please take a look at Nhat's swap
> > > > virtualization patches. They decouple zswap from disk geometry, while
> > > > still supporting writeback to an actual backend file.
> > >
> > > Yes, there are many ways to decouple zswap from disk geometry, my swap
> > > table + swap.tiers design can do that as well. I have concerns about
> > > swap virtualization in the aspect of adding another layer of memory
> > > overhead addition per swap entry and CPU overhead of extra xarray
> > > lookup. I believe my approach is technically superior and cleaner.
> > > Both faster and cleaner. Basically swap.tiers + VFS like swap read
> > > write page ops. I will let Nhat clarify the performance and memory
> > > overhead side of the swap virtualization.
> >
> > I'm happy to discuss it.
> >
> > But keep in mind that the swap virtualization idea is a collaborative
> > product of quite a few people with an extensive combined upstream
> > record. Quite a bit of thought has gone into balancing static vs
> > runtime costs of that proposal. So you'll forgive me if I'm a bit
> > skeptical of the somewhat grandiose claims of one person that is new
> > to upstream development.
> >
> > As to your specific points - we use xarray lookups in the page cache
> > fast path. It's a bold claim to say this would be too much overhead
> > during swapins.
> >
> > Two, it's not clear to me how you want to make writeback efficient
> > *without* any sort of swap entry redirection. Walking all relevant
> > page tables is expensive; and you have to be able to find them first.
> >
> > If you're talking about a redirection array as opposed to a tree -
> > static sizing of the compressed space is also a no-go. Zswap
> > utilization varies *widely* between workloads and different workload
> > combinations. Further, zswap consumes the same fungible resource as
> > uncompressed memory - there is really no excuse to burden users with
> > static sizing questions about this pool.
>
> I think what Chris's idea is (and Chris correct me if I am wrong), is
> that we use ghost swapfiles (that are not backed by disk space) for
> zswap. So zswap has its own swapfiles, separate from disk swapfiles.
>
> memory.tiers establishes the ordering between swapfiles, so you put
> "ghost" -> "real" to get today's zswap writeback behavior. When you
> writeback, you keep page tables pointing at the swap entry in the ghost
> swapfile. What you do is:
> - Allocate a new swap entry in the "real" swapfile.
> - Update the swap table of the "ghost" swapfile to point at the swap
>   entry in the "real" swapfile, reusing the pointer used for the
>   swapcache.
>
> Then, on swapin, you read the swap table of the "ghost" swapfile, find
> the redirection, and read to the swap table of the "real" swapfile, then
> read the page from disk into the swap cache. The redirection in the
> "ghost" swapfile will keep existing, wasting that slot, until all
> references to it are dropped.
>
> I think this might work for this specific use case, with less overhead
> than the xarray. BUT there are a few scenarios that are not covered
> AFAICT:

Thanks for explaining these issues better than I could :)

>
> - You still need to statically size the ghost swapfiles and their
>   overheads.

Yes.

>
> - Wasting a slot in the ghost swapfile for the redirection. This
>   complicates static provisioning a bit, because you have to account for
>   entries that will be in zswap as well as writtenback. Furthermore,
>   IIUC swap.tiers is intended to be generic and cover other use cases
>   beyond zswap like SSD -> HDD. For that, I think wasting a slot in the
>   SSD when we writeback to the HDD is a much bigger problem.

Yep. We are trying to get away from static provisioning as much as we
can - this design digs us deeper in the hole. Who the hell know what's
the zswap:disk swap split is going to be? It's going to depend on
access patterns and compressibility.

>
> - We still cannot do swapoff efficiently as we need to walk the page
>   tables (and some swap tables) to find and swapin all entries in a
>   swapfile. Not as important as other things, but worth mentioning.

Yeah I left swapoff out of it, because it is just another use case.
But yes we can't do swapoff efficiently easily either.

And in general, it's going to be a very rigid design for more
complicated backend change (pre-fetching from one tier to another, or
compaction).


  reply	other threads:[~2025-11-24 20:24 UTC|newest]

Thread overview: 58+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-11-21  9:31 Chris Li
2025-11-21 10:19 ` Nhat Pham
2025-11-22  1:52   ` Chris Li
2025-11-24 14:47     ` Nhat Pham
2025-11-25 18:26       ` Chris Li
2025-11-21 11:40 ` Johannes Weiner
2025-11-22  1:52   ` Chris Li
2025-11-22 10:29     ` Kairui Song
2025-11-24 15:35     ` Nhat Pham
2025-11-24 16:14     ` Rik van Riel
2025-11-24 17:26       ` Chris Li
2025-11-24 17:42         ` Rik van Riel
2025-11-24 17:58           ` Chris Li
2025-11-24 17:27     ` Johannes Weiner
2025-11-24 18:24       ` Chris Li
2025-11-24 19:32         ` Johannes Weiner
2025-11-25 19:27           ` Chris Li
2025-11-25 21:31             ` Johannes Weiner
2025-11-26 19:22               ` Chris Li
2025-11-26 21:52                 ` Rik van Riel
2025-11-27  1:52                   ` Chris Li
2025-11-27  2:26                     ` Rik van Riel
2025-11-27 19:09                       ` Chris Li
2025-11-28 20:46                         ` Nhat Pham
2025-11-29 20:38                           ` Chris Li
2025-12-01 16:43                             ` Johannes Weiner
2025-12-01 19:49                               ` Kairui Song
2025-12-02 17:02                                 ` Johannes Weiner
2025-12-02 20:48                                   ` Chris Li
2025-12-01 20:21                               ` Barry Song
2025-12-02 19:58                               ` Chris Li
2025-12-01 23:37                             ` Nhat Pham
2025-12-02 19:18                               ` Chris Li
2025-12-02 18:18               ` Nhat Pham
2025-12-02 21:07                 ` Chris Li
2025-11-24 19:32       ` Yosry Ahmed
2025-11-24 20:24         ` Nhat Pham [this message]
2025-11-25 18:50         ` Chris Li
2025-11-26 21:58           ` Rik van Riel
2025-11-27  2:07             ` Chris Li
2025-11-27  2:34               ` Rik van Riel
2025-11-25 18:14     ` Chris Li
2025-11-25 18:55       ` Johannes Weiner
2025-11-21 15:14 ` Yosry Ahmed
2025-11-22  1:52   ` Chris Li
2025-11-24 14:57     ` Nhat Pham
2025-11-22  9:59 ` Kairui Song
2025-11-22 13:58   ` Baoquan He
2025-12-02  2:56   ` Barry Song
2025-12-02  6:31     ` Baoquan He
2025-12-02 17:53       ` Nhat Pham
2025-12-02 21:01         ` Chris Li
2025-12-03  8:37 ` Yosry Ahmed
2025-12-03 20:02   ` Chris Li
2025-12-04  6:16     ` Yosry Ahmed
2025-12-04 10:11       ` Chris Li
2025-12-04 20:55         ` Yosry Ahmed
2025-12-05  8:56           ` Kairui Song

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAKEwX=Pq=9nLb+SrTXkBWH2yyoYzzOSJqdeASweFh+EpEokKzg@mail.gmail.com' \
    --to=nphamcs@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=baohua@kernel.org \
    --cc=bhe@redhat.com \
    --cc=chengming.zhou@linux.dev \
    --cc=chrisl@kernel.org \
    --cc=gthelen@google.com \
    --cc=hannes@cmpxchg.org \
    --cc=kasong@tencent.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=pratmal@google.com \
    --cc=riel@surriel.com \
    --cc=shikemeng@huaweicloud.com \
    --cc=sweettea@google.com \
    --cc=weixugc@google.com \
    --cc=yosry.ahmed@linux.dev \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox