From: Yosry Ahmed <yosryahmed@google.com>
To: Nhat Pham <nphamcs@gmail.com>
Cc: lsf-pc@lists.linux-foundation.org, akpm@linux-foundation.org,
hannes@cmpxchg.org, ryncsn@gmail.com, chengming.zhou@linux.dev,
chrisl@kernel.org, linux-mm@kvack.org, kernel-team@meta.com,
linux-kernel@vger.kernel.org, shakeel.butt@linux.dev,
hch@infradead.org, hughd@google.com, 21cnbao@gmail.com,
usamaarif642@gmail.com
Subject: Re: [LSF/MM/BPF TOPIC] Virtual Swap Space
Date: Fri, 17 Jan 2025 08:51:41 -0800 [thread overview]
Message-ID: <CAJD7tkY+prnDUW3GRAgrVOPp27rCUtuEz8jCnr=cEkXndvqKCw@mail.gmail.com> (raw)
In-Reply-To: <CAKEwX=MEunn2zhPi1Rmof2V1ShWud03X2-vFAgFTRHLRw2R8PQ@mail.gmail.com>
On Thu, Jan 16, 2025 at 6:47 PM Nhat Pham <nphamcs@gmail.com> wrote:
>
> On Fri, Jan 17, 2025 at 1:48 AM Yosry Ahmed <yosryahmed@google.com> wrote:
> >
> > On Thu, Jan 16, 2025 at 1:22 AM Nhat Pham <nphamcs@gmail.com> wrote:
> > >
> > > My apologies if I missed any interested party in the cc list -
> > > hopefully the mailing lists cc's suffice :)
> > >
> > > I'd like to (re-)propose the topic of swap abstraction layer for the
> > > conference, as a continuation of Yosry's proposals at LSFMMBPF 2023
> > > (see [1], [2], [3]).
> > >
> > > (AFAICT, the same idea has been floated by Rik van Riel since at
> > > least 2011 - see [8]).
> > >
> > > I have a working(-ish) prototype, which hopefully will be
> > > submission-ready soon. For now, I'd like to give the motivation/context
> > > for the topic, as well as some high level design:
> >
> > I would obviously be interested in attending this, albeit virtually if
> > possible. Just sharing some random thoughts below from my cold cache.
>
> Your inputs are always appreciated :)
>
> >
> > >
> > > I. Motivation
> > >
> > > Currently, when an anon page is swapped out, a slot in a backing swap
> > > device is allocated and stored in the page table entries that refer to
> > > the original page. This slot is also used as the "key" to find the
> > > swapped out content, as well as the index to swap data structures, such
> > > as the swap cache, or the swap cgroup mapping. Tying a swap entry to its
> > > backing slot in this way is performant and efficient when swap is purely
> > > just disk space, and swapoff is rare.
> > >
> > > However, the advent of many swap optimizations has exposed major
> > > drawbacks of this design. The first problem is that we occupy a physical
> > > slot in the swap space, even for pages that are NEVER expected to hit
> > > the disk: pages compressed and stored in the zswap pool, zero-filled
> > > pages, or pages rejected by both of these optimizations when zswap
> > > writeback is disabled. This is the arguably central shortcoming of
> > > zswap:
> > > * In deployments when no disk space can be afforded for swap (such as
> > > mobile and embedded devices), users cannot adopt zswap, and are forced
> > > to use zram. This is confusing for users, and creates extra burdens
> > > for developers, having to develop and maintain similar features for
> > > two separate swap backends (writeback, cgroup charging, THP support,
> > > etc.). For instance, see the discussion in [4].
> > > * Resource-wise, it is hugely wasteful in terms of disk usage, and
> > > limits the memory saving potentials of these optimizations by the
> > > static size of the swapfile, especially in high memory systems that
> > > can have up to terabytes worth of memory. It also creates significant
> > > challenges for users who rely on swap utilization as an early OOM
> > > signal.
> > >
> > > Another motivation for a swap redesign is to simplify swapoff, which
> > > is complicated and expensive in the current design. Tight coupling
> > > between a swap entry and its backing storage means that it requires a
> > > whole page table walk to update all the page table entries that refer to
> > > this swap entry, as well as updating all the associated swap data
> > > structures (swap cache, etc.).
> > >
> > >
> > > II. High Level Design Overview
> > >
> > > To fix the aforementioned issues, we need an abstraction that separates
> > > a swap entry from its physical backing storage. IOW, we need to
> > > “virtualize” the swap space: swap clients will work with a virtual swap
> > > slot (that is dynamically allocated on-demand), storing it in page
> > > table entries, and using it to index into various swap-related data
> > > structures.
> > >
> > > The backing storage is decoupled from this slot, and the newly
> > > introduced layer will “resolve” the ID to the actual storage, as well
> > > as cooperating with the swap cache to handle all the required
> > > synchronization. This layer also manages other metadata of the swap
> > > entry, such as its lifetime information (swap count), via a dynamically
> > > allocated per-entry swap descriptor:
> >
> > Do you plan to allocate one per-folio or per-page? I suppose it's
> > per-page based on the design, but I am wondering if you explored
> > having it per-folio. To make it work we'd need to support splitting a
> > swp_desc, and figuring out which slot or zswap_entry corresponds to a
> > certain page in a folio
>
> Per-page, for now. Per-folio requires allocating these swp_descs on
> huge page splitting etc., which is more complex.
We'd also need to allocate them during swapin. If a folio is swapped
out as a 16K chunk with a single swp_desc, then we try to swapin one
4K in the middle, we may need to split the swp_desc into 2.
>
> And yeah, we need to chain these zswap_entry's somehow. Not impossible
> certainly, but more overhead and more complexity :)
>
> >
> > >
> > > struct swp_desc {
> > > swp_entry_t vswap;
> > > union {
> > > swp_slot_t slot;
> > > struct folio *folio;
> > > struct zswap_entry *zswap_entry;
> > > };
> > > struct rcu_head rcu;
> > >
> > > rwlock_t lock;
> > > enum swap_type type;
> > >
> > > #ifdef CONFIG_MEMCG
> > > atomic_t memcgid;
> > > #endif
> > >
> > > atomic_t in_swapcache;
> > > struct kref refcnt;
> > > atomic_t swap_count;
> > > };
> >
> > That seems a bit large. I am assuming this is for the purpose of the
> > prototype and we can reduce its size eventually, right?
>
> Yup. I copied and pasted this from the prototype. Originally I
> squeezed all the state (in_swapcache and the swap type) in an
> integer-type "flag" field + 1 separate swap count field, and protected
> them all with a single rw lock. That gets really ugly/confusing, so
> for the sake of the prototype I just separate them all out in their
> own fields, and play with atomicity to see if it's possible to do
> things locklessly. So far so good (i.e no crashes yet), but the final
> form is TBD :) Maybe we can discuss in closer details once I send out
> this prototype as an RFC?
Yeah, I just had some passing comments.
>
> (I will say though it looks cleaner when all these fields are
> separated. So it's going to be a tradeoff in that sense too).
It's a tradeoff but I think we should be able to hide a lot of the
complexity behind neat helpers. It's not pretty but I think the memory
overhead is an important factor here.
>
> >
> > Particularly, I remember looking into merging the swap_count and
> > refcnt, and I am not sure what in_swapcache is (is this a bit? Why
> > can't we use a bit from swap_count?).
>
> Yup. That's a single bit - it's a (partial) replacement for
> SWAP_HAS_CACHE state in the existing swap map.
>
> No particular reason why we can't squeeze it into swap counts other
> than clarity :) It's going to be a bit annoying working with swap
> count values (swap count increment is now * 2 instead of ++ etc.).
Nothing a nice helper cannot hide :)
next prev parent reply other threads:[~2025-01-17 16:52 UTC|newest]
Thread overview: 7+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-01-16 9:22 Nhat Pham
2025-01-16 9:29 ` Nhat Pham
2025-01-16 18:47 ` Yosry Ahmed
2025-01-17 2:47 ` Nhat Pham
2025-01-17 3:09 ` Nhat Pham
2025-01-17 16:51 ` Yosry Ahmed [this message]
2025-03-28 14:41 ` Nhat Pham
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to='CAJD7tkY+prnDUW3GRAgrVOPp27rCUtuEz8jCnr=cEkXndvqKCw@mail.gmail.com' \
--to=yosryahmed@google.com \
--cc=21cnbao@gmail.com \
--cc=akpm@linux-foundation.org \
--cc=chengming.zhou@linux.dev \
--cc=chrisl@kernel.org \
--cc=hannes@cmpxchg.org \
--cc=hch@infradead.org \
--cc=hughd@google.com \
--cc=kernel-team@meta.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=lsf-pc@lists.linux-foundation.org \
--cc=nphamcs@gmail.com \
--cc=ryncsn@gmail.com \
--cc=shakeel.butt@linux.dev \
--cc=usamaarif642@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox