[LSF/MM/BPF TOPIC] Virtual Swap Space

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [LSF/MM/BPF TOPIC] Virtual Swap Space
@ 2025-01-16  9:22 Nhat Pham
  2025-01-16  9:29 ` Nhat Pham
                   ` (2 more replies)
  0 siblings, 3 replies; 7+ messages in thread
From: Nhat Pham @ 2025-01-16  9:22 UTC (permalink / raw)
  To: lsf-pc, akpm, hannes
  Cc: ryncsn, chengming.zhou, yosryahmed, chrisl, linux-mm,
	kernel-team, linux-kernel, shakeel.butt, hch, hughd, 21cnbao,
	usamaarif642

My apologies if I missed any interested party in the cc list -
hopefully the mailing lists cc's suffice :)

I'd like to (re-)propose the topic of swap abstraction layer for the
conference, as a continuation of Yosry's proposals at LSFMMBPF 2023
(see [1], [2], [3]).

(AFAICT, the same idea has been floated by Rik van Riel since at
least 2011 - see [8]).

I have a working(-ish) prototype, which hopefully will be
submission-ready soon. For now, I'd like to give the motivation/context
for the topic, as well as some high level design:

I. Motivation

Currently, when an anon page is swapped out, a slot in a backing swap
device is allocated and stored in the page table entries that refer to
the original page. This slot is also used as the "key" to find the
swapped out content, as well as the index to swap data structures, such
as the swap cache, or the swap cgroup mapping. Tying a swap entry to its
backing slot in this way is performant and efficient when swap is purely
just disk space, and swapoff is rare.

However, the advent of many swap optimizations has exposed major
drawbacks of this design. The first problem is that we occupy a physical
slot in the swap space, even for pages that are NEVER expected to hit
the disk: pages compressed and stored in the zswap pool, zero-filled
pages, or pages rejected by both of these optimizations when zswap
writeback is disabled. This is the arguably central shortcoming of
zswap:
* In deployments when no disk space can be afforded for swap (such as
  mobile and embedded devices), users cannot adopt zswap, and are forced
  to use zram. This is confusing for users, and creates extra burdens
  for developers, having to develop and maintain similar features for
  two separate swap backends (writeback, cgroup charging, THP support,
  etc.). For instance, see the discussion in [4].
* Resource-wise, it is hugely wasteful in terms of disk usage, and
  limits the memory saving potentials of these optimizations by the
  static size of the swapfile, especially in high memory systems that
  can have up to terabytes worth of memory. It also creates significant
  challenges for users who rely on swap utilization as an early OOM
  signal.

Another motivation for a swap redesign is to simplify swapoff, which
is complicated and expensive in the current design. Tight coupling
between a swap entry and its backing storage means that it requires a
whole page table walk to update all the page table entries that refer to
this swap entry, as well as updating all the associated swap data
structures (swap cache, etc.).

II. High Level Design Overview

To fix the aforementioned issues, we need an abstraction that separates
a swap entry from its physical backing storage. IOW, we need to
“virtualize” the swap space: swap clients will work with a virtual swap
slot (that is dynamically allocated on-demand), storing it in page
table entries, and using it to index into various swap-related data
structures.

The backing storage is decoupled from this slot, and the newly
introduced layer will “resolve” the ID to the actual storage, as well
as cooperating with the swap cache to handle all the required
synchronization. This layer also manages other metadata of the swap
entry, such as its lifetime information (swap count), via a dynamically
allocated per-entry swap descriptor:

struct swp_desc {
	swp_entry_t vswap;
	union {
		swp_slot_t slot;
		struct folio *folio;
		struct zswap_entry *zswap_entry;
	};
	struct rcu_head rcu;

	rwlock_t lock;
	enum swap_type type;

#ifdef CONFIG_MEMCG
	atomic_t memcgid;
#endif

	atomic_t in_swapcache;
	struct kref refcnt;
	atomic_t swap_count;
};

This design allows us to:
* Decouple zswap (and zeromapped swap entry) from backing swapfile:
  simply associate the swap ID with one of the supported backends: a
  zswap entry, a zero-filled swap page, a slot on the swapfile, or a
  page in memory .
* Simplify and optimize swapoff: we only have to fault the page in and
  have the swap ID points to the page instead of the on-disk swap slot.
  No need to perform any page table walking :)

III. Future Use Cases

Other than decoupling swap backends and optimizing swapoff, this new
design allows us to implement the following more easily and
efficiently:

* Multi-tier swapping (as mentioned in [5]), with transparent
  transferring (promotion/demotion) of pages across tiers (see [8] and
  [9]). Similar to swapoff, with the old design we would need to
  perform the expensive page table walk.
* Swapfile compaction to alleviate fragmentation (as proposed by Ying
  Huang in [6]).
* Mixed backing THP swapin (see [7]): Once you have pinned down the
  backing store of THPs, then you can dispatch each range of subpages
  to appropriate pagein handler.

[1]: https://lore.kernel.org/all/CAJD7tkbCnXJ95Qow_aOjNX6NOMU5ovMSHRC+95U4wtW6cM+puw@mail.gmail.com/
[2]: https://lwn.net/Articles/932077/
[3]: https://www.youtube.com/watch?v=Hwqw_TBGEhg
[4]: https://lore.kernel.org/all/Zqe_Nab-Df1CN7iW@infradead.org/
[5]: https://lore.kernel.org/lkml/CAF8kJuN-4UE0skVHvjUzpGefavkLULMonjgkXUZSBVJrcGFXCA@mail.gmail.com/
[6]: https://lore.kernel.org/linux-mm/87o78mzp24.fsf@yhuang6-desk2.ccr.corp.intel.com/
[7]: https://lore.kernel.org/all/CAGsJ_4ysCN6f7qt=6gvee1x3ttbOnifGneqcRm9Hoeun=uFQ2w@mail.gmail.com/
[8]: https://lore.kernel.org/linux-mm/4DA25039.3020700@redhat.com/
[9]: https://lore.kernel.org/all/CA+ZsKJ7DCE8PMOSaVmsmYZL9poxK6rn0gvVXbjpqxMwxS2C9TQ@mail.gmail.com/

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Virtual Swap Space
  2025-01-16  9:22 [LSF/MM/BPF TOPIC] Virtual Swap Space Nhat Pham
@ 2025-01-16  9:29 ` Nhat Pham
  2025-01-16 18:47 ` Yosry Ahmed
  2025-03-28 14:41 ` Nhat Pham
  2 siblings, 0 replies; 7+ messages in thread
From: Nhat Pham @ 2025-01-16  9:29 UTC (permalink / raw)
  To: lsf-pc, akpm, hannes
  Cc: ryncsn, chengming.zhou, yosryahmed, chrisl, linux-mm,
	kernel-team, linux-kernel, shakeel.butt, hch, hughd, 21cnbao,
	usamaarif642

On Thu, Jan 16, 2025 at 4:22 PM Nhat Pham <nphamcs@gmail.com> wrote:
>>
> The backing storage is decoupled from this slot, and the newly
> introduced layer will “resolve” the ID to the actual storage, as well

and to be perfectly clear, "ID" in this proposal means virtual swap
slot (rather than the physical swap slots). I'll stick to the virtual
swap slot terminology in future communications - apologies for any
confusion :)


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Virtual Swap Space
  2025-01-16  9:22 [LSF/MM/BPF TOPIC] Virtual Swap Space Nhat Pham
  2025-01-16  9:29 ` Nhat Pham
@ 2025-01-16 18:47 ` Yosry Ahmed
  2025-01-17  2:47   ` Nhat Pham
  2025-03-28 14:41 ` Nhat Pham
  2 siblings, 1 reply; 7+ messages in thread
From: Yosry Ahmed @ 2025-01-16 18:47 UTC (permalink / raw)
  To: Nhat Pham
  Cc: lsf-pc, akpm, hannes, ryncsn, chengming.zhou, chrisl, linux-mm,
	kernel-team, linux-kernel, shakeel.butt, hch, hughd, 21cnbao,
	usamaarif642

On Thu, Jan 16, 2025 at 1:22 AM Nhat Pham <nphamcs@gmail.com> wrote:
>
> My apologies if I missed any interested party in the cc list -
> hopefully the mailing lists cc's suffice :)
>
> I'd like to (re-)propose the topic of swap abstraction layer for the
> conference, as a continuation of Yosry's proposals at LSFMMBPF 2023
> (see [1], [2], [3]).
>
> (AFAICT, the same idea has been floated by Rik van Riel since at
> least 2011 - see [8]).
>
> I have a working(-ish) prototype, which hopefully will be
> submission-ready soon. For now, I'd like to give the motivation/context
> for the topic, as well as some high level design:

I would obviously be interested in attending this, albeit virtually if
possible. Just sharing some random thoughts below from my cold cache.

>
> I. Motivation
>
> Currently, when an anon page is swapped out, a slot in a backing swap
> device is allocated and stored in the page table entries that refer to
> the original page. This slot is also used as the "key" to find the
> swapped out content, as well as the index to swap data structures, such
> as the swap cache, or the swap cgroup mapping. Tying a swap entry to its
> backing slot in this way is performant and efficient when swap is purely
> just disk space, and swapoff is rare.
>
> However, the advent of many swap optimizations has exposed major
> drawbacks of this design. The first problem is that we occupy a physical
> slot in the swap space, even for pages that are NEVER expected to hit
> the disk: pages compressed and stored in the zswap pool, zero-filled
> pages, or pages rejected by both of these optimizations when zswap
> writeback is disabled. This is the arguably central shortcoming of
> zswap:
> * In deployments when no disk space can be afforded for swap (such as
>   mobile and embedded devices), users cannot adopt zswap, and are forced
>   to use zram. This is confusing for users, and creates extra burdens
>   for developers, having to develop and maintain similar features for
>   two separate swap backends (writeback, cgroup charging, THP support,
>   etc.). For instance, see the discussion in [4].
> * Resource-wise, it is hugely wasteful in terms of disk usage, and
>   limits the memory saving potentials of these optimizations by the
>   static size of the swapfile, especially in high memory systems that
>   can have up to terabytes worth of memory. It also creates significant
>   challenges for users who rely on swap utilization as an early OOM
>   signal.
>
> Another motivation for a swap redesign is to simplify swapoff, which
> is complicated and expensive in the current design. Tight coupling
> between a swap entry and its backing storage means that it requires a
> whole page table walk to update all the page table entries that refer to
> this swap entry, as well as updating all the associated swap data
> structures (swap cache, etc.).
>
>
> II. High Level Design Overview
>
> To fix the aforementioned issues, we need an abstraction that separates
> a swap entry from its physical backing storage. IOW, we need to
> “virtualize” the swap space: swap clients will work with a virtual swap
> slot (that is dynamically allocated on-demand), storing it in page
> table entries, and using it to index into various swap-related data
> structures.
>
> The backing storage is decoupled from this slot, and the newly
> introduced layer will “resolve” the ID to the actual storage, as well
> as cooperating with the swap cache to handle all the required
> synchronization. This layer also manages other metadata of the swap
> entry, such as its lifetime information (swap count), via a dynamically
> allocated per-entry swap descriptor:

Do you plan to allocate one per-folio or per-page? I suppose it's
per-page based on the design, but I am wondering if you explored
having it per-folio. To make it work we'd need to support splitting a
swp_desc, and figuring out which slot or zswap_entry corresponds to a
certain page in a folio.

>
> struct swp_desc {
>         swp_entry_t vswap;
>         union {
>                 swp_slot_t slot;
>                 struct folio *folio;
>                 struct zswap_entry *zswap_entry;
>         };
>         struct rcu_head rcu;
>
>         rwlock_t lock;
>         enum swap_type type;
>
> #ifdef CONFIG_MEMCG
>         atomic_t memcgid;
> #endif
>
>         atomic_t in_swapcache;
>         struct kref refcnt;
>         atomic_t swap_count;
> };

That seems a bit large. I am assuming this is for the purpose of the
prototype and we can reduce its size eventually, right?

Particularly, I remember looking into merging the swap_count and
refcnt, and I am not sure what in_swapcache is (is this a bit? Why
can't we use a bit from swap_count?).

I also think we can shove the swap_type in the low bits of the
pointers (with some finesse for swp_slot_t), and the locking could be
made less granular (I remember exploring going completely lockless,
but I don't remember how that turned out).

>
>
> This design allows us to:
> * Decouple zswap (and zeromapped swap entry) from backing swapfile:
>   simply associate the swap ID with one of the supported backends: a
>   zswap entry, a zero-filled swap page, a slot on the swapfile, or a
>   page in memory .
> * Simplify and optimize swapoff: we only have to fault the page in and
>   have the swap ID points to the page instead of the on-disk swap slot.
>   No need to perform any page table walking :)

It also allows us to delete the complex swap count continuation code.

>
> III. Future Use Cases
>
> Other than decoupling swap backends and optimizing swapoff, this new
> design allows us to implement the following more easily and
> efficiently:
>
> * Multi-tier swapping (as mentioned in [5]), with transparent
>   transferring (promotion/demotion) of pages across tiers (see [8] and
>   [9]). Similar to swapoff, with the old design we would need to
>   perform the expensive page table walk.
> * Swapfile compaction to alleviate fragmentation (as proposed by Ying
>   Huang in [6]).
> * Mixed backing THP swapin (see [7]): Once you have pinned down the
>   backing store of THPs, then you can dispatch each range of subpages
>   to appropriate pagein handler.
>
> [1]: https://lore.kernel.org/all/CAJD7tkbCnXJ95Qow_aOjNX6NOMU5ovMSHRC+95U4wtW6cM+puw@mail.gmail.com/
> [2]: https://lwn.net/Articles/932077/
> [3]: https://www.youtube.com/watch?v=Hwqw_TBGEhg
> [4]: https://lore.kernel.org/all/Zqe_Nab-Df1CN7iW@infradead.org/
> [5]: https://lore.kernel.org/lkml/CAF8kJuN-4UE0skVHvjUzpGefavkLULMonjgkXUZSBVJrcGFXCA@mail.gmail.com/
> [6]: https://lore.kernel.org/linux-mm/87o78mzp24.fsf@yhuang6-desk2.ccr.corp.intel.com/
> [7]: https://lore.kernel.org/all/CAGsJ_4ysCN6f7qt=6gvee1x3ttbOnifGneqcRm9Hoeun=uFQ2w@mail.gmail.com/
> [8]: https://lore.kernel.org/linux-mm/4DA25039.3020700@redhat.com/
> [9]: https://lore.kernel.org/all/CA+ZsKJ7DCE8PMOSaVmsmYZL9poxK6rn0gvVXbjpqxMwxS2C9TQ@mail.gmail.com/


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Virtual Swap Space
  2025-01-16 18:47 ` Yosry Ahmed
@ 2025-01-17  2:47   ` Nhat Pham
  2025-01-17  3:09     ` Nhat Pham
  2025-01-17 16:51     ` Yosry Ahmed
  0 siblings, 2 replies; 7+ messages in thread
From: Nhat Pham @ 2025-01-17  2:47 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: lsf-pc, akpm, hannes, ryncsn, chengming.zhou, chrisl, linux-mm,
	kernel-team, linux-kernel, shakeel.butt, hch, hughd, 21cnbao,
	usamaarif642

On Fri, Jan 17, 2025 at 1:48 AM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> On Thu, Jan 16, 2025 at 1:22 AM Nhat Pham <nphamcs@gmail.com> wrote:
> >
> > My apologies if I missed any interested party in the cc list -
> > hopefully the mailing lists cc's suffice :)
> >
> > I'd like to (re-)propose the topic of swap abstraction layer for the
> > conference, as a continuation of Yosry's proposals at LSFMMBPF 2023
> > (see [1], [2], [3]).
> >
> > (AFAICT, the same idea has been floated by Rik van Riel since at
> > least 2011 - see [8]).
> >
> > I have a working(-ish) prototype, which hopefully will be
> > submission-ready soon. For now, I'd like to give the motivation/context
> > for the topic, as well as some high level design:
>
> I would obviously be interested in attending this, albeit virtually if
> possible. Just sharing some random thoughts below from my cold cache.

Your inputs are always appreciated :)

>
> >
> > I. Motivation
> >
> > Currently, when an anon page is swapped out, a slot in a backing swap
> > device is allocated and stored in the page table entries that refer to
> > the original page. This slot is also used as the "key" to find the
> > swapped out content, as well as the index to swap data structures, such
> > as the swap cache, or the swap cgroup mapping. Tying a swap entry to its
> > backing slot in this way is performant and efficient when swap is purely
> > just disk space, and swapoff is rare.
> >
> > However, the advent of many swap optimizations has exposed major
> > drawbacks of this design. The first problem is that we occupy a physical
> > slot in the swap space, even for pages that are NEVER expected to hit
> > the disk: pages compressed and stored in the zswap pool, zero-filled
> > pages, or pages rejected by both of these optimizations when zswap
> > writeback is disabled. This is the arguably central shortcoming of
> > zswap:
> > * In deployments when no disk space can be afforded for swap (such as
> >   mobile and embedded devices), users cannot adopt zswap, and are forced
> >   to use zram. This is confusing for users, and creates extra burdens
> >   for developers, having to develop and maintain similar features for
> >   two separate swap backends (writeback, cgroup charging, THP support,
> >   etc.). For instance, see the discussion in [4].
> > * Resource-wise, it is hugely wasteful in terms of disk usage, and
> >   limits the memory saving potentials of these optimizations by the
> >   static size of the swapfile, especially in high memory systems that
> >   can have up to terabytes worth of memory. It also creates significant
> >   challenges for users who rely on swap utilization as an early OOM
> >   signal.
> >
> > Another motivation for a swap redesign is to simplify swapoff, which
> > is complicated and expensive in the current design. Tight coupling
> > between a swap entry and its backing storage means that it requires a
> > whole page table walk to update all the page table entries that refer to
> > this swap entry, as well as updating all the associated swap data
> > structures (swap cache, etc.).
> >
> >
> > II. High Level Design Overview
> >
> > To fix the aforementioned issues, we need an abstraction that separates
> > a swap entry from its physical backing storage. IOW, we need to
> > “virtualize” the swap space: swap clients will work with a virtual swap
> > slot (that is dynamically allocated on-demand), storing it in page
> > table entries, and using it to index into various swap-related data
> > structures.
> >
> > The backing storage is decoupled from this slot, and the newly
> > introduced layer will “resolve” the ID to the actual storage, as well
> > as cooperating with the swap cache to handle all the required
> > synchronization. This layer also manages other metadata of the swap
> > entry, such as its lifetime information (swap count), via a dynamically
> > allocated per-entry swap descriptor:
>
> Do you plan to allocate one per-folio or per-page? I suppose it's
> per-page based on the design, but I am wondering if you explored
> having it per-folio. To make it work we'd need to support splitting a
> swp_desc, and figuring out which slot or zswap_entry corresponds to a
> certain page in a folio

Per-page, for now. Per-folio requires allocating these swp_descs on
huge page splitting etc., which is more complex.

And yeah, we need to chain these zswap_entry's somehow. Not impossible
certainly, but more overhead and more complexity :)

>
> >
> > struct swp_desc {
> >         swp_entry_t vswap;
> >         union {
> >                 swp_slot_t slot;
> >                 struct folio *folio;
> >                 struct zswap_entry *zswap_entry;
> >         };
> >         struct rcu_head rcu;
> >
> >         rwlock_t lock;
> >         enum swap_type type;
> >
> > #ifdef CONFIG_MEMCG
> >         atomic_t memcgid;
> > #endif
> >
> >         atomic_t in_swapcache;
> >         struct kref refcnt;
> >         atomic_t swap_count;
> > };
>
> That seems a bit large. I am assuming this is for the purpose of the
> prototype and we can reduce its size eventually, right?

Yup. I copied and pasted this from the prototype. Originally I
squeezed all the state (in_swapcache and the swap type) in an
integer-type "flag" field + 1 separate swap count field, and protected
them all with a single rw lock. That gets really ugly/confusing, so
for the sake of the prototype I just separate them all out in their
own fields, and play with atomicity to see if it's possible to do
things locklessly. So far so good (i.e no crashes yet), but the final
form is TBD :) Maybe we can discuss in closer details once I send out
this prototype as an RFC?

(I will say though it looks cleaner when all these fields are
separated. So it's going to be a tradeoff in that sense too).

>
> Particularly, I remember looking into merging the swap_count and
> refcnt, and I am not sure what in_swapcache is (is this a bit? Why
> can't we use a bit from swap_count?).

Yup. That's a single bit - it's a (partial) replacement for
SWAP_HAS_CACHE state in the existing swap map.

No particular reason why we can't squeeze it into swap counts other
than clarity :) It's going to be a bit annoying working with swap
count values (swap count increment is now * 2 instead of ++ etc.).

>
> I also think we can shove the swap_type in the low bits of the
> pointers (with some finesse for swp_slot_t), and the locking could be
> made less granular (I remember exploring going completely lockless,
> but I don't remember how that turned out).

Ah nice, I did not think about that. There are 4 types, so we need at
least 2 bits for typing. Should be doable, but we need to double check
the size of the physical (i.e on swapfile) swap slot handle though.

>
> >
> >
> > This design allows us to:
> > * Decouple zswap (and zeromapped swap entry) from backing swapfile:
> >   simply associate the swap ID with one of the supported backends: a
> >   zswap entry, a zero-filled swap page, a slot on the swapfile, or a
> >   page in memory .
> > * Simplify and optimize swapoff: we only have to fault the page in and
> >   have the swap ID points to the page instead of the on-disk swap slot.
> >   No need to perform any page table walking :)
>
> It also allows us to delete the complex swap count continuation code.

Yep. FWIW, in the case of swap continuation complexity at least is for
space optimization, whereas in the swapoff case we're limited by the
architecture and cannot really do better complexity- or
efficiency-wise, so I decided to highlight the swapoff simplification
first. But you're right, we can choose not to keep the swap
continuation in the new design (it's what I'm doing in the prototype
at least).


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Virtual Swap Space
  2025-01-17  2:47   ` Nhat Pham
@ 2025-01-17  3:09     ` Nhat Pham
  2025-01-17 16:51     ` Yosry Ahmed
  1 sibling, 0 replies; 7+ messages in thread
From: Nhat Pham @ 2025-01-17  3:09 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: lsf-pc, akpm, hannes, ryncsn, chengming.zhou, chrisl, linux-mm,
	kernel-team, linux-kernel, shakeel.butt, hch, hughd, 21cnbao,
	usamaarif642

On Fri, Jan 17, 2025 at 9:47 AM Nhat Pham <nphamcs@gmail.com> wrote:
>
> No particular reason why we can't squeeze it into swap counts other
> than clarity :) It's going to be a bit annoying working with swap
> count values (swap count increment is now * 2 instead of ++ etc.).

errr.. += 2 I suppose (since the lowest bit is now the swap cache
bit). But point stands I suppose lol.

Getting exact swap count will be swap read then divide by 2 etc.


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Virtual Swap Space
  2025-01-17  2:47   ` Nhat Pham
  2025-01-17  3:09     ` Nhat Pham
@ 2025-01-17 16:51     ` Yosry Ahmed
  1 sibling, 0 replies; 7+ messages in thread
From: Yosry Ahmed @ 2025-01-17 16:51 UTC (permalink / raw)
  To: Nhat Pham
  Cc: lsf-pc, akpm, hannes, ryncsn, chengming.zhou, chrisl, linux-mm,
	kernel-team, linux-kernel, shakeel.butt, hch, hughd, 21cnbao,
	usamaarif642

On Thu, Jan 16, 2025 at 6:47 PM Nhat Pham <nphamcs@gmail.com> wrote:
>
> On Fri, Jan 17, 2025 at 1:48 AM Yosry Ahmed <yosryahmed@google.com> wrote:
> >
> > On Thu, Jan 16, 2025 at 1:22 AM Nhat Pham <nphamcs@gmail.com> wrote:
> > >
> > > My apologies if I missed any interested party in the cc list -
> > > hopefully the mailing lists cc's suffice :)
> > >
> > > I'd like to (re-)propose the topic of swap abstraction layer for the
> > > conference, as a continuation of Yosry's proposals at LSFMMBPF 2023
> > > (see [1], [2], [3]).
> > >
> > > (AFAICT, the same idea has been floated by Rik van Riel since at
> > > least 2011 - see [8]).
> > >
> > > I have a working(-ish) prototype, which hopefully will be
> > > submission-ready soon. For now, I'd like to give the motivation/context
> > > for the topic, as well as some high level design:
> >
> > I would obviously be interested in attending this, albeit virtually if
> > possible. Just sharing some random thoughts below from my cold cache.
>
> Your inputs are always appreciated :)
>
> >
> > >
> > > I. Motivation
> > >
> > > Currently, when an anon page is swapped out, a slot in a backing swap
> > > device is allocated and stored in the page table entries that refer to
> > > the original page. This slot is also used as the "key" to find the
> > > swapped out content, as well as the index to swap data structures, such
> > > as the swap cache, or the swap cgroup mapping. Tying a swap entry to its
> > > backing slot in this way is performant and efficient when swap is purely
> > > just disk space, and swapoff is rare.
> > >
> > > However, the advent of many swap optimizations has exposed major
> > > drawbacks of this design. The first problem is that we occupy a physical
> > > slot in the swap space, even for pages that are NEVER expected to hit
> > > the disk: pages compressed and stored in the zswap pool, zero-filled
> > > pages, or pages rejected by both of these optimizations when zswap
> > > writeback is disabled. This is the arguably central shortcoming of
> > > zswap:
> > > * In deployments when no disk space can be afforded for swap (such as
> > >   mobile and embedded devices), users cannot adopt zswap, and are forced
> > >   to use zram. This is confusing for users, and creates extra burdens
> > >   for developers, having to develop and maintain similar features for
> > >   two separate swap backends (writeback, cgroup charging, THP support,
> > >   etc.). For instance, see the discussion in [4].
> > > * Resource-wise, it is hugely wasteful in terms of disk usage, and
> > >   limits the memory saving potentials of these optimizations by the
> > >   static size of the swapfile, especially in high memory systems that
> > >   can have up to terabytes worth of memory. It also creates significant
> > >   challenges for users who rely on swap utilization as an early OOM
> > >   signal.
> > >
> > > Another motivation for a swap redesign is to simplify swapoff, which
> > > is complicated and expensive in the current design. Tight coupling
> > > between a swap entry and its backing storage means that it requires a
> > > whole page table walk to update all the page table entries that refer to
> > > this swap entry, as well as updating all the associated swap data
> > > structures (swap cache, etc.).
> > >
> > >
> > > II. High Level Design Overview
> > >
> > > To fix the aforementioned issues, we need an abstraction that separates
> > > a swap entry from its physical backing storage. IOW, we need to
> > > “virtualize” the swap space: swap clients will work with a virtual swap
> > > slot (that is dynamically allocated on-demand), storing it in page
> > > table entries, and using it to index into various swap-related data
> > > structures.
> > >
> > > The backing storage is decoupled from this slot, and the newly
> > > introduced layer will “resolve” the ID to the actual storage, as well
> > > as cooperating with the swap cache to handle all the required
> > > synchronization. This layer also manages other metadata of the swap
> > > entry, such as its lifetime information (swap count), via a dynamically
> > > allocated per-entry swap descriptor:
> >
> > Do you plan to allocate one per-folio or per-page? I suppose it's
> > per-page based on the design, but I am wondering if you explored
> > having it per-folio. To make it work we'd need to support splitting a
> > swp_desc, and figuring out which slot or zswap_entry corresponds to a
> > certain page in a folio
>
> Per-page, for now. Per-folio requires allocating these swp_descs on
> huge page splitting etc., which is more complex.

We'd also need to allocate them during swapin. If a folio is swapped
out as a 16K chunk with a single swp_desc, then we try to swapin one
4K in the middle, we may need to split the swp_desc into 2.

>
> And yeah, we need to chain these zswap_entry's somehow. Not impossible
> certainly, but more overhead and more complexity :)
>
> >
> > >
> > > struct swp_desc {
> > >         swp_entry_t vswap;
> > >         union {
> > >                 swp_slot_t slot;
> > >                 struct folio *folio;
> > >                 struct zswap_entry *zswap_entry;
> > >         };
> > >         struct rcu_head rcu;
> > >
> > >         rwlock_t lock;
> > >         enum swap_type type;
> > >
> > > #ifdef CONFIG_MEMCG
> > >         atomic_t memcgid;
> > > #endif
> > >
> > >         atomic_t in_swapcache;
> > >         struct kref refcnt;
> > >         atomic_t swap_count;
> > > };
> >
> > That seems a bit large. I am assuming this is for the purpose of the
> > prototype and we can reduce its size eventually, right?
>
> Yup. I copied and pasted this from the prototype. Originally I
> squeezed all the state (in_swapcache and the swap type) in an
> integer-type "flag" field + 1 separate swap count field, and protected
> them all with a single rw lock. That gets really ugly/confusing, so
> for the sake of the prototype I just separate them all out in their
> own fields, and play with atomicity to see if it's possible to do
> things locklessly. So far so good (i.e no crashes yet), but the final
> form is TBD :) Maybe we can discuss in closer details once I send out
> this prototype as an RFC?

Yeah, I just had some passing comments.

>
> (I will say though it looks cleaner when all these fields are
> separated. So it's going to be a tradeoff in that sense too).

It's a tradeoff but I think we should be able to hide a lot of the
complexity behind neat helpers. It's not pretty but I think the memory
overhead is an important factor here.

>
> >
> > Particularly, I remember looking into merging the swap_count and
> > refcnt, and I am not sure what in_swapcache is (is this a bit? Why
> > can't we use a bit from swap_count?).
>
> Yup. That's a single bit - it's a (partial) replacement for
> SWAP_HAS_CACHE state in the existing swap map.
>
> No particular reason why we can't squeeze it into swap counts other
> than clarity :) It's going to be a bit annoying working with swap
> count values (swap count increment is now * 2 instead of ++ etc.).

Nothing a nice helper cannot hide :)


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Virtual Swap Space
  2025-01-16  9:22 [LSF/MM/BPF TOPIC] Virtual Swap Space Nhat Pham
  2025-01-16  9:29 ` Nhat Pham
  2025-01-16 18:47 ` Yosry Ahmed
@ 2025-03-28 14:41 ` Nhat Pham
  2 siblings, 0 replies; 7+ messages in thread
From: Nhat Pham @ 2025-03-28 14:41 UTC (permalink / raw)
  To: lsf-pc, akpm, hannes
  Cc: ryncsn, chengming.zhou, yosryahmed, chrisl, linux-mm,
	kernel-team, linux-kernel, shakeel.butt, hch, hughd, 21cnbao,
	usamaarif642

On Thu, Jan 16, 2025 at 4:22 AM Nhat Pham <nphamcs@gmail.com> wrote:
>
> My apologies if I missed any interested party in the cc list -
> hopefully the mailing lists cc's suffice :)
>
> I'd like to (re-)propose the topic of swap abstraction layer for the
> conference, as a continuation of Yosry's proposals at LSFMMBPF 2023
> (see [1], [2], [3]).
>
> (AFAICT, the same idea has been floated by Rik van Riel since at
> least 2011 - see [8]).
>
> I have a working(-ish) prototype, which hopefully will be
> submission-ready soon. For now, I'd like to give the motivation/context
> for the topic, as well as some high level design:
>
> I. Motivation
>
> Currently, when an anon page is swapped out, a slot in a backing swap
> device is allocated and stored in the page table entries that refer to
> the original page. This slot is also used as the "key" to find the
> swapped out content, as well as the index to swap data structures, such
> as the swap cache, or the swap cgroup mapping. Tying a swap entry to its
> backing slot in this way is performant and efficient when swap is purely
> just disk space, and swapoff is rare.
>
> However, the advent of many swap optimizations has exposed major
> drawbacks of this design. The first problem is that we occupy a physical
> slot in the swap space, even for pages that are NEVER expected to hit
> the disk: pages compressed and stored in the zswap pool, zero-filled
> pages, or pages rejected by both of these optimizations when zswap
> writeback is disabled. This is the arguably central shortcoming of
> zswap:
> * In deployments when no disk space can be afforded for swap (such as
>   mobile and embedded devices), users cannot adopt zswap, and are forced
>   to use zram. This is confusing for users, and creates extra burdens
>   for developers, having to develop and maintain similar features for
>   two separate swap backends (writeback, cgroup charging, THP support,
>   etc.). For instance, see the discussion in [4].
> * Resource-wise, it is hugely wasteful in terms of disk usage, and
>   limits the memory saving potentials of these optimizations by the
>   static size of the swapfile, especially in high memory systems that
>   can have up to terabytes worth of memory. It also creates significant
>   challenges for users who rely on swap utilization as an early OOM
>   signal.
>
> Another motivation for a swap redesign is to simplify swapoff, which
> is complicated and expensive in the current design. Tight coupling
> between a swap entry and its backing storage means that it requires a
> whole page table walk to update all the page table entries that refer to
> this swap entry, as well as updating all the associated swap data
> structures (swap cache, etc.).
>
>
> II. High Level Design Overview
>
> To fix the aforementioned issues, we need an abstraction that separates
> a swap entry from its physical backing storage. IOW, we need to
> “virtualize” the swap space: swap clients will work with a virtual swap
> slot (that is dynamically allocated on-demand), storing it in page
> table entries, and using it to index into various swap-related data
> structures.
>
> The backing storage is decoupled from this slot, and the newly
> introduced layer will “resolve” the ID to the actual storage, as well
> as cooperating with the swap cache to handle all the required
> synchronization. This layer also manages other metadata of the swap
> entry, such as its lifetime information (swap count), via a dynamically
> allocated per-entry swap descriptor:
>
> struct swp_desc {
>         swp_entry_t vswap;
>         union {
>                 swp_slot_t slot;
>                 struct folio *folio;
>                 struct zswap_entry *zswap_entry;
>         };
>         struct rcu_head rcu;
>
>         rwlock_t lock;
>         enum swap_type type;
>
> #ifdef CONFIG_MEMCG
>         atomic_t memcgid;
> #endif
>
>         atomic_t in_swapcache;
>         struct kref refcnt;
>         atomic_t swap_count;
> };
>
>
> This design allows us to:
> * Decouple zswap (and zeromapped swap entry) from backing swapfile:
>   simply associate the swap ID with one of the supported backends: a
>   zswap entry, a zero-filled swap page, a slot on the swapfile, or a
>   page in memory .
> * Simplify and optimize swapoff: we only have to fault the page in and
>   have the swap ID points to the page instead of the on-disk swap slot.
>   No need to perform any page table walking :)
>
> III. Future Use Cases
>
> Other than decoupling swap backends and optimizing swapoff, this new
> design allows us to implement the following more easily and
> efficiently:
>
> * Multi-tier swapping (as mentioned in [5]), with transparent
>   transferring (promotion/demotion) of pages across tiers (see [8] and
>   [9]). Similar to swapoff, with the old design we would need to
>   perform the expensive page table walk.
> * Swapfile compaction to alleviate fragmentation (as proposed by Ying
>   Huang in [6]).
> * Mixed backing THP swapin (see [7]): Once you have pinned down the
>   backing store of THPs, then you can dispatch each range of subpages
>   to appropriate pagein handler.
>
> [1]: https://lore.kernel.org/all/CAJD7tkbCnXJ95Qow_aOjNX6NOMU5ovMSHRC+95U4wtW6cM+puw@mail.gmail.com/
> [2]: https://lwn.net/Articles/932077/
> [3]: https://www.youtube.com/watch?v=Hwqw_TBGEhg
> [4]: https://lore.kernel.org/all/Zqe_Nab-Df1CN7iW@infradead.org/
> [5]: https://lore.kernel.org/lkml/CAF8kJuN-4UE0skVHvjUzpGefavkLULMonjgkXUZSBVJrcGFXCA@mail.gmail.com/
> [6]: https://lore.kernel.org/linux-mm/87o78mzp24.fsf@yhuang6-desk2.ccr.corp.intel.com/
> [7]: https://lore.kernel.org/all/CAGsJ_4ysCN6f7qt=6gvee1x3ttbOnifGneqcRm9Hoeun=uFQ2w@mail.gmail.com/
> [8]: https://lore.kernel.org/linux-mm/4DA25039.3020700@redhat.com/
> [9]: https://lore.kernel.org/all/CA+ZsKJ7DCE8PMOSaVmsmYZL9poxK6rn0gvVXbjpqxMwxS2C9TQ@mail.gmail.com/

Link to my slides:

https://drive.google.com/file/d/1mn2kSczvEzwq7j55iKhVB3SP67Qy4KU2/view?usp=sharing

Thank you for your interest!


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2025-03-28 14:41 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-01-16  9:22 [LSF/MM/BPF TOPIC] Virtual Swap Space Nhat Pham
2025-01-16  9:29 ` Nhat Pham
2025-01-16 18:47 ` Yosry Ahmed
2025-01-17  2:47   ` Nhat Pham
2025-01-17  3:09     ` Nhat Pham
2025-01-17 16:51     ` Yosry Ahmed
2025-03-28 14:41 ` Nhat Pham

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox