[LSF/MM/BPF TOPIC] Integrate Swap Cache, Swap Maps with Swap Allocator

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [LSF/MM/BPF TOPIC] Integrate Swap Cache, Swap Maps with Swap Allocator
@ 2025-02-04 11:44 Kairui Song
  2025-02-04 16:24 ` Johannes Weiner
                   ` (2 more replies)
  0 siblings, 3 replies; 10+ messages in thread
From: Kairui Song @ 2025-02-04 11:44 UTC (permalink / raw)
  To: lsf-pc, linux-mm
  Cc: Andrew Morton, Chris Li, Johannes Weiner, Chengming Zhou,
	Yosry Ahmed, Shakeel Butt, Hugh Dickins, Matthew Wilcox,
	Barry Song, Nhat Pham, Usama Arif, Ryan Roberts, Huang, Ying

Hi all, sorry for the late submission.

Following previous work and topics with the SWAP allocator
[1][2][3][4], this topic would propose a way to redesign and integrate
multiple swap data into the swap allocator, which should be a
future-proof design, achieving following benefits:
- Even lower memory usage than the current design
- Higher performance (Remove HAS_CACHE pin trampoline)
- Dynamic allocation and growth support, further reducing idle memory usage
- Unifying the swapin path for a more maintainable code base (Remove SYNC_IO)
- More extensible, provide a clean bedrock for implementing things
like discontinuous swapout, readahead based mTHP swapin and more.

People have been complaining about the SWAP management subsystem [5].
Many incremental workarounds and optimizations are added, but causes
many other problems eg. [6][7][8][9] and making implementing new
features more difficult. One reason is the current design almost has
the minimal memory usage (1 byte swap map) with acceptable
performance, so it's hard to beat with incremental changes. But
actually as more code and features are added, there are already lots
of duplicated parts. So I'm proposing this idea to overhaul whole SWAP
slot management from a different aspect, as the following work on the
SWAP allocator [2].

Chris's topic "Swap abstraction" at LSFMM 2024 [1] raised the idea of
unifying swap data, we worked together to implement the short term
solution first: The swap allocator was the bottleneck for performance
and fragmentation issues. The new cluster allocator solved these
issues, and turned the cluster into a basic swap management unit.
It also removed slot cache freeing path, and I'll post another series
soon to remove the slot cache allocation path, so folios will always
interact with the SWAP allocator directly, preparing for this long
term goal:

A brief intro of the new design
===============================

It will first be a drop-in replacement for swap cache, using a per
cluster table to handle all things required for SWAP management.
Compared to the previous attempt to unify swap cache [11], this will
have lower overhead with more features achievable:

struct swap_cluster_info {
spinlock_t lock;
u16 count;
u8 flags;
u8 order;
+ void *table; /* 512 entries */
struct list_head list;
};

The table itself can have variants of format, but for basic usage,
each void* could be in one of the following type:

/*
 * a NULL:    | -----------    0    ------------| - Empty slot
 * a Shadow:  | SWAP_COUNT |---- Shadow ----|XX1| - Swaped out
 * a PFN:     | SWAP_COUNT |------ PFN -----|X10| - Cached
 * a Pointer: |----------- Pointer ---------|100| - Reserved / Unused yet
* SWAP_COUNT is still 8 bits.
 */

Clearly it can hold both cache and swap count. The shadow still has
enough for distance (using 16M as buckets for 52 bit VA) or gen
counting. For COUNT_CONTINUED, it can simply allocate another 512
atomics for one cluster.

The table is protected by ci->lock, which has little to none contention.
It also gets rid of the "HAS_CACHE bit setting vs Cache Insert",
"HAS_CACHE pin as trampoline" issue, deprecating SWP_SYNCHRONOUS_IO.
And remove the "multiple smaller file in one bit swapfile" design.

It will further remove the swap cgroup map. Cached folio (stored as
PFN) or shadow can provide such info. Some careful audit and workflow
redesign might be needed.

Each entry will be 8 bytes, smaller than current (8 bytes cache) + (2
bytes cgroup map) + (1 bytes SWAP Map) = 11 bytes.

Shadow reclaim and high order storing are still doable too, by
introducing dense cluster tables formats. We can even optimize it
specially for shmem to have 1 bit per entry. And empty clusters can
have their table freed. This part might be optional.

And it can have more types for supporting things like entry migrations
or virtual swapfile. The example formats above showed four types. Last
three or more bits can be used as a type indicator, as HAS_CACHE and
COUNT_CONTINUED will be gone.

Issues
======
There are unresolved problems or issues that may be worth getting some
addressing:
- Is workingset node reclaim really worth doing? We didn't do that
until 5649d113ffce in 2023. Especially considering fragmentation of
slab and the limited amount of SWAP compared to file cache.
- Userspace API change? This new design will allow dynamic growth of
swap size (especially for non physical devices like ZRAM or a
virtual/ghost swapfile). This may be worth thinking about how to be
used.
- Advanced usage and extensions for issues like "Swap Min Order",
"Discontinuous swapout". For example the "Swap Min Order" issue might
be solvable by allocating only specific order using the new cluster
allocator, then having an abstract / virtual file as a batch layer.
This layer may use some "redirection entries" in its table, with a
very low overhead and be optional in real world usage. Details are yet
to be decided.
- Noticed that this will allow all swapin to no longer bypass swap
cache (just like previous series) with better performance. This may
provide an opportunity to implement a tunable readahead based large
folio swapin. [12]

[1] https://lwn.net/Articles/974587/
[2] https://lpc.events/event/18/contributions/1769/
[3] https://lwn.net/Articles/984090/
[4] https://lwn.net/Articles/1005081/
[5] https://lwn.net/Articles/932077/
[6] https://lore.kernel.org/linux-mm/20240206182559.32264-1-ryncsn@gmail.com/
[7] https://lore.kernel.org/lkml/20240324210447.956973-1-hannes@cmpxchg.org/
[8] https://lore.kernel.org/lkml/20240926211936.75373-1-21cnbao@gmail.com/
[9] https://lore.kernel.org/all/CAMgjq7ACohT_uerSz8E_994ZZCv709Zor+43hdmesW_59W1BWw@mail.gmail.com/
[10] https://lore.kernel.org/lkml/20240326185032.72159-1-ryncsn@gmail.com/
[11] https://lwn.net/Articles/966845/
[12] https://lore.kernel.org/lkml/874j7zfqkk.fsf@yhuang6-desk2.ccr.corp.intel.com/

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Integrate Swap Cache, Swap Maps with Swap Allocator
  2025-02-04 11:44 [LSF/MM/BPF TOPIC] Integrate Swap Cache, Swap Maps with Swap Allocator Kairui Song
@ 2025-02-04 16:24 ` Johannes Weiner
  2025-02-04 16:46   ` Kairui Song
  2025-02-04 16:44 ` Yosry Ahmed
  2025-03-26  3:23 ` Kairui Song
  2 siblings, 1 reply; 10+ messages in thread
From: Johannes Weiner @ 2025-02-04 16:24 UTC (permalink / raw)
  To: Kairui Song
  Cc: lsf-pc, linux-mm, Andrew Morton, Chris Li, Chengming Zhou,
	Yosry Ahmed, Shakeel Butt, Hugh Dickins, Matthew Wilcox,
	Barry Song, Nhat Pham, Usama Arif, Ryan Roberts, Huang, Ying

Hi Kairui,

On Tue, Feb 04, 2025 at 07:44:46PM +0800, Kairui Song wrote:
> Hi all, sorry for the late submission.
> 
> Following previous work and topics with the SWAP allocator
> [1][2][3][4], this topic would propose a way to redesign and integrate
> multiple swap data into the swap allocator, which should be a
> future-proof design, achieving following benefits:
> - Even lower memory usage than the current design
> - Higher performance (Remove HAS_CACHE pin trampoline)
> - Dynamic allocation and growth support, further reducing idle memory usage
> - Unifying the swapin path for a more maintainable code base (Remove SYNC_IO)
> - More extensible, provide a clean bedrock for implementing things
> like discontinuous swapout, readahead based mTHP swapin and more.
> 
> People have been complaining about the SWAP management subsystem [5].
> Many incremental workarounds and optimizations are added, but causes
> many other problems eg. [6][7][8][9] and making implementing new
> features more difficult. One reason is the current design almost has
> the minimal memory usage (1 byte swap map) with acceptable
> performance, so it's hard to beat with incremental changes. But
> actually as more code and features are added, there are already lots
> of duplicated parts. So I'm proposing this idea to overhaul whole SWAP
> slot management from a different aspect, as the following work on the
> SWAP allocator [2].
> 
> Chris's topic "Swap abstraction" at LSFMM 2024 [1] raised the idea of
> unifying swap data, we worked together to implement the short term
> solution first: The swap allocator was the bottleneck for performance
> and fragmentation issues. The new cluster allocator solved these
> issues, and turned the cluster into a basic swap management unit.
> It also removed slot cache freeing path, and I'll post another series
> soon to remove the slot cache allocation path, so folios will always
> interact with the SWAP allocator directly, preparing for this long
> term goal:
> 
> A brief intro of the new design
> ===============================
> 
> It will first be a drop-in replacement for swap cache, using a per
> cluster table to handle all things required for SWAP management.
> Compared to the previous attempt to unify swap cache [11], this will
> have lower overhead with more features achievable:
> 
> struct swap_cluster_info {
> spinlock_t lock;
> u16 count;
> u8 flags;
> u8 order;
> + void *table; /* 512 entries */
> struct list_head list;
> };
> 
> The table itself can have variants of format, but for basic usage,
> each void* could be in one of the following type:
> 
> /*
>  * a NULL:    | -----------    0    ------------| - Empty slot
>  * a Shadow:  | SWAP_COUNT |---- Shadow ----|XX1| - Swaped out
>  * a PFN:     | SWAP_COUNT |------ PFN -----|X10| - Cached
>  * a Pointer: |----------- Pointer ---------|100| - Reserved / Unused yet
> * SWAP_COUNT is still 8 bits.
>  */
> 
> Clearly it can hold both cache and swap count. The shadow still has
> enough for distance (using 16M as buckets for 52 bit VA) or gen
> counting. For COUNT_CONTINUED, it can simply allocate another 512
> atomics for one cluster.
> 
> The table is protected by ci->lock, which has little to none contention.
> It also gets rid of the "HAS_CACHE bit setting vs Cache Insert",
> "HAS_CACHE pin as trampoline" issue, deprecating SWP_SYNCHRONOUS_IO.
> And remove the "multiple smaller file in one bit swapfile" design.
> 
> It will further remove the swap cgroup map. Cached folio (stored as
> PFN) or shadow can provide such info. Some careful audit and workflow
> redesign might be needed.
> 
> Each entry will be 8 bytes, smaller than current (8 bytes cache) + (2
> bytes cgroup map) + (1 bytes SWAP Map) = 11 bytes.
> 
> Shadow reclaim and high order storing are still doable too, by
> introducing dense cluster tables formats. We can even optimize it
> specially for shmem to have 1 bit per entry. And empty clusters can
> have their table freed. This part might be optional.
> 
> And it can have more types for supporting things like entry migrations
> or virtual swapfile. The example formats above showed four types. Last
> three or more bits can be used as a type indicator, as HAS_CACHE and
> COUNT_CONTINUED will be gone.

My understanding is that this would still tie the swap space to
configured swapfiles. That aspect of the current design has more and
more turned into a problem, because we now have several categories of
swap entries that either permanently or for extended periods of time
live in memory. Such entries should not occupy actual disk space.

The oldest one is probably partially refaulted entries (where one out
of N swapped page tables faults back in). We currently have to spend
full pages of both memory AND disk space for these.

The newest ones are zero-filled entries which are stored in a bitmap.

Then there is zswap. You mention ghost swapfiles - I know some setups
do this to use zswap purely for compression. But zswap is a writeback
cache for real swapfiles primarily, and it is used as such. That means
entries need to be able to move from the compressed pool to disk at
some point, but might not for a long time. Tying the compressed pool
size to disk space is hugely wasteful and an operational headache.

So I think any future proof design for the swap allocator needs to
decouple the virtual memory layer (page table count, swapcache, memcg
linkage, shadow info) from the physical layer (swapfile slot).

Can you touch on that concern?


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Integrate Swap Cache, Swap Maps with Swap Allocator
  2025-02-04 11:44 [LSF/MM/BPF TOPIC] Integrate Swap Cache, Swap Maps with Swap Allocator Kairui Song
  2025-02-04 16:24 ` Johannes Weiner
@ 2025-02-04 16:44 ` Yosry Ahmed
  2025-02-04 16:56   ` Kairui Song
  2025-03-26  3:23 ` Kairui Song
  2 siblings, 1 reply; 10+ messages in thread
From: Yosry Ahmed @ 2025-02-04 16:44 UTC (permalink / raw)
  To: Kairui Song
  Cc: lsf-pc, linux-mm, Andrew Morton, Chris Li, Johannes Weiner,
	Chengming Zhou, Shakeel Butt, Hugh Dickins, Matthew Wilcox,
	Barry Song, Nhat Pham, Usama Arif, Ryan Roberts, Huang, Ying

On Tue, Feb 04, 2025 at 07:44:46PM +0800, Kairui Song wrote:
> Hi all, sorry for the late submission.
> 
> Following previous work and topics with the SWAP allocator
> [1][2][3][4], this topic would propose a way to redesign and integrate
> multiple swap data into the swap allocator, which should be a
> future-proof design, achieving following benefits:
> - Even lower memory usage than the current design
> - Higher performance (Remove HAS_CACHE pin trampoline)
> - Dynamic allocation and growth support, further reducing idle memory usage
> - Unifying the swapin path for a more maintainable code base (Remove SYNC_IO)
> - More extensible, provide a clean bedrock for implementing things
> like discontinuous swapout, readahead based mTHP swapin and more.
> 
> People have been complaining about the SWAP management subsystem [5].
> Many incremental workarounds and optimizations are added, but causes
> many other problems eg. [6][7][8][9] and making implementing new
> features more difficult. One reason is the current design almost has
> the minimal memory usage (1 byte swap map) with acceptable
> performance, so it's hard to beat with incremental changes. But
> actually as more code and features are added, there are already lots
> of duplicated parts. So I'm proposing this idea to overhaul whole SWAP
> slot management from a different aspect, as the following work on the
> SWAP allocator [2].
> 
> Chris's topic "Swap abstraction" at LSFMM 2024 [1] raised the idea of
> unifying swap data, we worked together to implement the short term
> solution first: The swap allocator was the bottleneck for performance
> and fragmentation issues. The new cluster allocator solved these
> issues, and turned the cluster into a basic swap management unit.
> It also removed slot cache freeing path, and I'll post another series
> soon to remove the slot cache allocation path, so folios will always
> interact with the SWAP allocator directly, preparing for this long
> term goal:


I believe this was first raised in some form in LSFMM 2023 [1] :)

The approach described here is different, as it's cluster-based, which
is interesting. I am interested to know how this helps separate the swap
core layer from the underlying backing, as Johannes asked.

In all cases, Nhat is also working on something similar, so we need some
coordination here to avoid duplicated/conflicting work.

Thanks!

[1]https://lwn.net/Articles/932077/

> 
> A brief intro of the new design
> ===============================
> 
> It will first be a drop-in replacement for swap cache, using a per
> cluster table to handle all things required for SWAP management.
> Compared to the previous attempt to unify swap cache [11], this will
> have lower overhead with more features achievable:
> 
> struct swap_cluster_info {
> spinlock_t lock;
> u16 count;
> u8 flags;
> u8 order;
> + void *table; /* 512 entries */
> struct list_head list;
> };
> 
> The table itself can have variants of format, but for basic usage,
> each void* could be in one of the following type:
> 
> /*
>  * a NULL:    | -----------    0    ------------| - Empty slot
>  * a Shadow:  | SWAP_COUNT |---- Shadow ----|XX1| - Swaped out
>  * a PFN:     | SWAP_COUNT |------ PFN -----|X10| - Cached
>  * a Pointer: |----------- Pointer ---------|100| - Reserved / Unused yet
> * SWAP_COUNT is still 8 bits.
>  */
> 
> Clearly it can hold both cache and swap count. The shadow still has
> enough for distance (using 16M as buckets for 52 bit VA) or gen
> counting. For COUNT_CONTINUED, it can simply allocate another 512
> atomics for one cluster.
> 
> The table is protected by ci->lock, which has little to none contention.
> It also gets rid of the "HAS_CACHE bit setting vs Cache Insert",
> "HAS_CACHE pin as trampoline" issue, deprecating SWP_SYNCHRONOUS_IO.
> And remove the "multiple smaller file in one bit swapfile" design.
> 
> It will further remove the swap cgroup map. Cached folio (stored as
> PFN) or shadow can provide such info. Some careful audit and workflow
> redesign might be needed.
> 
> Each entry will be 8 bytes, smaller than current (8 bytes cache) + (2
> bytes cgroup map) + (1 bytes SWAP Map) = 11 bytes.
> 
> Shadow reclaim and high order storing are still doable too, by
> introducing dense cluster tables formats. We can even optimize it
> specially for shmem to have 1 bit per entry. And empty clusters can
> have their table freed. This part might be optional.
> 
> And it can have more types for supporting things like entry migrations
> or virtual swapfile. The example formats above showed four types. Last
> three or more bits can be used as a type indicator, as HAS_CACHE and
> COUNT_CONTINUED will be gone.
> 
> Issues
> ======
> There are unresolved problems or issues that may be worth getting some
> addressing:
> - Is workingset node reclaim really worth doing? We didn't do that
> until 5649d113ffce in 2023. Especially considering fragmentation of
> slab and the limited amount of SWAP compared to file cache.
> - Userspace API change? This new design will allow dynamic growth of
> swap size (especially for non physical devices like ZRAM or a
> virtual/ghost swapfile). This may be worth thinking about how to be
> used.
> - Advanced usage and extensions for issues like "Swap Min Order",
> "Discontinuous swapout". For example the "Swap Min Order" issue might
> be solvable by allocating only specific order using the new cluster
> allocator, then having an abstract / virtual file as a batch layer.
> This layer may use some "redirection entries" in its table, with a
> very low overhead and be optional in real world usage. Details are yet
> to be decided.
> - Noticed that this will allow all swapin to no longer bypass swap
> cache (just like previous series) with better performance. This may
> provide an opportunity to implement a tunable readahead based large
> folio swapin. [12]
> 
> [1] https://lwn.net/Articles/974587/
> [2] https://lpc.events/event/18/contributions/1769/
> [3] https://lwn.net/Articles/984090/
> [4] https://lwn.net/Articles/1005081/
> [5] https://lwn.net/Articles/932077/
> [6] https://lore.kernel.org/linux-mm/20240206182559.32264-1-ryncsn@gmail.com/
> [7] https://lore.kernel.org/lkml/20240324210447.956973-1-hannes@cmpxchg.org/
> [8] https://lore.kernel.org/lkml/20240926211936.75373-1-21cnbao@gmail.com/
> [9] https://lore.kernel.org/all/CAMgjq7ACohT_uerSz8E_994ZZCv709Zor+43hdmesW_59W1BWw@mail.gmail.com/
> [10] https://lore.kernel.org/lkml/20240326185032.72159-1-ryncsn@gmail.com/
> [11] https://lwn.net/Articles/966845/
> [12] https://lore.kernel.org/lkml/874j7zfqkk.fsf@yhuang6-desk2.ccr.corp.intel.com/
> 


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Integrate Swap Cache, Swap Maps with Swap Allocator
  2025-02-04 16:24 ` Johannes Weiner
@ 2025-02-04 16:46   ` Kairui Song
  2025-02-04 18:11     ` Yosry Ahmed
  0 siblings, 1 reply; 10+ messages in thread
From: Kairui Song @ 2025-02-04 16:46 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: lsf-pc, linux-mm, Andrew Morton, Chris Li, Chengming Zhou,
	Yosry Ahmed, Shakeel Butt, Hugh Dickins, Matthew Wilcox,
	Barry Song, Nhat Pham, Usama Arif, Ryan Roberts, Huang, Ying

On Wed, Feb 5, 2025 at 12:24 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> Hi Kairui,
>
> On Tue, Feb 04, 2025 at 07:44:46PM +0800, Kairui Song wrote:
> > Hi all, sorry for the late submission.
> >
> > Following previous work and topics with the SWAP allocator
> > [1][2][3][4], this topic would propose a way to redesign and integrate
> > multiple swap data into the swap allocator, which should be a
> > future-proof design, achieving following benefits:
> > - Even lower memory usage than the current design
> > - Higher performance (Remove HAS_CACHE pin trampoline)
> > - Dynamic allocation and growth support, further reducing idle memory usage
> > - Unifying the swapin path for a more maintainable code base (Remove SYNC_IO)
> > - More extensible, provide a clean bedrock for implementing things
> > like discontinuous swapout, readahead based mTHP swapin and more.
> >
> > People have been complaining about the SWAP management subsystem [5].
> > Many incremental workarounds and optimizations are added, but causes
> > many other problems eg. [6][7][8][9] and making implementing new
> > features more difficult. One reason is the current design almost has
> > the minimal memory usage (1 byte swap map) with acceptable
> > performance, so it's hard to beat with incremental changes. But
> > actually as more code and features are added, there are already lots
> > of duplicated parts. So I'm proposing this idea to overhaul whole SWAP
> > slot management from a different aspect, as the following work on the
> > SWAP allocator [2].
> >
> > Chris's topic "Swap abstraction" at LSFMM 2024 [1] raised the idea of
> > unifying swap data, we worked together to implement the short term
> > solution first: The swap allocator was the bottleneck for performance
> > and fragmentation issues. The new cluster allocator solved these
> > issues, and turned the cluster into a basic swap management unit.
> > It also removed slot cache freeing path, and I'll post another series
> > soon to remove the slot cache allocation path, so folios will always
> > interact with the SWAP allocator directly, preparing for this long
> > term goal:
> >
> > A brief intro of the new design
> > ===============================
> >
> > It will first be a drop-in replacement for swap cache, using a per
> > cluster table to handle all things required for SWAP management.
> > Compared to the previous attempt to unify swap cache [11], this will
> > have lower overhead with more features achievable:
> >
> > struct swap_cluster_info {
> > spinlock_t lock;
> > u16 count;
> > u8 flags;
> > u8 order;
> > + void *table; /* 512 entries */
> > struct list_head list;
> > };
> >
> > The table itself can have variants of format, but for basic usage,
> > each void* could be in one of the following type:
> >
> > /*
> >  * a NULL:    | -----------    0    ------------| - Empty slot
> >  * a Shadow:  | SWAP_COUNT |---- Shadow ----|XX1| - Swaped out
> >  * a PFN:     | SWAP_COUNT |------ PFN -----|X10| - Cached
> >  * a Pointer: |----------- Pointer ---------|100| - Reserved / Unused yet
> > * SWAP_COUNT is still 8 bits.
> >  */
> >
> > Clearly it can hold both cache and swap count. The shadow still has
> > enough for distance (using 16M as buckets for 52 bit VA) or gen
> > counting. For COUNT_CONTINUED, it can simply allocate another 512
> > atomics for one cluster.
> >
> > The table is protected by ci->lock, which has little to none contention.
> > It also gets rid of the "HAS_CACHE bit setting vs Cache Insert",
> > "HAS_CACHE pin as trampoline" issue, deprecating SWP_SYNCHRONOUS_IO.
> > And remove the "multiple smaller file in one bit swapfile" design.
> >
> > It will further remove the swap cgroup map. Cached folio (stored as
> > PFN) or shadow can provide such info. Some careful audit and workflow
> > redesign might be needed.
> >
> > Each entry will be 8 bytes, smaller than current (8 bytes cache) + (2
> > bytes cgroup map) + (1 bytes SWAP Map) = 11 bytes.
> >
> > Shadow reclaim and high order storing are still doable too, by
> > introducing dense cluster tables formats. We can even optimize it
> > specially for shmem to have 1 bit per entry. And empty clusters can
> > have their table freed. This part might be optional.
> >
> > And it can have more types for supporting things like entry migrations
> > or virtual swapfile. The example formats above showed four types. Last
> > three or more bits can be used as a type indicator, as HAS_CACHE and
> > COUNT_CONTINUED will be gone.
>

Hi Johannes

> My understanding is that this would still tie the swap space to
> configured swapfiles. That aspect of the current design has more and
> more turned into a problem, because we now have several categories of
> swap entries that either permanently or for extended periods of time
> live in memory. Such entries should not occupy actual disk space.
>
> The oldest one is probably partially refaulted entries (where one out
> of N swapped page tables faults back in). We currently have to spend
> full pages of both memory AND disk space for these.
>
> The newest ones are zero-filled entries which are stored in a bitmap.
>
> Then there is zswap. You mention ghost swapfiles - I know some setups
> do this to use zswap purely for compression. But zswap is a writeback
> cache for real swapfiles primarily, and it is used as such. That means
> entries need to be able to move from the compressed pool to disk at
> some point, but might not for a long time. Tying the compressed pool
> size to disk space is hugely wasteful and an operational headache.
>
> So I think any future proof design for the swap allocator needs to
> decouple the virtual memory layer (page table count, swapcache, memcg
> linkage, shadow info) from the physical layer (swapfile slot).
>
> Can you touch on that concern?

Yes, I fully understand your concern. The purpose of this swap table
design is to provide a base for building other parts, including
decoupling the virtual layer from the physical layer.

The table entry can have different types, so virtual file/space can
leverage this too. For example the virtual layer can have something
like a "redirection entry" pointing to a physical device layer. Or
just a pointer to anything that could possibly be used (In the four
examples I provided one type is a pointer). A swap space will need
something to index its data.
We have already internally deployed a very similar solution for
multi-layer swapout, and it's working well, we are expecting to
upstreamly implement it and deprecate the downstream solution.

Using an optional layer for doing so still consumes very little memory
(16 bytes per entry for two layers, and this might be doable just with
single layer), And there are setups that doesn't need a extra layer,
such setups can ignore that part and have only 8 bytes per entry,
having a very low overhead.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Integrate Swap Cache, Swap Maps with Swap Allocator
  2025-02-04 16:44 ` Yosry Ahmed
@ 2025-02-04 16:56   ` Kairui Song
  0 siblings, 0 replies; 10+ messages in thread
From: Kairui Song @ 2025-02-04 16:56 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: lsf-pc, linux-mm, Andrew Morton, Chris Li, Johannes Weiner,
	Chengming Zhou, Shakeel Butt, Hugh Dickins, Matthew Wilcox,
	Barry Song, Nhat Pham, Usama Arif, Ryan Roberts, Huang, Ying

On Wed, Feb 5, 2025 at 12:44 AM Yosry Ahmed <yosry.ahmed@linux.dev> wrote:
>
> On Tue, Feb 04, 2025 at 07:44:46PM +0800, Kairui Song wrote:
> > Hi all, sorry for the late submission.
> >
> > Following previous work and topics with the SWAP allocator
> > [1][2][3][4], this topic would propose a way to redesign and integrate
> > multiple swap data into the swap allocator, which should be a
> > future-proof design, achieving following benefits:
> > - Even lower memory usage than the current design
> > - Higher performance (Remove HAS_CACHE pin trampoline)
> > - Dynamic allocation and growth support, further reducing idle memory usage
> > - Unifying the swapin path for a more maintainable code base (Remove SYNC_IO)
> > - More extensible, provide a clean bedrock for implementing things
> > like discontinuous swapout, readahead based mTHP swapin and more.
> >
> > People have been complaining about the SWAP management subsystem [5].
> > Many incremental workarounds and optimizations are added, but causes
> > many other problems eg. [6][7][8][9] and making implementing new
> > features more difficult. One reason is the current design almost has
> > the minimal memory usage (1 byte swap map) with acceptable
> > performance, so it's hard to beat with incremental changes. But
> > actually as more code and features are added, there are already lots
> > of duplicated parts. So I'm proposing this idea to overhaul whole SWAP
> > slot management from a different aspect, as the following work on the
> > SWAP allocator [2].
> >
> > Chris's topic "Swap abstraction" at LSFMM 2024 [1] raised the idea of
> > unifying swap data, we worked together to implement the short term
> > solution first: The swap allocator was the bottleneck for performance
> > and fragmentation issues. The new cluster allocator solved these
> > issues, and turned the cluster into a basic swap management unit.
> > It also removed slot cache freeing path, and I'll post another series
> > soon to remove the slot cache allocation path, so folios will always
> > interact with the SWAP allocator directly, preparing for this long
> > term goal:

Hi Yosry,

> I believe this was first raised in some form in LSFMM 2023 [1] :)

Oh, Sorry, my bad, I didn't check the history about this well
enough... Thanks for the info!

>
> The approach described here is different, as it's cluster-based, which
> is interesting. I am interested to know how this helps separate the swap
> core layer from the underlying backing, as Johannes asked.
>
> In all cases, Nhat is also working on something similar, so we need some
> coordination here to avoid duplicated/conflicting work.

Right, I saw that, I think this is no fundamental conflict, as so far
the cluster based table approach is mostly focusing on the index and
allocation/freeing (also swapin/swapout) simplification. Things and
ideas mostly emerged while I was working on the swap allocator last
year, and to address the discontinuous swapout and min swapout order
issue Chris shared some very helpful insights that will come suitable
with this approach.

I fully agree that we can discuss and figure out a way to arrange the
development and implement things with the optimal approach :)

>
> Thanks!
>
> [1]https://lwn.net/Articles/932077/
>


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Integrate Swap Cache, Swap Maps with Swap Allocator
  2025-02-04 16:46   ` Kairui Song
@ 2025-02-04 18:11     ` Yosry Ahmed
  2025-02-04 18:38       ` Kairui Song
  0 siblings, 1 reply; 10+ messages in thread
From: Yosry Ahmed @ 2025-02-04 18:11 UTC (permalink / raw)
  To: Kairui Song
  Cc: Johannes Weiner, lsf-pc, linux-mm, Andrew Morton, Chris Li,
	Chengming Zhou, Shakeel Butt, Hugh Dickins, Matthew Wilcox,
	Barry Song, Nhat Pham, Usama Arif, Ryan Roberts, Huang, Ying

On Wed, Feb 05, 2025 at 12:46:26AM +0800, Kairui Song wrote:
> On Wed, Feb 5, 2025 at 12:24 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
> >
> > Hi Kairui,
> >
> > On Tue, Feb 04, 2025 at 07:44:46PM +0800, Kairui Song wrote:
> > > Hi all, sorry for the late submission.
> > >
> > > Following previous work and topics with the SWAP allocator
> > > [1][2][3][4], this topic would propose a way to redesign and integrate
> > > multiple swap data into the swap allocator, which should be a
> > > future-proof design, achieving following benefits:
> > > - Even lower memory usage than the current design
> > > - Higher performance (Remove HAS_CACHE pin trampoline)
> > > - Dynamic allocation and growth support, further reducing idle memory usage
> > > - Unifying the swapin path for a more maintainable code base (Remove SYNC_IO)
> > > - More extensible, provide a clean bedrock for implementing things
> > > like discontinuous swapout, readahead based mTHP swapin and more.
> > >
> > > People have been complaining about the SWAP management subsystem [5].
> > > Many incremental workarounds and optimizations are added, but causes
> > > many other problems eg. [6][7][8][9] and making implementing new
> > > features more difficult. One reason is the current design almost has
> > > the minimal memory usage (1 byte swap map) with acceptable
> > > performance, so it's hard to beat with incremental changes. But
> > > actually as more code and features are added, there are already lots
> > > of duplicated parts. So I'm proposing this idea to overhaul whole SWAP
> > > slot management from a different aspect, as the following work on the
> > > SWAP allocator [2].
> > >
> > > Chris's topic "Swap abstraction" at LSFMM 2024 [1] raised the idea of
> > > unifying swap data, we worked together to implement the short term
> > > solution first: The swap allocator was the bottleneck for performance
> > > and fragmentation issues. The new cluster allocator solved these
> > > issues, and turned the cluster into a basic swap management unit.
> > > It also removed slot cache freeing path, and I'll post another series
> > > soon to remove the slot cache allocation path, so folios will always
> > > interact with the SWAP allocator directly, preparing for this long
> > > term goal:
> > >
> > > A brief intro of the new design
> > > ===============================
> > >
> > > It will first be a drop-in replacement for swap cache, using a per
> > > cluster table to handle all things required for SWAP management.
> > > Compared to the previous attempt to unify swap cache [11], this will
> > > have lower overhead with more features achievable:
> > >
> > > struct swap_cluster_info {
> > > spinlock_t lock;
> > > u16 count;
> > > u8 flags;
> > > u8 order;
> > > + void *table; /* 512 entries */
> > > struct list_head list;
> > > };
> > >
> > > The table itself can have variants of format, but for basic usage,
> > > each void* could be in one of the following type:
> > >
> > > /*
> > >  * a NULL:    | -----------    0    ------------| - Empty slot
> > >  * a Shadow:  | SWAP_COUNT |---- Shadow ----|XX1| - Swaped out
> > >  * a PFN:     | SWAP_COUNT |------ PFN -----|X10| - Cached
> > >  * a Pointer: |----------- Pointer ---------|100| - Reserved / Unused yet
> > > * SWAP_COUNT is still 8 bits.
> > >  */
> > >
> > > Clearly it can hold both cache and swap count. The shadow still has
> > > enough for distance (using 16M as buckets for 52 bit VA) or gen
> > > counting. For COUNT_CONTINUED, it can simply allocate another 512
> > > atomics for one cluster.
> > >
> > > The table is protected by ci->lock, which has little to none contention.
> > > It also gets rid of the "HAS_CACHE bit setting vs Cache Insert",
> > > "HAS_CACHE pin as trampoline" issue, deprecating SWP_SYNCHRONOUS_IO.
> > > And remove the "multiple smaller file in one bit swapfile" design.
> > >
> > > It will further remove the swap cgroup map. Cached folio (stored as
> > > PFN) or shadow can provide such info. Some careful audit and workflow
> > > redesign might be needed.
> > >
> > > Each entry will be 8 bytes, smaller than current (8 bytes cache) + (2
> > > bytes cgroup map) + (1 bytes SWAP Map) = 11 bytes.
> > >
> > > Shadow reclaim and high order storing are still doable too, by
> > > introducing dense cluster tables formats. We can even optimize it
> > > specially for shmem to have 1 bit per entry. And empty clusters can
> > > have their table freed. This part might be optional.
> > >
> > > And it can have more types for supporting things like entry migrations
> > > or virtual swapfile. The example formats above showed four types. Last
> > > three or more bits can be used as a type indicator, as HAS_CACHE and
> > > COUNT_CONTINUED will be gone.
> >
> 
> Hi Johannes
> 
> > My understanding is that this would still tie the swap space to
> > configured swapfiles. That aspect of the current design has more and
> > more turned into a problem, because we now have several categories of
> > swap entries that either permanently or for extended periods of time
> > live in memory. Such entries should not occupy actual disk space.
> >
> > The oldest one is probably partially refaulted entries (where one out
> > of N swapped page tables faults back in). We currently have to spend
> > full pages of both memory AND disk space for these.
> >
> > The newest ones are zero-filled entries which are stored in a bitmap.
> >
> > Then there is zswap. You mention ghost swapfiles - I know some setups
> > do this to use zswap purely for compression. But zswap is a writeback
> > cache for real swapfiles primarily, and it is used as such. That means
> > entries need to be able to move from the compressed pool to disk at
> > some point, but might not for a long time. Tying the compressed pool
> > size to disk space is hugely wasteful and an operational headache.
> >
> > So I think any future proof design for the swap allocator needs to
> > decouple the virtual memory layer (page table count, swapcache, memcg
> > linkage, shadow info) from the physical layer (swapfile slot).
> >
> > Can you touch on that concern?
> 
> Yes, I fully understand your concern. The purpose of this swap table
> design is to provide a base for building other parts, including
> decoupling the virtual layer from the physical layer.
> 
> The table entry can have different types, so virtual file/space can
> leverage this too. For example the virtual layer can have something
> like a "redirection entry" pointing to a physical device layer. Or
> just a pointer to anything that could possibly be used (In the four
> examples I provided one type is a pointer). A swap space will need
> something to index its data.
> We have already internally deployed a very similar solution for
> multi-layer swapout, and it's working well, we are expecting to
> upstreamly implement it and deprecate the downstream solution.
> 
> Using an optional layer for doing so still consumes very little memory
> (16 bytes per entry for two layers, and this might be doable just with
> single layer), And there are setups that doesn't need a extra layer,
> such setups can ignore that part and have only 8 bytes per entry,
> having a very low overhead.

IIUC with this design we still have a fixed-size swap space, but it's
not directly tied to the physical swap layer (i.e. it can be backed with
a swap slot on disk, zswap, zero-filled pages, etc). Did I get this
right?

In this case, using clusters to manage this should be an implementation
detail that is not visible to userspace. Ideally the kernel would
allocate more clusters dynamically as needed, and when a swap entry is
being allocated in that cluster the kernel chooses the backing for that
swap entry based on the available options.

I see the benefit of managing things on the cluster level to reduce
memory overhead (e.g. one lock per cluster vs. per entry), and to
leverage existing code where it makes sense.

However, what we should *not* do is have these clusters be tied to the
disk swap space with the ability to redirect some entries to use
someting like zswap. This does not fix the problem Johannes is
describing.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Integrate Swap Cache, Swap Maps with Swap Allocator
  2025-02-04 18:11     ` Yosry Ahmed
@ 2025-02-04 18:38       ` Kairui Song
  2025-02-04 19:09         ` Johannes Weiner
  0 siblings, 1 reply; 10+ messages in thread
From: Kairui Song @ 2025-02-04 18:38 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Johannes Weiner, lsf-pc, linux-mm, Andrew Morton, Chris Li,
	Chengming Zhou, Shakeel Butt, Hugh Dickins, Matthew Wilcox,
	Barry Song, Nhat Pham, Usama Arif, Ryan Roberts, Huang, Ying

On Wed, Feb 5, 2025 at 2:11 AM Yosry Ahmed <yosry.ahmed@linux.dev> wrote:
>
> On Wed, Feb 05, 2025 at 12:46:26AM +0800, Kairui Song wrote:
> > On Wed, Feb 5, 2025 at 12:24 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
> > >
> > > Hi Kairui,
> > >
> > > On Tue, Feb 04, 2025 at 07:44:46PM +0800, Kairui Song wrote:
> > > > Hi all, sorry for the late submission.
> > > >
> > > > Following previous work and topics with the SWAP allocator
> > > > [1][2][3][4], this topic would propose a way to redesign and integrate
> > > > multiple swap data into the swap allocator, which should be a
> > > > future-proof design, achieving following benefits:
> > > > - Even lower memory usage than the current design
> > > > - Higher performance (Remove HAS_CACHE pin trampoline)
> > > > - Dynamic allocation and growth support, further reducing idle memory usage
> > > > - Unifying the swapin path for a more maintainable code base (Remove SYNC_IO)
> > > > - More extensible, provide a clean bedrock for implementing things
> > > > like discontinuous swapout, readahead based mTHP swapin and more.
> > > >
> > > > People have been complaining about the SWAP management subsystem [5].
> > > > Many incremental workarounds and optimizations are added, but causes
> > > > many other problems eg. [6][7][8][9] and making implementing new
> > > > features more difficult. One reason is the current design almost has
> > > > the minimal memory usage (1 byte swap map) with acceptable
> > > > performance, so it's hard to beat with incremental changes. But
> > > > actually as more code and features are added, there are already lots
> > > > of duplicated parts. So I'm proposing this idea to overhaul whole SWAP
> > > > slot management from a different aspect, as the following work on the
> > > > SWAP allocator [2].
> > > >
> > > > Chris's topic "Swap abstraction" at LSFMM 2024 [1] raised the idea of
> > > > unifying swap data, we worked together to implement the short term
> > > > solution first: The swap allocator was the bottleneck for performance
> > > > and fragmentation issues. The new cluster allocator solved these
> > > > issues, and turned the cluster into a basic swap management unit.
> > > > It also removed slot cache freeing path, and I'll post another series
> > > > soon to remove the slot cache allocation path, so folios will always
> > > > interact with the SWAP allocator directly, preparing for this long
> > > > term goal:
> > > >
> > > > A brief intro of the new design
> > > > ===============================
> > > >
> > > > It will first be a drop-in replacement for swap cache, using a per
> > > > cluster table to handle all things required for SWAP management.
> > > > Compared to the previous attempt to unify swap cache [11], this will
> > > > have lower overhead with more features achievable:
> > > >
> > > > struct swap_cluster_info {
> > > > spinlock_t lock;
> > > > u16 count;
> > > > u8 flags;
> > > > u8 order;
> > > > + void *table; /* 512 entries */
> > > > struct list_head list;
> > > > };
> > > >
> > > > The table itself can have variants of format, but for basic usage,
> > > > each void* could be in one of the following type:
> > > >
> > > > /*
> > > >  * a NULL:    | -----------    0    ------------| - Empty slot
> > > >  * a Shadow:  | SWAP_COUNT |---- Shadow ----|XX1| - Swaped out
> > > >  * a PFN:     | SWAP_COUNT |------ PFN -----|X10| - Cached
> > > >  * a Pointer: |----------- Pointer ---------|100| - Reserved / Unused yet
> > > > * SWAP_COUNT is still 8 bits.
> > > >  */
> > > >
> > > > Clearly it can hold both cache and swap count. The shadow still has
> > > > enough for distance (using 16M as buckets for 52 bit VA) or gen
> > > > counting. For COUNT_CONTINUED, it can simply allocate another 512
> > > > atomics for one cluster.
> > > >
> > > > The table is protected by ci->lock, which has little to none contention.
> > > > It also gets rid of the "HAS_CACHE bit setting vs Cache Insert",
> > > > "HAS_CACHE pin as trampoline" issue, deprecating SWP_SYNCHRONOUS_IO.
> > > > And remove the "multiple smaller file in one bit swapfile" design.
> > > >
> > > > It will further remove the swap cgroup map. Cached folio (stored as
> > > > PFN) or shadow can provide such info. Some careful audit and workflow
> > > > redesign might be needed.
> > > >
> > > > Each entry will be 8 bytes, smaller than current (8 bytes cache) + (2
> > > > bytes cgroup map) + (1 bytes SWAP Map) = 11 bytes.
> > > >
> > > > Shadow reclaim and high order storing are still doable too, by
> > > > introducing dense cluster tables formats. We can even optimize it
> > > > specially for shmem to have 1 bit per entry. And empty clusters can
> > > > have their table freed. This part might be optional.
> > > >
> > > > And it can have more types for supporting things like entry migrations
> > > > or virtual swapfile. The example formats above showed four types. Last
> > > > three or more bits can be used as a type indicator, as HAS_CACHE and
> > > > COUNT_CONTINUED will be gone.
> > >
> >
> > Hi Johannes
> >
> > > My understanding is that this would still tie the swap space to
> > > configured swapfiles. That aspect of the current design has more and
> > > more turned into a problem, because we now have several categories of
> > > swap entries that either permanently or for extended periods of time
> > > live in memory. Such entries should not occupy actual disk space.
> > >
> > > The oldest one is probably partially refaulted entries (where one out
> > > of N swapped page tables faults back in). We currently have to spend
> > > full pages of both memory AND disk space for these.
> > >
> > > The newest ones are zero-filled entries which are stored in a bitmap.
> > >
> > > Then there is zswap. You mention ghost swapfiles - I know some setups
> > > do this to use zswap purely for compression. But zswap is a writeback
> > > cache for real swapfiles primarily, and it is used as such. That means
> > > entries need to be able to move from the compressed pool to disk at
> > > some point, but might not for a long time. Tying the compressed pool
> > > size to disk space is hugely wasteful and an operational headache.
> > >
> > > So I think any future proof design for the swap allocator needs to
> > > decouple the virtual memory layer (page table count, swapcache, memcg
> > > linkage, shadow info) from the physical layer (swapfile slot).
> > >
> > > Can you touch on that concern?
> >
> > Yes, I fully understand your concern. The purpose of this swap table
> > design is to provide a base for building other parts, including
> > decoupling the virtual layer from the physical layer.
> >
> > The table entry can have different types, so virtual file/space can
> > leverage this too. For example the virtual layer can have something
> > like a "redirection entry" pointing to a physical device layer. Or
> > just a pointer to anything that could possibly be used (In the four
> > examples I provided one type is a pointer). A swap space will need
> > something to index its data.
> > We have already internally deployed a very similar solution for
> > multi-layer swapout, and it's working well, we are expecting to
> > upstreamly implement it and deprecate the downstream solution.
> >
> > Using an optional layer for doing so still consumes very little memory
> > (16 bytes per entry for two layers, and this might be doable just with
> > single layer), And there are setups that doesn't need a extra layer,
> > such setups can ignore that part and have only 8 bytes per entry,
> > having a very low overhead.
>
> IIUC with this design we still have a fixed-size swap space, but it's
> not directly tied to the physical swap layer (i.e. it can be backed with
> a swap slot on disk, zswap, zero-filled pages, etc). Did I get this
> right?
>
> In this case, using clusters to manage this should be an implementation
> detail that is not visible to userspace. Ideally the kernel would
> allocate more clusters dynamically as needed, and when a swap entry is
> being allocated in that cluster the kernel chooses the backing for that
> swap entry based on the available options.
>
> I see the benefit of managing things on the cluster level to reduce
> memory overhead (e.g. one lock per cluster vs. per entry), and to
> leverage existing code where it makes sense.

Yes, agree, cluster based map means we can have many empty clusters
without consuming any pre-reserved map memory. And extending the
cluster array should be doable too.

>
> However, what we should *not* do is have these clusters be tied to the
> disk swap space with the ability to redirect some entries to use
> someting like zswap. This does not fix the problem Johannes is
> describing.

Yes, a virtual swap file can have its own swap space, which is indexed
by the cache / table, and reuse all the logic. As long as we don't
dramatically change the kernel swapout path, adding a folio to
swapcache seems a very reasonable way to avoid redundant IO, and
synchronize it upon swapin/swapout, and reusing a lot of
infrastructure, even if that's a virtual file. For example a current
busy loop issue can be just fixed by leveraging the folio lock:
https://lore.kernel.org/lkml/CAMgjq7D5qoFEK9Omvd5_Zqs6M+TEoG03+2i_mhuP5CQPSOPrmQ@mail.gmail.com/

The virtual file/space can be decoupled from the lower device. But the
virtual file/space's table entry can point to an underlying physical
SWAP device or some meta struct.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Integrate Swap Cache, Swap Maps with Swap Allocator
  2025-02-04 18:38       ` Kairui Song
@ 2025-02-04 19:09         ` Johannes Weiner
  2025-02-04 19:25           ` Kairui Song
  0 siblings, 1 reply; 10+ messages in thread
From: Johannes Weiner @ 2025-02-04 19:09 UTC (permalink / raw)
  To: Kairui Song
  Cc: Yosry Ahmed, lsf-pc, linux-mm, Andrew Morton, Chris Li,
	Chengming Zhou, Shakeel Butt, Hugh Dickins, Matthew Wilcox,
	Barry Song, Nhat Pham, Usama Arif, Ryan Roberts, Huang, Ying

On Wed, Feb 05, 2025 at 02:38:39AM +0800, Kairui Song wrote:
> On Wed, Feb 5, 2025 at 2:11 AM Yosry Ahmed <yosry.ahmed@linux.dev> wrote:
> > However, what we should *not* do is have these clusters be tied to the
> > disk swap space with the ability to redirect some entries to use
> > someting like zswap. This does not fix the problem Johannes is
> > describing.
> 
> Yes, a virtual swap file can have its own swap space, which is indexed
> by the cache / table, and reuse all the logic. As long as we don't
> dramatically change the kernel swapout path, adding a folio to
> swapcache seems a very reasonable way to avoid redundant IO, and
> synchronize it upon swapin/swapout, and reusing a lot of
> infrastructure, even if that's a virtual file. For example a current
> busy loop issue can be just fixed by leveraging the folio lock:
> https://lore.kernel.org/lkml/CAMgjq7D5qoFEK9Omvd5_Zqs6M+TEoG03+2i_mhuP5CQPSOPrmQ@mail.gmail.com/
> 
> The virtual file/space can be decoupled from the lower device. But the
> virtual file/space's table entry can point to an underlying physical
> SWAP device or some meta struct.

It's a bit unclear to me still which level will use the struct
swap_cluster_info in the layered scenario.

Would it be the virtual address space, where ->table has tagged
pointers to resolve to swapcache/zeromap/zswap/swapfile?

Or would it be the swapfile space, where ->table resolves to disk
slots?

Or are you proposing to use the same struct on both levels, with
->table catering to different needs?

Keep in mind, in the virtualized case, it's the top layer that would
have to keep track of the page table count, the swapcache pointer and
likely the memcg linkage. That also means the physical layer could
likely be reduced to a single bit per entry - used or free.

I suppose void *table could also point to such a bitmap? But not sure
about the other members that would become redundant/unused.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Integrate Swap Cache, Swap Maps with Swap Allocator
  2025-02-04 19:09         ` Johannes Weiner
@ 2025-02-04 19:25           ` Kairui Song
  0 siblings, 0 replies; 10+ messages in thread
From: Kairui Song @ 2025-02-04 19:25 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Yosry Ahmed, lsf-pc, linux-mm, Andrew Morton, Chris Li,
	Chengming Zhou, Shakeel Butt, Hugh Dickins, Matthew Wilcox,
	Barry Song, Nhat Pham, Usama Arif, Ryan Roberts, Huang, Ying

On Wed, Feb 5, 2025 at 3:09 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Wed, Feb 05, 2025 at 02:38:39AM +0800, Kairui Song wrote:
> > On Wed, Feb 5, 2025 at 2:11 AM Yosry Ahmed <yosry.ahmed@linux.dev> wrote:
> > > However, what we should *not* do is have these clusters be tied to the
> > > disk swap space with the ability to redirect some entries to use
> > > someting like zswap. This does not fix the problem Johannes is
> > > describing.
> >
> > Yes, a virtual swap file can have its own swap space, which is indexed
> > by the cache / table, and reuse all the logic. As long as we don't
> > dramatically change the kernel swapout path, adding a folio to
> > swapcache seems a very reasonable way to avoid redundant IO, and
> > synchronize it upon swapin/swapout, and reusing a lot of
> > infrastructure, even if that's a virtual file. For example a current
> > busy loop issue can be just fixed by leveraging the folio lock:
> > https://lore.kernel.org/lkml/CAMgjq7D5qoFEK9Omvd5_Zqs6M+TEoG03+2i_mhuP5CQPSOPrmQ@mail.gmail.com/
> >
> > The virtual file/space can be decoupled from the lower device. But the
> > virtual file/space's table entry can point to an underlying physical
> > SWAP device or some meta struct.
>
> It's a bit unclear to me still which level will use the struct
> swap_cluster_info in the layered scenario.
>
> Would it be the virtual address space, where ->table has tagged
> pointers to resolve to swapcache/zeromap/zswap/swapfile?
>
> Or would it be the swapfile space, where ->table resolves to disk
> slots?
>
> Or are you proposing to use the same struct on both levels, with
> ->table catering to different needs?

I was thinking about the first case, that in the virtual address
space, ->table[n] will resolve to an offset in a lower (physical)
layer or some other meta structure. But we still reuse the same struct
for both layer, the table could be in dense mode for used clusters on
lower layer (3 bytes (memcg + count), or even 1 bit per entry,
depending on how we want to store info like memcg_id).

This also brings a nice side effect (feature), we can have multiple
swap file/devices, if the upper one (virtual or not) is full, it can
fall back to use the lower one just fine.

>
> Keep in mind, in the virtualized case, it's the top layer that would
> have to keep track of the page table count, the swapcache pointer and
> likely the memcg linkage. That also means the physical layer could
> likely be reduced to a single bit per entry - used or free.
>
> I suppose void *table could also point to such a bitmap? But not sure
> about the other members that would become redundant/unused.

That's very doable. I also wanted shmem to have a dense table, it may
also reduce the entry to one single bit.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Integrate Swap Cache, Swap Maps with Swap Allocator
  2025-02-04 11:44 [LSF/MM/BPF TOPIC] Integrate Swap Cache, Swap Maps with Swap Allocator Kairui Song
  2025-02-04 16:24 ` Johannes Weiner
  2025-02-04 16:44 ` Yosry Ahmed
@ 2025-03-26  3:23 ` Kairui Song
  2 siblings, 0 replies; 10+ messages in thread
From: Kairui Song @ 2025-03-26  3:23 UTC (permalink / raw)
  To: lsf-pc, linux-mm
  Cc: Andrew Morton, Chris Li, Johannes Weiner, Chengming Zhou,
	Yosry Ahmed, Shakeel Butt, Hugh Dickins, Matthew Wilcox,
	Barry Song, Nhat Pham, Usama Arif, Ryan Roberts, Huang, Ying

On Tue, Feb 4, 2025 at 6:44 AM Kairui Song <ryncsn@gmail.com> wrote:
>
> Hi all, sorry for the late submission.
>
> Following previous work and topics with the SWAP allocator
> [1][2][3][4], this topic would propose a way to redesign and integrate
> multiple swap data into the swap allocator, which should be a
> future-proof design, achieving following benefits:
> - Even lower memory usage than the current design
> - Higher performance (Remove HAS_CACHE pin trampoline)
> - Dynamic allocation and growth support, further reducing idle memory usage
> - Unifying the swapin path for a more maintainable code base (Remove SYNC_IO)
> - More extensible, provide a clean bedrock for implementing things
> like discontinuous swapout, readahead based mTHP swapin and more.
>
> People have been complaining about the SWAP management subsystem [5].
> Many incremental workarounds and optimizations are added, but causes
> many other problems eg. [6][7][8][9] and making implementing new
> features more difficult. One reason is the current design almost has
> the minimal memory usage (1 byte swap map) with acceptable
> performance, so it's hard to beat with incremental changes. But
> actually as more code and features are added, there are already lots
> of duplicated parts. So I'm proposing this idea to overhaul whole SWAP
> slot management from a different aspect, as the following work on the
> SWAP allocator [2].
>
> Chris's topic "Swap abstraction" at LSFMM 2024 [1] raised the idea of
> unifying swap data, we worked together to implement the short term
> solution first: The swap allocator was the bottleneck for performance
> and fragmentation issues. The new cluster allocator solved these
> issues, and turned the cluster into a basic swap management unit.
> It also removed slot cache freeing path, and I'll post another series
> soon to remove the slot cache allocation path, so folios will always
> interact with the SWAP allocator directly, preparing for this long
> term goal:
>
> A brief intro of the new design
> ===============================
>
> It will first be a drop-in replacement for swap cache, using a per
> cluster table to handle all things required for SWAP management.
> Compared to the previous attempt to unify swap cache [11], this will
> have lower overhead with more features achievable:
>
> struct swap_cluster_info {
> spinlock_t lock;
> u16 count;
> u8 flags;
> u8 order;
> + void *table; /* 512 entries */
> struct list_head list;
> };
>
> The table itself can have variants of format, but for basic usage,
> each void* could be in one of the following type:
>
> /*
>  * a NULL:    | -----------    0    ------------| - Empty slot
>  * a Shadow:  | SWAP_COUNT |---- Shadow ----|XX1| - Swaped out
>  * a PFN:     | SWAP_COUNT |------ PFN -----|X10| - Cached
>  * a Pointer: |----------- Pointer ---------|100| - Reserved / Unused yet
> * SWAP_COUNT is still 8 bits.
>  */
>
> Clearly it can hold both cache and swap count. The shadow still has
> enough for distance (using 16M as buckets for 52 bit VA) or gen
> counting. For COUNT_CONTINUED, it can simply allocate another 512
> atomics for one cluster.
>
> The table is protected by ci->lock, which has little to none contention.
> It also gets rid of the "HAS_CACHE bit setting vs Cache Insert",
> "HAS_CACHE pin as trampoline" issue, deprecating SWP_SYNCHRONOUS_IO.
> And remove the "multiple smaller file in one bit swapfile" design.
>
> It will further remove the swap cgroup map. Cached folio (stored as
> PFN) or shadow can provide such info. Some careful audit and workflow
> redesign might be needed.
>
> Each entry will be 8 bytes, smaller than current (8 bytes cache) + (2
> bytes cgroup map) + (1 bytes SWAP Map) = 11 bytes.
>
> Shadow reclaim and high order storing are still doable too, by
> introducing dense cluster tables formats. We can even optimize it
> specially for shmem to have 1 bit per entry. And empty clusters can
> have their table freed. This part might be optional.
>
> And it can have more types for supporting things like entry migrations
> or virtual swapfile. The example formats above showed four types. Last
> three or more bits can be used as a type indicator, as HAS_CACHE and
> COUNT_CONTINUED will be gone.
>
> Issues
> ======
> There are unresolved problems or issues that may be worth getting some
> addressing:
> - Is workingset node reclaim really worth doing? We didn't do that
> until 5649d113ffce in 2023. Especially considering fragmentation of
> slab and the limited amount of SWAP compared to file cache.
> - Userspace API change? This new design will allow dynamic growth of
> swap size (especially for non physical devices like ZRAM or a
> virtual/ghost swapfile). This may be worth thinking about how to be
> used.
> - Advanced usage and extensions for issues like "Swap Min Order",
> "Discontinuous swapout". For example the "Swap Min Order" issue might
> be solvable by allocating only specific order using the new cluster
> allocator, then having an abstract / virtual file as a batch layer.
> This layer may use some "redirection entries" in its table, with a
> very low overhead and be optional in real world usage. Details are yet
> to be decided.
> - Noticed that this will allow all swapin to no longer bypass swap
> cache (just like previous series) with better performance. This may
> provide an opportunity to implement a tunable readahead based large
> folio swapin. [12]
>
> [1] https://lwn.net/Articles/974587/
> [2] https://lpc.events/event/18/contributions/1769/
> [3] https://lwn.net/Articles/984090/
> [4] https://lwn.net/Articles/1005081/
> [5] https://lwn.net/Articles/932077/
> [6] https://lore.kernel.org/linux-mm/20240206182559.32264-1-ryncsn@gmail.com/
> [7] https://lore.kernel.org/lkml/20240324210447.956973-1-hannes@cmpxchg.org/
> [8] https://lore.kernel.org/lkml/20240926211936.75373-1-21cnbao@gmail.com/
> [9] https://lore.kernel.org/all/CAMgjq7ACohT_uerSz8E_994ZZCv709Zor+43hdmesW_59W1BWw@mail.gmail.com/
> [10] https://lore.kernel.org/lkml/20240326185032.72159-1-ryncsn@gmail.com/
> [11] https://lwn.net/Articles/966845/
> [12] https://lore.kernel.org/lkml/874j7zfqkk.fsf@yhuang6-desk2.ccr.corp.intel.com/

Hi all,

Here is the slides presented today:
https://drive.google.com/file/d/1_QKlXErUkQ-TXmJJy79fJoLPui9TGK1S/view?usp=sharing


^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2025-03-26  3:24 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-02-04 11:44 [LSF/MM/BPF TOPIC] Integrate Swap Cache, Swap Maps with Swap Allocator Kairui Song
2025-02-04 16:24 ` Johannes Weiner
2025-02-04 16:46   ` Kairui Song
2025-02-04 18:11     ` Yosry Ahmed
2025-02-04 18:38       ` Kairui Song
2025-02-04 19:09         ` Johannes Weiner
2025-02-04 19:25           ` Kairui Song
2025-02-04 16:44 ` Yosry Ahmed
2025-02-04 16:56   ` Kairui Song
2025-03-26  3:23 ` Kairui Song

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox