Re: [LSF/MM/BPF TOPIC] Integrate Swap Cache, Swap Maps with Swap Allocator

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Kairui Song <ryncsn@gmail.com>
To: lsf-pc@lists.linux-foundation.org, linux-mm <linux-mm@kvack.org>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Chris Li <chrisl@kernel.org>,
	 Johannes Weiner <hannes@cmpxchg.org>,
	Chengming Zhou <chengming.zhou@linux.dev>,
	 Yosry Ahmed <yosryahmed@google.com>,
	Shakeel Butt <shakeel.butt@linux.dev>,
	 Hugh Dickins <hughd@google.com>,
	Matthew Wilcox <willy@infradead.org>,
	Barry Song <21cnbao@gmail.com>,  Nhat Pham <nphamcs@gmail.com>,
	Usama Arif <usamaarif642@gmail.com>,
	 Ryan Roberts <ryan.roberts@arm.com>,
	"Huang, Ying" <ying.huang@linux.alibaba.com>
Subject: Re: [LSF/MM/BPF TOPIC] Integrate Swap Cache, Swap Maps with Swap Allocator
Date: Tue, 25 Mar 2025 23:23:51 -0400	[thread overview]
Message-ID: <CAMgjq7BaT8PBc-5m=zRCuYotxU5gE01JSazL7+=Pe+y_qnM-+w@mail.gmail.com> (raw)
In-Reply-To: <CAMgjq7BvQ0ZXvyLGp2YP96+i+6COCBBJCYmjXHGBnfisCAb8VA@mail.gmail.com>

On Tue, Feb 4, 2025 at 6:44 AM Kairui Song <ryncsn@gmail.com> wrote:
>
> Hi all, sorry for the late submission.
>
> Following previous work and topics with the SWAP allocator
> [1][2][3][4], this topic would propose a way to redesign and integrate
> multiple swap data into the swap allocator, which should be a
> future-proof design, achieving following benefits:
> - Even lower memory usage than the current design
> - Higher performance (Remove HAS_CACHE pin trampoline)
> - Dynamic allocation and growth support, further reducing idle memory usage
> - Unifying the swapin path for a more maintainable code base (Remove SYNC_IO)
> - More extensible, provide a clean bedrock for implementing things
> like discontinuous swapout, readahead based mTHP swapin and more.
>
> People have been complaining about the SWAP management subsystem [5].
> Many incremental workarounds and optimizations are added, but causes
> many other problems eg. [6][7][8][9] and making implementing new
> features more difficult. One reason is the current design almost has
> the minimal memory usage (1 byte swap map) with acceptable
> performance, so it's hard to beat with incremental changes. But
> actually as more code and features are added, there are already lots
> of duplicated parts. So I'm proposing this idea to overhaul whole SWAP
> slot management from a different aspect, as the following work on the
> SWAP allocator [2].
>
> Chris's topic "Swap abstraction" at LSFMM 2024 [1] raised the idea of
> unifying swap data, we worked together to implement the short term
> solution first: The swap allocator was the bottleneck for performance
> and fragmentation issues. The new cluster allocator solved these
> issues, and turned the cluster into a basic swap management unit.
> It also removed slot cache freeing path, and I'll post another series
> soon to remove the slot cache allocation path, so folios will always
> interact with the SWAP allocator directly, preparing for this long
> term goal:
>
> A brief intro of the new design
> ===============================
>
> It will first be a drop-in replacement for swap cache, using a per
> cluster table to handle all things required for SWAP management.
> Compared to the previous attempt to unify swap cache [11], this will
> have lower overhead with more features achievable:
>
> struct swap_cluster_info {
> spinlock_t lock;
> u16 count;
> u8 flags;
> u8 order;
> + void *table; /* 512 entries */
> struct list_head list;
> };
>
> The table itself can have variants of format, but for basic usage,
> each void* could be in one of the following type:
>
> /*
>  * a NULL:    | -----------    0    ------------| - Empty slot
>  * a Shadow:  | SWAP_COUNT |---- Shadow ----|XX1| - Swaped out
>  * a PFN:     | SWAP_COUNT |------ PFN -----|X10| - Cached
>  * a Pointer: |----------- Pointer ---------|100| - Reserved / Unused yet
> * SWAP_COUNT is still 8 bits.
>  */
>
> Clearly it can hold both cache and swap count. The shadow still has
> enough for distance (using 16M as buckets for 52 bit VA) or gen
> counting. For COUNT_CONTINUED, it can simply allocate another 512
> atomics for one cluster.
>
> The table is protected by ci->lock, which has little to none contention.
> It also gets rid of the "HAS_CACHE bit setting vs Cache Insert",
> "HAS_CACHE pin as trampoline" issue, deprecating SWP_SYNCHRONOUS_IO.
> And remove the "multiple smaller file in one bit swapfile" design.
>
> It will further remove the swap cgroup map. Cached folio (stored as
> PFN) or shadow can provide such info. Some careful audit and workflow
> redesign might be needed.
>
> Each entry will be 8 bytes, smaller than current (8 bytes cache) + (2
> bytes cgroup map) + (1 bytes SWAP Map) = 11 bytes.
>
> Shadow reclaim and high order storing are still doable too, by
> introducing dense cluster tables formats. We can even optimize it
> specially for shmem to have 1 bit per entry. And empty clusters can
> have their table freed. This part might be optional.
>
> And it can have more types for supporting things like entry migrations
> or virtual swapfile. The example formats above showed four types. Last
> three or more bits can be used as a type indicator, as HAS_CACHE and
> COUNT_CONTINUED will be gone.
>
> Issues
> ======
> There are unresolved problems or issues that may be worth getting some
> addressing:
> - Is workingset node reclaim really worth doing? We didn't do that
> until 5649d113ffce in 2023. Especially considering fragmentation of
> slab and the limited amount of SWAP compared to file cache.
> - Userspace API change? This new design will allow dynamic growth of
> swap size (especially for non physical devices like ZRAM or a
> virtual/ghost swapfile). This may be worth thinking about how to be
> used.
> - Advanced usage and extensions for issues like "Swap Min Order",
> "Discontinuous swapout". For example the "Swap Min Order" issue might
> be solvable by allocating only specific order using the new cluster
> allocator, then having an abstract / virtual file as a batch layer.
> This layer may use some "redirection entries" in its table, with a
> very low overhead and be optional in real world usage. Details are yet
> to be decided.
> - Noticed that this will allow all swapin to no longer bypass swap
> cache (just like previous series) with better performance. This may
> provide an opportunity to implement a tunable readahead based large
> folio swapin. [12]
>
> [1] https://lwn.net/Articles/974587/
> [2] https://lpc.events/event/18/contributions/1769/
> [3] https://lwn.net/Articles/984090/
> [4] https://lwn.net/Articles/1005081/
> [5] https://lwn.net/Articles/932077/
> [6] https://lore.kernel.org/linux-mm/20240206182559.32264-1-ryncsn@gmail.com/
> [7] https://lore.kernel.org/lkml/20240324210447.956973-1-hannes@cmpxchg.org/
> [8] https://lore.kernel.org/lkml/20240926211936.75373-1-21cnbao@gmail.com/
> [9] https://lore.kernel.org/all/CAMgjq7ACohT_uerSz8E_994ZZCv709Zor+43hdmesW_59W1BWw@mail.gmail.com/
> [10] https://lore.kernel.org/lkml/20240326185032.72159-1-ryncsn@gmail.com/
> [11] https://lwn.net/Articles/966845/
> [12] https://lore.kernel.org/lkml/874j7zfqkk.fsf@yhuang6-desk2.ccr.corp.intel.com/

Hi all,

Here is the slides presented today:
https://drive.google.com/file/d/1_QKlXErUkQ-TXmJJy79fJoLPui9TGK1S/view?usp=sharing

     prev parent reply	other threads:[~2025-03-26  3:24 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-02-04 11:44 Kairui Song
2025-02-04 16:24 ` Johannes Weiner
2025-02-04 16:46   ` Kairui Song
2025-02-04 18:11     ` Yosry Ahmed
2025-02-04 18:38       ` Kairui Song
2025-02-04 19:09         ` Johannes Weiner
2025-02-04 19:25           ` Kairui Song
2025-02-04 16:44 ` Yosry Ahmed
2025-02-04 16:56   ` Kairui Song
2025-03-26  3:23 ` Kairui Song [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAMgjq7BaT8PBc-5m=zRCuYotxU5gE01JSazL7+=Pe+y_qnM-+w@mail.gmail.com' \
    --to=ryncsn@gmail.com \
    --cc=21cnbao@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=chengming.zhou@linux.dev \
    --cc=chrisl@kernel.org \
    --cc=hannes@cmpxchg.org \
    --cc=hughd@google.com \
    --cc=linux-mm@kvack.org \
    --cc=lsf-pc@lists.linux-foundation.org \
    --cc=nphamcs@gmail.com \
    --cc=ryan.roberts@arm.com \
    --cc=shakeel.butt@linux.dev \
    --cc=usamaarif642@gmail.com \
    --cc=willy@infradead.org \
    --cc=ying.huang@linux.alibaba.com \
    --cc=yosryahmed@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox