From: Jan Kara <jack@suse.cz>
To: Chuanhua Han <hanchuanhua@oppo.com>
Cc: Chris Li <chrisl@kernel.org>, linux-mm <linux-mm@kvack.org>,
lsf-pc@lists.linux-foundation.org, ryan.roberts@arm.com,
21cnbao@gmail.com, david@redhat.com
Subject: Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Swap Abstraction "the pony"
Date: Thu, 7 Mar 2024 15:03:44 +0100 [thread overview]
Message-ID: <20240307140344.4wlumk6zxustylh6@quack3> (raw)
In-Reply-To: <039190fb-81da-c9b3-3f33-70069cdb27b0@oppo.com>
On Thu 07-03-24 15:56:57, Chuanhua Han via Lsf-pc wrote:
>
> 在 2024/3/1 17:24, Chris Li 写道:
> > In last year's LSF/MM I talked about a VFS-like swap system. That is
> > the pony that was chosen.
> > However, I did not have much chance to go into details.
> >
> > This year, I would like to discuss what it takes to re-architect the
> > whole swap back end from scratch?
> >
> > Let’s start from the requirements for the swap back end.
> >
> > 1) support the existing swap usage (not the implementation).
> >
> > Some other design goals::
> >
> > 2) low per swap entry memory usage.
> >
> > 3) low io latency.
> >
> > What are the functions the swap system needs to support?
> >
> > At the device level. Swap systems need to support a list of swap files
> > with a priority order. The same priority of swap device will do round
> > robin writing on the swap device. The swap device type includes zswap,
> > zram, SSD, spinning hard disk, swap file in a file system.
> >
> > At the swap entry level, here is the list of existing swap entry usage:
> >
> > * Swap entry allocation and free. Each swap entry needs to be
> > associated with a location of the disk space in the swapfile. (offset
> > of swap entry).
> > * Each swap entry needs to track the map count of the entry. (swap_map)
> > * Each swap entry needs to be able to find the associated memory
> > cgroup. (swap_cgroup_ctrl->map)
> > * Swap cache. Lookup folio/shadow from swap entry
> > * Swap page writes through a swapfile in a file system other than a
> > block device. (swap_extent)
> > * Shadow entry. (store in swap cache)
> >
> > Any new swap back end might have different internal implementation,
> > but needs to support the above usage. For example, using the existing
> > file system as swap backend, per vma or per swap entry map to a file
> > would mean it needs additional data structure to track the
> > swap_cgroup_ctrl, combined with the size of the file inode. It would
> > be challenging to meet the design goal 2) and 3) using another file
> > system as it is..
> >
> > I am considering grouping different swap entry data into one single
> > struct and dynamically allocate it so no upfront allocation of
> > swap_map.
> >
> > For the swap entry allocation.Current kernel support swap out 0 order
> > or pmd order pages.
> >
> > There are some discussions and patches that add swap out for folio
> > size in between (mTHP)
> >
> > https://lore.kernel.org/linux-mm/20231025144546.577640-1-ryan.roberts@arm.com/
> >
> > and swap in for mTHP:
> >
> > https://lore.kernel.org/all/20240229003753.134193-1-21cnbao@gmail.com/
> >
> > The introduction of swapping different order of pages will further
> > complicate the swap entry fragmentation issue. The swap back end has
> > no way to predict the life cycle of the swap entries. Repeat allocate
> > and free swap entry of different sizes will fragment the swap entries
> > array. If we can’t allocate the contiguous swap entry for a mTHP, it
> > will have to split the mTHP to a smaller size to perform the swap in
> > and out. T
> >
> > Current swap only supports 4K pages or pmd size pages. When adding the
> > other in between sizes, it greatly increases the chance of fragmenting
> > the swap entry space. When no more continuous swap swap entry for
> > mTHP, it will force the mTHP split into 4K pages. If we don’t solve
> > the fragmentation issue. It will be a constant source of splitting the
> > mTHP.
> >
> > Another limitation I would like to address is that swap_writepage can
> > only write out IO in one contiguous chunk, not able to perform
> > non-continuous IO. When the swapfile is close to full, it is likely
> > the unused entry will spread across different locations. It would be
> > nice to be able to read and write large folio using discontiguous disk
> > IO locations.
> >
> > Some possible ideas for the fragmentation issue.
> >
> > a) buddy allocator for swap entities. Similar to the buddy allocator
> > in memory. We can use a buddy allocator system for the swap entry to
> > avoid the low order swap entry fragment too much of the high order
> > swap entry. It should greatly reduce the fragmentation caused by
> > allocate and free of the swap entry of different sizes. However the
> > buddy allocator has its own limit as well. Unlike system memory, we
> > can move and compact the memory. There is no rmap for swap entry, it
> > is much harder to move a swap entry to another disk location. So the
> > buddy allocator for swap will help, but not solve all the
> > fragmentation issues.
> I have an idea here😁
>
> Each swap device is divided into multiple chunks, and each chunk is
> allocated to meet each order allocation
> (order indicates the order of swapout's folio, and each chunk is used
> for only one order).
> This can solve the fragmentation problem, which is much simpler than
> buddy, easier to implement,
> and can be compatible with multiple sizes, similar to small slab allocator.
>
> 1) Add structure members
> In the swap_info_struct structure, we only need to add the offset array
> representing the offset of each order search.
> eg:
>
> #define MTHP_NR_ORDER 9
>
> struct swap_info_struct {
> ...
> long order_off[MTHP_NR_ORDER];
> ...
> };
>
> Note: order_off = -1 indicates that this order is not supported.
>
> 2) Initialize
> Set the proportion of swap device occupied by each order.
> For the sake of simplicity, there are 8 kinds of orders.
> Number of slots occupied by each order: chunk_size = 1/8 * maxpages
> (maxpages indicates the maximum number of available slots in the current
> swap device)
Well, but then if you fill in space of a particular order and need to swap
out a page of that order what do you do? Return ENOSPC prematurely?
Frankly as I'm reading the discussions here, it seems to me you are trying
to reinvent a lot of things from the filesystem space :) Like block
allocation with reasonably efficient fragmentation prevention, transparent
data compression (zswap), hierarchical storage management (i.e., moving
data between different backing stores), efficient way to get from
VMA+offset to the place on disk where the content is stored. Sure you still
don't need a lot of things modern filesystems do like permissions,
directory structure (or even more complex namespacing stuff), all the stuff
achieving fs consistency after a crash, etc. But still what you need is a
notable portion of what filesystems do.
So maybe it would be time to implement swap as a proper filesystem? Or even
better we could think about factoring out these bits out of some existing
filesystem to share code?
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
next prev parent reply other threads:[~2024-03-07 17:35 UTC|newest]
Thread overview: 59+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-03-01 9:24 Chris Li
2024-03-01 9:53 ` Nhat Pham
2024-03-01 18:57 ` Chris Li
2024-03-04 22:58 ` Matthew Wilcox
2024-03-05 3:23 ` Chengming Zhou
2024-03-05 7:44 ` Chris Li
2024-03-05 8:15 ` Chengming Zhou
2024-03-05 18:24 ` Chris Li
2024-03-05 9:32 ` Nhat Pham
2024-03-05 9:52 ` Chengming Zhou
2024-03-05 10:55 ` Nhat Pham
2024-03-05 19:20 ` Chris Li
2024-03-05 20:56 ` Jared Hulbert
2024-03-05 21:38 ` Jared Hulbert
2024-03-05 21:58 ` Chris Li
2024-03-06 4:16 ` Jared Hulbert
2024-03-06 5:50 ` Chris Li
[not found] ` <CA+ZsKJ7JE56NS6hu4L_uyywxZO7ixgftvfKjdND9e5SOyn+72Q@mail.gmail.com>
2024-03-06 18:16 ` Chris Li
2024-03-06 22:44 ` Jared Hulbert
2024-03-07 0:46 ` Chris Li
2024-03-07 8:57 ` Jared Hulbert
2024-03-06 1:33 ` Barry Song
2024-03-04 18:43 ` Kairui Song
2024-03-04 22:03 ` Jared Hulbert
2024-03-04 22:47 ` Chris Li
2024-03-04 22:36 ` Chris Li
2024-03-06 1:15 ` Barry Song
2024-03-06 2:59 ` Chris Li
2024-03-06 6:05 ` Barry Song
2024-03-06 17:56 ` Chris Li
2024-03-06 21:29 ` Barry Song
2024-03-08 8:55 ` David Hildenbrand
2024-03-07 7:56 ` Chuanhua Han
2024-03-07 14:03 ` Jan Kara [this message]
2024-03-07 21:06 ` [Lsf-pc] " Jared Hulbert
2024-03-07 21:17 ` Barry Song
2024-03-08 0:14 ` Jared Hulbert
2024-03-08 0:53 ` Barry Song
2024-03-14 9:03 ` Jan Kara
2024-05-16 15:04 ` Zi Yan
2024-05-17 3:48 ` Chris Li
2024-03-14 8:52 ` Jan Kara
2024-03-08 2:02 ` Chuanhua Han
2024-03-14 8:26 ` Jan Kara
2024-03-14 11:19 ` Chuanhua Han
2024-05-15 23:07 ` Chris Li
2024-05-16 7:16 ` Chuanhua Han
2024-05-17 12:12 ` Karim Manaouil
2024-05-21 20:40 ` Chris Li
2024-05-28 7:08 ` Jared Hulbert
2024-05-29 3:36 ` Chris Li
2024-05-29 3:57 ` Matthew Wilcox
2024-05-29 6:50 ` Chris Li
2024-05-29 12:33 ` Matthew Wilcox
2024-05-30 22:53 ` Chris Li
2024-05-31 3:12 ` Matthew Wilcox
2024-06-01 0:43 ` Chris Li
2024-05-31 1:56 ` Yuanchu Xie
2024-05-31 16:51 ` Chris Li
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20240307140344.4wlumk6zxustylh6@quack3 \
--to=jack@suse.cz \
--cc=21cnbao@gmail.com \
--cc=chrisl@kernel.org \
--cc=david@redhat.com \
--cc=hanchuanhua@oppo.com \
--cc=linux-mm@kvack.org \
--cc=lsf-pc@lists.linux-foundation.org \
--cc=ryan.roberts@arm.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox