Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Swap Abstraction "the pony"

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Chris Li <chrisl@kernel.org>
To: Matthew Wilcox <willy@infradead.org>
Cc: Karim Manaouil <kmanaouil.dev@gmail.com>, Jan Kara <jack@suse.cz>,
	 Chuanhua Han <hanchuanhua@oppo.com>,
	linux-mm <linux-mm@kvack.org>,
	 lsf-pc@lists.linux-foundation.org, ryan.roberts@arm.com,
	21cnbao@gmail.com,  david@redhat.com
Subject: Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Swap Abstraction "the pony"
Date: Tue, 28 May 2024 23:50:47 -0700	[thread overview]
Message-ID: <CAF8kJuN7QfHM+fdqzmRHrYaLBeeBh1QNQPGZvzX1XFOXhV-6pw@mail.gmail.com> (raw)
In-Reply-To: <ZlanrUntADvnJWUY@casper.infradead.org>

On Tue, May 28, 2024 at 8:57 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Tue, May 21, 2024 at 01:40:56PM -0700, Chris Li wrote:
> > > Filesystems already implemented a lot of solutions for fragmentation
> > > avoidance that are more apropriate for slow storage media.
> >
> > Swap and file systems have very different requirements and usage
> > patterns and IO patterns.
>
> Should they, though?  Filesystems noticed that handling pages in LRU
> order was inefficient and so they stopped doing that (see the removal
> of aops->writepage in favour of ->writepages, along with where each are
> called from).  Maybe it's time for swap to start doing writes in the order
> of virtual addresses within a VMA, instead of LRU order.

Well, swap has one fundamental difference than file system:
the dirty file system cache will need to eventually write to file
backing at least once, otherwise machine reboots you lose the data.

Where the anonymous memory case, the dirty page does not have to write
to swap. It is optional, so which page you choose to swap out is
critical, you want to swap out the coldest page, the page that is
least likely to get swapin. Therefore, the LRU makes sense.

In VMA swap out, the question is, which VMA you choose from first? To
make things more complicated, the same page can map into different
processes in more than one VMA as well.

> Indeed, if we're open to radical ideas, the LRU sucks.  A physical scan
> is 40x faster:
> https://lore.kernel.org/linux-mm/ZTc7SHQ4RbPkD3eZ@casper.infradead.org/

That simulation assumes the page struct has access to information already.
On the physical CPU level, the access bit is from the PTE. If you scan
from physical page order, you need to use rmap to find the PTE to
check the access bit. It is not a simple pfn page order walk. You need
to scan the PTE first then move the access information into page
struct.

>
> > One challenging aspect is that the current swap back end has a very
> > low per swap entry memory overhead. It is about 1 byte (swap_map), 2
> > byte (swap cgroup), 8 byte(swap cache pointer). The inode struct is
> > more than 64 bytes per file. That is a big jump if you map a swap
> > entry to a file. If you map more than one swap entry to a file, then
> > you need to track the mapping of file offset to swap entry, and the
> > reverse lookup of swap entry to a file with offset. Whichever way you
> > cut it, it will significantly increase the per swap entry memory
> > overhead.
>
> Not necessarily, no.  If your workload uses a lot of order-2, order-4
> and order-9 folios, then the current scheme is using 11 bytes per page,
> so 44 bytes per order-2 folio, 176 per order-4 folio and 5632 per
> order-9 folio.  That's a lot of bytes we can use for an extent-based
> scheme.

Yes, if we allow dynamic allocation of swap entry, the 24B option.
Then sub entries inside the compound swap entry structure can share
the same compound swap struct pointer.

>
> Also, why would you compare the size of an inode to the size of an
> inode?  inode is ~equivalent to an anon_vma, not to a swap entry.

I am not assigning inode to one swap entry. That is covered in my
description of "if you map more than one swap entry to a file". If you
want to map each inode to anon_vma, you need to have a way to map
inode  and file offset into swap entry encoding. In your anon_vma as
inode world, how do you deal with two different vma containing the
same page? Once we have more detail of the swap entry mapping scheme,
we can analyse the pros and cons.

Chris

next prev parent reply	other threads:[~2024-05-29  6:51 UTC|newest]

Thread overview: 59+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-03-01  9:24 Chris Li
2024-03-01  9:53 ` Nhat Pham
2024-03-01 18:57   ` Chris Li
2024-03-04 22:58   ` Matthew Wilcox
2024-03-05  3:23     ` Chengming Zhou
2024-03-05  7:44       ` Chris Li
2024-03-05  8:15         ` Chengming Zhou
2024-03-05 18:24           ` Chris Li
2024-03-05  9:32         ` Nhat Pham
2024-03-05  9:52           ` Chengming Zhou
2024-03-05 10:55             ` Nhat Pham
2024-03-05 19:20               ` Chris Li
2024-03-05 20:56                 ` Jared Hulbert
2024-03-05 21:38         ` Jared Hulbert
2024-03-05 21:58           ` Chris Li
2024-03-06  4:16             ` Jared Hulbert
2024-03-06  5:50               ` Chris Li
     [not found]                 ` <CA+ZsKJ7JE56NS6hu4L_uyywxZO7ixgftvfKjdND9e5SOyn+72Q@mail.gmail.com>
2024-03-06 18:16                   ` Chris Li
2024-03-06 22:44                     ` Jared Hulbert
2024-03-07  0:46                       ` Chris Li
2024-03-07  8:57                         ` Jared Hulbert
2024-03-06  1:33   ` Barry Song
2024-03-04 18:43 ` Kairui Song
2024-03-04 22:03   ` Jared Hulbert
2024-03-04 22:47     ` Chris Li
2024-03-04 22:36   ` Chris Li
2024-03-06  1:15 ` Barry Song
2024-03-06  2:59   ` Chris Li
2024-03-06  6:05     ` Barry Song
2024-03-06 17:56       ` Chris Li
2024-03-06 21:29         ` Barry Song
2024-03-08  8:55       ` David Hildenbrand
2024-03-07  7:56 ` Chuanhua Han
2024-03-07 14:03   ` [Lsf-pc] " Jan Kara
2024-03-07 21:06     ` Jared Hulbert
2024-03-07 21:17       ` Barry Song
2024-03-08  0:14         ` Jared Hulbert
2024-03-08  0:53           ` Barry Song
2024-03-14  9:03         ` Jan Kara
2024-05-16 15:04           ` Zi Yan
2024-05-17  3:48             ` Chris Li
2024-03-14  8:52       ` Jan Kara
2024-03-08  2:02     ` Chuanhua Han
2024-03-14  8:26       ` Jan Kara
2024-03-14 11:19         ` Chuanhua Han
2024-05-15 23:07           ` Chris Li
2024-05-16  7:16             ` Chuanhua Han
2024-05-17 12:12     ` Karim Manaouil
2024-05-21 20:40       ` Chris Li
2024-05-28  7:08         ` Jared Hulbert
2024-05-29  3:36           ` Chris Li
2024-05-29  3:57         ` Matthew Wilcox
2024-05-29  6:50           ` Chris Li [this message]
2024-05-29 12:33             ` Matthew Wilcox
2024-05-30 22:53               ` Chris Li
2024-05-31  3:12                 ` Matthew Wilcox
2024-06-01  0:43                   ` Chris Li
2024-05-31  1:56               ` Yuanchu Xie
2024-05-31 16:51                 ` Chris Li

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAF8kJuN7QfHM+fdqzmRHrYaLBeeBh1QNQPGZvzX1XFOXhV-6pw@mail.gmail.com \
    --to=chrisl@kernel.org \
    --cc=21cnbao@gmail.com \
    --cc=david@redhat.com \
    --cc=hanchuanhua@oppo.com \
    --cc=jack@suse.cz \
    --cc=kmanaouil.dev@gmail.com \
    --cc=linux-mm@kvack.org \
    --cc=lsf-pc@lists.linux-foundation.org \
    --cc=ryan.roberts@arm.com \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox