From: Chris Li <chrisl@kernel.org>
To: Matthew Wilcox <willy@infradead.org>
Cc: Karim Manaouil <kmanaouil.dev@gmail.com>, Jan Kara <jack@suse.cz>,
Chuanhua Han <hanchuanhua@oppo.com>,
linux-mm <linux-mm@kvack.org>,
lsf-pc@lists.linux-foundation.org, ryan.roberts@arm.com,
21cnbao@gmail.com, david@redhat.com
Subject: Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Swap Abstraction "the pony"
Date: Fri, 31 May 2024 17:43:00 -0700 [thread overview]
Message-ID: <CAF8kJuNGcNiUk8DjBkHdjFe+zuq5u7i2xkUVF7Rt+kxBPK7scg@mail.gmail.com> (raw)
In-Reply-To: <ZllAJbLaYGQkrPyV@casper.infradead.org>
On Thu, May 30, 2024 at 8:12 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Thu, May 30, 2024 at 03:53:49PM -0700, Chris Li wrote:
> > On Wed, May 29, 2024 at 5:33 AM Matthew Wilcox <willy@infradead.org> wrote:
> > > > Where the anonymous memory case, the dirty page does not have to write
> > > > to swap. It is optional, so which page you choose to swap out is
> > > > critical, you want to swap out the coldest page, the page that is
> > > > least likely to get swapin. Therefore, the LRU makes sense.
> > >
> > > Disagree. There are two things you want and the LRU serves neither
> > > particularly well. One is that when you want to reclaim memory, you
> > > want to find some memory that is likely to not be accessed in the next
> > > few seconds/minutes/hours. It doesn't need to be the coldest, just in
> > > (say) the coldest 10% or so of memory. And it needs to already be clean,
> > > otherwise you have to wait for it to writeback, and you can't afford that.
> >
> > Do you disagree that LRU is necessary or the way we use the LRU?
>
> I think we should switch to a scheme where we just don't use an LRU at
> all.
I would love to hear more details on how to achieve that. Can you elaborate?
>
> > In order to get the coldest 10% or so pages, assume you still need to
> > maintain an LRU, no?
>
> I don't think that's true. If you reframe the problem as "we need to
> find some of the coldest pages in the system", then you can use a
> different scheme.
If you can have a way to do the reclaim without using LRU at all, that
would be some thing to replace the traditional LRU and MGLRU.
""we need to find some of the coldest pages in the system" that is not
enough for anonymous memory.
You want to find the and reclaim from the coldest memory, if that is
not enough, you need to reclaim more second coldest memory. The
threshold is a moving target depend on the memory pressure.
>
> > > The second thing you need to be able to do is find pages which are
> > > already dirty, and not likely to be written to soon, and write those
> > > back so they join the pool of clean pages which are eligible for reclaim.
> > > Again, the LRU isn't really the best tool for the job.
> >
> > It seems you need to LRU to find which pages qualify for write back.
> > It should be both dirty and cold.
> >
> > The question is, can you do the reclaim write back without LRU for
> > anonymous pages?
> > If LRU is unavoidable, then it is necessarily evil.
>
> The point I was trying to make is that a simple physical scan is 40x
> faster. So if you just scan N pages, starting from wherever you left
> off the scan last time, and even 1/10 of them are eligible for
> reclaiming (not referenced since last time the clock hand swept past it,
> perhaps), you're still reclaiming 4x as many pages as doing an LRU scan.
I feel that I am missing something. In your 40x faster scan, do you
still scan the page table PTE entry for access bit or not?
If no, I fail to see how you can get the dirty information in the
first place. Unmap a page can get that information at a very high
price.
If yes, then you scan order is not physical any way, you need to find
the PTE entry location and scan that. It is not going to be in the pfn
order.
Also, when reclaiming for a cgroup. You want to scan for memory that
is belong to this cgroup. The page used in this cgroup will be all
over the place, you wouldn't be doing a linear pfn scanning away.
Unless you want to scan for a lot of page that is not belong to this
cgroup. The CPU prefetching and caching contribute to that 40x speed
up would be out of the window.
>
> > > > In VMA swap out, the question is, which VMA you choose from first? To
> > > > make things more complicated, the same page can map into different
> > > > processes in more than one VMA as well.
> > >
> > > This is why we have the anon_vma, to handle the same pages mapped from
> > > multiple VMAs.
> >
> > Can you clarify when you use anon_vma to organize the swap out and
> > swap in, do you want to write a range of pages rather than just one
> > page at a time? Will write back a sub list of the LRU work for you?
> > Ideally we shouldn't write back pages that are hot. anon_vma alone
> > does not give us that information.
>
> So filesystems do write back all pages in an inode that are dirty,
> regardless of whether they're hot. But, as noted, we do like to
> get the pagecache written back periodically even if the pages are
> going to be redirtied soon. And this is somewhere that I think there's
Yes, I think there is a critical difference in file system vs
anonymous memory in this regard. In file system write out all dirty
page is more or less OK. It need to eventually happen anyway. Where in
anonymous memory, write out dirty memory has cost associate with it.
It needs to allocate swap entry, put on the swap cache etc. We want to
minimize swap out the page that are hot.
> a difference between anon & file pages. So maybe the algorithm looks
> something like this:
>
> A: write page fault causes page to be created
You are talking about swap in page fault, right? Are you only going to
write out pages that has recently been swap in?
> B: scan unmaps page, marks it dirty, does not start writeout
Sorry a lot of questions, I just want to make sure I understand what
you are saying correctly.
1) scan in what order? the pfn order or following the anon_vma scan
all page in that anon_vma?
2) The scan process unmaps which page? All pages in anon_vma or the
page recently have a swap in page fault in step A?
> C: scan finds dirty, unmapped anon page, starts writeout
Can you clarify "scan file dirty" where does the "dirty" come from?
Does it only use the above step B or also involve scanning the PTE
dirty/access bit by LRU/MGLRU?
I think you mean the dirty come from step B, just want to make sure.
> D: scan finds clean unmapped anon page, frees it
It seems you are using unmapped page causing page fault to detect if
that page is needed. Which is much more expensive than scanning the
PTE dirty/access bit.
>
> so it will actually take three trips around the whole of memory for
> the physical scan to evict an anon page. That should be adequate
> time for a workload to fault back in a page that's actually hot.
> (if a page fault finds a page in state B, it transitions back to state
> A and gets three more trips around the clock).
That seems limit to reclaim page you already swap out then recently swap in.
How does it reclaim the first page to when there is no page swap out
previously? It seems it would require step B to unmap all scanned page
not just the swap in one. That would have a lot of performance hit. I
still feel that I am missing some thing in your step A -> D.
Chris
next prev parent reply other threads:[~2024-06-01 0:43 UTC|newest]
Thread overview: 59+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-03-01 9:24 Chris Li
2024-03-01 9:53 ` Nhat Pham
2024-03-01 18:57 ` Chris Li
2024-03-04 22:58 ` Matthew Wilcox
2024-03-05 3:23 ` Chengming Zhou
2024-03-05 7:44 ` Chris Li
2024-03-05 8:15 ` Chengming Zhou
2024-03-05 18:24 ` Chris Li
2024-03-05 9:32 ` Nhat Pham
2024-03-05 9:52 ` Chengming Zhou
2024-03-05 10:55 ` Nhat Pham
2024-03-05 19:20 ` Chris Li
2024-03-05 20:56 ` Jared Hulbert
2024-03-05 21:38 ` Jared Hulbert
2024-03-05 21:58 ` Chris Li
2024-03-06 4:16 ` Jared Hulbert
2024-03-06 5:50 ` Chris Li
[not found] ` <CA+ZsKJ7JE56NS6hu4L_uyywxZO7ixgftvfKjdND9e5SOyn+72Q@mail.gmail.com>
2024-03-06 18:16 ` Chris Li
2024-03-06 22:44 ` Jared Hulbert
2024-03-07 0:46 ` Chris Li
2024-03-07 8:57 ` Jared Hulbert
2024-03-06 1:33 ` Barry Song
2024-03-04 18:43 ` Kairui Song
2024-03-04 22:03 ` Jared Hulbert
2024-03-04 22:47 ` Chris Li
2024-03-04 22:36 ` Chris Li
2024-03-06 1:15 ` Barry Song
2024-03-06 2:59 ` Chris Li
2024-03-06 6:05 ` Barry Song
2024-03-06 17:56 ` Chris Li
2024-03-06 21:29 ` Barry Song
2024-03-08 8:55 ` David Hildenbrand
2024-03-07 7:56 ` Chuanhua Han
2024-03-07 14:03 ` [Lsf-pc] " Jan Kara
2024-03-07 21:06 ` Jared Hulbert
2024-03-07 21:17 ` Barry Song
2024-03-08 0:14 ` Jared Hulbert
2024-03-08 0:53 ` Barry Song
2024-03-14 9:03 ` Jan Kara
2024-05-16 15:04 ` Zi Yan
2024-05-17 3:48 ` Chris Li
2024-03-14 8:52 ` Jan Kara
2024-03-08 2:02 ` Chuanhua Han
2024-03-14 8:26 ` Jan Kara
2024-03-14 11:19 ` Chuanhua Han
2024-05-15 23:07 ` Chris Li
2024-05-16 7:16 ` Chuanhua Han
2024-05-17 12:12 ` Karim Manaouil
2024-05-21 20:40 ` Chris Li
2024-05-28 7:08 ` Jared Hulbert
2024-05-29 3:36 ` Chris Li
2024-05-29 3:57 ` Matthew Wilcox
2024-05-29 6:50 ` Chris Li
2024-05-29 12:33 ` Matthew Wilcox
2024-05-30 22:53 ` Chris Li
2024-05-31 3:12 ` Matthew Wilcox
2024-06-01 0:43 ` Chris Li [this message]
2024-05-31 1:56 ` Yuanchu Xie
2024-05-31 16:51 ` Chris Li
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=CAF8kJuNGcNiUk8DjBkHdjFe+zuq5u7i2xkUVF7Rt+kxBPK7scg@mail.gmail.com \
--to=chrisl@kernel.org \
--cc=21cnbao@gmail.com \
--cc=david@redhat.com \
--cc=hanchuanhua@oppo.com \
--cc=jack@suse.cz \
--cc=kmanaouil.dev@gmail.com \
--cc=linux-mm@kvack.org \
--cc=lsf-pc@lists.linux-foundation.org \
--cc=ryan.roberts@arm.com \
--cc=willy@infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox