Hello guys,

I'd like to submit a patch against linux-2.4.0-test2 regarding
the vm/kswapd. The patch is attached to this email. Sorry I don't
have access to a web or ftp server where I can put it.

The following paragraph tries to explain what this patch is supposed to do
by describing how the swap works. I'm sure the first part will sound obvious
for most of you in which case you can just skip this part and go directly to
the idea section.
Linux, like any modern OS, tries to cache almost everything in a common cache
with the hope that what is cached will be used/reused/shared soon. The more
a system caches, the better the throughput is and Linux is good at that game.
In particular, this cache contains:
   . I/O buffers from read/write.
   . shared pages.
   . potentially shared pages (as an example, when a page is accessed for
     writing, the system keeps a copy of the original page in the cache).
   . read-ahead pages.
   . potentially re-usable pages (pages from a process that dies stay in
     the cache in the hope that the same process will be executed again soon).
   . pending swap pages (dirty pages that are/will be swapped out).
Hopefully, this cache grows as long as there is some memory available. The main
reason is that we want the system to use all the resources of the machine and
not just a subset of it. Now, when the memory becomes too low, it is time to
remove some [old] stuff from the  cache and this mechanism is called swapping.
In this sense, the probability that the system actually swaps is higher than
one can think. Writing an old dirty page on a swap device in order to free it,
is only one part of the problem. The swap algorithm may have started before that

but the pages removed from the cache happened to be not dirty and simply not
used anymore.
An another interesting behaviour of the swap is that as soon as it is activated,

it never ends. The system keeps the available memory between two water marks.
The low limit activates the swap while the high limit forces it to stop.
The range between the low and high mark depends on how much memory the system
has at boot time, but it is usually pretty small; the deal here is not to throw
away all the content of the memory but rather remove from the cache what seems
to be irrelevant (ok, the less important stuff).

To recap things, the point I wanted to make is the following; The swap
algorithm is basically a cache replacement problem; By design, the system
does eventually swap. And finally, when the swap mechanism is activated,
it never ends until the system shutdowns.

Idea
----
The main problem with kswapd comes from the fact that it actually handles
2 jobs completely different from my point of view. The first one is to actually
free some memory by removing pages from the cache and/or starting a disk I/O for

a dirty page (and even waiting for it if the disk queue becomes dangerously
flooded). The second job is to figure out which pages can be "safely" freed,
i.e which pages are the last recently used. I don't think the 2 jobs are
compatible in terms of when to start, what to do and how to stop.
So, the idea of this patch is actually simple; do the same thing but do it
a little bit differently.

    . A new thread (kpaged) ages the physical pages and tries to keep a set
      of LRU pages per virtual mapping. The execution model of this thread is
      based as much as possible on the idle thread. There is two reasons for
      that; First I believe there is enough spare cycles in the system to do
      the job in background (especially during pageout activity where I/Os are
      important). Second if there is not enough idle time, it probably means
      that the system is entering an "overload" situation and kpaged won't have
      the time to find correctly the LRU pages anyway.
    . The kswapd thread, as usual, wakes up when the memory becomes low and
      checks that it is relatively easy to remove/get a page from the page
      cache. If it's not, it starts "flushing" the cache by swapping out the
      LRU pages computed by kpaged. If kpaged didn't cope, kswapd falls back
      to the original algorithm and swap out pages based on the RSS usage.
    . Finally, an allocation request does not try to swap out anything, it just
      request to get a page from the page cache.

Other improvement?/modifications
--------------------------------
The following is a list of modifications I made to the vm/swapout in addition
of the algorithm described above. There are, in the sense, minor but I believe
still important:

    . An allocation request cannot fail because the pageout mechanism didn't
      keep up. The only way a normal (i.e no atomic) memory allocation should
      fail is if the system is out of swap or if an error occurred during the
      swap. If kswapd is too slow, the allocation will wait for kswapd to
      catch up.
    . The swap doesn't deal with processes but rather with virtual mappings.
      Processes can share a virtual mapping because of fork() or because of
      multithreaded applications. The problem of swapping is to deal with the
      currently allocated memory, swapping processes doesn't seem to be fair
      or really efficient.
    . A read-ahead memory allocation can be discarded if the available memory
      is too low. Read-ahead is very important in the system. However when the
      swap is active, a read-ahead page can be removed from the cache before
      being hit and in this case we just overload the system for nothing.
    . The swap defers a small amount of dirty pages that need to be written on
      the swap device (this is a patch I found on the Linux-MM web page coming
      from Eric W. Biederman I believe). Some measurement shows that a small
      percentage of LRU pages put in the cache by kswapd are actually reused
      before being freed. Well, I believe this proves that trying to predict
      the future by looking at the past doesn't work all the time.

This patch seems to work well for me. But, I validated/tested it on my own
computer, using my own environment. It's obviously a rather subjective opinion.
In particular, I didn't check it on a SMP machine, so I don't know how it
behaves and even if it's working on SMP.
I modified the Alt-SysReq-M key to have a better understanding of what's going
on in the system:

Swap cache: add {A} [{B}-{C}], del {D}, find {E}/{F} [{G}] {H}%
kswapd: total {I} overload {J} out of sync {K}
kswapd: wakeup {L} [g {M} y {N} o {O} r {P}] free {Q} io {R}
kswapd: aged pages {S} dirty pages {T}

A: total number of pages added to the cache by the swap mechanism.
B: number of swap pages added because of the read-head.
C: number of swap pages added because of kswapd.
D: total number of swap pages deleted from the cache.
E: number of pages found in the cache during a swap page fault.
F: total number of swap page faults.
G: number of pages marked for swapout found in the cache during a
   swap page fault.
H: average percentage of hits in the page swap cache.
I: total number of pages marked for swapout by kswapd.
J: number of times kswapd fell back to the RSS usage algorithm
K: number of times  a memory allocation had to wait for kswapd.
L: number of times kswapd has been wake-up.
M, N, O, P: number of times kpaged run in green, yellow, orange and
   and red mode respectively.
Q: number of times kswapd tried to free something.
R: number of times kswapd tried to swapout a virtual mapping.
S: current view of the total number of LRU pages in the system.
T: number of pending dirty pages in the cache.


Ludo.