Hello guys, I'd like to submit a patch against linux-2.4.0-test2 regarding the vm/kswapd. The patch is attached to this email. Sorry I don't have access to a web or ftp server where I can put it. The following paragraph tries to explain what this patch is supposed to do by describing how the swap works. I'm sure the first part will sound obvious for most of you in which case you can just skip this part and go directly to the idea section. Linux, like any modern OS, tries to cache almost everything in a common cache with the hope that what is cached will be used/reused/shared soon. The more a system caches, the better the throughput is and Linux is good at that game. In particular, this cache contains: . I/O buffers from read/write. . shared pages. . potentially shared pages (as an example, when a page is accessed for writing, the system keeps a copy of the original page in the cache). . read-ahead pages. . potentially re-usable pages (pages from a process that dies stay in the cache in the hope that the same process will be executed again soon). . pending swap pages (dirty pages that are/will be swapped out). Hopefully, this cache grows as long as there is some memory available. The main reason is that we want the system to use all the resources of the machine and not just a subset of it. Now, when the memory becomes too low, it is time to remove some [old] stuff from the cache and this mechanism is called swapping. In this sense, the probability that the system actually swaps is higher than one can think. Writing an old dirty page on a swap device in order to free it, is only one part of the problem. The swap algorithm may have started before that but the pages removed from the cache happened to be not dirty and simply not used anymore. An another interesting behaviour of the swap is that as soon as it is activated, it never ends. The system keeps the available memory between two water marks. The low limit activates the swap while the high limit forces it to stop. The range between the low and high mark depends on how much memory the system has at boot time, but it is usually pretty small; the deal here is not to throw away all the content of the memory but rather remove from the cache what seems to be irrelevant (ok, the less important stuff). To recap things, the point I wanted to make is the following; The swap algorithm is basically a cache replacement problem; By design, the system does eventually swap. And finally, when the swap mechanism is activated, it never ends until the system shutdowns. Idea ---- The main problem with kswapd comes from the fact that it actually handles 2 jobs completely different from my point of view. The first one is to actually free some memory by removing pages from the cache and/or starting a disk I/O for a dirty page (and even waiting for it if the disk queue becomes dangerously flooded). The second job is to figure out which pages can be "safely" freed, i.e which pages are the last recently used. I don't think the 2 jobs are compatible in terms of when to start, what to do and how to stop. So, the idea of this patch is actually simple; do the same thing but do it a little bit differently. . A new thread (kpaged) ages the physical pages and tries to keep a set of LRU pages per virtual mapping. The execution model of this thread is based as much as possible on the idle thread. There is two reasons for that; First I believe there is enough spare cycles in the system to do the job in background (especially during pageout activity where I/Os are important). Second if there is not enough idle time, it probably means that the system is entering an "overload" situation and kpaged won't have the time to find correctly the LRU pages anyway. . The kswapd thread, as usual, wakes up when the memory becomes low and checks that it is relatively easy to remove/get a page from the page cache. If it's not, it starts "flushing" the cache by swapping out the LRU pages computed by kpaged. If kpaged didn't cope, kswapd falls back to the original algorithm and swap out pages based on the RSS usage. . Finally, an allocation request does not try to swap out anything, it just request to get a page from the page cache. Other improvement?/modifications -------------------------------- The following is a list of modifications I made to the vm/swapout in addition of the algorithm described above. There are, in the sense, minor but I believe still important: . An allocation request cannot fail because the pageout mechanism didn't keep up. The only way a normal (i.e no atomic) memory allocation should fail is if the system is out of swap or if an error occurred during the swap. If kswapd is too slow, the allocation will wait for kswapd to catch up. . The swap doesn't deal with processes but rather with virtual mappings. Processes can share a virtual mapping because of fork() or because of multithreaded applications. The problem of swapping is to deal with the currently allocated memory, swapping processes doesn't seem to be fair or really efficient. . A read-ahead memory allocation can be discarded if the available memory is too low. Read-ahead is very important in the system. However when the swap is active, a read-ahead page can be removed from the cache before being hit and in this case we just overload the system for nothing. . The swap defers a small amount of dirty pages that need to be written on the swap device (this is a patch I found on the Linux-MM web page coming from Eric W. Biederman I believe). Some measurement shows that a small percentage of LRU pages put in the cache by kswapd are actually reused before being freed. Well, I believe this proves that trying to predict the future by looking at the past doesn't work all the time. This patch seems to work well for me. But, I validated/tested it on my own computer, using my own environment. It's obviously a rather subjective opinion. In particular, I didn't check it on a SMP machine, so I don't know how it behaves and even if it's working on SMP. I modified the Alt-SysReq-M key to have a better understanding of what's going on in the system: Swap cache: add {A} [{B}-{C}], del {D}, find {E}/{F} [{G}] {H}% kswapd: total {I} overload {J} out of sync {K} kswapd: wakeup {L} [g {M} y {N} o {O} r {P}] free {Q} io {R} kswapd: aged pages {S} dirty pages {T} A: total number of pages added to the cache by the swap mechanism. B: number of swap pages added because of the read-head. C: number of swap pages added because of kswapd. D: total number of swap pages deleted from the cache. E: number of pages found in the cache during a swap page fault. F: total number of swap page faults. G: number of pages marked for swapout found in the cache during a swap page fault. H: average percentage of hits in the page swap cache. I: total number of pages marked for swapout by kswapd. J: number of times kswapd fell back to the RSS usage algorithm K: number of times a memory allocation had to wait for kswapd. L: number of times kswapd has been wake-up. M, N, O, P: number of times kpaged run in green, yellow, orange and and red mode respectively. Q: number of times kswapd tried to free something. R: number of times kswapd tried to swapout a virtual mapping. S: current view of the total number of LRU pages in the system. T: number of pending dirty pages in the cache. Ludo.