Hi, On Wed, 16 Dec 1998 17:24:05 -0800 (PST), Linus Torvalds said: > On Tue, 1 Dec 1998, Rik van Riel wrote: >> >> --- ./mm/vmscan.c.orig Thu Nov 26 11:26:50 1998 >> +++ ./mm/vmscan.c Tue Dec 1 07:12:28 1998 >> @@ -431,6 +431,8 @@ >> kmem_cache_reap(gfp_mask); >> >> if (buffer_over_borrow() || pgcache_over_borrow()) >> + state = 0; >> + if (atomic_read(&nr_async_pages) > pager_daemon.swap_cluster / 2) >> shrink_mmap(i, gfp_mask); >> >> switch (state) { > I really hate the above tests that make no sense at all from a conceptual > view, and are fairly obviously just something to correct for a more basic > problem. Agreed: I've been saying this for several years now. :) Linus, I've had a test with your 132-pre2 patch, and the performance is really disappointing in some important cases. Particular effects I can reproduce with it include: * Extra file IO activity Doing a kernel build on a full (lots of applications have been loaded) but otherwise idle 64MB machine results in sustained 50 to 200kb/sec IO block read rates according to vmstat. I've never seen this with older kernels, and it results in a drop of about 10% in the cpu utilisation sustained over the entire kernel build. I've had independent confirmation of this effect from other people. * Poor swapout performance On my main development box, I've been able to sustain about 3MB/sec to swap quite easily when the VM got busy on most recent kernels since 2.1.130 (including all the late ac* patches with my VM changes in). Swap out peaks at a little under 4MB/sec, and I can sustain about 3MB/sec combined read+write traffic too. It streams to/from swap very well indeed. The 132-pre2 peaks at about 800kb/sec to swap, and sustains between 300 and 400. * Swap fragmentation The reduced swap streaming means that swap does seem to get much more fragmented than under, say, ac11. In particular, this appears to have two side effects: it defeats the swap clustered readin code in ac11 (which I have ported forward to 132-pre2), resulting in much slower swapping behaviour if I start up more applications than I have ram for and swap between them; and, especially on low memory, the swap fragmentation appears to make successive compilation runs in 8MB ever more slow as bits of my background tasks (https, cron) scatter themselves over swap. The problem that we have with the strict state-driven logic in do_try_to_free_page is that, for prolonged periods, it can bypass the normal shrink_mmap() loop which we _do_ want to keep active even while swapping. However, I think that the 132-pre2 cure is worse than the disease, because it penalises swap to such an extent that we lose the substantial performance benefit that comes from being able to stream both to and from swap rapidly. The VM in 2.1.131-ac11+ seems to work incredibly well. On my own 64MB box it feels as if the memory has doubled. I've had similar feedback from other people, including reports of 300% performance improvement over 2.0 in 4MB memory (!). Alan reports a huge increase in the uptake of his ac patches since the new VM stuff went in there. I've tried to port the best bits of that VM to 132-pre2, preserving your do_try_to_free_page state change, but so far I have not been able find a combination which gives anywhere near the overall performance of ac11 for all of my test cases (although it works reasonably well on low memory at first, until we start to fragment swap). The patch below is the best I have so far against 132-pre2. You will find that it has absolutely no references to the borrow percentages, and although it does honour the buffer/pgcache min percentages, those default to 1%. Andrea, I know you've seen odd behaviours since 2.1.131, although I'm not quite sure exactly which VMs you've been testing on. The one change I've found which does have a significant effect on predictability here is in do_try_to_free_page: if (current != kswapd_task) if (shrink_mmap(6, gfp_mask)) return 1; which means that even if kswapd is busy swapping, we can _still_ bypass the swap and go straight to the cache shrinking if we need more memory. The overall effect I observe is that large IO-bound tasks _can_ still grow the cache, and I don't see any excessive input IO during a kernel build, but that kswapd itself can still stream efficiently out to swap. The patch also includes a few extra performance counters in /proc/swapstats, and adds back the heuristic from a while ago that the kswap wakeup has a hysteresis behaviour between freepages.high and freepages.med: kswapd will remain inactive until nr_free_pages reaches freepages.med, and will then swap until it is brought back up to freepages.high. Any failure of shrink_mmap immediately kicks kswapd into action, though. To be honest, I haven't been able to measure a huge difference from this, but it's in my current tree so you are welcome to it. Finally, the patch includes the swap and mmap clustered read logic. That is entirely responsible for my being able to sustain 2MB/sec or more swapin performance, and disk performance (5MB/sec) when doing a mmap-based grep. Tested on 8MB, 64MB and with high filesystem and VM load. Doing an anonymous-page stress test (basically a memset on a region 3 times physical memory) it sustains 1.5M/sec to swap (and about 150K/sec from swap) for a couple of minutes until completion. Performance sucks during this, but X is still usable (although switching windows is slow), vmstat 1" in an xterm didn't miss a tick, and all the swapped-out applications swapped back within a couple of seconds after the test was complete. Please test and comment. Note that I'll be mostly offline until the new year, so don't expect me to test it too much more until then. However, this VM is mostly equivalent to the one in ac11, except without the messy borrow percentage rules and with the extra shrink_mmap for foreground page stealing in do_try_to_free_page. --Stephen