Hi,

On Wed, 16 Dec 1998 17:24:05 -0800 (PST), Linus Torvalds
<torvalds@transmeta.com> said:

> On Tue, 1 Dec 1998, Rik van Riel wrote:
>> 
>> --- ./mm/vmscan.c.orig	Thu Nov 26 11:26:50 1998
>> +++ ./mm/vmscan.c	Tue Dec  1 07:12:28 1998
>> @@ -431,6 +431,8 @@
>> kmem_cache_reap(gfp_mask);
>> 
>> if (buffer_over_borrow() || pgcache_over_borrow())
>> +		state = 0;		
>> +	if (atomic_read(&nr_async_pages) > pager_daemon.swap_cluster / 2)
>> shrink_mmap(i, gfp_mask);
>> 
>> switch (state) {

> I really hate the above tests that make no sense at all from a conceptual
> view, and are fairly obviously just something to correct for a more basic
> problem. 

Agreed: I've been saying this for several years now. :)

Linus, I've had a test with your 132-pre2 patch, and the performance is
really disappointing in some important cases.  Particular effects I can
reproduce with it include:

* Extra file IO activity

  Doing a kernel build on a full (lots of applications have been loaded)
  but otherwise idle 64MB machine results in sustained 50 to 200kb/sec
  IO block read rates according to vmstat.  I've never seen this with
  older kernels, and it results in  a drop of about 10% in the cpu
  utilisation sustained over the entire kernel build.  I've had
  independent confirmation of this effect from other people.

* Poor swapout performance

  On my main development box, I've been able to sustain about 3MB/sec to
  swap quite easily when the VM got busy on most recent kernels since
  2.1.130 (including all the late ac* patches with my VM changes in).
  Swap out peaks at a little under 4MB/sec, and I can sustain about
  3MB/sec combined read+write traffic too.  It streams to/from swap very
  well indeed.

  The 132-pre2 peaks at about 800kb/sec to swap, and sustains between
  300 and 400. 

* Swap fragmentation

  The reduced swap streaming means that swap does seem to get much more
  fragmented than under, say, ac11.  In particular, this appears to have
  two side effects: it defeats the swap clustered readin code in ac11
  (which I have ported forward to 132-pre2), resulting in much slower
  swapping behaviour if I start up more applications than I have ram for
  and swap between them; and, especially on low memory, the swap
  fragmentation appears to make successive compilation runs in 8MB ever
  more slow as bits of my background tasks (https, cron) scatter
  themselves over swap.

The problem that we have with the strict state-driven logic in
do_try_to_free_page is that, for prolonged periods, it can bypass the
normal shrink_mmap() loop which we _do_ want to keep active even while
swapping.  However, I think that the 132-pre2 cure is worse than the
disease, because it penalises swap to such an extent that we lose the
substantial performance benefit that comes from being able to stream
both to and from swap rapidly.

The VM in 2.1.131-ac11+ seems to work incredibly well.  On my own 64MB
box it feels as if the memory has doubled.  I've had similar feedback
from other people, including reports of 300% performance improvement
over 2.0 in 4MB memory (!).  Alan reports a huge increase in the uptake
of his ac patches since the new VM stuff went in there.  

I've tried to port the best bits of that VM to 132-pre2, preserving your
do_try_to_free_page state change, but so far I have not been able find a
combination which gives anywhere near the overall performance of ac11
for all of my test cases (although it works reasonably well on low
memory at first, until we start to fragment swap).

The patch below is the best I have so far against 132-pre2.  You will
find that it has absolutely no references to the borrow percentages, and
although it does honour the buffer/pgcache min percentages, those
default to 1%.

Andrea, I know you've seen odd behaviours since 2.1.131, although I'm
not quite sure exactly which VMs you've been testing on.  The one change
I've found which does have a significant effect on predictability here
is in do_try_to_free_page:

	if (current != kswapd_task)
		if (shrink_mmap(6, gfp_mask))
			return 1;

which means that even if kswapd is busy swapping, we can _still_ bypass
the swap and go straight to the cache shrinking if we need more memory.
The overall effect I observe is that large IO-bound tasks _can_ still
grow the cache, and I don't see any excessive input IO during a kernel
build, but that kswapd itself can still stream efficiently out to swap.

The patch also includes a few extra performance counters in
/proc/swapstats, and adds back the heuristic from a while ago that the
kswap wakeup has a hysteresis behaviour between freepages.high and
freepages.med: kswapd will remain inactive until nr_free_pages reaches
freepages.med, and will then swap until it is brought back up to
freepages.high.  Any failure of shrink_mmap immediately kicks kswapd
into action, though.  To be honest, I haven't been able to measure a
huge difference from this, but it's in my current tree so you are
welcome to it.

Finally, the patch includes the swap and mmap clustered read logic.
That is entirely responsible for my being able to sustain 2MB/sec or
more swapin performance, and disk performance (5MB/sec) when doing a
mmap-based grep.

Tested on 8MB, 64MB and with high filesystem and VM load.  Doing an
anonymous-page stress test (basically a memset on a region 3 times
physical memory) it sustains 1.5M/sec to swap (and about 150K/sec from
swap) for a couple of minutes until completion.  Performance sucks
during this, but X is still usable (although switching windows is slow),
vmstat 1" in an xterm didn't miss a tick, and all the swapped-out
applications swapped back within a couple of seconds after the test was
complete.


Please test and comment.  Note that I'll be mostly offline until the new
year, so don't expect me to test it too much more until then.  However,
this VM is mostly equivalent to the one in ac11, except without the
messy borrow percentage rules and with the extra shrink_mmap for
foreground page stealing in do_try_to_free_page.


--Stephen