RE: on load control / process swapping

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* RE: on load control / process swapping
@ 2001-05-16 15:17 Charles Randall
  2001-05-16 17:14 ` Matt Dillon
  0 siblings, 1 reply; 39+ messages in thread
From: Charles Randall @ 2001-05-16 15:17 UTC (permalink / raw)
  To: 'Matt Dillon', Roger Larsson
  Cc: Rik van Riel, arch, linux-mm, sfkaplan

On a related note, we have a process (currently on Solaris, but possibly
moving to FreeBSD) that reads a 26 GB file just once for a database load. On
Solaris, we use the directio() function call to tell the filesystem to
bypass the buffer cache for this file descriptor.

>From the Solaris directio() man page,

     DIRECTIO_ON
             The system behaves as though the application is  not
             going  to reuse the file data in the near future. In
             other words, the file data  is  not  cached  in  the
             system's memory pages.

We found that without this, Solaris was aggressively trying to cache the
huge input file at the expense of database load performance (but we knew
that we'd never access it again). For some applications this is a huge win
(random I/O on a file much larger than memory seems to be another case).

Would there be an advantage to having a similar feature in FreeBSD (if not
already present)?

-Charles

-----Original Message-----
From: Matt Dillon [mailto:dillon@earth.backplane.com]
Sent: Tuesday, May 15, 2001 6:17 PM
To: Roger Larsson
Cc: Rik van Riel; arch@FreeBSD.ORG; linux-mm@kvack.org;
sfkaplan@cs.amherst.edu
Subject: Re: on load control / process swapping

:Are the heuristics persistent? 
:Or will the first use after  boot use the rough prediction? 
:For how long time will the heuristic stick? Suppose it is suddenly used in
:a slightly different way. Like two sequential readers instead of one...
:
:/RogerL
:Roger Larsson
:Skelleftea
:Sweden

    It's based on the VM page cache, so its adaptive over time.  I wouldn't
    call it persistent, it is nothing more then a simple heuristic that
    'normally' throws a page away but 'sometimes' caches it.  In otherwords,
    you lose some performance on the frontend in order to gain some later
    on.  If you loop through a file enough times, most of the file
    winds up getting cached.  It's still experimental so it is only
    lightly tied into the system.  It seems to work, though, so at some
    point in the future I'll probably try to put some significant prediction
    in.  But as I said, it's a very difficult thing to predict.  You can't
    just put your foot down and say 'I'll cache X amount of file Y'.  That
    doesn't work at all.

						-Matt

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: RE: on load control / process swapping
  2001-05-16 15:17 on load control / process swapping Charles Randall
@ 2001-05-16 17:14 ` Matt Dillon
  2001-05-16 17:41   ` Rik van Riel
  0 siblings, 1 reply; 39+ messages in thread
From: Matt Dillon @ 2001-05-16 17:14 UTC (permalink / raw)
  To: Charles Randall; +Cc: Roger Larsson, Rik van Riel, arch, linux-mm, sfkaplan

    We've talked about implementing O_DIRECT.  I think it's a good idea.

    In regards to the particular case of scanning a huge multi-gigabyte
    file, FreeBSD has a sequential detection heuristic which does a
    pretty good job preventing cache blow-aways by depressing the priority
    of the data as it is read or written.  FreeBSD will still try to cache
    a good chunk, but it won't sacrifice all available memory.  If you
    access the data via the VM system, through mmap, you get even more 
    control through the madvise() syscall.

						-Matt

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: RE: on load control / process swapping
  2001-05-16 17:14 ` Matt Dillon
@ 2001-05-16 17:41   ` Rik van Riel
  2001-05-16 17:54     ` Matt Dillon
  2001-05-16 17:57     ` Alfred Perlstein
  0 siblings, 2 replies; 39+ messages in thread
From: Rik van Riel @ 2001-05-16 17:41 UTC (permalink / raw)
  To: Matt Dillon; +Cc: Charles Randall, Roger Larsson, arch, linux-mm, sfkaplan

On Wed, 16 May 2001, Matt Dillon wrote:

>     In regards to the particular case of scanning a huge multi-gigabyte
>     file, FreeBSD has a sequential detection heuristic which does a
>     pretty good job preventing cache blow-aways by depressing the priority
>     of the data as it is read or written.  FreeBSD will still try to cache
>     a good chunk, but it won't sacrifice all available memory.  If you
>     access the data via the VM system, through mmap, you get even more
>     control through the madvise() syscall.

There's one thing "wrong" with the drop-behind idea though;
it penalises data even when it's still in core and we're
reading it for the second or third time.

Maybe it would be better to only do drop-behind when we're
actually allocating new memory for the vnode in question and
let re-use of already present memory go "unpunished" ?

Hmmm, now that I think about this more, it _could_ introduce
some different fairness issues. Darn ;)

regards,

Rik
--
Linux MM bugzilla: http://linux-mm.org/bugzilla.shtml

Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

		http://www.surriel.com/
http://www.conectiva.com/	http://distro.conectiva.com/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: RE: on load control / process swapping
  2001-05-16 17:41   ` Rik van Riel
@ 2001-05-16 17:54     ` Matt Dillon
  2001-05-16 19:59       ` Rik van Riel
  2001-05-18  5:58       ` Terry Lambert
  2001-05-16 17:57     ` Alfred Perlstein
  1 sibling, 2 replies; 39+ messages in thread
From: Matt Dillon @ 2001-05-16 17:54 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Charles Randall, Roger Larsson, arch, linux-mm, sfkaplan

    It's not dropping the data, it's dropping the priority.  And yes, it
    does penalize the data somewhat.  On the otherhand if the data happens
    to still be in the cache and you scan it a second time, the page priority
    gets bumped up relative to what it already was so the net effect is
    that the data becomes high priority after a few passes.

:Maybe it would be better to only do drop-behind when we're
:actually allocating new memory for the vnode in question and
:let re-use of already present memory go "unpunished" ?

    You get an equivalent effect even without dropping the priority,
    because you blow away prior pages when reading a file that is
    larger then main memory so they don't exist at all when you re-read.
    But you do not get the expected 'recycling' characteristics verses
    the rest of the system if you do not make a distinction between
    sequential and random access.  You want to slightly depress the priority
    behind a sequential access because the 'cost' of re-reading the disk
    sequentially is nothing compared to the cost of re-reading the disk
    randomly (by about a 30:1 ratio!).  So keeping randomly seek/read data
    is more important by degrees then keeping sequentially read data.

    This isn't to say that it isn't important to try to cache sequentially
    read data, just that the cost of throwing away sequentially read data
    is much lower then the cost of throwing away randomly read data on
    a general purpose machine.

    Terry's description of 'ld' mmap()ing and doing all sorts of random
    seeking causing most UNIXes, including FreeBSD, to have a brainfart of
    the dataset is too big to fit in the cache is true as far as it goes,
    but there really isn't much we can do about that situation
    'automatically'.  Without hints, the system can't predict the fact that
    it should be trying to cache the whole of the object files being accessed
    randomly.  A hint could make performance much better... a simple 
    madvise(... MADV_SEQUENTIAL) on the mapped memory inside LD would 
    probably be beneficial, as would madvise(... MADV_WILLNEED).

					-Matt

:Hmmm, now that I think about this more, it _could_ introduce
:some different fairness issues. Darn ;)
:
:regards,
:
:Rik
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: RE: on load control / process swapping
  2001-05-16 17:54     ` Matt Dillon
@ 2001-05-16 19:59       ` Rik van Riel
  2001-05-16 20:41         ` Matt Dillon
  2001-05-18  5:58       ` Terry Lambert
  1 sibling, 1 reply; 39+ messages in thread
From: Rik van Riel @ 2001-05-16 19:59 UTC (permalink / raw)
  To: Matt Dillon; +Cc: Charles Randall, Roger Larsson, arch, linux-mm, sfkaplan

On Wed, 16 May 2001, Matt Dillon wrote:

> :There's one thing "wrong" with the drop-behind idea though;
> :it penalises data even when it's still in core and we're
> :reading it for the second or third time.
>
>     It's not dropping the data, it's dropping the priority.  And yes, it
>     does penalize the data somewhat.  On the otherhand if the data happens
>     to still be in the cache and you scan it a second time, the page priority
>     gets bumped up

But doesn't it get pushed _down_ again after the process has read
the data?  Or is this a part of the code outside of vm/* which I
haven't read yet?

regards,

Rik
--
Linux MM bugzilla: http://linux-mm.org/bugzilla.shtml

Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

		http://www.surriel.com/
http://www.conectiva.com/	http://distro.conectiva.com/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: RE: on load control / process swapping
  2001-05-16 19:59       ` Rik van Riel
@ 2001-05-16 20:41         ` Matt Dillon
  0 siblings, 0 replies; 39+ messages in thread
From: Matt Dillon @ 2001-05-16 20:41 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Charles Randall, Roger Larsson, arch, linux-mm, sfkaplan

    Well, I was going to answer, but I can't find the code.  I'll have to
    look at it more closely.
    
					-Matt
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: on load control / process swapping
  2001-05-16 17:54     ` Matt Dillon
  2001-05-16 19:59       ` Rik van Riel
@ 2001-05-18  5:58       ` Terry Lambert
  2001-05-18  6:20         ` Matt Dillon
  1 sibling, 1 reply; 39+ messages in thread
From: Terry Lambert @ 2001-05-18  5:58 UTC (permalink / raw)
  To: Matt Dillon
  Cc: Rik van Riel, Charles Randall, Roger Larsson, arch, linux-mm, sfkaplan

Matt Dillon wrote:
>     Terry's description of 'ld' mmap()ing and doing all
>     sorts of random seeking causing most UNIXes, including
>     FreeBSD, to have a brainfart of the dataset is too big
>     to fit in the cache is true as far as it goes, but
>     there really isn't much we can do about that situation
>     'automatically'.  Without hints, the system can't predict
>     the fact that it should be trying to cache the whole of
>     the object files being accessed randomly.  A hint could
>     make performance much better... a simple madvise(...
>     MADV_SEQUENTIAL) on the mapped memory inside LD would
>     probably be beneficial, as would madvise(... MADV_WILLNEED).

I don't understand how either of those things could help
but make overall performance worse.

The problem is the program in question is seeking all
over the place, potentially multiple times, in order
to avoid building the table in memory itself.

For many symbols, like "printf", it will hit the area
of the library containing their addresses many, many
times.

The problem in this case is _truly_ that the program in
question is _really_ trying to optimize its performance
at the expense of other programs in the system.

The system _needs_ to make page-ins by this program come
_at the expense of this program_, rather than thrashing
all other programs out of core, only to have the quanta
given to these (now higher priority) programs used to
thrash the pages back in, instead of doing real work.

The problem is what to do about this badly behaved program,
so that the system itself doesn't spend unnecessary time
undoing its evil, and so that other (well behaved) programs
are not unfairly penalized.

Cutler suggested a working set quota (first in VMS, later
in NT) to deal with these programs.

-- Terry
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: on load control / process swapping
  2001-05-18  5:58       ` Terry Lambert
@ 2001-05-18  6:20         ` Matt Dillon
  2001-05-18 10:00           ` Andrew Reilly
  2001-05-18 13:49           ` Jonathan Morton
  0 siblings, 2 replies; 39+ messages in thread
From: Matt Dillon @ 2001-05-18  6:20 UTC (permalink / raw)
  To: Terry Lambert
  Cc: Rik van Riel, Charles Randall, Roger Larsson, arch, linux-mm, sfkaplan

:I don't understand how either of those things could help
:but make overall performance worse.
:
:The problem is the program in question is seeking all
:over the place, potentially multiple times, in order
:to avoid building the table in memory itself.
:
:For many symbols, like "printf", it will hit the area
:of the library containing their addresses many, many
:times.
:
:The problem in this case is _truly_ that the program in
:question is _really_ trying to optimize its performance
:at the expense of other programs in the system.

    The linker is seeking randomly as a side effect of
    the linking algorithm.  It is not doing it on purpose to try
    to save memory.  Forcing the VM system to think it's 
    sequential causes the VM system to perform read-aheads,
    generally reducing the actual amount of physical seeking
    that must occur by increasing the size of the chunks
    read from disk.  Even if the linker's dataset is huge,
    increasing the chunk size is beneficial because linkers
    ultimately access the entire object file anyway.  Trying
    to save a few seeks is far more important then reading
    extra data and having to throw half of it away.

:The problem is what to do about this badly behaved program,
:so that the system itself doesn't spend unnecessary time
:undoing its evil, and so that other (well behaved) programs
:are not unfairly penalized.
:
:Cutler suggested a working set quota (first in VMS, later
:in NT) to deal with these programs.
:
:-- Terry

    The problem is not the resident set size, it's the
    seeking that the program is causing as a matter of
    course.  Be that as it may, the resident set size
    can be limited with the 'memoryuse' sysctl.  The system
    imposes the specified limit only when the memory
    subsystem is under pressure.

    You can also reduce the amount of random seeking the
    linker does by ordering the object modules within the
    library to forward-reference the dependancies.

					-Matt

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: on load control / process swapping
  2001-05-18  6:20         ` Matt Dillon
@ 2001-05-18 10:00           ` Andrew Reilly
  2001-05-18 13:49           ` Jonathan Morton
  1 sibling, 0 replies; 39+ messages in thread
From: Andrew Reilly @ 2001-05-18 10:00 UTC (permalink / raw)
  To: Matt Dillon
  Cc: Terry Lambert, Rik van Riel, Charles Randall, Roger Larsson,
	arch, linux-mm, sfkaplan

On Thu, May 17, 2001 at 11:20:23PM -0700, Matt Dillon wrote:
>Terry wrote:
> :The problem in this case is _truly_ that the program in
> :question is _really_ trying to optimize its performance
> :at the expense of other programs in the system.
> 
>     The linker is seeking randomly as a side effect of
>     the linking algorithm.  It is not doing it on purpose to try
>     to save memory.  Forcing the VM system to think it's 
>     sequential causes the VM system to perform read-aheads,
>     generally reducing the actual amount of physical seeking
>     that must occur by increasing the size of the chunks
>     read from disk.  Even if the linker's dataset is huge,
>     increasing the chunk size is beneficial because linkers
>     ultimately access the entire object file anyway.  Trying
>     to save a few seeks is far more important then reading
>     extra data and having to throw half of it away.

I know that this problem is real in the case of data base index
accesses---databases have data sets larger than RAM almost by
definition---and that the problem (of dealing with "randomly"
accessed memory mapped files) should be neatly solved in
general.

But is this issue of linking really the lynch pin?

Are there _any_ programs and library sets where the union of the
code sizes is larger than physical memory?

I haven't looked at the problem myself, but (on the surface)
it doesn't seem too likely.  There is a grand total of 90M of .a
files on my system (/usr/lib, /usr/X11/lib, and /usr/local/lib),
and I doubt that even a majority of them would be needed at
once.

-- 
Andrew
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: on load control / process swapping
  2001-05-18  6:20         ` Matt Dillon
  2001-05-18 10:00           ` Andrew Reilly
@ 2001-05-18 13:49           ` Jonathan Morton
  2001-05-19  2:18             ` Rik van Riel
  1 sibling, 1 reply; 39+ messages in thread
From: Jonathan Morton @ 2001-05-18 13:49 UTC (permalink / raw)
  To: Matt Dillon, Terry Lambert
  Cc: Rik van Riel, Charles Randall, Roger Larsson, arch, linux-mm, sfkaplan

>    The problem is not the resident set size, it's the
>    seeking that the program is causing as a matter of
>    course.

The RSS of 'ld' isn't the problem, no.  However, the working-set idea would
place an effective and sensible limit of the size of the disk cache, by
ensuring that other apps aren't being paged out beyond their non-working
sets.  Does this make sense?

FWIW, I've been running with a 2-line hack in my kernel for some weeks now,
which essentially forces the RSS of each process not to be forced below
some arbitrary "fair share" of the physical memory available.  It's not a
very clean hack, but it improves performance by a very large margin under a
thrashing load.  The only problem I'm seeing is a deadlock when I run out
of VM completely, but I think that's a separate issue that others are
already working on.

To others: is there already a means whereby we can (almost) calculate the
WS of a given process?  The "accessed" flag isn't a good one, but maybe the
'age' value is better.  However, I haven't quite clicked on how the 'age'
value is affected in either direction.

--------------------------------------------------------------
from:     Jonathan "Chromatix" Morton
mail:     chromi@cyberspace.org  (not for attachments)
big-mail: chromatix@penguinpowered.com
uni-mail: j.d.morton@lancaster.ac.uk

The key to knowledge is not to rely on people to teach you it.

Get VNC Server for Macintosh from http://www.chromatix.uklinux.net/vnc/

-----BEGIN GEEK CODE BLOCK-----
Version 3.12
GCS$/E/S dpu(!) s:- a20 C+++ UL++ P L+++ E W+ N- o? K? w--- O-- M++$ V? PS
PE- Y+ PGP++ t- 5- X- R !tv b++ DI+++ D G e+ h+ r++ y+(*)
-----END GEEK CODE BLOCK-----

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: on load control / process swapping
  2001-05-18 13:49           ` Jonathan Morton
@ 2001-05-19  2:18             ` Rik van Riel
  2001-05-19  2:56               ` Jonathan Morton
  0 siblings, 1 reply; 39+ messages in thread
From: Rik van Riel @ 2001-05-19  2:18 UTC (permalink / raw)
  To: Jonathan Morton
  Cc: Matt Dillon, Terry Lambert, Charles Randall, Roger Larsson, arch,
	linux-mm, sfkaplan

On Fri, 18 May 2001, Jonathan Morton wrote:

> FWIW, I've been running with a 2-line hack in my kernel for some weeks
> now, which essentially forces the RSS of each process not to be forced
> below some arbitrary "fair share" of the physical memory available.  
> It's not a very clean hack, but it improves performance by a very
> large margin under a thrashing load.  The only problem I'm seeing is a
> deadlock when I run out of VM completely, but I think that's a
> separate issue that others are already working on.

I'm pretty sure I know what you're running into.

Say you guarantee a minimum of 3% of memory for each process;
now when you have 30 processes running your memory is full and
you cannot reclaim any pages when one of the processes runs
into a page fault.

The minimum RSS guarantee is a really nice thing to prevent the
proverbial root shell from thrashing, but it really only works
if you drop such processes every once in a while and swap them
out completely. You especially need to do this when you're
getting tight on memory and you have idle processes sitting around
using their minimum RSS worth of RAM ;)

It'd work great together with load control though. I guess I should
post a patch for - simple&naive - load control code once I've got
the inodes and the dirty page writeout code balancing fixed.

regards,

Rik
--
Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

http://www.surriel.com/		http://distro.conectiva.com/

Send all your spam to aardvark@nl.linux.org (spam digging piggy)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: on load control / process swapping
  2001-05-19  2:18             ` Rik van Riel
@ 2001-05-19  2:56               ` Jonathan Morton
  0 siblings, 0 replies; 39+ messages in thread
From: Jonathan Morton @ 2001-05-19  2:56 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Matt Dillon, Terry Lambert, Charles Randall, Roger Larsson, arch,
	linux-mm, sfkaplan

>> FWIW, I've been running with a 2-line hack in my kernel for some weeks
>> now, which essentially forces the RSS of each process not to be forced
>> below some arbitrary "fair share" of the physical memory available.
>> It's not a very clean hack, but it improves performance by a very
>> large margin under a thrashing load.  The only problem I'm seeing is a
>> deadlock when I run out of VM completely, but I think that's a
>> separate issue that others are already working on.
>
>I'm pretty sure I know what you're running into.
>
>Say you guarantee a minimum of 3% of memory for each process;
>now when you have 30 processes running your memory is full and
>you cannot reclaim any pages when one of the processes runs
>into a page fault.

Actually I already thought of that one, and made it a "fair share" of the
system rather than a fixed amount.  IOW, the guaranteed amount is something
like (total_memory / nr_processes).  I think I was even sane enough to
lower this value slightly to allow for some buffer/cache memory, but I
didn't allow for locked pages (including the kernel itself).

The deadlock happened when the swap ran out, not the physical RAM, and is
independent of this particular hack - remember I'm running with some
out_of_memory() fixes and some other hackery I did a month or so ago
(remember that massive "OOM killer" thread?).  I should try to figure those
out and present cleaned-up versions for further perusal...

--------------------------------------------------------------
from:     Jonathan "Chromatix" Morton
mail:     chromi@cyberspace.org  (not for attachments)
big-mail: chromatix@penguinpowered.com
uni-mail: j.d.morton@lancaster.ac.uk

The key to knowledge is not to rely on people to teach you it.

Get VNC Server for Macintosh from http://www.chromatix.uklinux.net/vnc/

-----BEGIN GEEK CODE BLOCK-----
Version 3.12
GCS$/E/S dpu(!) s:- a20 C+++ UL++ P L+++ E W+ N- o? K? w--- O-- M++$ V? PS
PE- Y+ PGP++ t- 5- X- R !tv b++ DI+++ D G e+ h+ r++ y+(*)
-----END GEEK CODE BLOCK-----


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: on load control / process swapping
  2001-05-16 17:41   ` Rik van Riel
  2001-05-16 17:54     ` Matt Dillon
@ 2001-05-16 17:57     ` Alfred Perlstein
  2001-05-16 18:01       ` Matt Dillon
  1 sibling, 1 reply; 39+ messages in thread
From: Alfred Perlstein @ 2001-05-16 17:57 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Matt Dillon, Charles Randall, Roger Larsson, arch, linux-mm, sfkaplan

* Rik van Riel <riel@conectiva.com.br> [010516 13:42] wrote:
> On Wed, 16 May 2001, Matt Dillon wrote:
> 
> >     In regards to the particular case of scanning a huge multi-gigabyte
> >     file, FreeBSD has a sequential detection heuristic which does a
> >     pretty good job preventing cache blow-aways by depressing the priority
> >     of the data as it is read or written.  FreeBSD will still try to cache
> >     a good chunk, but it won't sacrifice all available memory.  If you
> >     access the data via the VM system, through mmap, you get even more
> >     control through the madvise() syscall.
> 
> There's one thing "wrong" with the drop-behind idea though;
> it penalises data even when it's still in core and we're
> reading it for the second or third time.
> 
> Maybe it would be better to only do drop-behind when we're
> actually allocating new memory for the vnode in question and
> let re-use of already present memory go "unpunished" ?
> 
> Hmmm, now that I think about this more, it _could_ introduce
> some different fairness issues. Darn ;)

Both of you guys are missing the point.

The directio interface is meant to reduce the stress of a large
seqential operation on a file where caching is of no use.

Even if you depress the worthyness of the pages you've still
blown rather large amounts of unrelated data out of the cache
in order to allocate new cacheable pages.

A simple solution would involve passing along flags such that if
the IO occurs to a non-previously-cached page the buf/page is
immediately placed on the free list upon completion.  That way the
next IO can pull the now useless bufferspace from the freelist.

Basically you add another buffer queue for "throw away" data that
exists as a "barely cached" queue.  This way your normal data
doesn't compete on the LRU with non-cached data.

As a hack one it looks like one could use the QUEUE_EMPTYKVA
buffer queue under FreeBSD for this, however I think one might
loose the minimal amount of caching that could be done.

If the direct IO happens to a page that's previously cached
you adhere to the previous behavior.

A more fancy approach might map in user pages into the kernel to
do the IO directly, however on large MP this may cause pain because
the vm may need to issue ipi to invalidate tlb entries.

It's quite simple in theory, the hard part is the code.

-Alfred Perlstein
--
Instead of asking why a piece of software is using "1970s technology,"
start asking why software is ignoring 30 years of accumulated wisdom.
  http://www.egr.unlv.edu/~slumos/on-netbsd.html
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: on load control / process swapping
  2001-05-16 17:57     ` Alfred Perlstein
@ 2001-05-16 18:01       ` Matt Dillon
  2001-05-16 18:10         ` Alfred Perlstein
  0 siblings, 1 reply; 39+ messages in thread
From: Matt Dillon @ 2001-05-16 18:01 UTC (permalink / raw)
  To: Alfred Perlstein
  Cc: Rik van Riel, Charles Randall, Roger Larsson, arch, linux-mm, sfkaplan

:Both of you guys are missing the point.
:
:The directio interface is meant to reduce the stress of a large
:seqential operation on a file where caching is of no use.
:
:Even if you depress the worthyness of the pages you've still
:blown rather large amounts of unrelated data out of the cache
:in order to allocate new cacheable pages.
:
:A simple solution would involve passing along flags such that if
:the IO occurs to a non-previously-cached page the buf/page is
:immediately placed on the free list upon completion.  That way the
:next IO can pull the now useless bufferspace from the freelist.
:
:Basically you add another buffer queue for "throw away" data that
:exists as a "barely cached" queue.  This way your normal data
:doesn't compete on the LRU with non-cached data.
:
:As a hack one it looks like one could use the QUEUE_EMPTYKVA
:buffer queue under FreeBSD for this, however I think one might
:loose the minimal amount of caching that could be done.
:
:If the direct IO happens to a page that's previously cached
:you adhere to the previous behavior.
:
:A more fancy approach might map in user pages into the kernel to
:do the IO directly, however on large MP this may cause pain because
:the vm may need to issue ipi to invalidate tlb entries.
:
:It's quite simple in theory, the hard part is the code.
:
:-Alfred Perlstein

    I think someone tried to implement O_DIRECT a while back, but it
    was fairly complex to try to do away with caching entirely.

    I think our best bet to 'start' an implementation of O_DIRECT is
    to support the flag in open() and fcntl(), and have it simply
    modify the sequential detection heuristic to throw away pages
    and buffers rather then simply depressing their priority.

    Eventually we can implement the direct-I/O piece of the equation.

    I could do this first part in an hour, I think.  When I get home....

						-Matt

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: on load control / process swapping
  2001-05-16 18:01       ` Matt Dillon
@ 2001-05-16 18:10         ` Alfred Perlstein
  0 siblings, 0 replies; 39+ messages in thread
From: Alfred Perlstein @ 2001-05-16 18:10 UTC (permalink / raw)
  To: Matt Dillon
  Cc: Rik van Riel, Charles Randall, Roger Larsson, arch, linux-mm, sfkaplan

* Matt Dillon <dillon@earth.backplane.com> [010516 14:01] wrote:
> 
>     I think someone tried to implement O_DIRECT a while back, but it
>     was fairly complex to try to do away with caching entirely.
> 
>     I think our best bet to 'start' an implementation of O_DIRECT is
>     to support the flag in open() and fcntl(), and have it simply
>     modify the sequential detection heuristic to throw away pages
>     and buffers rather then simply depressing their priority.

yes, as i said:

> :A simple solution would involve passing along flags such that if
> :the IO occurs to a non-previously-cached page the buf/page is
> :immediately placed on the free list upon completion.  That way the
> :next IO can pull the now useless bufferspace from the freelist.
> :
> :Basically you add another buffer queue for "throw away" data that
> :exists as a "barely cached" queue.  This way your normal data
> :doesn't compete on the LRU with non-cached data.
> 
>     Eventually we can implement the direct-I/O piece of the equation.
> 
>     I could do this first part in an hour, I think.  When I get home....

Thank you.

-Alfred
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 39+ messages in thread

[parent not found: <OF5A705983.9566DA96-ON86256A50.00630512@hou.us.ray.com>]

* Re: on load control / process swapping
       [not found] <OF5A705983.9566DA96-ON86256A50.00630512@hou.us.ray.com>
@ 2001-05-18 20:13 ` Jonathan Morton
  0 siblings, 0 replies; 39+ messages in thread
From: Jonathan Morton @ 2001-05-18 20:13 UTC (permalink / raw)
  To: Mark_H_Johnson; +Cc: linux-mm

>I'm not sure you have these items measured in the kernel at this point, but
>VAX/VMS used the page replacement rate to control the working set size
>(Linux term - resident set size) within three limits...
> - minimum working set size
> - maximum guaranteed working set size (under memory pressure)
> - maximum extended working set size (no memory pressure)
>The three sizes above were enforced on a per user basis. I could see using
>the existing Linux RSS limit for the maximum guarantee (or extended) and
>then ratios for the other items.

Seems reasonable, but remember RSS != working set.  Under "normal"
conditions we want all processes to have all the memory they want, then
when memory pressure encroaches we want to keep as many processes as
possible with their working set swapped in (but no more).

>There were several parameters - some on a per system basis and others on a
>per user basis [I can't recall which were which] to control this
>including...
> - amount to increase the working set size (say 5-10% of the maximum)
> - amount to decrease the working set size (usually about 1/2 the increase
>size value)
> - pages per second replaced in the working set to trigger a possible
>increase (say 10)
> - pages per second replaced in the working set to trigger a possible
>decrease (say 2 or 1)
>A new job would start at its minimum size and grow quickly to either the
>maximum limit or its natural working set size. If at the limit, it would
>thrash but not necessarily affect the other jobs on the system. I am not
>sure how the numbers I listed would apply with a fast system with huge
>memories - the values I listed were what I recall on what would be a small
>system today (4M to 64M).

Hmm, it looks to me like the algorithm above relies on a continuous rate of
paging.  This is a bad thing on a modern system where the swap device is so
much slower than main memory.  However, the idea is an interesting one and
could possibly be adapted...

The key thing is that maximum performance for a given process (particularly
a small one) is when *no* paging is occurring in relation to it.  Under
memory pressure, this is quite hard to achieve unless the working set is
already known.  Thus the VMS model (if I understood it correctly) doesn't
work so well for modern systems running Linux.

What i was really asking, to make the question clearer is "how does
page->age work?  And if it's not suitable for WS calculation in the ways
that I suspect, what else could be used - that is *already* instrumented?".

--------------------------------------------------------------
from:     Jonathan "Chromatix" Morton
mail:     chromi@cyberspace.org  (not for attachments)
big-mail: chromatix@penguinpowered.com
uni-mail: j.d.morton@lancaster.ac.uk

The key to knowledge is not to rely on people to teach you it.

Get VNC Server for Macintosh from http://www.chromatix.uklinux.net/vnc/

-----BEGIN GEEK CODE BLOCK-----
Version 3.12
GCS$/E/S dpu(!) s:- a20 C+++ UL++ P L+++ E W+ N- o? K? w--- O-- M++$ V? PS
PE- Y+ PGP++ t- 5- X- R !tv b++ DI+++ D G e+ h+ r++ y+(*)
-----END GEEK CODE BLOCK-----

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 39+ messages in thread

* on load control / process swapping
@ 2001-05-07 21:16 Rik van Riel
  2001-05-07 22:50 ` Matt Dillon
  2001-05-08 12:25 ` Scott F. Kaplan
  0 siblings, 2 replies; 39+ messages in thread
From: Rik van Riel @ 2001-05-07 21:16 UTC (permalink / raw)
  To: arch; +Cc: linux-mm, Matt Dillon, sfkaplan

Hi,

after staring at the code for a long long time, I finally
figured out exactly why FreeBSD's load control code (the
process swapping in vm_glue.c) can never work in many
scenarios.

In short, the process suspension / wake up code only does
load control in the sense that system load is reduced, but
absolutely no effort is made to ensure that individual
programs can run without thrashing. This, of course, kind of
defeats the purpose of doing load control in the first place.

To see this situation in some more detail, lets first look
at how the current process suspension code has evolved over
time.  Early paging Unixes, including earlier BSDs, had a
rate-limited clock algorithm for the pageout code, where
the VM subsystem would only scan (and page) memory out at
a rate of fastscan pages per second.

Whenever the paging system wasn't able to keep up, free
memory would get below a certain threshold and memory load
control (in the form of process suspension) kicked in.  As
soon as free memory (averaged over a few seconds) got over
this threshold, processes get swapped in again.  Because of
the exact "speed limit" for the paging code, this would give
a slow rotation of memory-resident progesses at a paging rate
well below the thashing threshold.

More modern Unixes, like FreeBSD, NetBSD or Linux, however,
don't have the artificial speed limit on pageout.  This means
the pageout code can go on freeing memory until well beyond
the trashing point of the system.  It also means that the
amount of free memory is no longer any indication of whether
the system is thrashing or not.

Add to that the fact that the classical load control in BSD
resumes a suspended task whenever the system is above the
(now not very meaningful) free memory threshold, regardless
of whether the resident tasks have had the opportunity to
make any progress ... which of course only encourages more
thrashing instead of letting the system work itself out of
the overload situation.

Any solution will have to address the following points:

1) allow the resident processes to stay resident long
   enough to make progess
2) make sure the resident processes aren't thrashing,
   that is, don't let new processes back in memory if
   none of the currently resident processes is "ready"
   to be suspended
3) have a mechanism to detect thrashing in a VM
   subsystem which isn't rate-limited  (hard?)

and, for extra brownie points:
4) fairness, small processes can be paged in and out
   faster, so we can suspend&resume them faster; this
   has the side effect of leaving the proverbial root
   shell more usable
5) make sure already resident processes cannot create
   a situation that'll keep the swapped out tasks out
   of memory forever ... but don't kill performance either,
   since bad performance means we cannot get out of the
   bad situation we're in

Points 1), 2) and 4) are relatively easy to address by simply
keeping resident tasks unswappable for a long enough time that
they've been able to do real work in an environment where
3) indicates we're not thrashing.

3) is the hard part. We know we're not thrashing when we don't
have ongoing page faults all the time, but (say) only 50% of the
time. However, I still have no idea to determine when we _are_
thrashing, since a system which always has 10 ongoing page faults
may still be functioning without thrashing...  This is the part
where I cannot hand a ready solution but where we have to figure
out a solution together.

(and it's also the reason I cannot "send a patch" ... I know the
current scheme cannot possibly work all the time, I understand why,
but I just don't have a solution to the problem ... yet)

regards,

Rik
--
Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

http://www.surriel.com/		http://distro.conectiva.com/

Send all your spam to aardvark@nl.linux.org (spam digging piggy)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: on load control / process swapping
  2001-05-07 21:16 Rik van Riel
@ 2001-05-07 22:50 ` Matt Dillon
  2001-05-07 23:35   ` Rik van Riel
  2001-05-08 20:52   ` Kirk McKusick
  2001-05-08 12:25 ` Scott F. Kaplan
  1 sibling, 2 replies; 39+ messages in thread
From: Matt Dillon @ 2001-05-07 22:50 UTC (permalink / raw)
  To: Rik van Riel; +Cc: arch, linux-mm, sfkaplan

    This is accomplished as a side effect to the way the page queues
    are handled.  A page placed in the active queue is not allowed
    to be moved out of that queue for a minimum period of time based
    on page aging.  See line 500 or so of vm_pageout.c (in -stable) .

    Thus when a process wakes up and pages a bunch of pages in, those
    pages are guarenteed to stay in-core for a period of time no matter
    what level of memory stress is occuring.

:2) make sure the resident processes aren't thrashing,
:   that is, don't let new processes back in memory if
:   none of the currently resident processes is "ready"
:   to be suspended

    When a process is swapped out, the process is removed from the run
    queue and the P_INMEM flag is cleared.  The process is only woken up
    when faultin() is called (vm_glue.c line 312).  faultin() is only
    called from the scheduler() (line 340 of vm_glue.c) and the scheduler
    only runs when the VM system indicates a minimum number of free pages
    are available (vm_page_count_min()), which you can adjust with
    the vm.v_free_min sysctl (usually represents 1-9 megabytes, dependings
    on how much memory the system has).

    So what occurs is that the system comes under extreme memory pressure
    and starts to swapout blocked processes.  This reduces memory pressure
    over time.  When memory pressure is sufficiently reudced the scheduler
    wakes up a swapped-out process (one at a time).

    There might be some fine tuning that we can do here, such as try to
    choose a better process to swapout (right now it's priority based which
    isn't the best way to do it).

:3) have a mechanism to detect thrashing in a VM
:   subsystem which isn't rate-limited  (hard?)

    In FreeBSD, rate-limiting is a function of a lightly loaded system.
    We rate-limit page laundering (pageouts).  However, if the rate-limited
    laundering is not sufficient to reach our free + cache page targets,
    we take another laundering loop and this time do not limit it at all.

    Thus under heavy memory pressure, no real rate limiting occurs.  The
    system will happily pagein and pageout megabytes/sec.  The reason we
    do this is because David Greenman and John Dyson found a long time
    ago that attempting to rate limit paging does not actually solve the
    thrashing problem, it actually makes it worse... So they solved the
    problem another way (see my answers for #1 and #2).  It isn't the
    paging operations themselves that cause thrashing.

:and, for extra brownie points:
:4) fairness, small processes can be paged in and out
:   faster, so we can suspend&resume them faster; this
:   has the side effect of leaving the proverbial root
:   shell more usable

    Small process can contribute to thrashing as easily as large
    processes can under extreme memory pressure... for example,
    take an overloaded shell machine.  *ALL* processes are 'small'
    processes in that case, or most of them are, and in great numbers
    they can be the cause.  So no test that specifically checks the
    size of the process can be used to give it any sort of priority.

    Additionally, *idle* small processes are also great contributers 
    to the VM subsystem in regards to clearing out idle pages.  For
    example, on a heavily loaded shell machine more then 80% of the
    'small processes' have been idle for long periods of time and it
    is exactly our ability to page them out that allows us to extend
    the machine's operational life and move the thrashing threshold
    farther away.  The last thing we want to do is make a 'fix' that
    prevents us from paging out idle small processes.  It would kill
    the machine.

:5) make sure already resident processes cannot create
:   a situation that'll keep the swapped out tasks out
:   of memory forever ... but don't kill performance either,
:   since bad performance means we cannot get out of the
:   bad situation we're in

    When the system starts swapping processes out, it continues to swap
    them out until memory pressure goes down.  With memory pressure down
    processes are swapped back in again one at a time, typically in FIFO
    order.  So this situation will generally not occur.

    Basically we have all the algorithms in place to deal with thrashing.
    I'm sure that there are a few places where we can optimize things...
    for example, we can certainly tune the swapout algorithm itself.

						-Matt

:regards,
:
:Rik
:--
:Virtual memory is like a game you can't win;
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: on load control / process swapping
  2001-05-07 22:50 ` Matt Dillon
@ 2001-05-07 23:35   ` Rik van Riel
  2001-05-08  0:56     ` Matt Dillon
  2001-05-08 20:52   ` Kirk McKusick
  1 sibling, 1 reply; 39+ messages in thread
From: Rik van Riel @ 2001-05-07 23:35 UTC (permalink / raw)
  To: Matt Dillon; +Cc: arch, linux-mm, sfkaplan

On Mon, 7 May 2001, Matt Dillon wrote:

> :1) allow the resident processes to stay resident long
> :   enough to make progess
>
>     This is accomplished as a side effect to the way the page queues
>     are handled.  A page placed in the active queue is not allowed
>     to be moved out of that queue for a minimum period of time based
>     on page aging.  See line 500 or so of vm_pageout.c (in -stable) .
>
>     Thus when a process wakes up and pages a bunch of pages in, those
>     pages are guarenteed to stay in-core for a period of time no matter
>     what level of memory stress is occuring.

I don't see anything limiting the speed at which the active list
is scanned over and over again. OTOH, you are right that a failure
to deactivate enough pages will trigger the swapout code .....

This sure is a subtle interaction ;)

> :2) make sure the resident processes aren't thrashing,
> :   that is, don't let new processes back in memory if
> :   none of the currently resident processes is "ready"
> :   to be suspended
>
>     When a process is swapped out, the process is removed from the run
>     queue and the P_INMEM flag is cleared.  The process is only woken up
>     when faultin() is called (vm_glue.c line 312).  faultin() is only
>     called from the scheduler() (line 340 of vm_glue.c) and the scheduler
>     only runs when the VM system indicates a minimum number of free pages
>     are available (vm_page_count_min()), which you can adjust with
>     the vm.v_free_min sysctl (usually represents 1-9 megabytes, dependings
>     on how much memory the system has).

But ... is this a good enough indication that the processes
currently resident have enough memory available to make any
progress ?

Especially if all the currently resident processes are waiting
in page faults, won't that make it easier for the system to find
pages to swap out, etc... ?

One thing I _am_ wondering though: the pageout and the pagein
thresholds are different. Can't this lead to problems where we
always hit both the pageout threshold -and- the pagein threshold
and the system thrashes swapping processes in and out ?

> :3) have a mechanism to detect thrashing in a VM
> :   subsystem which isn't rate-limited  (hard?)
>
>     In FreeBSD, rate-limiting is a function of a lightly loaded system.
>     We rate-limit page laundering (pageouts).  However, if the rate-limited
>     laundering is not sufficient to reach our free + cache page targets,
>     we take another laundering loop and this time do not limit it at all.
>
>     Thus under heavy memory pressure, no real rate limiting occurs.  The
>     system will happily pagein and pageout megabytes/sec.  The reason we
>     do this is because David Greenman and John Dyson found a long time
>     ago that attempting to rate limit paging does not actually solve the
>     thrashing problem, it actually makes it worse... So they solved the
>     problem another way (see my answers for #1 and #2).  It isn't the
>     paging operations themselves that cause thrashing.

Agreed on all points ... I'm just wondering how well 1) and 2)
still work after all the changes that were made to the VM in
the last few years.  They sure are subtle ...

> :and, for extra brownie points:
> :4) fairness, small processes can be paged in and out
> :   faster, so we can suspend&resume them faster; this
> :   has the side effect of leaving the proverbial root
> :   shell more usable
>
>     Small process can contribute to thrashing as easily as large
>     processes can under extreme memory pressure... for example,
>     take an overloaded shell machine.  *ALL* processes are 'small'
>     processes in that case, or most of them are, and in great numbers
>     they can be the cause.  So no test that specifically checks the
>     size of the process can be used to give it any sort of priority.

There's a test related to 2) though ... A small process needs
to be in memory less time than a big process in order to make
progress, so it can be swapped out earlier.

It can also be swapped back in earlier, giving small processes
shorter "time slices" for swapping than what large processes
have.  I'm not quite sure how much this would matter, though...

> :5) make sure already resident processes cannot create
> :   a situation that'll keep the swapped out tasks out
> :   of memory forever ... but don't kill performance either,
> :   since bad performance means we cannot get out of the
> :   bad situation we're in
>
>     When the system starts swapping processes out, it continues to swap
>     them out until memory pressure goes down.  With memory pressure down
>     processes are swapped back in again one at a time, typically in FIFO
>     order.  So this situation will generally not occur.
>
>     Basically we have all the algorithms in place to deal with thrashing.
>     I'm sure that there are a few places where we can optimize things...
>     for example, we can certainly tune the swapout algorithm itself.

Interesting, FreeBSD indeed _does_ seem to have all of the things in
place (though the interactions between the various parts seem to be
carefully hidden ;)).

They indeed should work for lots of scenarios, but things like the
subtlety of some of the code and the fact that the swapin and
swapout thresholds are fairly unrelated look a bit worrying...

regards,

Rik
--
Linux MM bugzilla: http://linux-mm.org/bugzilla.shtml

Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

		http://www.surriel.com/
http://www.conectiva.com/	http://distro.conectiva.com/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: on load control / process swapping
  2001-05-07 23:35   ` Rik van Riel
@ 2001-05-08  0:56     ` Matt Dillon
  2001-05-12 14:23       ` Rik van Riel
  0 siblings, 1 reply; 39+ messages in thread
From: Matt Dillon @ 2001-05-08  0:56 UTC (permalink / raw)
  To: Rik van Riel; +Cc: arch, linux-mm, sfkaplan

:>     to be moved out of that queue for a minimum period of time based
:>     on page aging.  See line 500 or so of vm_pageout.c (in -stable) .
:>
:>     Thus when a process wakes up and pages a bunch of pages in, those
:>     pages are guarenteed to stay in-core for a period of time no matter
:>     what level of memory stress is occuring.
:
:I don't see anything limiting the speed at which the active list
:is scanned over and over again. OTOH, you are right that a failure
:to deactivate enough pages will trigger the swapout code .....
:
:This sure is a subtle interaction ;)

    Look at the loop line 1362 of vm_pageout.c.  Note that it enforces
    a HZ/2 tsleep (2 scans per second) if the pageout daemon is unable
    to clean sufficient pages in two loops.  The tsleep is not woken up
    by anyone while waiting that 1/2 second becuase vm_pages_needed has
    not been cleared yet.  This is what is limiting the page queue scan.

:>     When a process is swapped out, the process is removed from the run
:>     queue and the P_INMEM flag is cleared.  The process is only woken up
:>     when faultin() is called (vm_glue.c line 312).  faultin() is only
:>     called from the scheduler() (line 340 of vm_glue.c) and the scheduler
:>     only runs when the VM system indicates a minimum number of free pages
:>     are available (vm_page_count_min()), which you can adjust with
:>     the vm.v_free_min sysctl (usually represents 1-9 megabytes, dependings
:>     on how much memory the system has).
:
:But ... is this a good enough indication that the processes
:currently resident have enough memory available to make any
:progress ?

    Yes.  Consider detecting the difference between a large process accessing
    its pages randomly, and a small process accessing a relatively small
    set of pages over and over again.  Now consider what happens when the
    system gets overloaded.  The small process will be able to access its
    pages enough that they will get page priority over the larger process.
    The larger process, due to the more random accesses (or simply the fact
    that it is accessing a larger set of pages) will tend to stall more on
    pagein I/O which has the side effect of reducing the large process's
    access rate on all of its pages.  The result:  small processes get more
    priority just by being small.

:Especially if all the currently resident processes are waiting
:in page faults, won't that make it easier for the system to find
:pages to swap out, etc... ?
:
:One thing I _am_ wondering though: the pageout and the pagein
:thresholds are different. Can't this lead to problems where we
:always hit both the pageout threshold -and- the pagein threshold
:and the system thrashes swapping processes in and out ?

    The system will not page out a page it has just paged in due to the
    center-of-the-road initialization of act_count (the page aging).
    My experience at BEST was that both pagein and pageout activity
    occured simultaniously, but the fact had no detrimental effect on
    the system.  You have to treat the pagein and pageout operations
    independantly because, in fact, they are only weakly related to each
    other.  The only optimization you make, to reduce thrashing, is to
    not allow a just-paged-in page to immediately turn around and be paged
    out.

    I could probably make this work even better by setting the vm_page_t's
    act_count to its max value when paging in from swap.  I'll think about
    doing that.

    The pagein and pageout rates have nothing to do with thrashing, per say,
    and should never be arbitrarily limited.   Consider the difference
    between a system that is paing heavily and a system with only two small
    processes (like cp's) competing for disk I/O.  Insofar as I/O goes,
    there is no difference.  You can have a perfectly running system with
    high pagein and pageout rates.  It's only when the paging I/O starts
    to eat into pages that are in active use where thrashing begins to occur.
    Think of a hotdog being eaten from both ends by two lovers.  Memory
    pressure (active VM pages) eat away at one end, pageout I/O eats away
    at the other.  You don't get fireworks until they meet.

:>     ago that attempting to rate limit paging does not actually solve the
:>     thrashing problem, it actually makes it worse... So they solved the
:>     problem another way (see my answers for #1 and #2).  It isn't the
:>     paging operations themselves that cause thrashing.
:
:Agreed on all points ... I'm just wondering how well 1) and 2)
:still work after all the changes that were made to the VM in
:the last few years.  They sure are subtle ...

    The algorithms mostly stayed the same.  Much of the work was to remove
    artificial limitations that were reducing performance (due to the
    existance of greater amounts of memory, faster disks, and so forth...).
    I also spent a good deal of time removing 'restart' cases from the code
    that was causing a lot of cpu-wasteage in certain cases.  What few
    restart cases remain just don't occur all that often.  And I've done
    other things like extend the heuristics we already use for read()/write()
    to the VM system and change heuristic variables into per-vm-map elements
    rather then sharing them with read/write within the vnode.  Etc.

:>     Small process can contribute to thrashing as easily as large
:>     processes can under extreme memory pressure... for example,
:>     take an overloaded shell machine.  *ALL* processes are 'small'
:>     processes in that case, or most of them are, and in great numbers
:>     they can be the cause.  So no test that specifically checks the
:>     size of the process can be used to give it any sort of priority.
:
:There's a test related to 2) though ... A small process needs
:to be in memory less time than a big process in order to make
:progress, so it can be swapped out earlier.

    Not necessarily.  It depends whether the small process is cpu-bound
    or interactive.  A cpu-bound small process should be allowed to run
    and not swapped out.  An interactive small process can be safely
    swapped if idle for a period of time, because it can be swapped back
    in very quickly.  It should not be swapped if it isn't idle (someone is
    typing, for example), because that would just waste disk I/O paging out
    and then paging right back in.  You never want to swapout a small
    process gratuitously simply because it is small.

:It can also be swapped back in earlier, giving small processes
:shorter "time slices" for swapping than what large processes
:have.  I'm not quite sure how much this would matter, though...

    Both swapin and swapout activities are demand paged, but will be
    clustered if possible.  I don't think there would be any point
    trying to conditionalize the algorithm based on the size of the
    process.  The size has its own indirect positive effects which I
    think are sufficient.

:Interesting, FreeBSD indeed _does_ seem to have all of the things in
:place (though the interactions between the various parts seem to be
:carefully hidden ;)).
:
:They indeed should work for lots of scenarios, but things like the
:subtlety of some of the code and the fact that the swapin and
:swapout thresholds are fairly unrelated look a bit worrying...
:
:regards,
:
:Rik

    I don't think it's possible to write a nice neat thrash-handling
    algorithm.  It's a bunch of algorithms all working together, all
    closely tied to the VM page cache.  Each taken alone is fairly easy
    to describe and understand.  All of them together result in complex
    interactions that are very easy to break if you make a mistake.  It
    usually takes me a couple of tries to get a solution to a problem in
    place without breaking something else (performance-wise) in the
    process.  For example, I fubar'd heavy load performance for a month
    in FreeBSD-4.2 when I 'fixed' the pageout scan laundering algorithm.

						-Matt

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: on load control / process swapping
  2001-05-08  0:56     ` Matt Dillon
@ 2001-05-12 14:23       ` Rik van Riel
  2001-05-12 17:21         ` Matt Dillon
  2001-05-12 23:58         ` Matt Dillon
  0 siblings, 2 replies; 39+ messages in thread
From: Rik van Riel @ 2001-05-12 14:23 UTC (permalink / raw)
  To: Matt Dillon; +Cc: arch, linux-mm, sfkaplan

On Mon, 7 May 2001, Matt Dillon wrote:

>     Look at the loop line 1362 of vm_pageout.c.  Note that it enforces
>     a HZ/2 tsleep (2 scans per second) if the pageout daemon is unable
>     to clean sufficient pages in two loops.  The tsleep is not woken up
>     by anyone while waiting that 1/2 second becuase vm_pages_needed has
>     not been cleared yet.  This is what is limiting the page queue scan.

Ahhh, so FreeBSD _does_ have a maxscan equivalent, just one that
only kicks in when the system is under very heavy memory pressure.

That explains why FreeBSD's thrashing detection code works... ;)

(I'm not convinced, though, that limiting the speed at which we
scan the active list is a good thing. There are some arguments
in favour of speed limiting, but it mostly seems to come down
to a short-cut to thrashing detection...)

> :But ... is this a good enough indication that the processes
> :currently resident have enough memory available to make any
> :progress ?
> 
>     Yes.  Consider detecting the difference between a large process accessing
>     its pages randomly, and a small process accessing a relatively small
>     set of pages over and over again.  Now consider what happens when the
>     system gets overloaded.  The small process will be able to access its
>     pages enough that they will get page priority over the larger process.
>     The larger process, due to the more random accesses (or simply the fact
>     that it is accessing a larger set of pages) will tend to stall more on
>     pagein I/O which has the side effect of reducing the large process's
>     access rate on all of its pages.  The result:  small processes get more
>     priority just by being small.

But if the larger processes never get a chance to make decent
progress without thrashing, won't your system be slowed down
forever by these (thrashing) large processes?

It's nice to protect your small processes from the large ones,
but if the large processes don't get to run to completion the
system will never get out of thrashing...

> :Especially if all the currently resident processes are waiting
> :in page faults, won't that make it easier for the system to find
> :pages to swap out, etc... ?
> :
> :One thing I _am_ wondering though: the pageout and the pagein
> :thresholds are different. Can't this lead to problems where we
> :always hit both the pageout threshold -and- the pagein threshold
> :and the system thrashes swapping processes in and out ?
> 
>     The system will not page out a page it has just paged in due to the
>     center-of-the-road initialization of act_count (the page aging).

Indeed, the speed limiting of the pageout scanning takes care of
this. But still, having the swapout threshold defined as being
short of inactive pages while the swapin threshold uses the number
of free+cache pages as an indication could lead to the situation
where you suspend and wake up processes while it isn't needed.

Or worse, suspending one process which easily fit in memory and
then waking up another process, which cannot be swapped in because
the first process' memory is still sitting in RAM and cannot be
removed yet due to the pageout scan speed limiting (and also cannot
be used, because we suspended the process).

The chance of this happening could be quite big in some situations
because the swapout and swapin thresholds are measuring things that
are only indirectly related...

>     The pagein and pageout rates have nothing to do with thrashing, per say,
>     and should never be arbitrarily limited.

But they are, with the pageout daemon going to sleep for half a
second if it doesn't succeed in freeing enough memory at once.
It even does this if a large part of the memory on the active
list belongs to a process which has just been suspended because
of thrashing...

>     I don't think it's possible to write a nice neat thrash-handling
>     algorithm.  It's a bunch of algorithms all working together, all
>     closely tied to the VM page cache.  Each taken alone is fairly easy
>     to describe and understand.  All of them together result in complex
>     interactions that are very easy to break if you make a mistake.

Heheh, certainly true ;)

cheers,

Rik
--
Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

http://www.surriel.com/		http://distro.conectiva.com/

Send all your spam to aardvark@nl.linux.org (spam digging piggy)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: on load control / process swapping
  2001-05-12 14:23       ` Rik van Riel
@ 2001-05-12 17:21         ` Matt Dillon
  2001-05-12 21:17           ` Rik van Riel
  2001-05-12 23:58         ` Matt Dillon
  1 sibling, 1 reply; 39+ messages in thread
From: Matt Dillon @ 2001-05-12 17:21 UTC (permalink / raw)
  To: Rik van Riel; +Cc: arch, linux-mm, sfkaplan

:
:Ahhh, so FreeBSD _does_ have a maxscan equivalent, just one that
:only kicks in when the system is under very heavy memory pressure.
:
:That explains why FreeBSD's thrashing detection code works... ;)
:
:(I'm not convinced, though, that limiting the speed at which we
:scan the active list is a good thing. There are some arguments
:in favour of speed limiting, but it mostly seems to come down
:to a short-cut to thrashing detection...)

    Note that there is a big distinction between limiting the page
    queue scan rate (which we do not do), and sleeping between full
    scans (which we do).  Limiting the page queue scan rate on a
    page-by-page basis does not scale.  Sleeping in between full queue
    scans (in an extreme case) does scale.

						-Matt

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: on load control / process swapping
  2001-05-12 17:21         ` Matt Dillon
@ 2001-05-12 21:17           ` Rik van Riel
  0 siblings, 0 replies; 39+ messages in thread
From: Rik van Riel @ 2001-05-12 21:17 UTC (permalink / raw)
  To: Matt Dillon; +Cc: arch, linux-mm, sfkaplan

On Sat, 12 May 2001, Matt Dillon wrote:

> :Ahhh, so FreeBSD _does_ have a maxscan equivalent, just one that
> :only kicks in when the system is under very heavy memory pressure.
> :
> :That explains why FreeBSD's thrashing detection code works... ;)
>
>     Note that there is a big distinction between limiting the page
>     queue scan rate (which we do not do), and sleeping between full
>     scans (which we do).  Limiting the page queue scan rate on a
>     page-by-page basis does not scale.  Sleeping in between full queue
>     scans (in an extreme case) does scale.

I'm not convinced it's doing a very useful thing, though ;)

(see the rest of the email you replied to)

Rik
--
Linux MM bugzilla: http://linux-mm.org/bugzilla.shtml

Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

		http://www.surriel.com/
http://www.conectiva.com/	http://distro.conectiva.com/


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: on load control / process swapping
  2001-05-12 14:23       ` Rik van Riel
  2001-05-12 17:21         ` Matt Dillon
@ 2001-05-12 23:58         ` Matt Dillon
  2001-05-13 17:22           ` Rik van Riel
  1 sibling, 1 reply; 39+ messages in thread
From: Matt Dillon @ 2001-05-12 23:58 UTC (permalink / raw)
  To: Rik van Riel; +Cc: arch, linux-mm, sfkaplan

    Consider the case where you have one large process and many small
    processes.  If you were to skew things to allow the large process to
    run at the cost of all the small processes, you have just inconvenienced
    98% of your users so one ozob can run a big job.  Not only that, but 
    there is no guarentee that the 'big job' will ever finish (a topic of
    many a paper on scheduling, BTW)... what if it's been running for hours
    and still has hours to go?  Do we blow away the rest of the system to
    let it run?  

    What if there are several big jobs?  If you skew things in favor of
    one the others could take 60 seconds *just* to recover their RSS when
    they are finally allowed to run.  So much for timesharing... you
    would have to run each job exclusively for 5-10 minutes at a time
    to get any sort of effiency, which is not practical in a timeshare
    system.  So there is really very little that you can do.

:Indeed, the speed limiting of the pageout scanning takes care of
:this. But still, having the swapout threshold defined as being
:short of inactive pages while the swapin threshold uses the number
:of free+cache pages as an indication could lead to the situation
:where you suspend and wake up processes while it isn't needed.
:
:Or worse, suspending one process which easily fit in memory and
:then waking up another process, which cannot be swapped in because
:the first process' memory is still sitting in RAM and cannot be
:removed yet due to the pageout scan speed limiting (and also cannot
:be used, because we suspended the process).

    We don't suspend running processes, but I do believe FreeBSD is still
    vulnerable to this issue.  Suspending the marked process when it hits the
    vm_fault code is a good idea and would solve the problem.  If the process
    never takes an allocation fault, it probably doesn't have to be swapped
    out.  The normal pageout would suffice for that process.

:>     The pagein and pageout rates have nothing to do with thrashing, per say,
:>     and should never be arbitrarily limited.
:
:But they are, with the pageout daemon going to sleep for half a
:second if it doesn't succeed in freeing enough memory at once.
:It even does this if a large part of the memory on the active
:list belongs to a process which has just been suspended because
:of thrashing...

    No.  I did say the code was complex.  A process which has been
    suspended for thrashing gets all of its pages depressed in priority.
    The page daemon would have no problem recovering the pages.   See
    line 1458 of vm_pageout.c.  This code also enforces the 'memoryuse'
    resource limit (which is perhaps even more important).  It is not
    necessary to try to launder the pages immediately.  Simply depressing
    their priority is sufficient and it allows for quicker recovery when
    the thrashing goes away.  It also allows us to implement the 
    vm.swap_idle_{threshold1,threshold2,enabled} sysctls trivially, which
    results in proactive swapping that is extremely useful in certain
    situations (like shell machines with lots of idle users).

    The pagedaemon gets behind when there are too many
    active pages in the system and the pagedaemon is unable to move them
    to the inactive queue due to the pages still being very active... that is,
    when the active resident set for all processes in the system exceeds
    available memory.  This is what triggers thrashing.  Swapping has the
    side effect of reducing the total active resident set for the system
    as a whole, fixing the thrashing problem. 

						-Matt

:>     I don't think it's possible to write a nice neat thrash-handling
:>     algorithm.  It's a bunch of algorithms all working together, all
:>     closely tied to the VM page cache.  Each taken alone is fairly easy
:>     to describe and understand.  All of them together result in complex
:>     interactions that are very easy to break if you make a mistake.
:
:Heheh, certainly true ;)
:
:cheers,
:
:Rik
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: on load control / process swapping
  2001-05-12 23:58         ` Matt Dillon
@ 2001-05-13 17:22           ` Rik van Riel
  2001-05-15  6:38             ` Terry Lambert
  0 siblings, 1 reply; 39+ messages in thread
From: Rik van Riel @ 2001-05-13 17:22 UTC (permalink / raw)
  To: Matt Dillon; +Cc: arch, linux-mm, sfkaplan

On Sat, 12 May 2001, Matt Dillon wrote:

> :But if the larger processes never get a chance to make decent
> :progress without thrashing, won't your system be slowed down
> :forever by these (thrashing) large processes?
> :
> :It's nice to protect your small processes from the large ones,
> :but if the large processes don't get to run to completion the
> :system will never get out of thrashing...
> 
>     Consider the case where you have one large process and many small
>     processes.  If you were to skew things to allow the large process to
>     run at the cost of all the small processes, you have just inconvenienced
>     98% of your users so one ozob can run a big job.

So we should not allow just one single large job to take all
of memory, but we should allow some small jobs in memory too.

>     What if there are several big jobs?  If you skew things in favor of
>     one the others could take 60 seconds *just* to recover their RSS when
>     they are finally allowed to run.  So much for timesharing... you
>     would have to run each job exclusively for 5-10 minutes at a time
>     to get any sort of effiency, which is not practical in a timeshare
>     system.  So there is really very little that you can do.

If you don't do this very slow swapping, NONE of the big tasks
will have the opportunity to make decent progress and the system
will never get out of thrashing.

If we simply make the "swap time slices" for larger processes
larger than for smaller processes we:

1) have a better chance of the large jobs getting any work done
2) won't have the large jobs artificially increase memory load,
   because all time will be spent removing each other's RSS
3) can have more small jobs in memory at once, due to 2)
4) can be better for interactive performance due to 3)
5) have a better chance of getting out of the overload situation
   sooner

I realise this would make the scheduling algorithm slightly
more complex and I'm not convinced doing this would be worth
it myself, but we may want to do some brainstorming over this ;)

regards,

Rik
--
Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

http://www.surriel.com/		http://distro.conectiva.com/

Send all your spam to aardvark@nl.linux.org (spam digging piggy)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: on load control / process swapping
  2001-05-13 17:22           ` Rik van Riel
@ 2001-05-15  6:38             ` Terry Lambert
  2001-05-15 13:39               ` Cy Schubert - ITSD Open Systems Group
                                 ` (2 more replies)
  0 siblings, 3 replies; 39+ messages in thread
From: Terry Lambert @ 2001-05-15  6:38 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Matt Dillon, arch, linux-mm, sfkaplan

Rik van Riel wrote:
> So we should not allow just one single large job to take all
> of memory, but we should allow some small jobs in memory too.

Historically, this problem is solved with a "working set
quota".

> If you don't do this very slow swapping, NONE of the big tasks
> will have the opportunity to make decent progress and the system
> will never get out of thrashing.
> 
> If we simply make the "swap time slices" for larger processes
> larger than for smaller processes we:
> 
> 1) have a better chance of the large jobs getting any work done
> 2) won't have the large jobs artificially increase memory load,
>    because all time will be spent removing each other's RSS
> 3) can have more small jobs in memory at once, due to 2)
> 4) can be better for interactive performance due to 3)
> 5) have a better chance of getting out of the overload situation
>    sooner
> 
> I realise this would make the scheduling algorithm slightly
> more complex and I'm not convinced doing this would be worth
> it myself, but we may want to do some brainstorming over this ;)

A per vnode working set quota with a per use count adjust
would resolve most load thrashing issues.  Programs with
large working sets can either be granted a case by case
exception (via rlimit), or, more likely just have their
pages thrashed out more often.

You only ever need to do this when you have exhausted
memory to the point you are swapping, and then only when
you want to reap cached clean pages; when all you have
left is dirty pages in memory and swap, you are well and
truly thrashing -- for the right reason: your system load
is too high.

It's also relatively easy to implement something like a
per vnode working set quota, which can be self-enforced,
without making the scheduler so ugly that you will never
be able to do things like have per-CPU run queues for a
very efficient SMP that deals with the cache locality
issue naturally and easily (by merely setting migration
policies for moving from one run queue to another, and
by threads in a thread group having negative affinity for
each other's CPUs, to maximize real concurrency).

Psuedo code:

	IF THRASH_CONDITIONS
		IF (COPY_ON_WRITE_FAULT OR
		   PAGE_FILL_OF_SBRKED_PAGE_FAULT)
			IF VNODE_OVER_WORKING_SET_QUOTA
				STEAL_PAGE_FROM_VNODE_LRU
	ELSE
		GET_PAGE_FROM_SYSTEM

Obviously, this would work for vnodes that were acting as
backing store for programs, just as they would prevent a
large mmap() with a traversal from thrashing everyone else's
data and code out of core (which is, I think, a much worse
and much more common problem).

Doing extremely complicated things is only going to get
you into trouble... in particular, you don't want to
have policy in effect to deal with border load conditions
unless you are under those conditions in the first place.
The current scheduling algorithms are quite simple,
relatively speaking, and it makes much more sense to
make the thrasher fight with themselves, rather than them
peeing in everyone's pool.

I think that badly written programs taking more time, as
a result, is not a problem; if it is, it's one I could
live with much more easily than cache-busting for no good
reason, and slowing well behaved code down.  You need to
penalize the culprit.

It's possible to do a more complicated working set quota,
which actually applies to a process' working set, instead
of to vnodes, out of context with the process, but I think
that the vnode approach, particularly when you bump the
working set up per each additional opener, using the count
I suggested, to ensure proper locality of reference, is
good enough to solve the problem.

At the very least, the system would not "freeze" with this
approach, even if it could later recover.

-- Terry
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: on load control / process swapping
  2001-05-15  6:38             ` Terry Lambert
@ 2001-05-15 13:39               ` Cy Schubert - ITSD Open Systems Group
  2001-05-15 15:31               ` Rik van Riel
  2001-05-15 17:24               ` Matt Dillon
  2 siblings, 0 replies; 39+ messages in thread
From: Cy Schubert - ITSD Open Systems Group @ 2001-05-15 13:39 UTC (permalink / raw)
  To: tlambert2; +Cc: Rik van Riel, Matt Dillon, arch, linux-mm, sfkaplan

In message <3B00CECF.9A3DEEFA@mindspring.com>, Terry Lambert writes:
> Rik van Riel wrote:
> > So we should not allow just one single large job to take all
> > of memory, but we should allow some small jobs in memory too.
> 
> Historically, this problem is solved with a "working set
> quota".
> 
> > If you don't do this very slow swapping, NONE of the big tasks
> > will have the opportunity to make decent progress and the system
> > will never get out of thrashing.
> > 
> > If we simply make the "swap time slices" for larger processes
> > larger than for smaller processes we:
> > 
> > 1) have a better chance of the large jobs getting any work done
> > 2) won't have the large jobs artificially increase memory load,
> >    because all time will be spent removing each other's RSS
> > 3) can have more small jobs in memory at once, due to 2)
> > 4) can be better for interactive performance due to 3)
> > 5) have a better chance of getting out of the overload situation
> >    sooner
> > 
> > I realise this would make the scheduling algorithm slightly
> > more complex and I'm not convinced doing this would be worth
> > it myself, but we may want to do some brainstorming over this ;)
> 
> A per vnode working set quota with a per use count adjust
> would resolve most load thrashing issues.  Programs with
> large working sets can either be granted a case by case
> exception (via rlimit), or, more likely just have their
> pages thrashed out more often.
> 
> You only ever need to do this when you have exhausted
> memory to the point you are swapping, and then only when
> you want to reap cached clean pages; when all you have
> left is dirty pages in memory and swap, you are well and
> truly thrashing -- for the right reason: your system load
> is too high.

An operating system I worked on at one time, MVS, had this feature (not 
sure whether it still does today).  We called it fencing (e.g. fencing 
an address space).  An address space could be limited to the amount of 
real memory used.  Conversely, important address spaces could be given 
a minimum amount of real memory, e.g. online applications such a CICS.  
Additionally instead of limiting an address space to a minimum or 
maximum amount of real memory, an address space could be limited to a 
maximum paging rate, giving the O/S the option of increasing its real 
memory to match its WSS, reducing paging of the specified address space 
to a preset limit.  Of course this could have negative impact on other 
applications running on the system, which is why IBM recommended 
against using this feature.



Regards,                         Phone:  (250)387-8437
Cy Schubert                        Fax:  (250)387-5766
Team Leader, Sun/Alpha Team   Internet:  Cy.Schubert@osg.gov.bc.ca
Open Systems Group, ITSD, ISTA
Province of BC



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: on load control / process swapping
  2001-05-15  6:38             ` Terry Lambert
  2001-05-15 13:39               ` Cy Schubert - ITSD Open Systems Group
@ 2001-05-15 15:31               ` Rik van Riel
  2001-05-15 17:24               ` Matt Dillon
  2 siblings, 0 replies; 39+ messages in thread
From: Rik van Riel @ 2001-05-15 15:31 UTC (permalink / raw)
  To: Terry Lambert; +Cc: Matt Dillon, arch, linux-mm, sfkaplan

On Mon, 14 May 2001, Terry Lambert wrote:
> Rik van Riel wrote:
> > So we should not allow just one single large job to take all
> > of memory, but we should allow some small jobs in memory too.
> 
> Historically, this problem is solved with a "working set
> quota".

This is a great idea for when the system is in-between normal
loads and real thrashing. It will save small processes while
slowing down memory hogs which are taking resources fairly.

I'm not convinced it is any replacement for swapping, but it
sure a good way to delay swapping as long as possible.

Also, having a working set size guarantee in combination with
idle swapping will almost certainly give the proveribial root
shell the boost it needs ;)

> Doing extremely complicated things is only going to get
> you into trouble... in particular, you don't want to
> have policy in effect to deal with border load conditions
> unless you are under those conditions in the first place.

Agreed.

> It's possible to do a more complicated working set quota,
> which actually applies to a process' working set, instead
> of to vnodes, out of context with the process,

I guess in FreeBSD a per-vnode approach would be easier to
implement while in Linux a per-process working set would be
easier...

regards,

Rik
--
Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

http://www.surriel.com/		http://distro.conectiva.com/

Send all your spam to aardvark@nl.linux.org (spam digging piggy)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: on load control / process swapping
  2001-05-15  6:38             ` Terry Lambert
  2001-05-15 13:39               ` Cy Schubert - ITSD Open Systems Group
  2001-05-15 15:31               ` Rik van Riel
@ 2001-05-15 17:24               ` Matt Dillon
  2001-05-15 23:55                 ` Roger Larsson
  2001-05-16  8:23                 ` Terry Lambert
  2 siblings, 2 replies; 39+ messages in thread
From: Matt Dillon @ 2001-05-15 17:24 UTC (permalink / raw)
  To: Terry Lambert; +Cc: Rik van Riel, arch, linux-mm, sfkaplan

:Rik van Riel wrote:
:> So we should not allow just one single large job to take all
:> of memory, but we should allow some small jobs in memory too.
:
:Historically, this problem is solved with a "working set
:quota".

    We have a process-wide working set quota.  It's called the 'memoryuse'
    resource.

:...
:> 5) have a better chance of getting out of the overload situation
:>    sooner
:> 
:> I realise this would make the scheduling algorithm slightly
:> more complex and I'm not convinced doing this would be worth
:> it myself, but we may want to do some brainstorming over this ;)
:
:A per vnode working set quota with a per use count adjust
:would resolve most load thrashing issues.  Programs with

    It most certainly would not.  Limiting the number of pages
    you allow to be 'cached' on a vnode by vnode basis would be a 
    disaster.  It has absolutely nothing whatsoever to do with thrashing
    or thrash-management.  It would simply be an artificial limitation
    based on artificial assumptions that are as likely to be wrong as right.

    If I've learned anything working on the FreeBSD VM system, it's that
    the number of assumptions you make in regards to what programs do,
    how they do it, how much data they should be able to cache, and so forth
    is directly proportional to how badly you fuck up the paging algorithms.

    I implemented a special page-recycling algorithm in 4.1/4.2 (which is
    still there in 4.3).  Basically it tries predict when it is possible to
    throw away pages 'behind' a sequentially accessed file, so as not to
    allow that file to blow away your cache.  E.G. if you have 128M of ram
    and you are sequentially accessing a 200MB file, obviously there is
    not much point in trying to cache the data as you read it.

    But being able to predict something like this is extremely difficult.
    In fact, nearly impossible.  And without being able to make the
    prediction accurately you simply cannot determine how much data you
    should try to cache before you begin recycling it.  I wound up having
    to change the algorithm to act more like a heuristic -- it does a rough
    prediction but doesn't hold the system to it, then allows the page
    priority mechanism to refine the prediction.  But it can take several
    passes (or non-passes) on the file before the page recycling stabilizes.

    So the jist of the matter is that FreeBSD (1) already has process-wide
    working set limitations which are activated when the system is under
    load, and (2) already has a heuristic that attempts to predict when
    not to cache pages.  Actually several heuristics (a number of which were
    in place in the original CSRG code).

						-Matt

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: on load control / process swapping
  2001-05-15 17:24               ` Matt Dillon
@ 2001-05-15 23:55                 ` Roger Larsson
  2001-05-16  0:16                   ` Matt Dillon
  2001-05-16  8:23                 ` Terry Lambert
  1 sibling, 1 reply; 39+ messages in thread
From: Roger Larsson @ 2001-05-15 23:55 UTC (permalink / raw)
  To: Matt Dillon; +Cc: Rik van Riel, arch, linux-mm, sfkaplan

On Tuesday 15 May 2001 19:24, Matt Dillon wrote:
>     I implemented a special page-recycling algorithm in 4.1/4.2 (which is
>     still there in 4.3).  Basically it tries predict when it is possible to
>     throw away pages 'behind' a sequentially accessed file, so as not to
>     allow that file to blow away your cache.  E.G. if you have 128M of ram
>     and you are sequentially accessing a 200MB file, obviously there is
>     not much point in trying to cache the data as you read it.
>
>     But being able to predict something like this is extremely difficult.
>     In fact, nearly impossible.  And without being able to make the
>     prediction accurately you simply cannot determine how much data you
>     should try to cache before you begin recycling it.  I wound up having
>     to change the algorithm to act more like a heuristic -- it does a rough
>     prediction but doesn't hold the system to it, then allows the page
>     priority mechanism to refine the prediction.  But it can take several
>     passes (or non-passes) on the file before the page recycling
> stabilizes.
>

Are the heuristics persistent? 
Or will the first use after  boot use the rough prediction? 
For how long time will the heuristic stick? Suppose it is suddenly used in
a slightly different way. Like two sequential readers instead of one...

/RogerL

-- 
Roger Larsson
Skelleftea
Sweden

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: on load control / process swapping
  2001-05-15 23:55                 ` Roger Larsson
@ 2001-05-16  0:16                   ` Matt Dillon
  0 siblings, 0 replies; 39+ messages in thread
From: Matt Dillon @ 2001-05-16  0:16 UTC (permalink / raw)
  To: Roger Larsson; +Cc: Rik van Riel, arch, linux-mm, sfkaplan

:Are the heuristics persistent? 
:Or will the first use after  boot use the rough prediction? 
:For how long time will the heuristic stick? Suppose it is suddenly used in
:a slightly different way. Like two sequential readers instead of one...
:
:/RogerL
:Roger Larsson
:Skelleftea
:Sweden

    It's based on the VM page cache, so its adaptive over time.  I wouldn't
    call it persistent, it is nothing more then a simple heuristic that
    'normally' throws a page away but 'sometimes' caches it.  In otherwords,
    you lose some performance on the frontend in order to gain some later
    on.  If you loop through a file enough times, most of the file
    winds up getting cached.  It's still experimental so it is only
    lightly tied into the system.  It seems to work, though, so at some
    point in the future I'll probably try to put some significant prediction
    in.  But as I said, it's a very difficult thing to predict.  You can't
    just put your foot down and say 'I'll cache X amount of file Y'.  That
    doesn't work at all.

						-Matt

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: on load control / process swapping
  2001-05-15 17:24               ` Matt Dillon
  2001-05-15 23:55                 ` Roger Larsson
@ 2001-05-16  8:23                 ` Terry Lambert
  2001-05-16 17:26                   ` Matt Dillon
  1 sibling, 1 reply; 39+ messages in thread
From: Terry Lambert @ 2001-05-16  8:23 UTC (permalink / raw)
  To: Matt Dillon; +Cc: Rik van Riel, arch, linux-mm, sfkaplan

Matt Dillon wrote:
> :> So we should not allow just one single large job to take all
> :> of memory, but we should allow some small jobs in memory too.
> :
> :Historically, this problem is solved with a "working set
> :quota".
> 
>     We have a process-wide working set quota.  It's called
>     the 'memoryuse' resource.

It's not terrifically useful for limiting pageout as a
result of excessive demand pagein operations.

> :A per vnode working set quota with a per use count adjust
> :would resolve most load thrashing issues.  Programs with
> 
>     It most certainly would not.  Limiting the number of pages
>     you allow to be 'cached' on a vnode by vnode basis would
>     be a disaster.

I don't know whether to believe you, or Dave Cutler... 8-).

>     It has absolutely nothing whatsoever to do with thrashing
>     or thrash-management.  It would simply be an artificial
>     limitation based on artificial assumptions that are as
>     likely to be wrong as right.

I have a lot of problems with most of FreeBSD's anti-thrash
"protection"; I don't think many people are really running
it at a very high load.

I think a lot of the "administrative limits" are stupid;
in particular, I think it's really dumb to have 70% free
resources, and yet enforce administrative limits as if all
machines were shell account servers at an ISP where the
customers are just waiting for the operators to turn their
heads for a second so they can run 10,000 IRC "bots".

I also have a problem with the preallocation of contiguous
pageable regions of real memory via zalloci() in order to
support inpcb and tcpcb structures, which inherently mean
that I have to statically preallocate structures for IPs,
TCP structures, and sockets, as well as things like file
descriptors.   In other words, I have to guess the future
characteristics of my load, rather than having the OS do
the best it can in any given situation.

Not to mention the allocation of an entire mbuf per socket.

>     If I've learned anything working on the FreeBSD VM
>     system, it's that the number of assumptions you make
>     in regards to what programs do, how they do it, how
>     much data they should be able to cache, and so forth
>     is directly proportional to how badly you fuck up the
>     paging algorithms.

I've personally experienced thrash from a moronic method
of implementing "ld", which mmap's all the .o files, and
then seeks all over heck, randomly, in order to perform
the actual link.  It makes that specific operation very
fast, at the expense of the rest of the system.

The result of this is that everything else on the system
gets thrashed out of core, including the X server, and the
very simple and intuitive "move mouse, wiggle cursor"
breaks, which then breaks the entire paradigm.

FreeBSD is succeptible to this problem.  So was SVR4 UNIX.

The way SVR4 "repaired" the problem was to invent a new
scheduling class, "fixed", which would guarantee time
slices to the X server.  Thus, as fast as "ld" thrashed
pages it wasn't interested in out, "X" thrashed them
back in.  The interactive experience was degraded by the
excessive paging.

I implemented a different approach in UnixWare 2.x; it
didn't end up making it into the main UnixWare source tree
(I was barely able to get my /procfs based rfork() into
the thing, with the help of some good engineers from NJ);
but it was a per vnode working set quota approach.  It
operated in much the way I described, and it fixed the
problem: the only program that got thrashed by "ld" was
"ld": everything else on the system had LRU pages present
when the needed to run.  The "ld" program wasn't affected
itself until you started running low on buffer cache.

IMO, anything that results in the majority of programs
remaining reasonably runnable, and penalizes only the
programs making life hell for everyone else, and only
kicks in when life is truly starting to go to hell, is a
good approach.  I really don't care that I got the idea
from Dave Cutler's work in VMS, instead of arriving at
it on my own (those the per-vnode nature of mine is, I
think, an historically unique approach).

>     I implemented a special page-recycling algorithm in
>     4.1/4.2 (which is still there in 4.3).  Basically it
>     tries predict when it is possible to throw away pages
>     'behind' a sequentially accessed file, so as not to
>     allow that file to blow away your cache.  E.G. if you
>     have 128M of ram and you are sequentially accessing a
>     200MB file, obviously there is not much point in trying
>     to cache the data as you read it.

IMO, the ability to stream data like this is why Sun, in
Solaris 2.8, felt the need to "invent" seperate VM and
buffer caches once again -- "everything old is new again".

Also, IMO, I feel that the rationale used to justify this
decision was poorly defended, and that there are much
better implementations one could have -- including simple
red queueing for large data sets.  It was a cop out on
their part, having to do with not setting up simple high
and low water marks to keep things like a particular FS
or networking subsystem from monopolizing memory.  Instead,
they now have this artificial divide, where under typical
workloads, one pool lies largely fallow (which one depends
on the server role).  I guess that's not a problem, if your
primary after market marked up revenue generation sale item
is DRAM...

If the code you are referring to is the code that I think
it is, I don't think it's useful, except for something
like a web server with large objects to serve.  Even then,
discarding the entire concept of locality of reference
when you notice sequential access seems bogus.  Realize
that average web server service objects are on the order
of 10k, not 200M.  Realize also the _absolutely disasterous_
effect that code kicking in would have on, for example, an
FTP server immediately after the release of FreeBSD ISO
images to the net.  You would basically not cache that data
which is your primary hottest content -- turning virtually
assured cache hits into cache misses.

>     But being able to predict something like this is
>     extremely difficult.  In fact, nearly impossible.

I would say that it could be reduced to a stochiastic and
iterative process, but (see above), that it would be a
terrible idea for all but something like a popular MP3
server... even then, you start discarding useful data
under burst loads, and we're back to cache missing.

>     And without being able to make the prediction
>     accurately you simply cannot determine how much data
>     you should try to cache before you begin recycling it.

I should think that would be obvious: nearly everything
you can, based on locality and number of concurrent
references.  It's only when you attempt prefetch that it
actually becomes complicated; deciding to throw away a
clean page later instead of _now_ costs you practically
nothing.

>     So the jist of the matter is that FreeBSD (1) already
>     has process-wide working set limitations which are
>     activated when the system is under load,

They are largely useless, since they are also active even
when the system is not under load, so they act as preemptive
drags on performance.  They are also (as was pointed out in
an earlier thread) _not_ applied to mmap() and other regions,
so they are easily subverted.

>     and (2) already has a heuristic that attempts to predict
>     when not to cache pages.  Actually several heuristics (a
>     number of which were in place in the original CSRG code).

I would argue that the CPU vs. memory vs. disk speed
pendulum is moving back the other way, and that it's time
to reconsider these algorithms once again.  If it's done
correctly, they would be adaptive based on knowing the
data rate for each given subsystem.  We have gigabit NICs
these days, which can fully monopolize a PCI bus very
easily with few cards -- doing noting but network I/O at
burst rate on a 66MHz 64 bit PCI bus, thing max out at 4
cards -- and that's if you can get them to transfer the
data directly to each other, with no host intervention
being required, which you can't.

The fastest memory bus I've seen in Intel calls hardware
is 133MHz; at 64 bits, that's twice as fast as the 64bit
66MHz PCI bus.

Disks are pig-slow comparatively; in all cases, they're
going to be limited to the I/O bus speed anyway, and as
rotational speeds have gone up, seek latency has failed
to keep pace.  Most fast IDE ("multimedia") drives still
turn off thermal recalibration in order to keep streaming.

I think you need to stress a system -- really stress it,
so that you are hitting some hardware limit because of
the way FreeBSD uses the hardware -- in order to understand
where the real problems in FreeBSD lie.  Otherwise, it's
just like profiling a program over a tiny workload: the
actual cost of servicing real work get lost in the costs
associated with initialization.

It's pretty obvious from some of the recent bugs I've
run into that no one has attempted to open more than
32767 sockets in a production environment using a FreeBSD
system.  It's also obvious that no one has attempted to
have more than 65535 client connections open on a FreeBSD
box.  There are similar (obvious in retrospect) problems
in the routing and other code (what is with the alias
requirement for a 255.255.255.255 netmask, for example?
Has no one heard of VLANs, without explicit VLAN code?).

The upshot is that things are failing to scale under a
number of serious stress loads, and rather than defending
the past, we should be looking at fixing the problems.

I'm personally very happy to have the Linux geeks interested
in covering this territory cooperatively with the FreeBSD
geeks.  We need to be clever about causing scaling problems,
and more clever about fixing them, IMO.

-- Terry
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: on load control / process swapping
  2001-05-16  8:23                 ` Terry Lambert
@ 2001-05-16 17:26                   ` Matt Dillon
  0 siblings, 0 replies; 39+ messages in thread
From: Matt Dillon @ 2001-05-16 17:26 UTC (permalink / raw)
  To: Terry Lambert; +Cc: Rik van Riel, arch, linux-mm, sfkaplan

:I think a lot of the "administrative limits" are stupid;
:in particular, I think it's really dumb to have 70% free
:resources, and yet enforce administrative limits as if all
:...

    The 'memoryuse' resource limit is not enforced unless
    the system is under memory pressure.

:...
:>     And without being able to make the prediction
:>     accurately you simply cannot determine how much data
:>     you should try to cache before you begin recycling it.
:
:I should think that would be obvious: nearly everything
:you can, based on locality and number of concurrent
:references.  It's only when you attempt prefetch that it
:actually becomes complicated; deciding to throw away a
:clean page later instead of _now_ costs you practically
:nothing.
:...

    Prefetching has nothing to do with what we've been
    talking about.  We don't have a problem caching prefetched
    pages that aren't used.  The problem we have is determining 
    when to throw away data once it has been used by a program.

:...
:>     So the jist of the matter is that FreeBSD (1) already
:>     has process-wide working set limitations which are
:>     activated when the system is under load,
:
:They are largely useless, since they are also active even
:when the system is not under load, so they act as preemptive
:...

    This is not true.  Who told you this?  This is absolutely
    not true.

:drags on performance.  They are also (as was pointed out in
:an earlier thread) _not_ applied to mmap() and other regions,
:so they are easily subverted.
:...
:
:-- Terry
:

    This is not true.  The 'memoryuse' limit applies to all
    in-core pages associated with the process, whether mmap()'d
    or not.

					-Matt

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: on load control / process swapping
  2001-05-07 22:50 ` Matt Dillon
  2001-05-07 23:35   ` Rik van Riel
@ 2001-05-08 20:52   ` Kirk McKusick
  2001-05-09  0:18     ` Matt Dillon
  1 sibling, 1 reply; 39+ messages in thread
From: Kirk McKusick @ 2001-05-08 20:52 UTC (permalink / raw)
  To: Matt Dillon; +Cc: Rik van Riel, arch, linux-mm, sfkaplan

I know that FreeBSD will swap out sleeping processes, but will it
ever swap out running processes? The old BSD VM system would do so
(we called it hard swapping). It is possible to get a set of running
processes that simply do not all fit in memory, and the only way
for them to make forward progress is to cycle them through memory.

As to the size issue, we used to be biased towards the processes
with large resident set sizes in kicking things out. In general,
swapping out small things does not buy you much memory and it
annoys more users. To avoid picking on the biggest, each time we
needed to kick something out, we would find the five biggest, and 
kick out the one that had been memory resident the longest. The
effect is to go round-robin among the big processes. Note that
this algorithm allows you to kick out shells, if they are the
biggest processes. Also note that this is a last ditch algorithm
used only after there are no more idle processes available to
kick out. Our decision that we had had to kick out running
processes was: (1) no idle processes available to swap, (2) load
average over one (if there is just one process, kicking it out
does not solve the problem :-), (3) paging rate above a specified
threshhold over the entire previous 30 seconds (e.g., been bad 
for a long time and not getting better in the short term), and
(4) paging rate to/from swap area using more than half the 
available disk bandwidth (if your filesystems are on the same
disk as you swap areas, you can get a false sense of success
because all your process stop paging while they are blocked
waiting for their file data.

	Kirk
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: on load control / process swapping
  2001-05-08 20:52   ` Kirk McKusick
@ 2001-05-09  0:18     ` Matt Dillon
  2001-05-09  2:07       ` Peter Jeremy
  2001-05-12 14:28       ` Rik van Riel
  0 siblings, 2 replies; 39+ messages in thread
From: Matt Dillon @ 2001-05-09  0:18 UTC (permalink / raw)
  To: Kirk McKusick; +Cc: Rik van Riel, arch, linux-mm, sfkaplan

    I looked at the code fairly carefully last night... it doesn't
    swap out running processes and it also does not appear to swap
    out processes blocked in a page-fault (on I/O).  Now, of course
    we can't swap a process out right then (it might be holding locks),
    but I think it would be beneficial to be able to mark the process
    as 'requesting a swapout on return to user mode' or something
    like that.  At the moment what gets picked for swapping is
    hit-or-miss due to the wait states.

:As to the size issue, we used to be biased towards the processes
:with large resident set sizes in kicking things out. In general,
:swapping out small things does not buy you much memory and it

    The VM system does enforce the 'memoryuse' resource limit when
    the memory load gets heavy.  But once the load goes beyond that
    the VM system doesn't appear to care how big the process is.

:...
:biggest processes. Also note that this is a last ditch algorithm
:used only after there are no more idle processes available to
:kick out. Our decision that we had had to kick out running
:processes was: (1) no idle processes available to swap, (2) load
:average over one (if there is just one process, kicking it out
:does not solve the problem :-), (3) paging rate above a specified
:threshhold over the entire previous 30 seconds (e.g., been bad 
:for a long time and not getting better in the short term), and
:(4) paging rate to/from swap area using more than half the 
:available disk bandwidth (if your filesystems are on the same
:disk as you swap areas, you can get a false sense of success
:because all your process stop paging while they are blocked
:waiting for their file data.
:
:	Kirk

    I don't think we want to kick out running processes.  Thrashing
    by definition means that many of the processes are stuck in 
    disk-wait, usually from a VM fault, and not running.  The other 
    effect of thrashing is, of course, the the cpu idle time goes way
    up due to all the process stalls.  A process that is actually able 
    to run under these circumstances probably has a small run-time footprint
    (at least for whatever operation it is currently doing), so it should
    definitely be allowed to continue to run.

						-Matt

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: on load control / process swapping
  2001-05-09  0:18     ` Matt Dillon
@ 2001-05-09  2:07       ` Peter Jeremy
  2001-05-09 19:41         ` Matt Dillon
  2001-05-12 14:28       ` Rik van Riel
  1 sibling, 1 reply; 39+ messages in thread
From: Peter Jeremy @ 2001-05-09  2:07 UTC (permalink / raw)
  To: Matt Dillon; +Cc: Kirk McKusick, Rik van Riel, arch, linux-mm, sfkaplan

On 2001-May-08 17:18:16 -0700, Matt Dillon <dillon@earth.backplane.com> wrote:
>    I don't think we want to kick out running processes.  Thrashing
>    by definition means that many of the processes are stuck in 
>    disk-wait, usually from a VM fault, and not running.  The other 
>    effect of thrashing is, of course, the the cpu idle time goes way
>    up due to all the process stalls.  A process that is actually able 
>    to run under these circumstances probably has a small run-time footprint
>    (at least for whatever operation it is currently doing), so it should
>    definitely be allowed to continue to run.

I don't think this follows.  A program that does something like:
{
	extern char	memory[BIG_NUMBER];
	int		i;

	for (i = 0; i < BIG_NUMBER; i += PAGE_SIZE)
		memory[i]++;
}
will thrash nicely (assuming BIG_NUMBER is large compared to the
currently available physical memory).  Occasionally, it will be
runnable - at which stage it has a footprint of only two pages, but
after executing a couple of instructions, it'll have another page
fault.  Old pages will remain resident for some time before they age
enough to be paged out.  If the VM system is stressed, swapping this
process out completely would seem to be a win.

Whilst this code is artificial, a process managing a very large hash
table will have similar behaviour.

Given that most (all?) recent CPU's have cheap hi-resolution clocks,
would it be worthwhile for the VM system to maintain a per-process
page fault rate?  (average clock cycles before a process faults).  If
you ignore spikes due to process initialisation etc, a process that
faults very quickly after being given the CPU wants a working set size
that is larger than the VM system currently allows.  The fault rate
would seem to be proportional to the ratio between the wanted WSS and
allowed RSS.  This would seem to be a useful parameter to help decide
which process to swap out - in an ideal world the VM subsystem would
swap processes to keep the WSS of all in-core processes at about the
size of non-kernel RAM.

Peter
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: on load control / process swapping
  2001-05-09  2:07       ` Peter Jeremy
@ 2001-05-09 19:41         ` Matt Dillon
  0 siblings, 0 replies; 39+ messages in thread
From: Matt Dillon @ 2001-05-09 19:41 UTC (permalink / raw)
  To: Peter Jeremy; +Cc: Kirk McKusick, Rik van Riel, arch, linux-mm, sfkaplan

:I don't think this follows.  A program that does something like:
:{
:	extern char	memory[BIG_NUMBER];
:	int		i;
:
:	for (i = 0; i < BIG_NUMBER; i += PAGE_SIZE)
:		memory[i]++;
:}
:will thrash nicely (assuming BIG_NUMBER is large compared to the
:currently available physical memory).  Occasionally, it will be
:runnable - at which stage it has a footprint of only two pages, but

    Why only two pages?  It looks to me like the footprint is BIG_NUMBER
    bytes.

:after executing a couple of instructions, it'll have another page
:fault.  Old pages will remain resident for some time before they age
:enough to be paged out.  If the VM system is stressed, swapping this
:process out completely would seem to be a win.

    Not exactly.  Page aging works both ways.  Just accessing a page
    once does not give it priority over everything else in the page
    queues.

:...
:you ignore spikes due to process initialisation etc, a process that
:faults very quickly after being given the CPU wants a working set size
:that is larger than the VM system currently allows.  The fault rate
:would seem to be proportional to the ratio between the wanted WSS and
:allowed RSS.  This would seem to be a useful parameter to help decide
:which process to swap out - in an ideal world the VM subsystem would
:swap processes to keep the WSS of all in-core processes at about the
:size of non-kernel RAM.
:
:Peter

    Fault rate isn't useful -- maybe faults that require large disk seeks
    would be useful, but just counting the faults themselves is not useful.

						-Matt

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: on load control / process swapping
  2001-05-09  0:18     ` Matt Dillon
  2001-05-09  2:07       ` Peter Jeremy
@ 2001-05-12 14:28       ` Rik van Riel
  1 sibling, 0 replies; 39+ messages in thread
From: Rik van Riel @ 2001-05-12 14:28 UTC (permalink / raw)
  To: Matt Dillon; +Cc: Kirk McKusick, arch, linux-mm, sfkaplan

On Tue, 8 May 2001, Matt Dillon wrote:

> :I know that FreeBSD will swap out sleeping processes, but will it
> :ever swap out running processes? The old BSD VM system would do so
> :(we called it hard swapping). It is possible to get a set of running
> :processes that simply do not all fit in memory, and the only way
> :for them to make forward progress is to cycle them through memory.
> 
>     I looked at the code fairly carefully last night... it doesn't
>     swap out running processes and it also does not appear to swap
>     out processes blocked in a page-fault (on I/O).  Now, of course
>     we can't swap a process out right then (it might be holding locks),
>     but I think it would be beneficial to be able to mark the process
>     as 'requesting a swapout on return to user mode' or something
>     like that.

In the (still very rough) swapping code for Linux I simply do
this as "swapout on next pagefault". The idea behind that is:

1) it's easy, at a page fault we know we can suspend the process

2) if we're thrashing, we want every process to make as much
   progress as possible before it's suspended (swapped out),
   letting the process run until the next page fault means we
   will never suspend a process while it's still able to make
   progress

3) thrashing should be a rare situation, so you don't want to
   complicate fast-path code like "return to userspace"; instead
   we make sure to have as little impact on the rest of the
   kernel code as possible

regards,

Rik
--
Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

http://www.surriel.com/		http://distro.conectiva.com/

Send all your spam to aardvark@nl.linux.org (spam digging piggy)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: on load control / process swapping
  2001-05-07 21:16 Rik van Riel
  2001-05-07 22:50 ` Matt Dillon
@ 2001-05-08 12:25 ` Scott F. Kaplan
  1 sibling, 0 replies; 39+ messages in thread
From: Scott F. Kaplan @ 2001-05-08 12:25 UTC (permalink / raw)
  To: linux-mm

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Okay, in responding to this topic, I will issue a warning:  I'm
looking at this from an academic point of view, and probably won't
give as much attention to what is reasonable to engineer as some
people might like.  That said, I think I might have some useful
thoughts...y'all can be the judge of that.

On Mon, 7 May 2001, Rik van Riel wrote:

> In short, the process suspension / wake up code only does
> load control in the sense that system load is reduced, but
> absolutely no effort is made to ensure that individual
> programs can run without thrashing. This, of course, kind of
> defeats the purpose of doing load control in the first place.

First, I agree -- To suspend a process without any calculation that
will indicate that the suspension will reduce the page fault rate is
to operate blindly.  Performing such a calculation, though, requires
some information about the locality characteristics of each process,
based on recent reference behavior.  What would be really nice is some
indication as to how much additional space would reduce paging for
each of the processes that will remain active.  For some, a little
extra space won't help much, and for others, a little extra space is
just what it needs for a significant reduction.  Determining which
processes are which, and just how much "a little extra" needs to be,
seems important in this context.

Second, a nit pick:  We're using the term "thrashing" in so many ways
that it would be nice to standardize on something so that we
understand one another.  As I understand it, the textbook definition
of thrashing is the point at which CPU utilization falls because all
active processes are I/O bound.  That is, thrashing is a system-wide
characteristic, and not applicable to individual processes.  That's
why some people have pointed out that "thrashing" and "heavy paging"
aren't the same thing.  A single process can cause heavy paging while
the CPU is still fully loaded with the work of other processes.  

So, given the paragraph above, are you talking a single process that
may still be paging heavily, in spite of the additional free space
created by process suspension?  (Like I said, it was a nit pick.)  I'm
assuming that's what you mean.

> Any solution will have to address the following points:
> 
> 1) allow the resident processes to stay resident long
>    enough to make progess

Seems reasonable.

> 2) make sure the resident processes aren't thrashing,
>    that is, don't let new processes back in memory if
>    none of the currently resident processes is "ready"
>    to be suspended

What does it mean to be ready to be suspended?  I'm confused by this
one.

> 3) have a mechanism to detect thrashing in a VM
>    subsystem which isn't rate-limited  (hard?)

What's your definition of "thrashing" here?  If it's the system-wide
version, detection of this situation doesn't seem to be too difficult:
When all processes are stalled on page faults, and that situation
obtains over time recently, then the system is thrashing.  Detecting
whether or not a single process is thrashing (paging hopelessly) is a
different matter.  You could deactivate this process (or some other in
the hopes of helping this process), but it could be the case the a
reallocation of space could stop this process from paging so heavily
while not increasing the paging rate of any other process
substantially.

> and, for extra brownie points:
> 4) fairness, small processes can be paged in and out
>    faster, so we can suspend&resume them faster; this
>    has the side effect of leaving the proverbial root
>    shell more usable

I think point should have greater significance.  The very issue at
hand is that fairness and throughput are at odds when there is
contention for memory.  The central question (I think) is, "Given
paging sufficiently detrimental to progress, *how* unfair should the
system be in order to restore progress and increase throughput?"  Note
that if we want increased throughput, we can easily come up with a
scheme that almost completely throws fairness to the wind, and we'll
get great reductions in total paging and incrases in process
throughput.  For a time-sharing system, though, there should probably
a limit to the unfairness.

There has never been a really good solution to this kind of problem,
and there seems to be two important sides to it:

1) Given a level of fairness that you want to maintain, how can you
   keep the paging as low as possible?

2) Given the unfairness you're willing to use, how can you select
   eligible processes intelligently so as to maximize the reduction in
   total paging?

Question 1 is associated, and an important problem, but not part of
the issue here.  Question 2 seems to be the central question, and a
hard one.  I have trouble believing that any solution to Question 2
will make sense if it does not refer directly to the reference
behavior of both the suspended process, and the reference behavior of
the remaining active processes. 

I also have trouble with any solution to Question 2 that doesn't take
into account the cost associated with the deactivation and
reactivation steps.  When a process is reactivated, it's going to
cause substantial paging activity, and so it needs not to be done too
frequently.  If you're going to be unfair, then leave the deactivated
process out for long enough that the cost of paging it back in will be
a small fraction of the total time spent on the
deactivation/reactivation activities.

I hope these are useful thoughts.  Despite all of my complaining here,
I think this problem has been insufficiently addressed for a long
time.  Working Set counted on it, but there was never a study that
showed a good strategy for deacivation/reactivation, in spite of the
fact that different choices could significantly affect the results.
I'd like very much to see a solution to this particular problem.

Scott
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.0.4 (GNU/Linux)
Comment: For info see http://www.gnupg.org

iD8DBQE69+Wz8eFdWQtoOmgRAopvAJ0QuVPjUFZU5Pa78JsNUSgndKmGGwCdGJ2/
YKDVahEmCMm7yfoSXnrvfE4=
=Ql2h
-----END PGP SIGNATURE-----

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 39+ messages in thread

end of thread, other threads:[~2001-05-19  2:56 UTC | newest]

Thread overview: 39+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2001-05-16 15:17 on load control / process swapping Charles Randall
2001-05-16 17:14 ` Matt Dillon
2001-05-16 17:41   ` Rik van Riel
2001-05-16 17:54     ` Matt Dillon
2001-05-16 19:59       ` Rik van Riel
2001-05-16 20:41         ` Matt Dillon
2001-05-18  5:58       ` Terry Lambert
2001-05-18  6:20         ` Matt Dillon
2001-05-18 10:00           ` Andrew Reilly
2001-05-18 13:49           ` Jonathan Morton
2001-05-19  2:18             ` Rik van Riel
2001-05-19  2:56               ` Jonathan Morton
2001-05-16 17:57     ` Alfred Perlstein
2001-05-16 18:01       ` Matt Dillon
2001-05-16 18:10         ` Alfred Perlstein
     [not found] <OF5A705983.9566DA96-ON86256A50.00630512@hou.us.ray.com>
2001-05-18 20:13 ` Jonathan Morton
  -- strict thread matches above, loose matches on Subject: below --
2001-05-07 21:16 Rik van Riel
2001-05-07 22:50 ` Matt Dillon
2001-05-07 23:35   ` Rik van Riel
2001-05-08  0:56     ` Matt Dillon
2001-05-12 14:23       ` Rik van Riel
2001-05-12 17:21         ` Matt Dillon
2001-05-12 21:17           ` Rik van Riel
2001-05-12 23:58         ` Matt Dillon
2001-05-13 17:22           ` Rik van Riel
2001-05-15  6:38             ` Terry Lambert
2001-05-15 13:39               ` Cy Schubert - ITSD Open Systems Group
2001-05-15 15:31               ` Rik van Riel
2001-05-15 17:24               ` Matt Dillon
2001-05-15 23:55                 ` Roger Larsson
2001-05-16  0:16                   ` Matt Dillon
2001-05-16  8:23                 ` Terry Lambert
2001-05-16 17:26                   ` Matt Dillon
2001-05-08 20:52   ` Kirk McKusick
2001-05-09  0:18     ` Matt Dillon
2001-05-09  2:07       ` Peter Jeremy
2001-05-09 19:41         ` Matt Dillon
2001-05-12 14:28       ` Rik van Riel
2001-05-08 12:25 ` Scott F. Kaplan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox