* RE: on load control / process swapping
@ 2001-05-16 15:17 Charles Randall
2001-05-16 17:14 ` Matt Dillon
0 siblings, 1 reply; 39+ messages in thread
From: Charles Randall @ 2001-05-16 15:17 UTC (permalink / raw)
To: 'Matt Dillon', Roger Larsson
Cc: Rik van Riel, arch, linux-mm, sfkaplan
On a related note, we have a process (currently on Solaris, but possibly
moving to FreeBSD) that reads a 26 GB file just once for a database load. On
Solaris, we use the directio() function call to tell the filesystem to
bypass the buffer cache for this file descriptor.
>From the Solaris directio() man page,
DIRECTIO_ON
The system behaves as though the application is not
going to reuse the file data in the near future. In
other words, the file data is not cached in the
system's memory pages.
We found that without this, Solaris was aggressively trying to cache the
huge input file at the expense of database load performance (but we knew
that we'd never access it again). For some applications this is a huge win
(random I/O on a file much larger than memory seems to be another case).
Would there be an advantage to having a similar feature in FreeBSD (if not
already present)?
-Charles
-----Original Message-----
From: Matt Dillon [mailto:dillon@earth.backplane.com]
Sent: Tuesday, May 15, 2001 6:17 PM
To: Roger Larsson
Cc: Rik van Riel; arch@FreeBSD.ORG; linux-mm@kvack.org;
sfkaplan@cs.amherst.edu
Subject: Re: on load control / process swapping
:Are the heuristics persistent?
:Or will the first use after boot use the rough prediction?
:For how long time will the heuristic stick? Suppose it is suddenly used in
:a slightly different way. Like two sequential readers instead of one...
:
:/RogerL
:Roger Larsson
:Skelleftea
:Sweden
It's based on the VM page cache, so its adaptive over time. I wouldn't
call it persistent, it is nothing more then a simple heuristic that
'normally' throws a page away but 'sometimes' caches it. In otherwords,
you lose some performance on the frontend in order to gain some later
on. If you loop through a file enough times, most of the file
winds up getting cached. It's still experimental so it is only
lightly tied into the system. It seems to work, though, so at some
point in the future I'll probably try to put some significant prediction
in. But as I said, it's a very difficult thing to predict. You can't
just put your foot down and say 'I'll cache X amount of file Y'. That
doesn't work at all.
-Matt
To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/
^ permalink raw reply [flat|nested] 39+ messages in thread* Re: RE: on load control / process swapping 2001-05-16 15:17 on load control / process swapping Charles Randall @ 2001-05-16 17:14 ` Matt Dillon 2001-05-16 17:41 ` Rik van Riel 0 siblings, 1 reply; 39+ messages in thread From: Matt Dillon @ 2001-05-16 17:14 UTC (permalink / raw) To: Charles Randall; +Cc: Roger Larsson, Rik van Riel, arch, linux-mm, sfkaplan We've talked about implementing O_DIRECT. I think it's a good idea. In regards to the particular case of scanning a huge multi-gigabyte file, FreeBSD has a sequential detection heuristic which does a pretty good job preventing cache blow-aways by depressing the priority of the data as it is read or written. FreeBSD will still try to cache a good chunk, but it won't sacrifice all available memory. If you access the data via the VM system, through mmap, you get even more control through the madvise() syscall. -Matt -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: RE: on load control / process swapping 2001-05-16 17:14 ` Matt Dillon @ 2001-05-16 17:41 ` Rik van Riel 2001-05-16 17:54 ` Matt Dillon 2001-05-16 17:57 ` Alfred Perlstein 0 siblings, 2 replies; 39+ messages in thread From: Rik van Riel @ 2001-05-16 17:41 UTC (permalink / raw) To: Matt Dillon; +Cc: Charles Randall, Roger Larsson, arch, linux-mm, sfkaplan On Wed, 16 May 2001, Matt Dillon wrote: > In regards to the particular case of scanning a huge multi-gigabyte > file, FreeBSD has a sequential detection heuristic which does a > pretty good job preventing cache blow-aways by depressing the priority > of the data as it is read or written. FreeBSD will still try to cache > a good chunk, but it won't sacrifice all available memory. If you > access the data via the VM system, through mmap, you get even more > control through the madvise() syscall. There's one thing "wrong" with the drop-behind idea though; it penalises data even when it's still in core and we're reading it for the second or third time. Maybe it would be better to only do drop-behind when we're actually allocating new memory for the vnode in question and let re-use of already present memory go "unpunished" ? Hmmm, now that I think about this more, it _could_ introduce some different fairness issues. Darn ;) regards, Rik -- Linux MM bugzilla: http://linux-mm.org/bugzilla.shtml Virtual memory is like a game you can't win; However, without VM there's truly nothing to lose... http://www.surriel.com/ http://www.conectiva.com/ http://distro.conectiva.com/ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: RE: on load control / process swapping 2001-05-16 17:41 ` Rik van Riel @ 2001-05-16 17:54 ` Matt Dillon 2001-05-16 19:59 ` Rik van Riel 2001-05-18 5:58 ` Terry Lambert 2001-05-16 17:57 ` Alfred Perlstein 1 sibling, 2 replies; 39+ messages in thread From: Matt Dillon @ 2001-05-16 17:54 UTC (permalink / raw) To: Rik van Riel; +Cc: Charles Randall, Roger Larsson, arch, linux-mm, sfkaplan It's not dropping the data, it's dropping the priority. And yes, it does penalize the data somewhat. On the otherhand if the data happens to still be in the cache and you scan it a second time, the page priority gets bumped up relative to what it already was so the net effect is that the data becomes high priority after a few passes. :Maybe it would be better to only do drop-behind when we're :actually allocating new memory for the vnode in question and :let re-use of already present memory go "unpunished" ? You get an equivalent effect even without dropping the priority, because you blow away prior pages when reading a file that is larger then main memory so they don't exist at all when you re-read. But you do not get the expected 'recycling' characteristics verses the rest of the system if you do not make a distinction between sequential and random access. You want to slightly depress the priority behind a sequential access because the 'cost' of re-reading the disk sequentially is nothing compared to the cost of re-reading the disk randomly (by about a 30:1 ratio!). So keeping randomly seek/read data is more important by degrees then keeping sequentially read data. This isn't to say that it isn't important to try to cache sequentially read data, just that the cost of throwing away sequentially read data is much lower then the cost of throwing away randomly read data on a general purpose machine. Terry's description of 'ld' mmap()ing and doing all sorts of random seeking causing most UNIXes, including FreeBSD, to have a brainfart of the dataset is too big to fit in the cache is true as far as it goes, but there really isn't much we can do about that situation 'automatically'. Without hints, the system can't predict the fact that it should be trying to cache the whole of the object files being accessed randomly. A hint could make performance much better... a simple madvise(... MADV_SEQUENTIAL) on the mapped memory inside LD would probably be beneficial, as would madvise(... MADV_WILLNEED). -Matt :Hmmm, now that I think about this more, it _could_ introduce :some different fairness issues. Darn ;) : :regards, : :Rik -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: RE: on load control / process swapping 2001-05-16 17:54 ` Matt Dillon @ 2001-05-16 19:59 ` Rik van Riel 2001-05-16 20:41 ` Matt Dillon 2001-05-18 5:58 ` Terry Lambert 1 sibling, 1 reply; 39+ messages in thread From: Rik van Riel @ 2001-05-16 19:59 UTC (permalink / raw) To: Matt Dillon; +Cc: Charles Randall, Roger Larsson, arch, linux-mm, sfkaplan On Wed, 16 May 2001, Matt Dillon wrote: > :There's one thing "wrong" with the drop-behind idea though; > :it penalises data even when it's still in core and we're > :reading it for the second or third time. > > It's not dropping the data, it's dropping the priority. And yes, it > does penalize the data somewhat. On the otherhand if the data happens > to still be in the cache and you scan it a second time, the page priority > gets bumped up But doesn't it get pushed _down_ again after the process has read the data? Or is this a part of the code outside of vm/* which I haven't read yet? regards, Rik -- Linux MM bugzilla: http://linux-mm.org/bugzilla.shtml Virtual memory is like a game you can't win; However, without VM there's truly nothing to lose... http://www.surriel.com/ http://www.conectiva.com/ http://distro.conectiva.com/ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: RE: on load control / process swapping 2001-05-16 19:59 ` Rik van Riel @ 2001-05-16 20:41 ` Matt Dillon 0 siblings, 0 replies; 39+ messages in thread From: Matt Dillon @ 2001-05-16 20:41 UTC (permalink / raw) To: Rik van Riel; +Cc: Charles Randall, Roger Larsson, arch, linux-mm, sfkaplan Well, I was going to answer, but I can't find the code. I'll have to look at it more closely. -Matt -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: on load control / process swapping 2001-05-16 17:54 ` Matt Dillon 2001-05-16 19:59 ` Rik van Riel @ 2001-05-18 5:58 ` Terry Lambert 2001-05-18 6:20 ` Matt Dillon 1 sibling, 1 reply; 39+ messages in thread From: Terry Lambert @ 2001-05-18 5:58 UTC (permalink / raw) To: Matt Dillon Cc: Rik van Riel, Charles Randall, Roger Larsson, arch, linux-mm, sfkaplan Matt Dillon wrote: > Terry's description of 'ld' mmap()ing and doing all > sorts of random seeking causing most UNIXes, including > FreeBSD, to have a brainfart of the dataset is too big > to fit in the cache is true as far as it goes, but > there really isn't much we can do about that situation > 'automatically'. Without hints, the system can't predict > the fact that it should be trying to cache the whole of > the object files being accessed randomly. A hint could > make performance much better... a simple madvise(... > MADV_SEQUENTIAL) on the mapped memory inside LD would > probably be beneficial, as would madvise(... MADV_WILLNEED). I don't understand how either of those things could help but make overall performance worse. The problem is the program in question is seeking all over the place, potentially multiple times, in order to avoid building the table in memory itself. For many symbols, like "printf", it will hit the area of the library containing their addresses many, many times. The problem in this case is _truly_ that the program in question is _really_ trying to optimize its performance at the expense of other programs in the system. The system _needs_ to make page-ins by this program come _at the expense of this program_, rather than thrashing all other programs out of core, only to have the quanta given to these (now higher priority) programs used to thrash the pages back in, instead of doing real work. The problem is what to do about this badly behaved program, so that the system itself doesn't spend unnecessary time undoing its evil, and so that other (well behaved) programs are not unfairly penalized. Cutler suggested a working set quota (first in VMS, later in NT) to deal with these programs. -- Terry -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: on load control / process swapping 2001-05-18 5:58 ` Terry Lambert @ 2001-05-18 6:20 ` Matt Dillon 2001-05-18 10:00 ` Andrew Reilly 2001-05-18 13:49 ` Jonathan Morton 0 siblings, 2 replies; 39+ messages in thread From: Matt Dillon @ 2001-05-18 6:20 UTC (permalink / raw) To: Terry Lambert Cc: Rik van Riel, Charles Randall, Roger Larsson, arch, linux-mm, sfkaplan :I don't understand how either of those things could help :but make overall performance worse. : :The problem is the program in question is seeking all :over the place, potentially multiple times, in order :to avoid building the table in memory itself. : :For many symbols, like "printf", it will hit the area :of the library containing their addresses many, many :times. : :The problem in this case is _truly_ that the program in :question is _really_ trying to optimize its performance :at the expense of other programs in the system. The linker is seeking randomly as a side effect of the linking algorithm. It is not doing it on purpose to try to save memory. Forcing the VM system to think it's sequential causes the VM system to perform read-aheads, generally reducing the actual amount of physical seeking that must occur by increasing the size of the chunks read from disk. Even if the linker's dataset is huge, increasing the chunk size is beneficial because linkers ultimately access the entire object file anyway. Trying to save a few seeks is far more important then reading extra data and having to throw half of it away. :The problem is what to do about this badly behaved program, :so that the system itself doesn't spend unnecessary time :undoing its evil, and so that other (well behaved) programs :are not unfairly penalized. : :Cutler suggested a working set quota (first in VMS, later :in NT) to deal with these programs. : :-- Terry The problem is not the resident set size, it's the seeking that the program is causing as a matter of course. Be that as it may, the resident set size can be limited with the 'memoryuse' sysctl. The system imposes the specified limit only when the memory subsystem is under pressure. You can also reduce the amount of random seeking the linker does by ordering the object modules within the library to forward-reference the dependancies. -Matt -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: on load control / process swapping 2001-05-18 6:20 ` Matt Dillon @ 2001-05-18 10:00 ` Andrew Reilly 2001-05-18 13:49 ` Jonathan Morton 1 sibling, 0 replies; 39+ messages in thread From: Andrew Reilly @ 2001-05-18 10:00 UTC (permalink / raw) To: Matt Dillon Cc: Terry Lambert, Rik van Riel, Charles Randall, Roger Larsson, arch, linux-mm, sfkaplan On Thu, May 17, 2001 at 11:20:23PM -0700, Matt Dillon wrote: >Terry wrote: > :The problem in this case is _truly_ that the program in > :question is _really_ trying to optimize its performance > :at the expense of other programs in the system. > > The linker is seeking randomly as a side effect of > the linking algorithm. It is not doing it on purpose to try > to save memory. Forcing the VM system to think it's > sequential causes the VM system to perform read-aheads, > generally reducing the actual amount of physical seeking > that must occur by increasing the size of the chunks > read from disk. Even if the linker's dataset is huge, > increasing the chunk size is beneficial because linkers > ultimately access the entire object file anyway. Trying > to save a few seeks is far more important then reading > extra data and having to throw half of it away. I know that this problem is real in the case of data base index accesses---databases have data sets larger than RAM almost by definition---and that the problem (of dealing with "randomly" accessed memory mapped files) should be neatly solved in general. But is this issue of linking really the lynch pin? Are there _any_ programs and library sets where the union of the code sizes is larger than physical memory? I haven't looked at the problem myself, but (on the surface) it doesn't seem too likely. There is a grand total of 90M of .a files on my system (/usr/lib, /usr/X11/lib, and /usr/local/lib), and I doubt that even a majority of them would be needed at once. -- Andrew -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: on load control / process swapping 2001-05-18 6:20 ` Matt Dillon 2001-05-18 10:00 ` Andrew Reilly @ 2001-05-18 13:49 ` Jonathan Morton 2001-05-19 2:18 ` Rik van Riel 1 sibling, 1 reply; 39+ messages in thread From: Jonathan Morton @ 2001-05-18 13:49 UTC (permalink / raw) To: Matt Dillon, Terry Lambert Cc: Rik van Riel, Charles Randall, Roger Larsson, arch, linux-mm, sfkaplan > The problem is not the resident set size, it's the > seeking that the program is causing as a matter of > course. The RSS of 'ld' isn't the problem, no. However, the working-set idea would place an effective and sensible limit of the size of the disk cache, by ensuring that other apps aren't being paged out beyond their non-working sets. Does this make sense? FWIW, I've been running with a 2-line hack in my kernel for some weeks now, which essentially forces the RSS of each process not to be forced below some arbitrary "fair share" of the physical memory available. It's not a very clean hack, but it improves performance by a very large margin under a thrashing load. The only problem I'm seeing is a deadlock when I run out of VM completely, but I think that's a separate issue that others are already working on. To others: is there already a means whereby we can (almost) calculate the WS of a given process? The "accessed" flag isn't a good one, but maybe the 'age' value is better. However, I haven't quite clicked on how the 'age' value is affected in either direction. -------------------------------------------------------------- from: Jonathan "Chromatix" Morton mail: chromi@cyberspace.org (not for attachments) big-mail: chromatix@penguinpowered.com uni-mail: j.d.morton@lancaster.ac.uk The key to knowledge is not to rely on people to teach you it. Get VNC Server for Macintosh from http://www.chromatix.uklinux.net/vnc/ -----BEGIN GEEK CODE BLOCK----- Version 3.12 GCS$/E/S dpu(!) s:- a20 C+++ UL++ P L+++ E W+ N- o? K? w--- O-- M++$ V? PS PE- Y+ PGP++ t- 5- X- R !tv b++ DI+++ D G e+ h+ r++ y+(*) -----END GEEK CODE BLOCK----- -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: on load control / process swapping 2001-05-18 13:49 ` Jonathan Morton @ 2001-05-19 2:18 ` Rik van Riel 2001-05-19 2:56 ` Jonathan Morton 0 siblings, 1 reply; 39+ messages in thread From: Rik van Riel @ 2001-05-19 2:18 UTC (permalink / raw) To: Jonathan Morton Cc: Matt Dillon, Terry Lambert, Charles Randall, Roger Larsson, arch, linux-mm, sfkaplan On Fri, 18 May 2001, Jonathan Morton wrote: > FWIW, I've been running with a 2-line hack in my kernel for some weeks > now, which essentially forces the RSS of each process not to be forced > below some arbitrary "fair share" of the physical memory available. > It's not a very clean hack, but it improves performance by a very > large margin under a thrashing load. The only problem I'm seeing is a > deadlock when I run out of VM completely, but I think that's a > separate issue that others are already working on. I'm pretty sure I know what you're running into. Say you guarantee a minimum of 3% of memory for each process; now when you have 30 processes running your memory is full and you cannot reclaim any pages when one of the processes runs into a page fault. The minimum RSS guarantee is a really nice thing to prevent the proverbial root shell from thrashing, but it really only works if you drop such processes every once in a while and swap them out completely. You especially need to do this when you're getting tight on memory and you have idle processes sitting around using their minimum RSS worth of RAM ;) It'd work great together with load control though. I guess I should post a patch for - simple&naive - load control code once I've got the inodes and the dirty page writeout code balancing fixed. regards, Rik -- Virtual memory is like a game you can't win; However, without VM there's truly nothing to lose... http://www.surriel.com/ http://distro.conectiva.com/ Send all your spam to aardvark@nl.linux.org (spam digging piggy) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: on load control / process swapping 2001-05-19 2:18 ` Rik van Riel @ 2001-05-19 2:56 ` Jonathan Morton 0 siblings, 0 replies; 39+ messages in thread From: Jonathan Morton @ 2001-05-19 2:56 UTC (permalink / raw) To: Rik van Riel Cc: Matt Dillon, Terry Lambert, Charles Randall, Roger Larsson, arch, linux-mm, sfkaplan >> FWIW, I've been running with a 2-line hack in my kernel for some weeks >> now, which essentially forces the RSS of each process not to be forced >> below some arbitrary "fair share" of the physical memory available. >> It's not a very clean hack, but it improves performance by a very >> large margin under a thrashing load. The only problem I'm seeing is a >> deadlock when I run out of VM completely, but I think that's a >> separate issue that others are already working on. > >I'm pretty sure I know what you're running into. > >Say you guarantee a minimum of 3% of memory for each process; >now when you have 30 processes running your memory is full and >you cannot reclaim any pages when one of the processes runs >into a page fault. Actually I already thought of that one, and made it a "fair share" of the system rather than a fixed amount. IOW, the guaranteed amount is something like (total_memory / nr_processes). I think I was even sane enough to lower this value slightly to allow for some buffer/cache memory, but I didn't allow for locked pages (including the kernel itself). The deadlock happened when the swap ran out, not the physical RAM, and is independent of this particular hack - remember I'm running with some out_of_memory() fixes and some other hackery I did a month or so ago (remember that massive "OOM killer" thread?). I should try to figure those out and present cleaned-up versions for further perusal... -------------------------------------------------------------- from: Jonathan "Chromatix" Morton mail: chromi@cyberspace.org (not for attachments) big-mail: chromatix@penguinpowered.com uni-mail: j.d.morton@lancaster.ac.uk The key to knowledge is not to rely on people to teach you it. Get VNC Server for Macintosh from http://www.chromatix.uklinux.net/vnc/ -----BEGIN GEEK CODE BLOCK----- Version 3.12 GCS$/E/S dpu(!) s:- a20 C+++ UL++ P L+++ E W+ N- o? K? w--- O-- M++$ V? PS PE- Y+ PGP++ t- 5- X- R !tv b++ DI+++ D G e+ h+ r++ y+(*) -----END GEEK CODE BLOCK----- -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: on load control / process swapping 2001-05-16 17:41 ` Rik van Riel 2001-05-16 17:54 ` Matt Dillon @ 2001-05-16 17:57 ` Alfred Perlstein 2001-05-16 18:01 ` Matt Dillon 1 sibling, 1 reply; 39+ messages in thread From: Alfred Perlstein @ 2001-05-16 17:57 UTC (permalink / raw) To: Rik van Riel Cc: Matt Dillon, Charles Randall, Roger Larsson, arch, linux-mm, sfkaplan * Rik van Riel <riel@conectiva.com.br> [010516 13:42] wrote: > On Wed, 16 May 2001, Matt Dillon wrote: > > > In regards to the particular case of scanning a huge multi-gigabyte > > file, FreeBSD has a sequential detection heuristic which does a > > pretty good job preventing cache blow-aways by depressing the priority > > of the data as it is read or written. FreeBSD will still try to cache > > a good chunk, but it won't sacrifice all available memory. If you > > access the data via the VM system, through mmap, you get even more > > control through the madvise() syscall. > > There's one thing "wrong" with the drop-behind idea though; > it penalises data even when it's still in core and we're > reading it for the second or third time. > > Maybe it would be better to only do drop-behind when we're > actually allocating new memory for the vnode in question and > let re-use of already present memory go "unpunished" ? > > Hmmm, now that I think about this more, it _could_ introduce > some different fairness issues. Darn ;) Both of you guys are missing the point. The directio interface is meant to reduce the stress of a large seqential operation on a file where caching is of no use. Even if you depress the worthyness of the pages you've still blown rather large amounts of unrelated data out of the cache in order to allocate new cacheable pages. A simple solution would involve passing along flags such that if the IO occurs to a non-previously-cached page the buf/page is immediately placed on the free list upon completion. That way the next IO can pull the now useless bufferspace from the freelist. Basically you add another buffer queue for "throw away" data that exists as a "barely cached" queue. This way your normal data doesn't compete on the LRU with non-cached data. As a hack one it looks like one could use the QUEUE_EMPTYKVA buffer queue under FreeBSD for this, however I think one might loose the minimal amount of caching that could be done. If the direct IO happens to a page that's previously cached you adhere to the previous behavior. A more fancy approach might map in user pages into the kernel to do the IO directly, however on large MP this may cause pain because the vm may need to issue ipi to invalidate tlb entries. It's quite simple in theory, the hard part is the code. -Alfred Perlstein -- Instead of asking why a piece of software is using "1970s technology," start asking why software is ignoring 30 years of accumulated wisdom. http://www.egr.unlv.edu/~slumos/on-netbsd.html -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: on load control / process swapping 2001-05-16 17:57 ` Alfred Perlstein @ 2001-05-16 18:01 ` Matt Dillon 2001-05-16 18:10 ` Alfred Perlstein 0 siblings, 1 reply; 39+ messages in thread From: Matt Dillon @ 2001-05-16 18:01 UTC (permalink / raw) To: Alfred Perlstein Cc: Rik van Riel, Charles Randall, Roger Larsson, arch, linux-mm, sfkaplan :Both of you guys are missing the point. : :The directio interface is meant to reduce the stress of a large :seqential operation on a file where caching is of no use. : :Even if you depress the worthyness of the pages you've still :blown rather large amounts of unrelated data out of the cache :in order to allocate new cacheable pages. : :A simple solution would involve passing along flags such that if :the IO occurs to a non-previously-cached page the buf/page is :immediately placed on the free list upon completion. That way the :next IO can pull the now useless bufferspace from the freelist. : :Basically you add another buffer queue for "throw away" data that :exists as a "barely cached" queue. This way your normal data :doesn't compete on the LRU with non-cached data. : :As a hack one it looks like one could use the QUEUE_EMPTYKVA :buffer queue under FreeBSD for this, however I think one might :loose the minimal amount of caching that could be done. : :If the direct IO happens to a page that's previously cached :you adhere to the previous behavior. : :A more fancy approach might map in user pages into the kernel to :do the IO directly, however on large MP this may cause pain because :the vm may need to issue ipi to invalidate tlb entries. : :It's quite simple in theory, the hard part is the code. : :-Alfred Perlstein I think someone tried to implement O_DIRECT a while back, but it was fairly complex to try to do away with caching entirely. I think our best bet to 'start' an implementation of O_DIRECT is to support the flag in open() and fcntl(), and have it simply modify the sequential detection heuristic to throw away pages and buffers rather then simply depressing their priority. Eventually we can implement the direct-I/O piece of the equation. I could do this first part in an hour, I think. When I get home.... -Matt -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: on load control / process swapping 2001-05-16 18:01 ` Matt Dillon @ 2001-05-16 18:10 ` Alfred Perlstein 0 siblings, 0 replies; 39+ messages in thread From: Alfred Perlstein @ 2001-05-16 18:10 UTC (permalink / raw) To: Matt Dillon Cc: Rik van Riel, Charles Randall, Roger Larsson, arch, linux-mm, sfkaplan * Matt Dillon <dillon@earth.backplane.com> [010516 14:01] wrote: > > I think someone tried to implement O_DIRECT a while back, but it > was fairly complex to try to do away with caching entirely. > > I think our best bet to 'start' an implementation of O_DIRECT is > to support the flag in open() and fcntl(), and have it simply > modify the sequential detection heuristic to throw away pages > and buffers rather then simply depressing their priority. yes, as i said: > :A simple solution would involve passing along flags such that if > :the IO occurs to a non-previously-cached page the buf/page is > :immediately placed on the free list upon completion. That way the > :next IO can pull the now useless bufferspace from the freelist. > : > :Basically you add another buffer queue for "throw away" data that > :exists as a "barely cached" queue. This way your normal data > :doesn't compete on the LRU with non-cached data. > > Eventually we can implement the direct-I/O piece of the equation. > > I could do this first part in an hour, I think. When I get home.... Thank you. -Alfred -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 39+ messages in thread
[parent not found: <OF5A705983.9566DA96-ON86256A50.00630512@hou.us.ray.com>]
* Re: on load control / process swapping [not found] <OF5A705983.9566DA96-ON86256A50.00630512@hou.us.ray.com> @ 2001-05-18 20:13 ` Jonathan Morton 0 siblings, 0 replies; 39+ messages in thread From: Jonathan Morton @ 2001-05-18 20:13 UTC (permalink / raw) To: Mark_H_Johnson; +Cc: linux-mm >I'm not sure you have these items measured in the kernel at this point, but >VAX/VMS used the page replacement rate to control the working set size >(Linux term - resident set size) within three limits... > - minimum working set size > - maximum guaranteed working set size (under memory pressure) > - maximum extended working set size (no memory pressure) >The three sizes above were enforced on a per user basis. I could see using >the existing Linux RSS limit for the maximum guarantee (or extended) and >then ratios for the other items. Seems reasonable, but remember RSS != working set. Under "normal" conditions we want all processes to have all the memory they want, then when memory pressure encroaches we want to keep as many processes as possible with their working set swapped in (but no more). >There were several parameters - some on a per system basis and others on a >per user basis [I can't recall which were which] to control this >including... > - amount to increase the working set size (say 5-10% of the maximum) > - amount to decrease the working set size (usually about 1/2 the increase >size value) > - pages per second replaced in the working set to trigger a possible >increase (say 10) > - pages per second replaced in the working set to trigger a possible >decrease (say 2 or 1) >A new job would start at its minimum size and grow quickly to either the >maximum limit or its natural working set size. If at the limit, it would >thrash but not necessarily affect the other jobs on the system. I am not >sure how the numbers I listed would apply with a fast system with huge >memories - the values I listed were what I recall on what would be a small >system today (4M to 64M). Hmm, it looks to me like the algorithm above relies on a continuous rate of paging. This is a bad thing on a modern system where the swap device is so much slower than main memory. However, the idea is an interesting one and could possibly be adapted... The key thing is that maximum performance for a given process (particularly a small one) is when *no* paging is occurring in relation to it. Under memory pressure, this is quite hard to achieve unless the working set is already known. Thus the VMS model (if I understood it correctly) doesn't work so well for modern systems running Linux. What i was really asking, to make the question clearer is "how does page->age work? And if it's not suitable for WS calculation in the ways that I suspect, what else could be used - that is *already* instrumented?". -------------------------------------------------------------- from: Jonathan "Chromatix" Morton mail: chromi@cyberspace.org (not for attachments) big-mail: chromatix@penguinpowered.com uni-mail: j.d.morton@lancaster.ac.uk The key to knowledge is not to rely on people to teach you it. Get VNC Server for Macintosh from http://www.chromatix.uklinux.net/vnc/ -----BEGIN GEEK CODE BLOCK----- Version 3.12 GCS$/E/S dpu(!) s:- a20 C+++ UL++ P L+++ E W+ N- o? K? w--- O-- M++$ V? PS PE- Y+ PGP++ t- 5- X- R !tv b++ DI+++ D G e+ h+ r++ y+(*) -----END GEEK CODE BLOCK----- -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 39+ messages in thread
* on load control / process swapping @ 2001-05-07 21:16 Rik van Riel 2001-05-07 22:50 ` Matt Dillon 2001-05-08 12:25 ` Scott F. Kaplan 0 siblings, 2 replies; 39+ messages in thread From: Rik van Riel @ 2001-05-07 21:16 UTC (permalink / raw) To: arch; +Cc: linux-mm, Matt Dillon, sfkaplan Hi, after staring at the code for a long long time, I finally figured out exactly why FreeBSD's load control code (the process swapping in vm_glue.c) can never work in many scenarios. In short, the process suspension / wake up code only does load control in the sense that system load is reduced, but absolutely no effort is made to ensure that individual programs can run without thrashing. This, of course, kind of defeats the purpose of doing load control in the first place. To see this situation in some more detail, lets first look at how the current process suspension code has evolved over time. Early paging Unixes, including earlier BSDs, had a rate-limited clock algorithm for the pageout code, where the VM subsystem would only scan (and page) memory out at a rate of fastscan pages per second. Whenever the paging system wasn't able to keep up, free memory would get below a certain threshold and memory load control (in the form of process suspension) kicked in. As soon as free memory (averaged over a few seconds) got over this threshold, processes get swapped in again. Because of the exact "speed limit" for the paging code, this would give a slow rotation of memory-resident progesses at a paging rate well below the thashing threshold. More modern Unixes, like FreeBSD, NetBSD or Linux, however, don't have the artificial speed limit on pageout. This means the pageout code can go on freeing memory until well beyond the trashing point of the system. It also means that the amount of free memory is no longer any indication of whether the system is thrashing or not. Add to that the fact that the classical load control in BSD resumes a suspended task whenever the system is above the (now not very meaningful) free memory threshold, regardless of whether the resident tasks have had the opportunity to make any progress ... which of course only encourages more thrashing instead of letting the system work itself out of the overload situation. Any solution will have to address the following points: 1) allow the resident processes to stay resident long enough to make progess 2) make sure the resident processes aren't thrashing, that is, don't let new processes back in memory if none of the currently resident processes is "ready" to be suspended 3) have a mechanism to detect thrashing in a VM subsystem which isn't rate-limited (hard?) and, for extra brownie points: 4) fairness, small processes can be paged in and out faster, so we can suspend&resume them faster; this has the side effect of leaving the proverbial root shell more usable 5) make sure already resident processes cannot create a situation that'll keep the swapped out tasks out of memory forever ... but don't kill performance either, since bad performance means we cannot get out of the bad situation we're in Points 1), 2) and 4) are relatively easy to address by simply keeping resident tasks unswappable for a long enough time that they've been able to do real work in an environment where 3) indicates we're not thrashing. 3) is the hard part. We know we're not thrashing when we don't have ongoing page faults all the time, but (say) only 50% of the time. However, I still have no idea to determine when we _are_ thrashing, since a system which always has 10 ongoing page faults may still be functioning without thrashing... This is the part where I cannot hand a ready solution but where we have to figure out a solution together. (and it's also the reason I cannot "send a patch" ... I know the current scheme cannot possibly work all the time, I understand why, but I just don't have a solution to the problem ... yet) regards, Rik -- Virtual memory is like a game you can't win; However, without VM there's truly nothing to lose... http://www.surriel.com/ http://distro.conectiva.com/ Send all your spam to aardvark@nl.linux.org (spam digging piggy) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: on load control / process swapping 2001-05-07 21:16 Rik van Riel @ 2001-05-07 22:50 ` Matt Dillon 2001-05-07 23:35 ` Rik van Riel 2001-05-08 20:52 ` Kirk McKusick 2001-05-08 12:25 ` Scott F. Kaplan 1 sibling, 2 replies; 39+ messages in thread From: Matt Dillon @ 2001-05-07 22:50 UTC (permalink / raw) To: Rik van Riel; +Cc: arch, linux-mm, sfkaplan This is accomplished as a side effect to the way the page queues are handled. A page placed in the active queue is not allowed to be moved out of that queue for a minimum period of time based on page aging. See line 500 or so of vm_pageout.c (in -stable) . Thus when a process wakes up and pages a bunch of pages in, those pages are guarenteed to stay in-core for a period of time no matter what level of memory stress is occuring. :2) make sure the resident processes aren't thrashing, : that is, don't let new processes back in memory if : none of the currently resident processes is "ready" : to be suspended When a process is swapped out, the process is removed from the run queue and the P_INMEM flag is cleared. The process is only woken up when faultin() is called (vm_glue.c line 312). faultin() is only called from the scheduler() (line 340 of vm_glue.c) and the scheduler only runs when the VM system indicates a minimum number of free pages are available (vm_page_count_min()), which you can adjust with the vm.v_free_min sysctl (usually represents 1-9 megabytes, dependings on how much memory the system has). So what occurs is that the system comes under extreme memory pressure and starts to swapout blocked processes. This reduces memory pressure over time. When memory pressure is sufficiently reudced the scheduler wakes up a swapped-out process (one at a time). There might be some fine tuning that we can do here, such as try to choose a better process to swapout (right now it's priority based which isn't the best way to do it). :3) have a mechanism to detect thrashing in a VM : subsystem which isn't rate-limited (hard?) In FreeBSD, rate-limiting is a function of a lightly loaded system. We rate-limit page laundering (pageouts). However, if the rate-limited laundering is not sufficient to reach our free + cache page targets, we take another laundering loop and this time do not limit it at all. Thus under heavy memory pressure, no real rate limiting occurs. The system will happily pagein and pageout megabytes/sec. The reason we do this is because David Greenman and John Dyson found a long time ago that attempting to rate limit paging does not actually solve the thrashing problem, it actually makes it worse... So they solved the problem another way (see my answers for #1 and #2). It isn't the paging operations themselves that cause thrashing. :and, for extra brownie points: :4) fairness, small processes can be paged in and out : faster, so we can suspend&resume them faster; this : has the side effect of leaving the proverbial root : shell more usable Small process can contribute to thrashing as easily as large processes can under extreme memory pressure... for example, take an overloaded shell machine. *ALL* processes are 'small' processes in that case, or most of them are, and in great numbers they can be the cause. So no test that specifically checks the size of the process can be used to give it any sort of priority. Additionally, *idle* small processes are also great contributers to the VM subsystem in regards to clearing out idle pages. For example, on a heavily loaded shell machine more then 80% of the 'small processes' have been idle for long periods of time and it is exactly our ability to page them out that allows us to extend the machine's operational life and move the thrashing threshold farther away. The last thing we want to do is make a 'fix' that prevents us from paging out idle small processes. It would kill the machine. :5) make sure already resident processes cannot create : a situation that'll keep the swapped out tasks out : of memory forever ... but don't kill performance either, : since bad performance means we cannot get out of the : bad situation we're in When the system starts swapping processes out, it continues to swap them out until memory pressure goes down. With memory pressure down processes are swapped back in again one at a time, typically in FIFO order. So this situation will generally not occur. Basically we have all the algorithms in place to deal with thrashing. I'm sure that there are a few places where we can optimize things... for example, we can certainly tune the swapout algorithm itself. -Matt :regards, : :Rik :-- :Virtual memory is like a game you can't win; -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: on load control / process swapping 2001-05-07 22:50 ` Matt Dillon @ 2001-05-07 23:35 ` Rik van Riel 2001-05-08 0:56 ` Matt Dillon 2001-05-08 20:52 ` Kirk McKusick 1 sibling, 1 reply; 39+ messages in thread From: Rik van Riel @ 2001-05-07 23:35 UTC (permalink / raw) To: Matt Dillon; +Cc: arch, linux-mm, sfkaplan On Mon, 7 May 2001, Matt Dillon wrote: > :1) allow the resident processes to stay resident long > : enough to make progess > > This is accomplished as a side effect to the way the page queues > are handled. A page placed in the active queue is not allowed > to be moved out of that queue for a minimum period of time based > on page aging. See line 500 or so of vm_pageout.c (in -stable) . > > Thus when a process wakes up and pages a bunch of pages in, those > pages are guarenteed to stay in-core for a period of time no matter > what level of memory stress is occuring. I don't see anything limiting the speed at which the active list is scanned over and over again. OTOH, you are right that a failure to deactivate enough pages will trigger the swapout code ..... This sure is a subtle interaction ;) > :2) make sure the resident processes aren't thrashing, > : that is, don't let new processes back in memory if > : none of the currently resident processes is "ready" > : to be suspended > > When a process is swapped out, the process is removed from the run > queue and the P_INMEM flag is cleared. The process is only woken up > when faultin() is called (vm_glue.c line 312). faultin() is only > called from the scheduler() (line 340 of vm_glue.c) and the scheduler > only runs when the VM system indicates a minimum number of free pages > are available (vm_page_count_min()), which you can adjust with > the vm.v_free_min sysctl (usually represents 1-9 megabytes, dependings > on how much memory the system has). But ... is this a good enough indication that the processes currently resident have enough memory available to make any progress ? Especially if all the currently resident processes are waiting in page faults, won't that make it easier for the system to find pages to swap out, etc... ? One thing I _am_ wondering though: the pageout and the pagein thresholds are different. Can't this lead to problems where we always hit both the pageout threshold -and- the pagein threshold and the system thrashes swapping processes in and out ? > :3) have a mechanism to detect thrashing in a VM > : subsystem which isn't rate-limited (hard?) > > In FreeBSD, rate-limiting is a function of a lightly loaded system. > We rate-limit page laundering (pageouts). However, if the rate-limited > laundering is not sufficient to reach our free + cache page targets, > we take another laundering loop and this time do not limit it at all. > > Thus under heavy memory pressure, no real rate limiting occurs. The > system will happily pagein and pageout megabytes/sec. The reason we > do this is because David Greenman and John Dyson found a long time > ago that attempting to rate limit paging does not actually solve the > thrashing problem, it actually makes it worse... So they solved the > problem another way (see my answers for #1 and #2). It isn't the > paging operations themselves that cause thrashing. Agreed on all points ... I'm just wondering how well 1) and 2) still work after all the changes that were made to the VM in the last few years. They sure are subtle ... > :and, for extra brownie points: > :4) fairness, small processes can be paged in and out > : faster, so we can suspend&resume them faster; this > : has the side effect of leaving the proverbial root > : shell more usable > > Small process can contribute to thrashing as easily as large > processes can under extreme memory pressure... for example, > take an overloaded shell machine. *ALL* processes are 'small' > processes in that case, or most of them are, and in great numbers > they can be the cause. So no test that specifically checks the > size of the process can be used to give it any sort of priority. There's a test related to 2) though ... A small process needs to be in memory less time than a big process in order to make progress, so it can be swapped out earlier. It can also be swapped back in earlier, giving small processes shorter "time slices" for swapping than what large processes have. I'm not quite sure how much this would matter, though... > :5) make sure already resident processes cannot create > : a situation that'll keep the swapped out tasks out > : of memory forever ... but don't kill performance either, > : since bad performance means we cannot get out of the > : bad situation we're in > > When the system starts swapping processes out, it continues to swap > them out until memory pressure goes down. With memory pressure down > processes are swapped back in again one at a time, typically in FIFO > order. So this situation will generally not occur. > > Basically we have all the algorithms in place to deal with thrashing. > I'm sure that there are a few places where we can optimize things... > for example, we can certainly tune the swapout algorithm itself. Interesting, FreeBSD indeed _does_ seem to have all of the things in place (though the interactions between the various parts seem to be carefully hidden ;)). They indeed should work for lots of scenarios, but things like the subtlety of some of the code and the fact that the swapin and swapout thresholds are fairly unrelated look a bit worrying... regards, Rik -- Linux MM bugzilla: http://linux-mm.org/bugzilla.shtml Virtual memory is like a game you can't win; However, without VM there's truly nothing to lose... http://www.surriel.com/ http://www.conectiva.com/ http://distro.conectiva.com/ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: on load control / process swapping 2001-05-07 23:35 ` Rik van Riel @ 2001-05-08 0:56 ` Matt Dillon 2001-05-12 14:23 ` Rik van Riel 0 siblings, 1 reply; 39+ messages in thread From: Matt Dillon @ 2001-05-08 0:56 UTC (permalink / raw) To: Rik van Riel; +Cc: arch, linux-mm, sfkaplan :> to be moved out of that queue for a minimum period of time based :> on page aging. See line 500 or so of vm_pageout.c (in -stable) . :> :> Thus when a process wakes up and pages a bunch of pages in, those :> pages are guarenteed to stay in-core for a period of time no matter :> what level of memory stress is occuring. : :I don't see anything limiting the speed at which the active list :is scanned over and over again. OTOH, you are right that a failure :to deactivate enough pages will trigger the swapout code ..... : :This sure is a subtle interaction ;) Look at the loop line 1362 of vm_pageout.c. Note that it enforces a HZ/2 tsleep (2 scans per second) if the pageout daemon is unable to clean sufficient pages in two loops. The tsleep is not woken up by anyone while waiting that 1/2 second becuase vm_pages_needed has not been cleared yet. This is what is limiting the page queue scan. :> When a process is swapped out, the process is removed from the run :> queue and the P_INMEM flag is cleared. The process is only woken up :> when faultin() is called (vm_glue.c line 312). faultin() is only :> called from the scheduler() (line 340 of vm_glue.c) and the scheduler :> only runs when the VM system indicates a minimum number of free pages :> are available (vm_page_count_min()), which you can adjust with :> the vm.v_free_min sysctl (usually represents 1-9 megabytes, dependings :> on how much memory the system has). : :But ... is this a good enough indication that the processes :currently resident have enough memory available to make any :progress ? Yes. Consider detecting the difference between a large process accessing its pages randomly, and a small process accessing a relatively small set of pages over and over again. Now consider what happens when the system gets overloaded. The small process will be able to access its pages enough that they will get page priority over the larger process. The larger process, due to the more random accesses (or simply the fact that it is accessing a larger set of pages) will tend to stall more on pagein I/O which has the side effect of reducing the large process's access rate on all of its pages. The result: small processes get more priority just by being small. :Especially if all the currently resident processes are waiting :in page faults, won't that make it easier for the system to find :pages to swap out, etc... ? : :One thing I _am_ wondering though: the pageout and the pagein :thresholds are different. Can't this lead to problems where we :always hit both the pageout threshold -and- the pagein threshold :and the system thrashes swapping processes in and out ? The system will not page out a page it has just paged in due to the center-of-the-road initialization of act_count (the page aging). My experience at BEST was that both pagein and pageout activity occured simultaniously, but the fact had no detrimental effect on the system. You have to treat the pagein and pageout operations independantly because, in fact, they are only weakly related to each other. The only optimization you make, to reduce thrashing, is to not allow a just-paged-in page to immediately turn around and be paged out. I could probably make this work even better by setting the vm_page_t's act_count to its max value when paging in from swap. I'll think about doing that. The pagein and pageout rates have nothing to do with thrashing, per say, and should never be arbitrarily limited. Consider the difference between a system that is paing heavily and a system with only two small processes (like cp's) competing for disk I/O. Insofar as I/O goes, there is no difference. You can have a perfectly running system with high pagein and pageout rates. It's only when the paging I/O starts to eat into pages that are in active use where thrashing begins to occur. Think of a hotdog being eaten from both ends by two lovers. Memory pressure (active VM pages) eat away at one end, pageout I/O eats away at the other. You don't get fireworks until they meet. :> ago that attempting to rate limit paging does not actually solve the :> thrashing problem, it actually makes it worse... So they solved the :> problem another way (see my answers for #1 and #2). It isn't the :> paging operations themselves that cause thrashing. : :Agreed on all points ... I'm just wondering how well 1) and 2) :still work after all the changes that were made to the VM in :the last few years. They sure are subtle ... The algorithms mostly stayed the same. Much of the work was to remove artificial limitations that were reducing performance (due to the existance of greater amounts of memory, faster disks, and so forth...). I also spent a good deal of time removing 'restart' cases from the code that was causing a lot of cpu-wasteage in certain cases. What few restart cases remain just don't occur all that often. And I've done other things like extend the heuristics we already use for read()/write() to the VM system and change heuristic variables into per-vm-map elements rather then sharing them with read/write within the vnode. Etc. :> Small process can contribute to thrashing as easily as large :> processes can under extreme memory pressure... for example, :> take an overloaded shell machine. *ALL* processes are 'small' :> processes in that case, or most of them are, and in great numbers :> they can be the cause. So no test that specifically checks the :> size of the process can be used to give it any sort of priority. : :There's a test related to 2) though ... A small process needs :to be in memory less time than a big process in order to make :progress, so it can be swapped out earlier. Not necessarily. It depends whether the small process is cpu-bound or interactive. A cpu-bound small process should be allowed to run and not swapped out. An interactive small process can be safely swapped if idle for a period of time, because it can be swapped back in very quickly. It should not be swapped if it isn't idle (someone is typing, for example), because that would just waste disk I/O paging out and then paging right back in. You never want to swapout a small process gratuitously simply because it is small. :It can also be swapped back in earlier, giving small processes :shorter "time slices" for swapping than what large processes :have. I'm not quite sure how much this would matter, though... Both swapin and swapout activities are demand paged, but will be clustered if possible. I don't think there would be any point trying to conditionalize the algorithm based on the size of the process. The size has its own indirect positive effects which I think are sufficient. :Interesting, FreeBSD indeed _does_ seem to have all of the things in :place (though the interactions between the various parts seem to be :carefully hidden ;)). : :They indeed should work for lots of scenarios, but things like the :subtlety of some of the code and the fact that the swapin and :swapout thresholds are fairly unrelated look a bit worrying... : :regards, : :Rik I don't think it's possible to write a nice neat thrash-handling algorithm. It's a bunch of algorithms all working together, all closely tied to the VM page cache. Each taken alone is fairly easy to describe and understand. All of them together result in complex interactions that are very easy to break if you make a mistake. It usually takes me a couple of tries to get a solution to a problem in place without breaking something else (performance-wise) in the process. For example, I fubar'd heavy load performance for a month in FreeBSD-4.2 when I 'fixed' the pageout scan laundering algorithm. -Matt -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: on load control / process swapping 2001-05-08 0:56 ` Matt Dillon @ 2001-05-12 14:23 ` Rik van Riel 2001-05-12 17:21 ` Matt Dillon 2001-05-12 23:58 ` Matt Dillon 0 siblings, 2 replies; 39+ messages in thread From: Rik van Riel @ 2001-05-12 14:23 UTC (permalink / raw) To: Matt Dillon; +Cc: arch, linux-mm, sfkaplan On Mon, 7 May 2001, Matt Dillon wrote: > Look at the loop line 1362 of vm_pageout.c. Note that it enforces > a HZ/2 tsleep (2 scans per second) if the pageout daemon is unable > to clean sufficient pages in two loops. The tsleep is not woken up > by anyone while waiting that 1/2 second becuase vm_pages_needed has > not been cleared yet. This is what is limiting the page queue scan. Ahhh, so FreeBSD _does_ have a maxscan equivalent, just one that only kicks in when the system is under very heavy memory pressure. That explains why FreeBSD's thrashing detection code works... ;) (I'm not convinced, though, that limiting the speed at which we scan the active list is a good thing. There are some arguments in favour of speed limiting, but it mostly seems to come down to a short-cut to thrashing detection...) > :But ... is this a good enough indication that the processes > :currently resident have enough memory available to make any > :progress ? > > Yes. Consider detecting the difference between a large process accessing > its pages randomly, and a small process accessing a relatively small > set of pages over and over again. Now consider what happens when the > system gets overloaded. The small process will be able to access its > pages enough that they will get page priority over the larger process. > The larger process, due to the more random accesses (or simply the fact > that it is accessing a larger set of pages) will tend to stall more on > pagein I/O which has the side effect of reducing the large process's > access rate on all of its pages. The result: small processes get more > priority just by being small. But if the larger processes never get a chance to make decent progress without thrashing, won't your system be slowed down forever by these (thrashing) large processes? It's nice to protect your small processes from the large ones, but if the large processes don't get to run to completion the system will never get out of thrashing... > :Especially if all the currently resident processes are waiting > :in page faults, won't that make it easier for the system to find > :pages to swap out, etc... ? > : > :One thing I _am_ wondering though: the pageout and the pagein > :thresholds are different. Can't this lead to problems where we > :always hit both the pageout threshold -and- the pagein threshold > :and the system thrashes swapping processes in and out ? > > The system will not page out a page it has just paged in due to the > center-of-the-road initialization of act_count (the page aging). Indeed, the speed limiting of the pageout scanning takes care of this. But still, having the swapout threshold defined as being short of inactive pages while the swapin threshold uses the number of free+cache pages as an indication could lead to the situation where you suspend and wake up processes while it isn't needed. Or worse, suspending one process which easily fit in memory and then waking up another process, which cannot be swapped in because the first process' memory is still sitting in RAM and cannot be removed yet due to the pageout scan speed limiting (and also cannot be used, because we suspended the process). The chance of this happening could be quite big in some situations because the swapout and swapin thresholds are measuring things that are only indirectly related... > The pagein and pageout rates have nothing to do with thrashing, per say, > and should never be arbitrarily limited. But they are, with the pageout daemon going to sleep for half a second if it doesn't succeed in freeing enough memory at once. It even does this if a large part of the memory on the active list belongs to a process which has just been suspended because of thrashing... > I don't think it's possible to write a nice neat thrash-handling > algorithm. It's a bunch of algorithms all working together, all > closely tied to the VM page cache. Each taken alone is fairly easy > to describe and understand. All of them together result in complex > interactions that are very easy to break if you make a mistake. Heheh, certainly true ;) cheers, Rik -- Virtual memory is like a game you can't win; However, without VM there's truly nothing to lose... http://www.surriel.com/ http://distro.conectiva.com/ Send all your spam to aardvark@nl.linux.org (spam digging piggy) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: on load control / process swapping 2001-05-12 14:23 ` Rik van Riel @ 2001-05-12 17:21 ` Matt Dillon 2001-05-12 21:17 ` Rik van Riel 2001-05-12 23:58 ` Matt Dillon 1 sibling, 1 reply; 39+ messages in thread From: Matt Dillon @ 2001-05-12 17:21 UTC (permalink / raw) To: Rik van Riel; +Cc: arch, linux-mm, sfkaplan : :Ahhh, so FreeBSD _does_ have a maxscan equivalent, just one that :only kicks in when the system is under very heavy memory pressure. : :That explains why FreeBSD's thrashing detection code works... ;) : :(I'm not convinced, though, that limiting the speed at which we :scan the active list is a good thing. There are some arguments :in favour of speed limiting, but it mostly seems to come down :to a short-cut to thrashing detection...) Note that there is a big distinction between limiting the page queue scan rate (which we do not do), and sleeping between full scans (which we do). Limiting the page queue scan rate on a page-by-page basis does not scale. Sleeping in between full queue scans (in an extreme case) does scale. -Matt -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: on load control / process swapping 2001-05-12 17:21 ` Matt Dillon @ 2001-05-12 21:17 ` Rik van Riel 0 siblings, 0 replies; 39+ messages in thread From: Rik van Riel @ 2001-05-12 21:17 UTC (permalink / raw) To: Matt Dillon; +Cc: arch, linux-mm, sfkaplan On Sat, 12 May 2001, Matt Dillon wrote: > :Ahhh, so FreeBSD _does_ have a maxscan equivalent, just one that > :only kicks in when the system is under very heavy memory pressure. > : > :That explains why FreeBSD's thrashing detection code works... ;) > > Note that there is a big distinction between limiting the page > queue scan rate (which we do not do), and sleeping between full > scans (which we do). Limiting the page queue scan rate on a > page-by-page basis does not scale. Sleeping in between full queue > scans (in an extreme case) does scale. I'm not convinced it's doing a very useful thing, though ;) (see the rest of the email you replied to) Rik -- Linux MM bugzilla: http://linux-mm.org/bugzilla.shtml Virtual memory is like a game you can't win; However, without VM there's truly nothing to lose... http://www.surriel.com/ http://www.conectiva.com/ http://distro.conectiva.com/ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: on load control / process swapping 2001-05-12 14:23 ` Rik van Riel 2001-05-12 17:21 ` Matt Dillon @ 2001-05-12 23:58 ` Matt Dillon 2001-05-13 17:22 ` Rik van Riel 1 sibling, 1 reply; 39+ messages in thread From: Matt Dillon @ 2001-05-12 23:58 UTC (permalink / raw) To: Rik van Riel; +Cc: arch, linux-mm, sfkaplan Consider the case where you have one large process and many small processes. If you were to skew things to allow the large process to run at the cost of all the small processes, you have just inconvenienced 98% of your users so one ozob can run a big job. Not only that, but there is no guarentee that the 'big job' will ever finish (a topic of many a paper on scheduling, BTW)... what if it's been running for hours and still has hours to go? Do we blow away the rest of the system to let it run? What if there are several big jobs? If you skew things in favor of one the others could take 60 seconds *just* to recover their RSS when they are finally allowed to run. So much for timesharing... you would have to run each job exclusively for 5-10 minutes at a time to get any sort of effiency, which is not practical in a timeshare system. So there is really very little that you can do. :Indeed, the speed limiting of the pageout scanning takes care of :this. But still, having the swapout threshold defined as being :short of inactive pages while the swapin threshold uses the number :of free+cache pages as an indication could lead to the situation :where you suspend and wake up processes while it isn't needed. : :Or worse, suspending one process which easily fit in memory and :then waking up another process, which cannot be swapped in because :the first process' memory is still sitting in RAM and cannot be :removed yet due to the pageout scan speed limiting (and also cannot :be used, because we suspended the process). We don't suspend running processes, but I do believe FreeBSD is still vulnerable to this issue. Suspending the marked process when it hits the vm_fault code is a good idea and would solve the problem. If the process never takes an allocation fault, it probably doesn't have to be swapped out. The normal pageout would suffice for that process. :> The pagein and pageout rates have nothing to do with thrashing, per say, :> and should never be arbitrarily limited. : :But they are, with the pageout daemon going to sleep for half a :second if it doesn't succeed in freeing enough memory at once. :It even does this if a large part of the memory on the active :list belongs to a process which has just been suspended because :of thrashing... No. I did say the code was complex. A process which has been suspended for thrashing gets all of its pages depressed in priority. The page daemon would have no problem recovering the pages. See line 1458 of vm_pageout.c. This code also enforces the 'memoryuse' resource limit (which is perhaps even more important). It is not necessary to try to launder the pages immediately. Simply depressing their priority is sufficient and it allows for quicker recovery when the thrashing goes away. It also allows us to implement the vm.swap_idle_{threshold1,threshold2,enabled} sysctls trivially, which results in proactive swapping that is extremely useful in certain situations (like shell machines with lots of idle users). The pagedaemon gets behind when there are too many active pages in the system and the pagedaemon is unable to move them to the inactive queue due to the pages still being very active... that is, when the active resident set for all processes in the system exceeds available memory. This is what triggers thrashing. Swapping has the side effect of reducing the total active resident set for the system as a whole, fixing the thrashing problem. -Matt :> I don't think it's possible to write a nice neat thrash-handling :> algorithm. It's a bunch of algorithms all working together, all :> closely tied to the VM page cache. Each taken alone is fairly easy :> to describe and understand. All of them together result in complex :> interactions that are very easy to break if you make a mistake. : :Heheh, certainly true ;) : :cheers, : :Rik -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: on load control / process swapping 2001-05-12 23:58 ` Matt Dillon @ 2001-05-13 17:22 ` Rik van Riel 2001-05-15 6:38 ` Terry Lambert 0 siblings, 1 reply; 39+ messages in thread From: Rik van Riel @ 2001-05-13 17:22 UTC (permalink / raw) To: Matt Dillon; +Cc: arch, linux-mm, sfkaplan On Sat, 12 May 2001, Matt Dillon wrote: > :But if the larger processes never get a chance to make decent > :progress without thrashing, won't your system be slowed down > :forever by these (thrashing) large processes? > : > :It's nice to protect your small processes from the large ones, > :but if the large processes don't get to run to completion the > :system will never get out of thrashing... > > Consider the case where you have one large process and many small > processes. If you were to skew things to allow the large process to > run at the cost of all the small processes, you have just inconvenienced > 98% of your users so one ozob can run a big job. So we should not allow just one single large job to take all of memory, but we should allow some small jobs in memory too. > What if there are several big jobs? If you skew things in favor of > one the others could take 60 seconds *just* to recover their RSS when > they are finally allowed to run. So much for timesharing... you > would have to run each job exclusively for 5-10 minutes at a time > to get any sort of effiency, which is not practical in a timeshare > system. So there is really very little that you can do. If you don't do this very slow swapping, NONE of the big tasks will have the opportunity to make decent progress and the system will never get out of thrashing. If we simply make the "swap time slices" for larger processes larger than for smaller processes we: 1) have a better chance of the large jobs getting any work done 2) won't have the large jobs artificially increase memory load, because all time will be spent removing each other's RSS 3) can have more small jobs in memory at once, due to 2) 4) can be better for interactive performance due to 3) 5) have a better chance of getting out of the overload situation sooner I realise this would make the scheduling algorithm slightly more complex and I'm not convinced doing this would be worth it myself, but we may want to do some brainstorming over this ;) regards, Rik -- Virtual memory is like a game you can't win; However, without VM there's truly nothing to lose... http://www.surriel.com/ http://distro.conectiva.com/ Send all your spam to aardvark@nl.linux.org (spam digging piggy) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: on load control / process swapping 2001-05-13 17:22 ` Rik van Riel @ 2001-05-15 6:38 ` Terry Lambert 2001-05-15 13:39 ` Cy Schubert - ITSD Open Systems Group ` (2 more replies) 0 siblings, 3 replies; 39+ messages in thread From: Terry Lambert @ 2001-05-15 6:38 UTC (permalink / raw) To: Rik van Riel; +Cc: Matt Dillon, arch, linux-mm, sfkaplan Rik van Riel wrote: > So we should not allow just one single large job to take all > of memory, but we should allow some small jobs in memory too. Historically, this problem is solved with a "working set quota". > If you don't do this very slow swapping, NONE of the big tasks > will have the opportunity to make decent progress and the system > will never get out of thrashing. > > If we simply make the "swap time slices" for larger processes > larger than for smaller processes we: > > 1) have a better chance of the large jobs getting any work done > 2) won't have the large jobs artificially increase memory load, > because all time will be spent removing each other's RSS > 3) can have more small jobs in memory at once, due to 2) > 4) can be better for interactive performance due to 3) > 5) have a better chance of getting out of the overload situation > sooner > > I realise this would make the scheduling algorithm slightly > more complex and I'm not convinced doing this would be worth > it myself, but we may want to do some brainstorming over this ;) A per vnode working set quota with a per use count adjust would resolve most load thrashing issues. Programs with large working sets can either be granted a case by case exception (via rlimit), or, more likely just have their pages thrashed out more often. You only ever need to do this when you have exhausted memory to the point you are swapping, and then only when you want to reap cached clean pages; when all you have left is dirty pages in memory and swap, you are well and truly thrashing -- for the right reason: your system load is too high. It's also relatively easy to implement something like a per vnode working set quota, which can be self-enforced, without making the scheduler so ugly that you will never be able to do things like have per-CPU run queues for a very efficient SMP that deals with the cache locality issue naturally and easily (by merely setting migration policies for moving from one run queue to another, and by threads in a thread group having negative affinity for each other's CPUs, to maximize real concurrency). Psuedo code: IF THRASH_CONDITIONS IF (COPY_ON_WRITE_FAULT OR PAGE_FILL_OF_SBRKED_PAGE_FAULT) IF VNODE_OVER_WORKING_SET_QUOTA STEAL_PAGE_FROM_VNODE_LRU ELSE GET_PAGE_FROM_SYSTEM Obviously, this would work for vnodes that were acting as backing store for programs, just as they would prevent a large mmap() with a traversal from thrashing everyone else's data and code out of core (which is, I think, a much worse and much more common problem). Doing extremely complicated things is only going to get you into trouble... in particular, you don't want to have policy in effect to deal with border load conditions unless you are under those conditions in the first place. The current scheduling algorithms are quite simple, relatively speaking, and it makes much more sense to make the thrasher fight with themselves, rather than them peeing in everyone's pool. I think that badly written programs taking more time, as a result, is not a problem; if it is, it's one I could live with much more easily than cache-busting for no good reason, and slowing well behaved code down. You need to penalize the culprit. It's possible to do a more complicated working set quota, which actually applies to a process' working set, instead of to vnodes, out of context with the process, but I think that the vnode approach, particularly when you bump the working set up per each additional opener, using the count I suggested, to ensure proper locality of reference, is good enough to solve the problem. At the very least, the system would not "freeze" with this approach, even if it could later recover. -- Terry -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: on load control / process swapping 2001-05-15 6:38 ` Terry Lambert @ 2001-05-15 13:39 ` Cy Schubert - ITSD Open Systems Group 2001-05-15 15:31 ` Rik van Riel 2001-05-15 17:24 ` Matt Dillon 2 siblings, 0 replies; 39+ messages in thread From: Cy Schubert - ITSD Open Systems Group @ 2001-05-15 13:39 UTC (permalink / raw) To: tlambert2; +Cc: Rik van Riel, Matt Dillon, arch, linux-mm, sfkaplan In message <3B00CECF.9A3DEEFA@mindspring.com>, Terry Lambert writes: > Rik van Riel wrote: > > So we should not allow just one single large job to take all > > of memory, but we should allow some small jobs in memory too. > > Historically, this problem is solved with a "working set > quota". > > > If you don't do this very slow swapping, NONE of the big tasks > > will have the opportunity to make decent progress and the system > > will never get out of thrashing. > > > > If we simply make the "swap time slices" for larger processes > > larger than for smaller processes we: > > > > 1) have a better chance of the large jobs getting any work done > > 2) won't have the large jobs artificially increase memory load, > > because all time will be spent removing each other's RSS > > 3) can have more small jobs in memory at once, due to 2) > > 4) can be better for interactive performance due to 3) > > 5) have a better chance of getting out of the overload situation > > sooner > > > > I realise this would make the scheduling algorithm slightly > > more complex and I'm not convinced doing this would be worth > > it myself, but we may want to do some brainstorming over this ;) > > A per vnode working set quota with a per use count adjust > would resolve most load thrashing issues. Programs with > large working sets can either be granted a case by case > exception (via rlimit), or, more likely just have their > pages thrashed out more often. > > You only ever need to do this when you have exhausted > memory to the point you are swapping, and then only when > you want to reap cached clean pages; when all you have > left is dirty pages in memory and swap, you are well and > truly thrashing -- for the right reason: your system load > is too high. An operating system I worked on at one time, MVS, had this feature (not sure whether it still does today). We called it fencing (e.g. fencing an address space). An address space could be limited to the amount of real memory used. Conversely, important address spaces could be given a minimum amount of real memory, e.g. online applications such a CICS. Additionally instead of limiting an address space to a minimum or maximum amount of real memory, an address space could be limited to a maximum paging rate, giving the O/S the option of increasing its real memory to match its WSS, reducing paging of the specified address space to a preset limit. Of course this could have negative impact on other applications running on the system, which is why IBM recommended against using this feature. Regards, Phone: (250)387-8437 Cy Schubert Fax: (250)387-5766 Team Leader, Sun/Alpha Team Internet: Cy.Schubert@osg.gov.bc.ca Open Systems Group, ITSD, ISTA Province of BC -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: on load control / process swapping 2001-05-15 6:38 ` Terry Lambert 2001-05-15 13:39 ` Cy Schubert - ITSD Open Systems Group @ 2001-05-15 15:31 ` Rik van Riel 2001-05-15 17:24 ` Matt Dillon 2 siblings, 0 replies; 39+ messages in thread From: Rik van Riel @ 2001-05-15 15:31 UTC (permalink / raw) To: Terry Lambert; +Cc: Matt Dillon, arch, linux-mm, sfkaplan On Mon, 14 May 2001, Terry Lambert wrote: > Rik van Riel wrote: > > So we should not allow just one single large job to take all > > of memory, but we should allow some small jobs in memory too. > > Historically, this problem is solved with a "working set > quota". This is a great idea for when the system is in-between normal loads and real thrashing. It will save small processes while slowing down memory hogs which are taking resources fairly. I'm not convinced it is any replacement for swapping, but it sure a good way to delay swapping as long as possible. Also, having a working set size guarantee in combination with idle swapping will almost certainly give the proveribial root shell the boost it needs ;) > Doing extremely complicated things is only going to get > you into trouble... in particular, you don't want to > have policy in effect to deal with border load conditions > unless you are under those conditions in the first place. Agreed. > It's possible to do a more complicated working set quota, > which actually applies to a process' working set, instead > of to vnodes, out of context with the process, I guess in FreeBSD a per-vnode approach would be easier to implement while in Linux a per-process working set would be easier... regards, Rik -- Virtual memory is like a game you can't win; However, without VM there's truly nothing to lose... http://www.surriel.com/ http://distro.conectiva.com/ Send all your spam to aardvark@nl.linux.org (spam digging piggy) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: on load control / process swapping 2001-05-15 6:38 ` Terry Lambert 2001-05-15 13:39 ` Cy Schubert - ITSD Open Systems Group 2001-05-15 15:31 ` Rik van Riel @ 2001-05-15 17:24 ` Matt Dillon 2001-05-15 23:55 ` Roger Larsson 2001-05-16 8:23 ` Terry Lambert 2 siblings, 2 replies; 39+ messages in thread From: Matt Dillon @ 2001-05-15 17:24 UTC (permalink / raw) To: Terry Lambert; +Cc: Rik van Riel, arch, linux-mm, sfkaplan :Rik van Riel wrote: :> So we should not allow just one single large job to take all :> of memory, but we should allow some small jobs in memory too. : :Historically, this problem is solved with a "working set :quota". We have a process-wide working set quota. It's called the 'memoryuse' resource. :... :> 5) have a better chance of getting out of the overload situation :> sooner :> :> I realise this would make the scheduling algorithm slightly :> more complex and I'm not convinced doing this would be worth :> it myself, but we may want to do some brainstorming over this ;) : :A per vnode working set quota with a per use count adjust :would resolve most load thrashing issues. Programs with It most certainly would not. Limiting the number of pages you allow to be 'cached' on a vnode by vnode basis would be a disaster. It has absolutely nothing whatsoever to do with thrashing or thrash-management. It would simply be an artificial limitation based on artificial assumptions that are as likely to be wrong as right. If I've learned anything working on the FreeBSD VM system, it's that the number of assumptions you make in regards to what programs do, how they do it, how much data they should be able to cache, and so forth is directly proportional to how badly you fuck up the paging algorithms. I implemented a special page-recycling algorithm in 4.1/4.2 (which is still there in 4.3). Basically it tries predict when it is possible to throw away pages 'behind' a sequentially accessed file, so as not to allow that file to blow away your cache. E.G. if you have 128M of ram and you are sequentially accessing a 200MB file, obviously there is not much point in trying to cache the data as you read it. But being able to predict something like this is extremely difficult. In fact, nearly impossible. And without being able to make the prediction accurately you simply cannot determine how much data you should try to cache before you begin recycling it. I wound up having to change the algorithm to act more like a heuristic -- it does a rough prediction but doesn't hold the system to it, then allows the page priority mechanism to refine the prediction. But it can take several passes (or non-passes) on the file before the page recycling stabilizes. So the jist of the matter is that FreeBSD (1) already has process-wide working set limitations which are activated when the system is under load, and (2) already has a heuristic that attempts to predict when not to cache pages. Actually several heuristics (a number of which were in place in the original CSRG code). -Matt -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: on load control / process swapping 2001-05-15 17:24 ` Matt Dillon @ 2001-05-15 23:55 ` Roger Larsson 2001-05-16 0:16 ` Matt Dillon 2001-05-16 8:23 ` Terry Lambert 1 sibling, 1 reply; 39+ messages in thread From: Roger Larsson @ 2001-05-15 23:55 UTC (permalink / raw) To: Matt Dillon; +Cc: Rik van Riel, arch, linux-mm, sfkaplan On Tuesday 15 May 2001 19:24, Matt Dillon wrote: > I implemented a special page-recycling algorithm in 4.1/4.2 (which is > still there in 4.3). Basically it tries predict when it is possible to > throw away pages 'behind' a sequentially accessed file, so as not to > allow that file to blow away your cache. E.G. if you have 128M of ram > and you are sequentially accessing a 200MB file, obviously there is > not much point in trying to cache the data as you read it. > > But being able to predict something like this is extremely difficult. > In fact, nearly impossible. And without being able to make the > prediction accurately you simply cannot determine how much data you > should try to cache before you begin recycling it. I wound up having > to change the algorithm to act more like a heuristic -- it does a rough > prediction but doesn't hold the system to it, then allows the page > priority mechanism to refine the prediction. But it can take several > passes (or non-passes) on the file before the page recycling > stabilizes. > Are the heuristics persistent? Or will the first use after boot use the rough prediction? For how long time will the heuristic stick? Suppose it is suddenly used in a slightly different way. Like two sequential readers instead of one... /RogerL -- Roger Larsson Skelleftea Sweden -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: on load control / process swapping 2001-05-15 23:55 ` Roger Larsson @ 2001-05-16 0:16 ` Matt Dillon 0 siblings, 0 replies; 39+ messages in thread From: Matt Dillon @ 2001-05-16 0:16 UTC (permalink / raw) To: Roger Larsson; +Cc: Rik van Riel, arch, linux-mm, sfkaplan :Are the heuristics persistent? :Or will the first use after boot use the rough prediction? :For how long time will the heuristic stick? Suppose it is suddenly used in :a slightly different way. Like two sequential readers instead of one... : :/RogerL :Roger Larsson :Skelleftea :Sweden It's based on the VM page cache, so its adaptive over time. I wouldn't call it persistent, it is nothing more then a simple heuristic that 'normally' throws a page away but 'sometimes' caches it. In otherwords, you lose some performance on the frontend in order to gain some later on. If you loop through a file enough times, most of the file winds up getting cached. It's still experimental so it is only lightly tied into the system. It seems to work, though, so at some point in the future I'll probably try to put some significant prediction in. But as I said, it's a very difficult thing to predict. You can't just put your foot down and say 'I'll cache X amount of file Y'. That doesn't work at all. -Matt -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: on load control / process swapping 2001-05-15 17:24 ` Matt Dillon 2001-05-15 23:55 ` Roger Larsson @ 2001-05-16 8:23 ` Terry Lambert 2001-05-16 17:26 ` Matt Dillon 1 sibling, 1 reply; 39+ messages in thread From: Terry Lambert @ 2001-05-16 8:23 UTC (permalink / raw) To: Matt Dillon; +Cc: Rik van Riel, arch, linux-mm, sfkaplan Matt Dillon wrote: > :> So we should not allow just one single large job to take all > :> of memory, but we should allow some small jobs in memory too. > : > :Historically, this problem is solved with a "working set > :quota". > > We have a process-wide working set quota. It's called > the 'memoryuse' resource. It's not terrifically useful for limiting pageout as a result of excessive demand pagein operations. > :A per vnode working set quota with a per use count adjust > :would resolve most load thrashing issues. Programs with > > It most certainly would not. Limiting the number of pages > you allow to be 'cached' on a vnode by vnode basis would > be a disaster. I don't know whether to believe you, or Dave Cutler... 8-). > It has absolutely nothing whatsoever to do with thrashing > or thrash-management. It would simply be an artificial > limitation based on artificial assumptions that are as > likely to be wrong as right. I have a lot of problems with most of FreeBSD's anti-thrash "protection"; I don't think many people are really running it at a very high load. I think a lot of the "administrative limits" are stupid; in particular, I think it's really dumb to have 70% free resources, and yet enforce administrative limits as if all machines were shell account servers at an ISP where the customers are just waiting for the operators to turn their heads for a second so they can run 10,000 IRC "bots". I also have a problem with the preallocation of contiguous pageable regions of real memory via zalloci() in order to support inpcb and tcpcb structures, which inherently mean that I have to statically preallocate structures for IPs, TCP structures, and sockets, as well as things like file descriptors. In other words, I have to guess the future characteristics of my load, rather than having the OS do the best it can in any given situation. Not to mention the allocation of an entire mbuf per socket. > If I've learned anything working on the FreeBSD VM > system, it's that the number of assumptions you make > in regards to what programs do, how they do it, how > much data they should be able to cache, and so forth > is directly proportional to how badly you fuck up the > paging algorithms. I've personally experienced thrash from a moronic method of implementing "ld", which mmap's all the .o files, and then seeks all over heck, randomly, in order to perform the actual link. It makes that specific operation very fast, at the expense of the rest of the system. The result of this is that everything else on the system gets thrashed out of core, including the X server, and the very simple and intuitive "move mouse, wiggle cursor" breaks, which then breaks the entire paradigm. FreeBSD is succeptible to this problem. So was SVR4 UNIX. The way SVR4 "repaired" the problem was to invent a new scheduling class, "fixed", which would guarantee time slices to the X server. Thus, as fast as "ld" thrashed pages it wasn't interested in out, "X" thrashed them back in. The interactive experience was degraded by the excessive paging. I implemented a different approach in UnixWare 2.x; it didn't end up making it into the main UnixWare source tree (I was barely able to get my /procfs based rfork() into the thing, with the help of some good engineers from NJ); but it was a per vnode working set quota approach. It operated in much the way I described, and it fixed the problem: the only program that got thrashed by "ld" was "ld": everything else on the system had LRU pages present when the needed to run. The "ld" program wasn't affected itself until you started running low on buffer cache. IMO, anything that results in the majority of programs remaining reasonably runnable, and penalizes only the programs making life hell for everyone else, and only kicks in when life is truly starting to go to hell, is a good approach. I really don't care that I got the idea from Dave Cutler's work in VMS, instead of arriving at it on my own (those the per-vnode nature of mine is, I think, an historically unique approach). > I implemented a special page-recycling algorithm in > 4.1/4.2 (which is still there in 4.3). Basically it > tries predict when it is possible to throw away pages > 'behind' a sequentially accessed file, so as not to > allow that file to blow away your cache. E.G. if you > have 128M of ram and you are sequentially accessing a > 200MB file, obviously there is not much point in trying > to cache the data as you read it. IMO, the ability to stream data like this is why Sun, in Solaris 2.8, felt the need to "invent" seperate VM and buffer caches once again -- "everything old is new again". Also, IMO, I feel that the rationale used to justify this decision was poorly defended, and that there are much better implementations one could have -- including simple red queueing for large data sets. It was a cop out on their part, having to do with not setting up simple high and low water marks to keep things like a particular FS or networking subsystem from monopolizing memory. Instead, they now have this artificial divide, where under typical workloads, one pool lies largely fallow (which one depends on the server role). I guess that's not a problem, if your primary after market marked up revenue generation sale item is DRAM... If the code you are referring to is the code that I think it is, I don't think it's useful, except for something like a web server with large objects to serve. Even then, discarding the entire concept of locality of reference when you notice sequential access seems bogus. Realize that average web server service objects are on the order of 10k, not 200M. Realize also the _absolutely disasterous_ effect that code kicking in would have on, for example, an FTP server immediately after the release of FreeBSD ISO images to the net. You would basically not cache that data which is your primary hottest content -- turning virtually assured cache hits into cache misses. > But being able to predict something like this is > extremely difficult. In fact, nearly impossible. I would say that it could be reduced to a stochiastic and iterative process, but (see above), that it would be a terrible idea for all but something like a popular MP3 server... even then, you start discarding useful data under burst loads, and we're back to cache missing. > And without being able to make the prediction > accurately you simply cannot determine how much data > you should try to cache before you begin recycling it. I should think that would be obvious: nearly everything you can, based on locality and number of concurrent references. It's only when you attempt prefetch that it actually becomes complicated; deciding to throw away a clean page later instead of _now_ costs you practically nothing. > So the jist of the matter is that FreeBSD (1) already > has process-wide working set limitations which are > activated when the system is under load, They are largely useless, since they are also active even when the system is not under load, so they act as preemptive drags on performance. They are also (as was pointed out in an earlier thread) _not_ applied to mmap() and other regions, so they are easily subverted. > and (2) already has a heuristic that attempts to predict > when not to cache pages. Actually several heuristics (a > number of which were in place in the original CSRG code). I would argue that the CPU vs. memory vs. disk speed pendulum is moving back the other way, and that it's time to reconsider these algorithms once again. If it's done correctly, they would be adaptive based on knowing the data rate for each given subsystem. We have gigabit NICs these days, which can fully monopolize a PCI bus very easily with few cards -- doing noting but network I/O at burst rate on a 66MHz 64 bit PCI bus, thing max out at 4 cards -- and that's if you can get them to transfer the data directly to each other, with no host intervention being required, which you can't. The fastest memory bus I've seen in Intel calls hardware is 133MHz; at 64 bits, that's twice as fast as the 64bit 66MHz PCI bus. Disks are pig-slow comparatively; in all cases, they're going to be limited to the I/O bus speed anyway, and as rotational speeds have gone up, seek latency has failed to keep pace. Most fast IDE ("multimedia") drives still turn off thermal recalibration in order to keep streaming. I think you need to stress a system -- really stress it, so that you are hitting some hardware limit because of the way FreeBSD uses the hardware -- in order to understand where the real problems in FreeBSD lie. Otherwise, it's just like profiling a program over a tiny workload: the actual cost of servicing real work get lost in the costs associated with initialization. It's pretty obvious from some of the recent bugs I've run into that no one has attempted to open more than 32767 sockets in a production environment using a FreeBSD system. It's also obvious that no one has attempted to have more than 65535 client connections open on a FreeBSD box. There are similar (obvious in retrospect) problems in the routing and other code (what is with the alias requirement for a 255.255.255.255 netmask, for example? Has no one heard of VLANs, without explicit VLAN code?). The upshot is that things are failing to scale under a number of serious stress loads, and rather than defending the past, we should be looking at fixing the problems. I'm personally very happy to have the Linux geeks interested in covering this territory cooperatively with the FreeBSD geeks. We need to be clever about causing scaling problems, and more clever about fixing them, IMO. -- Terry -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: on load control / process swapping 2001-05-16 8:23 ` Terry Lambert @ 2001-05-16 17:26 ` Matt Dillon 0 siblings, 0 replies; 39+ messages in thread From: Matt Dillon @ 2001-05-16 17:26 UTC (permalink / raw) To: Terry Lambert; +Cc: Rik van Riel, arch, linux-mm, sfkaplan :I think a lot of the "administrative limits" are stupid; :in particular, I think it's really dumb to have 70% free :resources, and yet enforce administrative limits as if all :... The 'memoryuse' resource limit is not enforced unless the system is under memory pressure. :... :> And without being able to make the prediction :> accurately you simply cannot determine how much data :> you should try to cache before you begin recycling it. : :I should think that would be obvious: nearly everything :you can, based on locality and number of concurrent :references. It's only when you attempt prefetch that it :actually becomes complicated; deciding to throw away a :clean page later instead of _now_ costs you practically :nothing. :... Prefetching has nothing to do with what we've been talking about. We don't have a problem caching prefetched pages that aren't used. The problem we have is determining when to throw away data once it has been used by a program. :... :> So the jist of the matter is that FreeBSD (1) already :> has process-wide working set limitations which are :> activated when the system is under load, : :They are largely useless, since they are also active even :when the system is not under load, so they act as preemptive :... This is not true. Who told you this? This is absolutely not true. :drags on performance. They are also (as was pointed out in :an earlier thread) _not_ applied to mmap() and other regions, :so they are easily subverted. :... : :-- Terry : This is not true. The 'memoryuse' limit applies to all in-core pages associated with the process, whether mmap()'d or not. -Matt -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: on load control / process swapping 2001-05-07 22:50 ` Matt Dillon 2001-05-07 23:35 ` Rik van Riel @ 2001-05-08 20:52 ` Kirk McKusick 2001-05-09 0:18 ` Matt Dillon 1 sibling, 1 reply; 39+ messages in thread From: Kirk McKusick @ 2001-05-08 20:52 UTC (permalink / raw) To: Matt Dillon; +Cc: Rik van Riel, arch, linux-mm, sfkaplan I know that FreeBSD will swap out sleeping processes, but will it ever swap out running processes? The old BSD VM system would do so (we called it hard swapping). It is possible to get a set of running processes that simply do not all fit in memory, and the only way for them to make forward progress is to cycle them through memory. As to the size issue, we used to be biased towards the processes with large resident set sizes in kicking things out. In general, swapping out small things does not buy you much memory and it annoys more users. To avoid picking on the biggest, each time we needed to kick something out, we would find the five biggest, and kick out the one that had been memory resident the longest. The effect is to go round-robin among the big processes. Note that this algorithm allows you to kick out shells, if they are the biggest processes. Also note that this is a last ditch algorithm used only after there are no more idle processes available to kick out. Our decision that we had had to kick out running processes was: (1) no idle processes available to swap, (2) load average over one (if there is just one process, kicking it out does not solve the problem :-), (3) paging rate above a specified threshhold over the entire previous 30 seconds (e.g., been bad for a long time and not getting better in the short term), and (4) paging rate to/from swap area using more than half the available disk bandwidth (if your filesystems are on the same disk as you swap areas, you can get a false sense of success because all your process stop paging while they are blocked waiting for their file data. Kirk -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: on load control / process swapping 2001-05-08 20:52 ` Kirk McKusick @ 2001-05-09 0:18 ` Matt Dillon 2001-05-09 2:07 ` Peter Jeremy 2001-05-12 14:28 ` Rik van Riel 0 siblings, 2 replies; 39+ messages in thread From: Matt Dillon @ 2001-05-09 0:18 UTC (permalink / raw) To: Kirk McKusick; +Cc: Rik van Riel, arch, linux-mm, sfkaplan I looked at the code fairly carefully last night... it doesn't swap out running processes and it also does not appear to swap out processes blocked in a page-fault (on I/O). Now, of course we can't swap a process out right then (it might be holding locks), but I think it would be beneficial to be able to mark the process as 'requesting a swapout on return to user mode' or something like that. At the moment what gets picked for swapping is hit-or-miss due to the wait states. :As to the size issue, we used to be biased towards the processes :with large resident set sizes in kicking things out. In general, :swapping out small things does not buy you much memory and it The VM system does enforce the 'memoryuse' resource limit when the memory load gets heavy. But once the load goes beyond that the VM system doesn't appear to care how big the process is. :... :biggest processes. Also note that this is a last ditch algorithm :used only after there are no more idle processes available to :kick out. Our decision that we had had to kick out running :processes was: (1) no idle processes available to swap, (2) load :average over one (if there is just one process, kicking it out :does not solve the problem :-), (3) paging rate above a specified :threshhold over the entire previous 30 seconds (e.g., been bad :for a long time and not getting better in the short term), and :(4) paging rate to/from swap area using more than half the :available disk bandwidth (if your filesystems are on the same :disk as you swap areas, you can get a false sense of success :because all your process stop paging while they are blocked :waiting for their file data. : : Kirk I don't think we want to kick out running processes. Thrashing by definition means that many of the processes are stuck in disk-wait, usually from a VM fault, and not running. The other effect of thrashing is, of course, the the cpu idle time goes way up due to all the process stalls. A process that is actually able to run under these circumstances probably has a small run-time footprint (at least for whatever operation it is currently doing), so it should definitely be allowed to continue to run. -Matt -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: on load control / process swapping 2001-05-09 0:18 ` Matt Dillon @ 2001-05-09 2:07 ` Peter Jeremy 2001-05-09 19:41 ` Matt Dillon 2001-05-12 14:28 ` Rik van Riel 1 sibling, 1 reply; 39+ messages in thread From: Peter Jeremy @ 2001-05-09 2:07 UTC (permalink / raw) To: Matt Dillon; +Cc: Kirk McKusick, Rik van Riel, arch, linux-mm, sfkaplan On 2001-May-08 17:18:16 -0700, Matt Dillon <dillon@earth.backplane.com> wrote: > I don't think we want to kick out running processes. Thrashing > by definition means that many of the processes are stuck in > disk-wait, usually from a VM fault, and not running. The other > effect of thrashing is, of course, the the cpu idle time goes way > up due to all the process stalls. A process that is actually able > to run under these circumstances probably has a small run-time footprint > (at least for whatever operation it is currently doing), so it should > definitely be allowed to continue to run. I don't think this follows. A program that does something like: { extern char memory[BIG_NUMBER]; int i; for (i = 0; i < BIG_NUMBER; i += PAGE_SIZE) memory[i]++; } will thrash nicely (assuming BIG_NUMBER is large compared to the currently available physical memory). Occasionally, it will be runnable - at which stage it has a footprint of only two pages, but after executing a couple of instructions, it'll have another page fault. Old pages will remain resident for some time before they age enough to be paged out. If the VM system is stressed, swapping this process out completely would seem to be a win. Whilst this code is artificial, a process managing a very large hash table will have similar behaviour. Given that most (all?) recent CPU's have cheap hi-resolution clocks, would it be worthwhile for the VM system to maintain a per-process page fault rate? (average clock cycles before a process faults). If you ignore spikes due to process initialisation etc, a process that faults very quickly after being given the CPU wants a working set size that is larger than the VM system currently allows. The fault rate would seem to be proportional to the ratio between the wanted WSS and allowed RSS. This would seem to be a useful parameter to help decide which process to swap out - in an ideal world the VM subsystem would swap processes to keep the WSS of all in-core processes at about the size of non-kernel RAM. Peter -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: on load control / process swapping 2001-05-09 2:07 ` Peter Jeremy @ 2001-05-09 19:41 ` Matt Dillon 0 siblings, 0 replies; 39+ messages in thread From: Matt Dillon @ 2001-05-09 19:41 UTC (permalink / raw) To: Peter Jeremy; +Cc: Kirk McKusick, Rik van Riel, arch, linux-mm, sfkaplan :I don't think this follows. A program that does something like: :{ : extern char memory[BIG_NUMBER]; : int i; : : for (i = 0; i < BIG_NUMBER; i += PAGE_SIZE) : memory[i]++; :} :will thrash nicely (assuming BIG_NUMBER is large compared to the :currently available physical memory). Occasionally, it will be :runnable - at which stage it has a footprint of only two pages, but Why only two pages? It looks to me like the footprint is BIG_NUMBER bytes. :after executing a couple of instructions, it'll have another page :fault. Old pages will remain resident for some time before they age :enough to be paged out. If the VM system is stressed, swapping this :process out completely would seem to be a win. Not exactly. Page aging works both ways. Just accessing a page once does not give it priority over everything else in the page queues. :... :you ignore spikes due to process initialisation etc, a process that :faults very quickly after being given the CPU wants a working set size :that is larger than the VM system currently allows. The fault rate :would seem to be proportional to the ratio between the wanted WSS and :allowed RSS. This would seem to be a useful parameter to help decide :which process to swap out - in an ideal world the VM subsystem would :swap processes to keep the WSS of all in-core processes at about the :size of non-kernel RAM. : :Peter Fault rate isn't useful -- maybe faults that require large disk seeks would be useful, but just counting the faults themselves is not useful. -Matt -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: on load control / process swapping 2001-05-09 0:18 ` Matt Dillon 2001-05-09 2:07 ` Peter Jeremy @ 2001-05-12 14:28 ` Rik van Riel 1 sibling, 0 replies; 39+ messages in thread From: Rik van Riel @ 2001-05-12 14:28 UTC (permalink / raw) To: Matt Dillon; +Cc: Kirk McKusick, arch, linux-mm, sfkaplan On Tue, 8 May 2001, Matt Dillon wrote: > :I know that FreeBSD will swap out sleeping processes, but will it > :ever swap out running processes? The old BSD VM system would do so > :(we called it hard swapping). It is possible to get a set of running > :processes that simply do not all fit in memory, and the only way > :for them to make forward progress is to cycle them through memory. > > I looked at the code fairly carefully last night... it doesn't > swap out running processes and it also does not appear to swap > out processes blocked in a page-fault (on I/O). Now, of course > we can't swap a process out right then (it might be holding locks), > but I think it would be beneficial to be able to mark the process > as 'requesting a swapout on return to user mode' or something > like that. In the (still very rough) swapping code for Linux I simply do this as "swapout on next pagefault". The idea behind that is: 1) it's easy, at a page fault we know we can suspend the process 2) if we're thrashing, we want every process to make as much progress as possible before it's suspended (swapped out), letting the process run until the next page fault means we will never suspend a process while it's still able to make progress 3) thrashing should be a rare situation, so you don't want to complicate fast-path code like "return to userspace"; instead we make sure to have as little impact on the rest of the kernel code as possible regards, Rik -- Virtual memory is like a game you can't win; However, without VM there's truly nothing to lose... http://www.surriel.com/ http://distro.conectiva.com/ Send all your spam to aardvark@nl.linux.org (spam digging piggy) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: on load control / process swapping 2001-05-07 21:16 Rik van Riel 2001-05-07 22:50 ` Matt Dillon @ 2001-05-08 12:25 ` Scott F. Kaplan 1 sibling, 0 replies; 39+ messages in thread From: Scott F. Kaplan @ 2001-05-08 12:25 UTC (permalink / raw) To: linux-mm -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Okay, in responding to this topic, I will issue a warning: I'm looking at this from an academic point of view, and probably won't give as much attention to what is reasonable to engineer as some people might like. That said, I think I might have some useful thoughts...y'all can be the judge of that. On Mon, 7 May 2001, Rik van Riel wrote: > In short, the process suspension / wake up code only does > load control in the sense that system load is reduced, but > absolutely no effort is made to ensure that individual > programs can run without thrashing. This, of course, kind of > defeats the purpose of doing load control in the first place. First, I agree -- To suspend a process without any calculation that will indicate that the suspension will reduce the page fault rate is to operate blindly. Performing such a calculation, though, requires some information about the locality characteristics of each process, based on recent reference behavior. What would be really nice is some indication as to how much additional space would reduce paging for each of the processes that will remain active. For some, a little extra space won't help much, and for others, a little extra space is just what it needs for a significant reduction. Determining which processes are which, and just how much "a little extra" needs to be, seems important in this context. Second, a nit pick: We're using the term "thrashing" in so many ways that it would be nice to standardize on something so that we understand one another. As I understand it, the textbook definition of thrashing is the point at which CPU utilization falls because all active processes are I/O bound. That is, thrashing is a system-wide characteristic, and not applicable to individual processes. That's why some people have pointed out that "thrashing" and "heavy paging" aren't the same thing. A single process can cause heavy paging while the CPU is still fully loaded with the work of other processes. So, given the paragraph above, are you talking a single process that may still be paging heavily, in spite of the additional free space created by process suspension? (Like I said, it was a nit pick.) I'm assuming that's what you mean. > Any solution will have to address the following points: > > 1) allow the resident processes to stay resident long > enough to make progess Seems reasonable. > 2) make sure the resident processes aren't thrashing, > that is, don't let new processes back in memory if > none of the currently resident processes is "ready" > to be suspended What does it mean to be ready to be suspended? I'm confused by this one. > 3) have a mechanism to detect thrashing in a VM > subsystem which isn't rate-limited (hard?) What's your definition of "thrashing" here? If it's the system-wide version, detection of this situation doesn't seem to be too difficult: When all processes are stalled on page faults, and that situation obtains over time recently, then the system is thrashing. Detecting whether or not a single process is thrashing (paging hopelessly) is a different matter. You could deactivate this process (or some other in the hopes of helping this process), but it could be the case the a reallocation of space could stop this process from paging so heavily while not increasing the paging rate of any other process substantially. > and, for extra brownie points: > 4) fairness, small processes can be paged in and out > faster, so we can suspend&resume them faster; this > has the side effect of leaving the proverbial root > shell more usable I think point should have greater significance. The very issue at hand is that fairness and throughput are at odds when there is contention for memory. The central question (I think) is, "Given paging sufficiently detrimental to progress, *how* unfair should the system be in order to restore progress and increase throughput?" Note that if we want increased throughput, we can easily come up with a scheme that almost completely throws fairness to the wind, and we'll get great reductions in total paging and incrases in process throughput. For a time-sharing system, though, there should probably a limit to the unfairness. There has never been a really good solution to this kind of problem, and there seems to be two important sides to it: 1) Given a level of fairness that you want to maintain, how can you keep the paging as low as possible? 2) Given the unfairness you're willing to use, how can you select eligible processes intelligently so as to maximize the reduction in total paging? Question 1 is associated, and an important problem, but not part of the issue here. Question 2 seems to be the central question, and a hard one. I have trouble believing that any solution to Question 2 will make sense if it does not refer directly to the reference behavior of both the suspended process, and the reference behavior of the remaining active processes. I also have trouble with any solution to Question 2 that doesn't take into account the cost associated with the deactivation and reactivation steps. When a process is reactivated, it's going to cause substantial paging activity, and so it needs not to be done too frequently. If you're going to be unfair, then leave the deactivated process out for long enough that the cost of paging it back in will be a small fraction of the total time spent on the deactivation/reactivation activities. I hope these are useful thoughts. Despite all of my complaining here, I think this problem has been insufficiently addressed for a long time. Working Set counted on it, but there was never a study that showed a good strategy for deacivation/reactivation, in spite of the fact that different choices could significantly affect the results. I'd like very much to see a solution to this particular problem. Scott -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.0.4 (GNU/Linux) Comment: For info see http://www.gnupg.org iD8DBQE69+Wz8eFdWQtoOmgRAopvAJ0QuVPjUFZU5Pa78JsNUSgndKmGGwCdGJ2/ YKDVahEmCMm7yfoSXnrvfE4= =Ql2h -----END PGP SIGNATURE----- -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 39+ messages in thread
end of thread, other threads:[~2001-05-19 2:56 UTC | newest]
Thread overview: 39+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2001-05-16 15:17 on load control / process swapping Charles Randall
2001-05-16 17:14 ` Matt Dillon
2001-05-16 17:41 ` Rik van Riel
2001-05-16 17:54 ` Matt Dillon
2001-05-16 19:59 ` Rik van Riel
2001-05-16 20:41 ` Matt Dillon
2001-05-18 5:58 ` Terry Lambert
2001-05-18 6:20 ` Matt Dillon
2001-05-18 10:00 ` Andrew Reilly
2001-05-18 13:49 ` Jonathan Morton
2001-05-19 2:18 ` Rik van Riel
2001-05-19 2:56 ` Jonathan Morton
2001-05-16 17:57 ` Alfred Perlstein
2001-05-16 18:01 ` Matt Dillon
2001-05-16 18:10 ` Alfred Perlstein
[not found] <OF5A705983.9566DA96-ON86256A50.00630512@hou.us.ray.com>
2001-05-18 20:13 ` Jonathan Morton
-- strict thread matches above, loose matches on Subject: below --
2001-05-07 21:16 Rik van Riel
2001-05-07 22:50 ` Matt Dillon
2001-05-07 23:35 ` Rik van Riel
2001-05-08 0:56 ` Matt Dillon
2001-05-12 14:23 ` Rik van Riel
2001-05-12 17:21 ` Matt Dillon
2001-05-12 21:17 ` Rik van Riel
2001-05-12 23:58 ` Matt Dillon
2001-05-13 17:22 ` Rik van Riel
2001-05-15 6:38 ` Terry Lambert
2001-05-15 13:39 ` Cy Schubert - ITSD Open Systems Group
2001-05-15 15:31 ` Rik van Riel
2001-05-15 17:24 ` Matt Dillon
2001-05-15 23:55 ` Roger Larsson
2001-05-16 0:16 ` Matt Dillon
2001-05-16 8:23 ` Terry Lambert
2001-05-16 17:26 ` Matt Dillon
2001-05-08 20:52 ` Kirk McKusick
2001-05-09 0:18 ` Matt Dillon
2001-05-09 2:07 ` Peter Jeremy
2001-05-09 19:41 ` Matt Dillon
2001-05-12 14:28 ` Rik van Riel
2001-05-08 12:25 ` Scott F. Kaplan
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox