* Two naive questions and a suggestion @ 1998-11-19 0:20 jfm2 1998-11-19 20:05 ` Rik van Riel 1998-11-23 18:08 ` Stephen C. Tweedie 0 siblings, 2 replies; 29+ messages in thread From: jfm2 @ 1998-11-19 0:20 UTC (permalink / raw) To: linux-mm 1) Is there any text describing memory management in 2.1? (Forgive me if I missed an obvious URL) 2) Are there plans for implementing the swapping of whole processes a la BSD? Suggestion: Given that the requiremnts for a workstation (quick response) are different than for a server (high throughput) it could make sense to allow the user either use /proc for selecting the VM policy or have a form of loadable VM manager. Or select it at compile time. -- Jean Francois Martinez -- This is a majordomo managed list. To unsubscribe, send a message with the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Two naive questions and a suggestion 1998-11-19 0:20 Two naive questions and a suggestion jfm2 @ 1998-11-19 20:05 ` Rik van Riel 1998-11-20 1:25 ` jfm2 1998-11-23 18:08 ` Stephen C. Tweedie 1 sibling, 1 reply; 29+ messages in thread From: Rik van Riel @ 1998-11-19 20:05 UTC (permalink / raw) To: jfm2; +Cc: linux-mm On 19 Nov 1998 jfm2@club-internet.fr wrote: > 1) Is there any text describing memory management in 2.1? (Forgive me > if I missed an obvious URL) Not yet, I really should be working on that (the code seems to have stabilized now)... > 2) Are there plans for implementing the swapping of whole processes a > la BSD? Yes, there are plans. The plans are quite detailed too, but I think I haven't put them up on my home page yet. > Suggestion: Given that the requiremnts for a workstation (quick > response) are different than for a server (high throughput) it could > make sense to allow the user either use /proc for selecting the VM > policy or have a form of loadable VM manager. Or select it at > compile time. There are quite a lot of things you can tune in /proc, I don't know if you have read the documentation, but if you start trying things you'll be amazed hom much you can change the system's behaviour with the existing controls. Btw, since you are so enthusiastic about documentation, would you be willing to help me write it? cheers, Rik -- slowly getting used to dvorak kbd layout... +-------------------------------------------------------------------+ | Linux memory management tour guide. H.H.vanRiel@phys.uu.nl | | Scouting Vries cubscout leader. http://www.phys.uu.nl/~riel/ | +-------------------------------------------------------------------+ -- This is a majordomo managed list. To unsubscribe, send a message with the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Two naive questions and a suggestion 1998-11-19 20:05 ` Rik van Riel @ 1998-11-20 1:25 ` jfm2 1998-11-20 15:31 ` Eric W. Biederman 0 siblings, 1 reply; 29+ messages in thread From: jfm2 @ 1998-11-20 1:25 UTC (permalink / raw) To: H.H.vanRiel; +Cc: jfm2, linux-mm > > On 19 Nov 1998 jfm2@club-internet.fr wrote: > > > 1) Is there any text describing memory management in 2.1? (Forgive me > > if I missed an obvious URL) > > Not yet, I really should be working on that (the code > seems to have stabilized now)... > > > 2) Are there plans for implementing the swapping of whole processes a > > la BSD? > > Yes, there are plans. The plans are quite detailed too, but > I think I haven't put them up on my home page yet. > This will close the gap between Linux and *BSDs at high loads. It will also close the mouthes of some BSD people who talk loudly about what areas where BSD is superior and carefully forget SMP, Elf or modules to name just a few areas where Linux got it first. > > Suggestion: Given that the requiremnts for a workstation (quick > > response) are different than for a server (high throughput) it could > > make sense to allow the user either use /proc for selecting the VM > > policy or have a form of loadable VM manager. Or select it at > > compile time. > > There are quite a lot of things you can tune in /proc, > I don't know if you have read the documentation, but > if you start trying things you'll be amazed hom much > you can change the system's behaviour with the existing > controls. > I have read a bit about them but sometimes changing the algorythm is the right thing to do. > Btw, since you are so enthusiastic about documentation, > would you be willing to help me write it? > I could try to help you but it will be limited help. I already work on a project of "Linux for normal people" and I also wanted to write an article about optimizing a Linux box. The goal is to smash the myth about kernel compiling. Why? Because in 95 my brother in law needed a computer for his thesis in Spanish litterature. I remebered kernel compiling and I led him to Apple Expo. That day one thing was clear: Linux will never reach world domination as long as litterature professors cannot use it and as long as kernel compiling be necessary or even recommended then Linux will be off limits for litterature professors. So I scanned the source code in 2.0.34 and found the unsignificant differences between code compiled for Pentiums and 386. Then I compiled the Byte benchmark using the same compile flags used for the 386 kernels and the ones for Pentiums and PPros. Difference in speed was < 2 % both on a real Pentium and on a K6. So much for "it will allow you to tune to the processor". About memory savings. First of all in 98 distributors shipping crippled kernels should be shot: modular 2.0 has been around for over two years. Also modularity has reduced the memory savings you get from recompiling the kernel (if the distributor did a good job) qwhile machines got bigger: over 1.5 Megs saved on an 8 Meg box are significant (1.2.13 in 95), 500K on a 32 Meg box are a triffle (2.0 in 98). This is not entirely true: you can write pathological programs where a single page means the difference between blinding speed and hours swapping. Also the significant number is the increase in memory you lack: a 500K deficit becoming 1 Meg. Consider also disk bandwidth: being 16 megs short on a 32 Megs is much worse than 4 Meg box on a 8 Meg because you need much more time to push 16 Megs to the disk. On the other hand proceses will have to spend more time analyzing a big array on a big box than a small one in a small box (processor speed being equal) and this plays in the side of the 32 Megs being 16 Megs short for normal, non pathological programs. Finally there is the question of probability: 500K is under 2% on a 32 Meg box so there is a good chance that when programs need more memory than what you have they miss the mark by 20 or 30% and rarely fall just straight on the 500K zone. Needs refining but indulge with the fact I am writing at 2am. A (not to be published) conclusion is: "Kernel compiling is a thing performed only by idiots and kernel hackers". I am not a kernel hacker and I have performed over 2 hundred of them. :-) Perhaps we could help one another for our docs/articles. -- Jean Francois Martinez Project Independence: Linux for the Masses http://www.independence.seul.org -- This is a majordomo managed list. To unsubscribe, send a message with the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Two naive questions and a suggestion 1998-11-20 1:25 ` jfm2 @ 1998-11-20 15:31 ` Eric W. Biederman 0 siblings, 0 replies; 29+ messages in thread From: Eric W. Biederman @ 1998-11-20 15:31 UTC (permalink / raw) To: jfm2; +Cc: H.H.vanRiel, linux-mm >>>>> "jfm2" == jfm2 <jfm2@club-internet.fr> writes: jfm2> A (not to be published) conclusion is: "Kernel compiling is a thing jfm2> performed only by idiots and kernel hackers". I am not a kernel jfm2> hacker and I have performed over 2 hundred of them. :-) No. As far as functionality I don't trust a linux box that doesn't have it's standard hardware drivers, comm port, floppy disk etc compiled in. A modular kernel seems to work well for protocol layers however. An important advantage of linux is what you can do if something isn't working automatically. With Windows you have 2 possibilities. 1) Either something works automatically 2) Soemthing doesn't work. With Linux you have 3 posibilities. 1) Something works automatically. (We need more in this category). 2) Something with research and looking around can be made to work. (The ability to compile a kernel is an advantage here) 3) Something doesn't work. (Linux has much less in this category than any other OS) With the memory management system. There are tuning paramenters. But generally resorting to them is dropping down to case 2. And in most cases with 2.0 and probably also with 2.2 the memory management system should be a case of it works automatically. Eric -- This is a majordomo managed list. To unsubscribe, send a message with the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Two naive questions and a suggestion 1998-11-19 0:20 Two naive questions and a suggestion jfm2 1998-11-19 20:05 ` Rik van Riel @ 1998-11-23 18:08 ` Stephen C. Tweedie 1998-11-23 20:45 ` jfm2 1998-11-23 21:59 ` jfm2 1 sibling, 2 replies; 29+ messages in thread From: Stephen C. Tweedie @ 1998-11-23 18:08 UTC (permalink / raw) To: jfm2; +Cc: linux-mm, Stephen Tweedie Hi, On 19 Nov 1998 00:20:37 -0000, jfm2@club-internet.fr said: > 1) Is there any text describing memory management in 2.1? (Forgive me > if I missed an obvious URL) The source code. :) > 2) Are there plans for implementing the swapping of whole processes a > la BSD? Not exactly, but there are substantial plans for other related changes. In particular, most of the benefits of BSD-style swapping can be achieved through swapping of page tables, dynamic RSS limits and streaming swapout, all of which are on the slate for 2.3. --Stephen -- This is a majordomo managed list. To unsubscribe, send a message with the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Two naive questions and a suggestion 1998-11-23 18:08 ` Stephen C. Tweedie @ 1998-11-23 20:45 ` jfm2 1998-11-23 21:59 ` jfm2 1 sibling, 0 replies; 29+ messages in thread From: jfm2 @ 1998-11-23 20:45 UTC (permalink / raw) To: sct; +Cc: jfm2, linux-mm -- This is a majordomo managed list. To unsubscribe, send a message with the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Two naive questions and a suggestion 1998-11-23 18:08 ` Stephen C. Tweedie 1998-11-23 20:45 ` jfm2 @ 1998-11-23 21:59 ` jfm2 1998-11-24 1:21 ` Vladimir Dergachev 1998-11-24 11:17 ` Stephen C. Tweedie 1 sibling, 2 replies; 29+ messages in thread From: jfm2 @ 1998-11-23 21:59 UTC (permalink / raw) To: sct; +Cc: jfm2, linux-mm > > Hi, > > On 19 Nov 1998 00:20:37 -0000, jfm2@club-internet.fr said: > > > 1) Is there any text describing memory management in 2.1? (Forgive me > > if I missed an obvious URL) > > The source code. :) > I knew about it. :) And this is not an URL :) > > 2) Are there plans for implementing the swapping of whole processes a > > la BSD? > > Not exactly, but there are substantial plans for other related changes. > In particular, most of the benefits of BSD-style swapping can be > achieved through swapping of page tables, dynamic RSS limits and > streaming swapout, all of which are on the slate for 2.3. > The problem is: will you be able to manage the following situation? Two processes running in an 8 Meg box. Both will page fault every ms if you give them 4 Megs (they are scanning large arrays so no locality), a page fault will take 20 ms to handle. That means only 5% of the CPU time is used, remainder is spent waiting for page being brought from disk or pushing a page of the other process out of memory. And both of these processes would run like hell (no page fault) given 6 Megs of memory. Only solution I see is stop one of them (short of adding memory :) and let the other one make some progress. That is swapping. Of course swapping can be undesiarable in work stations and that is the reason I suggested user control about MM policy be it by recompiling, by /proc or by module insertion. In 96 I asked for that same feature, gave the same example (same numbers :-) and Alan Cox agreed but told Linux was not used under heavy loads. That means we are in a catch 22 situation: Linux not used for heavy loads because it does not handle them well and the necessary feaatures not implemented because it is not used in such situations. And now we are at it: in 2.0 I found a deamon can be killed by the system if it runs out of VM. Problem is: it was a normal user process who had allocatedc most of it and in addition that daemon could be important enough it is better to kill anything else, so it would be useful to give some privilege to root processes here. I think this ends my Christmas wish list. :) -- Jean Francois Martinez Project Independence: Linux for the Masses http://www.independence.seul.org -- This is a majordomo managed list. To unsubscribe, send a message with the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Two naive questions and a suggestion 1998-11-23 21:59 ` jfm2 @ 1998-11-24 1:21 ` Vladimir Dergachev 1998-11-24 11:17 ` Stephen C. Tweedie 1 sibling, 0 replies; 29+ messages in thread From: Vladimir Dergachev @ 1998-11-24 1:21 UTC (permalink / raw) To: jfm2; +Cc: sct, linux-mm On 23 Nov 1998 jfm2@club-internet.fr wrote: > > > > > Hi, > > > > On 19 Nov 1998 00:20:37 -0000, jfm2@club-internet.fr said: > > > > > 1) Is there any text describing memory management in 2.1? (Forgive me > > > if I missed an obvious URL) > > > > The source code. :) > > > > I knew about it. :) And this is not an URL :) > > > > 2) Are there plans for implementing the swapping of whole processes a > > > la BSD? > > > > Not exactly, but there are substantial plans for other related changes. > > In particular, most of the benefits of BSD-style swapping can be > > achieved through swapping of page tables, dynamic RSS limits and > > streaming swapout, all of which are on the slate for 2.3. > > > > The problem is: will you be able to manage the following situation? > > Two processes running in an 8 Meg box. Both will page fault every ms > if you give them 4 Megs (they are scanning large arrays so no > locality), a page fault will take 20 ms to handle. That means only 5% > of the CPU time is used, remainder is spent waiting for page being > brought from disk or pushing a page of the other process out of > memory. And both of these processes would run like hell (no page > fault) given 6 Megs of memory. > > Only solution I see is stop one of them (short of adding memory :) and > let the other one make some progress. That is swapping. Of course > swapping can be undesiarable in work stations and that is the reason I > suggested user control about MM policy be it by recompiling, by /proc > or by module insertion. > > In 96 I asked for that same feature, gave the same example (same > numbers :-) and Alan Cox agreed but told Linux was not used under > heavy loads. That means we are in a catch 22 situation: Linux not used > for heavy loads because it does not handle them well and the necessary > feaatures not implemented because it is not used in such situations. > > > And now we are at it: in 2.0 I found a deamon can be killed by the > system if it runs out of VM. Problem is: it was a normal user process > who had allocatedc most of it and in addition that daemon could be > important enough it is better to kill anything else, so it would be > useful to give some privilege to root processes here. > > I think this ends my Christmas wish list. :) what about this solution: write a small program the monitors for programs (via /proc) that swap a lot. If this happens make them get "wide slice" time - i.e. send SIGSTOP to everybody and SIGRESUME/SIGSTOP with interval 0.5-1 sec. This would have the concurency you want and will also eliminate the swapping (mostly). Since you need this heavy load anyway I don't think your programs will complain about long delays - for all they know this time is taken up swapping. And this being a userspace program will allow you to fine tune this as much as you want, all the way to OOM/OOT killer that will pop-up a nice box on the terminal and ask what you want to do. Vladimir Dergachev > > -- > Jean Francois Martinez > > Project Independence: Linux for the Masses > http://www.independence.seul.org > > -- > This is a majordomo managed list. To unsubscribe, send a message with > the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org > -- This is a majordomo managed list. To unsubscribe, send a message with the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Two naive questions and a suggestion 1998-11-23 21:59 ` jfm2 1998-11-24 1:21 ` Vladimir Dergachev @ 1998-11-24 11:17 ` Stephen C. Tweedie 1998-11-24 21:44 ` jfm2 1 sibling, 1 reply; 29+ messages in thread From: Stephen C. Tweedie @ 1998-11-24 11:17 UTC (permalink / raw) To: jfm2; +Cc: sct, linux-mm Hi, On 23 Nov 1998 21:59:33 -0000, jfm2@club-internet.fr said: > The problem is: will you be able to manage the following situation? > Two processes running in an 8 Meg box. Both will page fault every ms > if you give them 4 Megs (they are scanning large arrays so no > locality), a page fault will take 20 ms to handle. That means only 5% > of the CPU time is used, remainder is spent waiting for page being > brought from disk or pushing a page of the other process out of > memory. And both of these processes would run like hell (no page > fault) given 6 Megs of memory. These days, most people agree that in this situation your box is simply misconfigured for the load. :) Seriously, requirements have changed enormously since swapping was first implemented. > Only solution I see is stop one of them (short of adding memory :) and > let the other one make some progress. That is swapping. No it is not. That is scheduling. Swapping is a very precise term used to define a mechanism by which we suspend a process and stream all of its internal state to disk, including page tables and so on. There's no reason why we can't do a temporary schedule trick to deal with this in Linux: it's still not true swapping. > In 96 I asked for that same feature, gave the same example (same > numbers :-) and Alan Cox agreed but told Linux was not used under > heavy loads. That means we are in a catch 22 situation: Linux not used > for heavy loads because it does not handle them well and the necessary > feaatures not implemented because it is not used in such situations. Linux is used under very heavy load, actually. > And now we are at it: in 2.0 I found a deamon can be killed by the > system if it runs out of VM. Same on any BSD. Once virtual memory is full, any new memory allocations must fail. It doesn't matter whether that allocation comes from a user process or a daemon: if there is no more virtual memory then the process will get a NULL back from malloc. If a daemon dies as a result of that, the death will happen on any Unix system. > Problem is: it was a normal user process who had allocatedc most of it > and in addition that daemon could be important enough it is better to > kill anything else, so it would be useful to give some privilege to > root processes here. No. It's not an issue of the operating system killing processes. It is an issue of the O/S failing a request for new memory, and a process exit()ing as a result of that failed malloc. The process is voluntarily exiting, as far as the kernel is concerned. --Stephen -- This is a majordomo managed list. To unsubscribe, send a message with the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Two naive questions and a suggestion 1998-11-24 11:17 ` Stephen C. Tweedie @ 1998-11-24 21:44 ` jfm2 1998-11-25 6:41 ` Rik van Riel ` (2 more replies) 0 siblings, 3 replies; 29+ messages in thread From: jfm2 @ 1998-11-24 21:44 UTC (permalink / raw) To: sct; +Cc: jfm2, linux-mm > > > The problem is: will you be able to manage the following situation? > > > Two processes running in an 8 Meg box. Both will page fault every ms > > if you give them 4 Megs (they are scanning large arrays so no > > locality), a page fault will take 20 ms to handle. That means only 5% > > of the CPU time is used, remainder is spent waiting for page being > > brought from disk or pushing a page of the other process out of > > memory. And both of these processes would run like hell (no page > > fault) given 6 Megs of memory. > > These days, most people agree that in this situation your box is simply > misconfigured for the load. :) Seriously, requirements have changed > enormously since swapping was first implemented. > > > Only solution I see is stop one of them (short of adding memory :) and > > let the other one make some progress. That is swapping. > > No it is not. That is scheduling. Swapping is a very precise term used > to define a mechanism by which we suspend a process and stream all of > its internal state to disk, including page tables and so on. There's no > reason why we can't do a temporary schedule trick to deal with this in > Linux: it's still not true swapping. > Agreed, the important feature is the stopping one of the processes when critically short of memory. Swapping is only a trick for getting more bandwidth at the expenses of pushing in an out of memory a greater amount of process space so there is no proof it is faster than letting other processes steal memory page by page from the now stopped process. > > In 96 I asked for that same feature, gave the same example (same > > numbers :-) and Alan Cox agreed but told Linux was not used under > > heavy loads. That means we are in a catch 22 situation: Linux not used > > for heavy loads because it does not handle them well and the necessary > > feaatures not implemented because it is not used in such situations. > > Linux is used under very heavy load, actually. > BSD and Solaris partisans are still boasting about how much better those systems are at heavy loads. I agree boasting tends to survive to the situation who originated it. > > And now we are at it: in 2.0 I found a deamon can be killed by the > > system if it runs out of VM. > > Same on any BSD. Once virtual memory is full, any new memory > allocations must fail. It doesn't matter whether that allocation comes > from a user process or a daemon: if there is no more virtual memory then > the process will get a NULL back from malloc. If a daemon dies as a > result of that, the death will happen on any Unix system. > Say the Web or database server can be deemed important enough for it not being killed just because some dim witt is playing with the GIMP at the console and the GIMP has allocated 80 Megs. More reallistically, it can happen that the X server is killed (-9) due to the misbeahviour of a user program and you get trapped with a useless console. Very diificult to recover. Specially if you consider inetd could have been killed too, so no telnetting. You can also find half of your daemons, are gone. That is no mail, no printing, no nothing. > > Problem is: it was a normal user process who had allocatedc most of it > > and in addition that daemon could be important enough it is better to > > kill anything else, so it would be useful to give some privilege to > > root processes here. > > No. It's not an issue of the operating system killing processes. It is > an issue of the O/S failing a request for new memory, and a process > exit()ing as a result of that failed malloc. The process is voluntarily > exiting, as far as the kernel is concerned. > In situation like those above I would like Linux supported a concept like guaranteed processses: if VM is exhausted by one of them then try to get memory by killing non guaranteed processes and only kill the original one if all reamining survivors are guaranteed ones. It would be better for mission critical tasks. -- Jean Francois Martinez Project Independence: Linux for the Masses http://www.independence.seul.org -- This is a majordomo managed list. To unsubscribe, send a message with the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Two naive questions and a suggestion 1998-11-24 21:44 ` jfm2 @ 1998-11-25 6:41 ` Rik van Riel 1998-11-25 12:27 ` Stephen C. Tweedie 1998-11-25 20:01 ` jfm2 1998-11-25 14:48 ` Eric W. Biederman 1998-11-25 16:31 ` ralf 2 siblings, 2 replies; 29+ messages in thread From: Rik van Riel @ 1998-11-25 6:41 UTC (permalink / raw) To: jfm2; +Cc: sct, linux-mm On 24 Nov 1998 jfm2@club-internet.fr wrote: > Agreed, the important feature is the stopping one of the processes > when critically short of memory. Swapping is only a trick for > getting more bandwidth at the expenses of pushing in an out of > memory a greater amount of process space so there is no proof it is > faster than letting other processes steal memory page by page from > the now stopped process. When the mythical swapin readahead will be merged, we can gain some ungodly amount of speed almost for free. I don't know if we'll ever implement the scheduling tricks... I do have a few ideas for the scheduling stuff though, with RSS limits (we can safely implement those when the swap cache trick is implemented) and the keeping of a few statistics, we will be able to implement the swapping tricks. Without swapin readahead, we'll be unable to implement them properly however :( > > > And now we are at it: in 2.0 I found a deamon can be killed by the > > > system if it runs out of VM. > > > > Same on any BSD. > > Say the Web or database server can be deemed important enough for it > not being killed just because some dim witt is playing with the GIMP > at the console and the GIMP has allocated 80 Megs. I sounds remarkably like you want my Out Of Memory killer patch. This patch tries to remove the randomness in killing a process when you're OOM by carefully selecting a process based on a lot of different factors (size, age, CPU used, suid, root, IOPL, etc). It needs to be cleaned up, ported to 2.1.129 and improved a little bit though... After that it should be ready for inclusion in the kernel. cheers, Rik -- slowly getting used to dvorak kbd layout... +-------------------------------------------------------------------+ | Linux memory management tour guide. H.H.vanRiel@phys.uu.nl | | Scouting Vries cubscout leader. http://www.phys.uu.nl/~riel/ | +-------------------------------------------------------------------+ -- This is a majordomo managed list. To unsubscribe, send a message with the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Two naive questions and a suggestion 1998-11-25 6:41 ` Rik van Riel @ 1998-11-25 12:27 ` Stephen C. Tweedie 1998-11-25 13:08 ` Rik van Riel 1998-11-25 20:01 ` jfm2 1 sibling, 1 reply; 29+ messages in thread From: Stephen C. Tweedie @ 1998-11-25 12:27 UTC (permalink / raw) To: Rik van Riel; +Cc: jfm2, sct, linux-mm Hi, On Wed, 25 Nov 1998 07:41:41 +0100 (CET), Rik van Riel <H.H.vanRiel@phys.uu.nl> said: > When the mythical swapin readahead will be merged, we can > gain some ungodly amount of speed almost for free. I don't > know if we'll ever implement the scheduling tricks... Agreed: usage patterns these days are very different. We simply don't expect to run parallel massive processes whose combined working sets exceed physical memory these days. Making sure that we don't thrash to death is still an important point, but we can achieve that by guaranteeing processes a minimum rss quota (so that only those processes exceeding that quota compete for the remaining physical memory). > I do have a few ideas for the scheduling stuff though, with > RSS limits (we can safely implement those when the swap cache > trick is implemented) and the keeping of a few statistics, > we will be able to implement the swapping tricks. Rick, get real: when will you work out how the VM works? We can safely implement RSS limits *today*, and have been able to since 2.1.89. <grin> It's just a matter of doing a vmscan on the current process whenever it exceeds its own RSS limit. The mechanism is all there. > Without swapin readahead, we'll be unable to implement them > properly however :( No, we don't need readahead (although the swap cache itself already includes all of the necessary mechanism: rw_swap_page(READ, nowait) will do it). The only extra functionality we might want is extra control over when we write swap-cached pages: right now, all dirty pages need to be in the RSS, and we write them to disk when we evict them to the swap cache. Thus, only clean pages can be in the swap cache. If we want to support processes with a dirty working set > RSS, we'd need to extend this. --Stephen -- This is a majordomo managed list. To unsubscribe, send a message with the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Two naive questions and a suggestion 1998-11-25 12:27 ` Stephen C. Tweedie @ 1998-11-25 13:08 ` Rik van Riel 1998-11-25 14:46 ` Stephen C. Tweedie 0 siblings, 1 reply; 29+ messages in thread From: Rik van Riel @ 1998-11-25 13:08 UTC (permalink / raw) To: Stephen C. Tweedie; +Cc: Rik van Riel, jfm2, linux-mm On Wed, 25 Nov 1998, Stephen C. Tweedie wrote: > On Wed, 25 Nov 1998 07:41:41 +0100 (CET), Rik van Riel > <H.H.vanRiel@phys.uu.nl> said: > > > I do have a few ideas for the scheduling stuff though, with > > RSS limits (we can safely implement those when the swap cache > > trick is implemented) and the keeping of a few statistics, > > we will be able to implement the swapping tricks. > > Rick, get real: when will you work out how the VM works? We can > safely implement RSS limits *today*, and have been able to since > 2.1.89. <grin> It's just a matter of doing a vmscan on the current > process whenever it exceeds its own RSS limit. The mechanism is all > there. If we tried to implement RSS limits now, it would mean that the large task(s) we limited would be continuously thrashing and keep the I/O subsystem busy -- this impacts the rest of the system a lot. With the new scheme, we can implement the RSS limit, but the truly busily used pages would simply stay inside the swap cache, freeing up I/O bandwidth (at the cost of some memory) for the rest of the system. I think that with the new scheme the balancing will be so much better that we can implement RSS limits without a negative impact on the rest of the system. With the current VM system RSS limits would probably hamper the performance the rest of the system gets. We might want to perform the scheduling tricks for over-RSS processes however. Without swap readahead I really don't see any way we could run them without keeping back the rest of the system too much... cheers, Rik -- slowly getting used to dvorak kbd layout... +-------------------------------------------------------------------+ | Linux memory management tour guide. H.H.vanRiel@phys.uu.nl | | Scouting Vries cubscout leader. http://www.phys.uu.nl/~riel/ | +-------------------------------------------------------------------+ -- This is a majordomo managed list. To unsubscribe, send a message with the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Two naive questions and a suggestion 1998-11-25 13:08 ` Rik van Riel @ 1998-11-25 14:46 ` Stephen C. Tweedie 1998-11-25 16:47 ` Rik van Riel 0 siblings, 1 reply; 29+ messages in thread From: Stephen C. Tweedie @ 1998-11-25 14:46 UTC (permalink / raw) To: Rik van Riel; +Cc: Stephen C. Tweedie, jfm2, linux-mm Hi, On Wed, 25 Nov 1998 14:08:47 +0100 (CET), Rik van Riel <H.H.vanRiel@phys.uu.nl> said: > On Wed, 25 Nov 1998, Stephen C. Tweedie wrote: >> Rick, get real: when will you work out how the VM works? We can >> safely implement RSS limits *today*, and have been able to since >> 2.1.89. <grin> > If we tried to implement RSS limits now, it would mean that > the large task(s) we limited would be continuously thrashing > and keep the I/O subsystem busy -- this impacts the rest of > the system a lot. WRONG. We can very very easily unlink pages from a process's pte (hence reducing the process's RSS) without removing that page from memory. It's trivial. We do it all the time. We can do it both for memory-mapped files and for anonymous pages. In the latest 2.1.130 prepatch, this is in fact the *preferred* way of swapping. This mechanism is fundamental to the way we maintain page sharing of swapped COW pages. The only thing we cannot do is unlink dirty pages (for swap, that means pages which have been modified since we last paged the swap back into memory). We have to write them back before we unlink. That does not mean that we have to throw the data away: as long as the copy on disk is uptodate, we can have as much of a process's address space as we want in the page cache or swap cache without it being mapped in the process' address space and without it counting as task RSS. Today, such an RSS limit would NOT thrash the IO: it would just cause minor page faults as we relink the cached page back into the page tables. All of that functionality exists today. Rik, you should probably try to work out how try_to_swap_out() actually works one of these days. You'll find it does a lot of neat stuff you seem to be unaware of! We are really a lot closer to having a proper unified page handling mechanism than you think. The handling of dirty pages is pretty much the only missing part of the mechanism right now. Even that is not necessarily a bad thing: there are good performance reasons why we might want the swap cache to contain only clean pages: for example, it makes it easier to guarantee that those pages can be reclaimed for another use at short notice. --Stephen -- This is a majordomo managed list. To unsubscribe, send a message with the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Two naive questions and a suggestion 1998-11-25 14:46 ` Stephen C. Tweedie @ 1998-11-25 16:47 ` Rik van Riel 1998-11-25 21:02 ` Stephen C. Tweedie 0 siblings, 1 reply; 29+ messages in thread From: Rik van Riel @ 1998-11-25 16:47 UTC (permalink / raw) To: Stephen C. Tweedie; +Cc: jfm2, Linux MM On Wed, 25 Nov 1998, Stephen C. Tweedie wrote: > On Wed, 25 Nov 1998 14:08:47 +0100 (CET), Rik van Riel > <H.H.vanRiel@phys.uu.nl> said: > > > If we tried to implement RSS limits now, it would mean that > > the large task(s) we limited would be continuously thrashing > > and keep the I/O subsystem busy -- this impacts the rest of > > the system a lot. > > WRONG. We can very very easily unlink pages from a process's pte > (hence reducing the process's RSS) without removing that page from > memory. It's trivial. We do it all the time. Rik, you should > probably try to work out how try_to_swap_out() actually works one of > these days. I just looked in mm/vmscan.c of kernel version 2.1.129, and line 173, 191 and 205 feature a prominent: free_page_and_swap_cache(page); > We are really a lot closer to having a proper unified page handling > mechanism than you think. The handling of dirty pages is pretty > much the only missing part of the mechanism right now. I know how close we are. I think I posted an assesment on what to do and what to leave yesterday :)) The most essential things can probably be coded in a day or two, if we want to. Oh, one question. Can we attach a swap page to the swap cache while there's no program using it? This way we can implement a very primitive swapin readahead right now, improving the algorithm as we go along... > Even that is not necessarily a bad thing: there are good performance > reasons why we might want the swap cache to contain only clean > pages: for example, it makes it easier to guarantee that those > pages can be reclaimed for another use at short notice. IMHO it would be a big loss to have dirty pages in the swap cache. Writing out swap pages is cheap since we do proper I/O clustering, not writing them out immediately will result in them being written out in the order that shrink_mmap() comes across them, which is a suboptimal way for when we want to read the pages back. Besides, having a large/huge clean swap cache means that we can very easily free up memory when we need to, this is essential for NFS buffers, networking stuff, etc. If we keep a quota of 20% of memory in buffers and unmapped cache, we can also do away with a buffer for the 8 and 16kB area's. We can always find some contiguous area in swap/page cache that we can free... cheers, Rik -- slowly getting used to dvorak kbd layout... +-------------------------------------------------------------------+ | Linux memory management tour guide. H.H.vanRiel@phys.uu.nl | | Scouting Vries cubscout leader. http://www.phys.uu.nl/~riel/ | +-------------------------------------------------------------------+ -- This is a majordomo managed list. To unsubscribe, send a message with the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Two naive questions and a suggestion 1998-11-25 16:47 ` Rik van Riel @ 1998-11-25 21:02 ` Stephen C. Tweedie 1998-11-25 21:21 ` Rik van Riel 0 siblings, 1 reply; 29+ messages in thread From: Stephen C. Tweedie @ 1998-11-25 21:02 UTC (permalink / raw) To: Rik van Riel; +Cc: Stephen C. Tweedie, jfm2, Linux MM Hi, On Wed, 25 Nov 1998 17:47:18 +0100 (CET), Rik van Riel <H.H.vanRiel@phys.uu.nl> said: >> WRONG. We can very very easily unlink pages from a process's pte >> (hence reducing the process's RSS) without removing that page from >> memory. It's trivial. We do it all the time. Rik, you should >> probably try to work out how try_to_swap_out() actually works one of >> these days. > I just looked in mm/vmscan.c of kernel version 2.1.129, and > line 173, 191 and 205 feature a prominent: > free_page_and_swap_cache(page); It is not there in 2.1.130-pre3, however. :) That misses the point, though. The point is that it is trivial to remove these mappings without freeing the swap cache, and the code you point to confirms this: vmscan actually has to go to _extra_ trouble to free the underlying cache if that is wanted (the shared page case is the same, hence the unuse_page call at the end of try_to_swap_out() (also removed in 2.1.130-3). The default action of the free_page alone removes the mapping but not the cache entry, and the functionality of leaving the cache present is already there. > Oh, one question. Can we attach a swap page to the swap cache > while there's no program using it? This way we can implement > a very primitive swapin readahead right now, improving the > algorithm as we go along... Yes, rw_swap_page(READ, nowait) does exactly that: it primes the swap cache asynchronously but does not map it anywhere. It should be completely safe right now: the normal swap read is just a special case of this. > IMHO it would be a big loss to have dirty pages in the swap > cache. Writing out swap pages is cheap since we do proper > I/O clustering ... > Besides, having a large/huge clean swap cache means that we > can very easily free up memory when we need to, this is > essential for NFS buffers, networking stuff, etc. Yep, absolutely: agreed on both counts. This is exactly how 2.1.130-3 works! > If we keep a quota of 20% of memory in buffers and unmapped > cache, we can also do away with a buffer for the 8 and 16kB > area's. We can always find some contiguous area in swap/page > cache that we can free... That will kill performance if you have a large simulation which has a legitimate need to keep 90% of physical memory full of anonymous pages. I'd rather do without that 20% magic limit if we can. The only special limit we really need is to make sure that kswapd keeps far enough in advance of interrupt memory load that the free list doesn't empty. --Stephen -- This is a majordomo managed list. To unsubscribe, send a message with the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Two naive questions and a suggestion 1998-11-25 21:02 ` Stephen C. Tweedie @ 1998-11-25 21:21 ` Rik van Riel 1998-11-25 22:29 ` Stephen C. Tweedie 0 siblings, 1 reply; 29+ messages in thread From: Rik van Riel @ 1998-11-25 21:21 UTC (permalink / raw) To: Stephen C. Tweedie; +Cc: jfm2, Linux MM On Wed, 25 Nov 1998, Stephen C. Tweedie wrote: > <H.H.vanRiel@phys.uu.nl> said: > > It is not there in 2.1.130-pre3, however. :) That misses the point, > though. The point is that it is trivial to remove these mappings > without freeing the swap cache, and the code you point to confirms this: OK, point taken. {:-) > > Oh, one question. Can we attach a swap page to the swap cache > > while there's no program using it? This way we can implement > > a very primitive swapin readahead right now, improving the > > algorithm as we go along... > > Yes, rw_swap_page(READ, nowait) does exactly that: it primes the > swap cache asynchronously but does not map it anywhere. It should > be completely safe right now: the normal swap read is just a special > case of this. Then I think it's time to do swapin readahead on the entire SWAP_CLUSTER (or just from the point where we faulted) on a dumb-and-dumber basis, awaiting a good readahead scheme. Of course it will need to be sysctl tuneable :) The reason I propose this dumb scheme is because we can read one SWAP_CLUSTER_MAX sized chunk in one sweep without having to move the disks head... Plus Linus might actually accept a change like this :) > > If we keep a quota of 20% of memory in buffers and unmapped > > cache, we can also do away with a buffer for the 8 and 16kB > > area's. We can always find some contiguous area in swap/page > > cache that we can free... > > That will kill performance if you have a large simulation which has a > legitimate need to keep 90% of physical memory full of anonymous pages. > I'd rather do without that 20% magic limit if we can. The only special > limit we really need is to make sure that kswapd keeps far enough in > advance of interrupt memory load that the free list doesn't empty. OK, then we should let the kernel calculate the limit itself based on the number of soft faults, swapout pressure, memory pressure and process priority. We can also use stats like this to temporarily suspend very large processes when we've got multiple processes with: (p->vm_mm->rss + p->dec_flt) > RSS_THRASH_LIMIT, where p->dec_flt is a floating average and the RSS limit is calculated dynamically as well... I know this could be a slightly expensive trick, but we can easily make that sysctl tuneable as well. Rik -- slowly getting used to dvorak kbd layout... +-------------------------------------------------------------------+ | Linux memory management tour guide. H.H.vanRiel@phys.uu.nl | | Scouting Vries cubscout leader. http://www.phys.uu.nl/~riel/ | +-------------------------------------------------------------------+ -- This is a majordomo managed list. To unsubscribe, send a message with the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Two naive questions and a suggestion 1998-11-25 21:21 ` Rik van Riel @ 1998-11-25 22:29 ` Stephen C. Tweedie 1998-11-26 7:30 ` Rik van Riel 0 siblings, 1 reply; 29+ messages in thread From: Stephen C. Tweedie @ 1998-11-25 22:29 UTC (permalink / raw) To: Rik van Riel; +Cc: Stephen C. Tweedie, jfm2, Linux MM Hi, On Wed, 25 Nov 1998 22:21:43 +0100 (CET), Rik van Riel <H.H.vanRiel@phys.uu.nl> said: > Then I think it's time to do swapin readahead on the > entire SWAP_CLUSTER (or just from the point where we > faulted) on a dumb-and-dumber basis, awaiting a good > readahead scheme. Of course it will need to be sysctl > tuneable :) Yep, although I'm not sure that reading a whole SWAP_CLUSTER would be a good idea. Contrary to popular belief, disks are still quite slow at sequential data transfer. Non-sequential IO is obviously enormously slower still, but doing readahead on a whole SWAP_CLUSTER (128k) is definitely _not_ free. It will increase the VM latency enormously if we start reading in a lot of unnecessary data. On the other hand, swap readahead is sufficiently trivial to code that experimenting with good values is not hard. Normal pagein already does a one-block readahead, and doing this in swap would be pretty easy. The biggest problem with swap readahead is that there is very little guarantee that the next page in any one swap partition is related to the current page: the way we select pages for swapout makes it quite likely that bits of different processes may intermix, and swap partitions can also get fragmented over time. To really benefit from swap readahead, we would also want improved swap clustering which tried to keep a logical association between adjacent physical pages, in the same way that the filesystem does. Right now, the swap clustering is great for output performance but doesn't necessarily lead to disk layouts which are good for swaping. > Plus Linus might actually accept a change like this :) If it is tunable, then it is so easy that he might well, yes. --Stephen -- This is a majordomo managed list. To unsubscribe, send a message with the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Two naive questions and a suggestion 1998-11-25 22:29 ` Stephen C. Tweedie @ 1998-11-26 7:30 ` Rik van Riel 1998-11-26 12:48 ` Stephen C. Tweedie 0 siblings, 1 reply; 29+ messages in thread From: Rik van Riel @ 1998-11-26 7:30 UTC (permalink / raw) To: Stephen C. Tweedie; +Cc: jfm2, Linux MM On Wed, 25 Nov 1998, Stephen C. Tweedie wrote: > On Wed, 25 Nov 1998 22:21:43 +0100 (CET), Rik van Riel > <H.H.vanRiel@phys.uu.nl> said: > > > Then I think it's time to do swapin readahead on the > > entire SWAP_CLUSTER > > Yep, although I'm not sure that reading a whole SWAP_CLUSTER would > be a good idea. Contrary to popular belief, disks are still quite > slow at sequential data transfer. I have a better idea for a default limit: swap_stream.max = num_physpages >> 9; if (swap_stream.max > SWAP_CLUSTER_MAX) swap_stream.max = SWAP_CLUSTER_MAX; swap_stream.enabled = 0; > Non-sequential IO is obviously enormously slower still, but doing > readahead on a whole SWAP_CLUSTER (128k) is definitely _not_ free. > It will increase the VM latency enormously if we start reading in a > lot of unnecessary data. We could simply increase the readahead if we were more than 50% succesful (ie. 80% of swap requests can be satisfied from the swap cache) and decrease it if we drop below 40% (or less than 50% of swap requests can be serviced from the swap cache). One thing that helps us enormously is the way kswapd pages out stuff. If pages (within a process) have the same kind of usage pattern and are near eachother, they will be swapped out together. Now since they have the same usage pattern, it is likely that they are needed together as well. Especially without page aging we are likely to store adjecant pages next to eachother in swap. Later on (when the simple code has been proven to work and Linus doesn't pay attention) we can introduce a really intelligent swapin readahead mechanism that will make Linux rock :) It's just that we need something simple now because Linus wants the kernel to stay relatively unchanged at the moment... cheers, Rik -- slowly getting used to dvorak kbd layout... +-------------------------------------------------------------------+ | Linux memory management tour guide. H.H.vanRiel@phys.uu.nl | | Scouting Vries cubscout leader. http://www.phys.uu.nl/~riel/ | +-------------------------------------------------------------------+ -- This is a majordomo managed list. To unsubscribe, send a message with the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Two naive questions and a suggestion 1998-11-26 7:30 ` Rik van Riel @ 1998-11-26 12:48 ` Stephen C. Tweedie 0 siblings, 0 replies; 29+ messages in thread From: Stephen C. Tweedie @ 1998-11-26 12:48 UTC (permalink / raw) To: Rik van Riel; +Cc: Stephen C. Tweedie, jfm2, Linux MM Hi, On Thu, 26 Nov 1998 08:30:20 +0100 (CET), Rik van Riel <H.H.vanRiel@phys.uu.nl> said: > We could simply increase the readahead if we were more > than 50% succesful (ie. 80% of swap requests can be > satisfied from the swap cache) and decrease it if we > drop below 40% (or less than 50% of swap requests can > be serviced from the swap cache). Yes --- do a patch, show us some benchmarks! We could make a big difference with this. --Stephen -- This is a majordomo managed list. To unsubscribe, send a message with the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Two naive questions and a suggestion 1998-11-25 6:41 ` Rik van Riel 1998-11-25 12:27 ` Stephen C. Tweedie @ 1998-11-25 20:01 ` jfm2 1998-11-26 7:16 ` Rik van Riel 1 sibling, 1 reply; 29+ messages in thread From: jfm2 @ 1998-11-25 20:01 UTC (permalink / raw) To: H.H.vanRiel; +Cc: jfm2, sct, linux-mm > > Without swapin readahead, we'll be unable to implement them > properly however :( > > > > > And now we are at it: in 2.0 I found a deamon can be killed by the > > > > system if it runs out of VM. > > > > > > Same on any BSD. > > > > Say the Web or database server can be deemed important enough for it > > not being killed just because some dim witt is playing with the GIMP > > at the console and the GIMP has allocated 80 Megs. > > I sounds remarkably like you want my Out Of Memory killer > patch. This patch tries to remove the randomness in killing > a process when you're OOM by carefully selecting a process > based on a lot of different factors (size, age, CPU used, > suid, root, IOPL, etc). > > It needs to be cleaned up, ported to 2.1.129 and improved > a little bit though... After that it should be ready for > inclusion in the kernel. > Your scheme is (IMHO) far too complicated and (IMHO) falls short. The problem is that the kernel has no way to know what is the really important process in the box. For instance you can have a database server running as normal user and that be considered far more important the X server (setuid root) whose only real goal is to allow a user friendly UI for administering the database. Why not simply allow a root-owned process declare itself (and the program it will exec into) as "guaranteed"? Only a human can know what is important and what is unimportant in a box so it should be a human who, by the way of starting a program throuh a "guaranteer", has the final word on what should be protected Allow an option for having this priviliege extended to descendents of the process given some database programs start special daemons for other tasks and will not run without them. Or a box used as a mail server using qmail: qmail starts sub-servers each one for a different task. Of course this is only a suugestion for a mechanism but the important is allowing a human to have the final word. -- Jean Francois Martinez Project Independence: Linux for the Masses http://www.independence.seul.org -- This is a majordomo managed list. To unsubscribe, send a message with the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Two naive questions and a suggestion 1998-11-25 20:01 ` jfm2 @ 1998-11-26 7:16 ` Rik van Riel 1998-11-26 19:59 ` jfm2 0 siblings, 1 reply; 29+ messages in thread From: Rik van Riel @ 1998-11-26 7:16 UTC (permalink / raw) To: jfm2; +Cc: Stephen C. Tweedie, Linux MM On 25 Nov 1998 jfm2@club-internet.fr wrote: > > I sounds remarkably like you want my Out Of Memory killer > > patch. This patch tries to remove the randomness in killing > > a process when you're OOM by carefully selecting a process > > based on a lot of different factors (size, age, CPU used, > > suid, root, IOPL, etc). > > Your scheme is (IMHO) far too complicated and (IMHO) falls short. > The problem is that the kernel has no way to know what is the really > important process in the box. In my (and other people's) experience, an educated guess is better than a random kill. Furthermore it is not possible to get out of the OOM situation without killing one or more processes, so we want to limit: - the number of processes we kill (reducing the chance of killing something important) - the CPU time 'lost' when we kill something (so we don't have to run that simulation for two weeks again) - the risk of killing something important and stable, we try to avoid this by giving less hitpoints to older processes (which presumably are stable and take a long time to 'recreate' the state in which they are now) - the amount of work lost -- killing new processes that haven't used much CPU is a way of doing this - the probability of the machine hanging -- don't kill IOPL programs and limit the points for old daemons and root/suid stuff Granted, we can never make a perfect guess. It will be a lot better than a more or less random kill, however. The large simulation that's taking 70% of your RAM and has run for 2 weeks is the most likely victim under our current scheme, but with my killer code it's priority will be far less that that of a newly-started and exploded GIMP or Netscape... > Why not simply allow a root-owned process declare itself (and the > program it will exec into) as "guaranteed"? If the guaranteed program explodes it will kill the machine. Even for single-purpose machines this will be bad since it will increase the downtime with a reboot&fsck cycle instead of just a program restart. > Or a box used as a mail server using qmail: qmail starts sub-servers > each one for a different task. The children are younger and will be killed first. Starting the master server from init will make sure that it is restarted in the case of a real emergency or fluke. > Of course this is only a suugestion for a mechanism but the important > is allowing a human to have the final word. What? You have a person sitting around keeping an eye on your mailserver 24x7? Usually the most important servers are tucked away in a closet and crash at 03:40 AM when the sysadmin is in bed 20 miles away... The kernel is there to prevent Murphy from taking over :) cheers, Rik -- slowly getting used to dvorak kbd layout... +-------------------------------------------------------------------+ | Linux memory management tour guide. H.H.vanRiel@phys.uu.nl | | Scouting Vries cubscout leader. http://www.phys.uu.nl/~riel/ | +-------------------------------------------------------------------+ -- This is a majordomo managed list. To unsubscribe, send a message with the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Two naive questions and a suggestion 1998-11-26 7:16 ` Rik van Riel @ 1998-11-26 19:59 ` jfm2 1998-11-27 17:45 ` Stephen C. Tweedie 0 siblings, 1 reply; 29+ messages in thread From: jfm2 @ 1998-11-26 19:59 UTC (permalink / raw) To: H.H.vanRiel; +Cc: jfm2, sct, linux-mm > On 25 Nov 1998 jfm2@club-internet.fr wrote: > > > > I sounds remarkably like you want my Out Of Memory killer > > > patch. This patch tries to remove the randomness in killing > > > a process when you're OOM by carefully selecting a process > > > based on a lot of different factors (size, age, CPU used, > > > suid, root, IOPL, etc). > > > > Your scheme is (IMHO) far too complicated and (IMHO) falls short. > > The problem is that the kernel has no way to know what is the really > > important process in the box. > > In my (and other people's) experience, an educated guess is > better than a random kill. Furthermore it is not possible to > get out of the OOM situation without killing one or more > processes, so we want to limit: > - the number of processes we kill (reducing the chance of > killing something important) > - the CPU time 'lost' when we kill something (so we don't > have to run that simulation for two weeks again) > - the risk of killing something important and stable, we > try to avoid this by giving less hitpoints to older > processes (which presumably are stable and take a long > time to 'recreate' the state in which they are now) > - the amount of work lost -- killing new processes that > haven't used much CPU is a way of doing this > - the probability of the machine hanging -- don't kill > IOPL programs and limit the points for old daemons > and root/suid stuff > > Granted, we can never make a perfect guess. It will be a > lot better than a more or less random kill, however. > > The large simulation that's taking 70% of your RAM and > has run for 2 weeks is the most likely victim under our > current scheme, but with my killer code it's priority > will be far less that that of a newly-started and exploded > GIMP or Netscape... > My idea was: -VM exhausted and process allocating is a normal process then kill process. -VM exhausted and process is a guaranteed one then kill a non guaranteed process. -VM exhausted, process is guaranteed but only remaining processes are guaranteed ones. Kill allocated process. Of course INIT is guaranteed. > > Why not simply allow a root-owned process declare itself (and the > > program it will exec into) as "guaranteed"? > > If the guaranteed program explodes it will kill the machine. > Even for single-purpose machines this will be bad since it > will increase the downtime with a reboot&fsck cycle instead > of just a program restart. > Nope see higher. The guaranteed program would be killed once "unimportant" processes have been killed. The goal is not to allow impunity to guaranteed programs but to protect an important program against possible misbehaviour of other programs: a misbehaving process who has allocated all the VM except 1 page and then our database server tries to allocate two more pages. > > Or a box used as a mail server using qmail: qmail starts sub-servers > > each one for a different task. > > The children are younger and will be killed first. Starting > the master server from init will make sure that it is > restarted in the case of a real emergency or fluke. > > > Of course this is only a suugestion for a mechanism but the important > > is allowing a human to have the final word. > > What? You have a person sitting around keeping an eye on > your mailserver 24x7? Usually the most important servers > are tucked away in a closet and crash at 03:40 AM when > the sysadmin is in bed 20 miles away... > No. The sysadmin uses emacs at normal hours to edit a file telling what are the important processes. Now it is to you to find a scheme in order the sysadmin's wishes are communicated to the kernel. :-) -- Jean Francois Martinez Project Independence: Linux for the Masses http://www.independence.seul.org -- This is a majordomo managed list. To unsubscribe, send a message with the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Two naive questions and a suggestion 1998-11-26 19:59 ` jfm2 @ 1998-11-27 17:45 ` Stephen C. Tweedie 1998-11-27 21:14 ` jfm2 0 siblings, 1 reply; 29+ messages in thread From: Stephen C. Tweedie @ 1998-11-27 17:45 UTC (permalink / raw) To: jfm2; +Cc: H.H.vanRiel, sct, linux-mm Hi, On 26 Nov 1998 19:59:42 -0000, jfm2@club-internet.fr said: > My idea was: > -VM exhausted and process allocating is a normal process then kill > process. > -VM exhausted and process is a guaranteed one then kill a non > guaranteed process. > -VM exhausted, process is guaranteed but only remaining processes are > guaranteed ones. Kill allocated process. But the _whole_ problem is that we do not necessarily go around killing processes. We just fail requests for new allocations. In that case we still have not run out of memory yet, but a daemon may have died. It is simply not possible to guarantee all of the future memory allocations which a process might make! --Stephen -- This is a majordomo managed list. To unsubscribe, send a message with the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Two naive questions and a suggestion 1998-11-27 17:45 ` Stephen C. Tweedie @ 1998-11-27 21:14 ` jfm2 0 siblings, 0 replies; 29+ messages in thread From: jfm2 @ 1998-11-27 21:14 UTC (permalink / raw) To: sct; +Cc: jfm2, H.H.vanRiel, linux-mm > Date: Fri, 27 Nov 1998 17:45:55 GMT > From: "Stephen C. Tweedie" <sct@redhat.com> > Content-Type: text/plain; charset=us-ascii > Cc: H.H.vanRiel@phys.uu.nl, sct@redhat.com, linux-mm@kvack.org > X-UIDL: 62f6721511a1878f885583dcf30990c3 > > Hi, > > On 26 Nov 1998 19:59:42 -0000, jfm2@club-internet.fr said: > > > My idea was: > > > -VM exhausted and process allocating is a normal process then kill > > process. > > -VM exhausted and process is a guaranteed one then kill a non > > guaranteed process. > > -VM exhausted, process is guaranteed but only remaining processes are > > guaranteed ones. Kill allocated process. > > But the _whole_ problem is that we do not necessarily go around > killing processes. We just fail requests for new allocations. In > that case we still have not run out of memory yet, but a daemon may > have died. It is simply not possible to guarantee all of the future > memory allocations which a process might make! > The word "guaranteed" was an unfortunate one. "Protected" would have been better. As a user I feel there are processes more equal than others and I find unfortunate one of them is killed when it tries to grow its stack (SIGKILL so no recovering) and it is unable to do so due to mibehaviour of an unimportant process. I think they should be protected and that it is the sysadmin and not a heuristic who should define what is important and what is not in a box. We cannot guarantee the success of a memory allocation but we can make mission critical software motre robust. But if you think the idea is bad we can kill this thread. -- Jean Francois Martinez Project Independence: Linux for the Masses http://www.independence.seul.org -- This is a majordomo managed list. To unsubscribe, send a message with the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Two naive questions and a suggestion 1998-11-24 21:44 ` jfm2 1998-11-25 6:41 ` Rik van Riel @ 1998-11-25 14:48 ` Eric W. Biederman 1998-11-25 20:29 ` jfm2 1998-11-25 16:31 ` ralf 2 siblings, 1 reply; 29+ messages in thread From: Eric W. Biederman @ 1998-11-25 14:48 UTC (permalink / raw) To: jfm2; +Cc: sct, linux-mm >>>>> "jfm2" == jfm2 <jfm2@club-internet.fr> writes: jfm2> Say the Web or database server can be deemed important enough for it jfm2> not being killed just because some dim witt is playing with the GIMP jfm2> at the console and the GIMP has allocated 80 Megs. jfm2> More reallistically, it can happen that the X server is killed jfm2> (-9) due to the misbeahviour of a user program and you get jfm2> trapped with a useless console. Very diificult to recover. Specially jfm2> if you consider inetd could have been killed too, so no telnetting. jfm2> You can also find half of your daemons, are gone. That is no mail, no jfm2> printing, no nothing. initd is never killed. Won't & can't be killed. initd should be configured to restart all of your important daemons if they go down. Currently most unix systems ( I don't think i'ts linux specific) are misconfigured so they don't automatically restart their important daemons if they go down. jfm2> In situation like those above I would like Linux supported a concept jfm2> like guaranteed processses: if VM is exhausted by one of them then try jfm2> to get memory by killing non guaranteed processes and only kill the jfm2> original one if all reamining survivors are guaranteed ones. jfm2> It would be better for mission critical tasks. Some. But it would be simple and much healthier for tasks that can be down for a little bit to have initd restart the processes after they go down. That allows for other cases when the important system daemons goes down, is more robust, and doesn't require kernel changes. Eric -- This is a majordomo managed list. To unsubscribe, send a message with the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Two naive questions and a suggestion 1998-11-25 14:48 ` Eric W. Biederman @ 1998-11-25 20:29 ` jfm2 0 siblings, 0 replies; 29+ messages in thread From: jfm2 @ 1998-11-25 20:29 UTC (permalink / raw) To: ebiederm+eric; +Cc: jfm2, sct, linux-mm > > >>>>> "jfm2" == jfm2 <jfm2@club-internet.fr> writes: > > jfm2> Say the Web or database server can be deemed important enough for it > jfm2> not being killed just because some dim witt is playing with the GIMP > jfm2> at the console and the GIMP has allocated 80 Megs. > > jfm2> More reallistically, it can happen that the X server is killed > jfm2> (-9) due to the misbeahviour of a user program and you get > jfm2> trapped with a useless console. Very diificult to recover. Specially > jfm2> if you consider inetd could have been killed too, so no telnetting. > > jfm2> You can also find half of your daemons, are gone. That is no mail, no > jfm2> printing, no nothing. > > initd is never killed. Won't & can't be killed. > initd should be configured to restart all of your important daemons if > they go down. > This does not solve the problem. To begin with after an unclean shutdown a database server spends time rolling back uncommitted transactions and possibly writing somye comitted ones to the database from its journals. Users could prefer a database who doesn't go down in the first place. Second: the 80 Megs GIMP is still there so when init restarts the database, the databse tries to allocate memory and it crashes again. Third: A process can crash because it is misconfigured or a file is corrupted. And crash again if you restart it. It si not Init's job to do things like try five times and use a pager interface to send a message to the admin in case there is a sixth crash. It could be considered that "guaranteed" processes is not a good idea but using Init is not the way to address the problem. -- Jean Francois Martinez Project Independence: Linux for the Masses http://www.independence.seul.org -- This is a majordomo managed list. To unsubscribe, send a message with the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Two naive questions and a suggestion 1998-11-24 21:44 ` jfm2 1998-11-25 6:41 ` Rik van Riel 1998-11-25 14:48 ` Eric W. Biederman @ 1998-11-25 16:31 ` ralf 1998-11-26 12:18 ` Rik van Riel 2 siblings, 1 reply; 29+ messages in thread From: ralf @ 1998-11-25 16:31 UTC (permalink / raw) To: jfm2, sct; +Cc: linux-mm On Tue, Nov 24, 1998 at 09:44:32PM -0000, jfm2@club-internet.fr wrote: > In situation like those above I would like Linux supported a concept > like guaranteed processses: if VM is exhausted by one of them then try > to get memory by killing non guaranteed processes and only kill the > original one if all reamining survivors are guaranteed ones. > It would be better for mission critical tasks. Long time ago I suggested to make it configurable whether a process gets memory which might be overcommited or not. This leaves malloc(x) == NULL to deal with and that's a userland problem anyway. Ralf -- This is a majordomo managed list. To unsubscribe, send a message with the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Two naive questions and a suggestion 1998-11-25 16:31 ` ralf @ 1998-11-26 12:18 ` Rik van Riel 0 siblings, 0 replies; 29+ messages in thread From: Rik van Riel @ 1998-11-26 12:18 UTC (permalink / raw) To: ralf; +Cc: jfm2, Linux MM On Wed, 25 Nov 1998 ralf@uni-koblenz.de wrote: > On Tue, Nov 24, 1998 at 09:44:32PM -0000, jfm2@club-internet.fr wrote: > > > In situation like those above I would like Linux supported a concept > > like guaranteed processses: if VM is exhausted by one of them then try > > to get memory by killing non guaranteed processes and only kill the > > original one if all reamining survivors are guaranteed ones. > > It would be better for mission critical tasks. > > Long time ago I suggested to make it configurable whether a process > gets memory which might be overcommited or not. This leaves > malloc(x) == NULL to deal with and that's a userland problem anyway. Then what would you do when your 250MB non-overcommitting program needs to do a fork() in order to call /usr/bin/lpr? Install an extra 250MB of swap? I don't think so :) These are the situations where sane people want overcommit. regards, Rik -- who actually has 250MB of extra swap... +-------------------------------------------------------------------+ | Linux memory management tour guide. H.H.vanRiel@phys.uu.nl | | Scouting Vries cubscout leader. http://www.phys.uu.nl/~riel/ | +-------------------------------------------------------------------+ -- This is a majordomo managed list. To unsubscribe, send a message with the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org ^ permalink raw reply [flat|nested] 29+ messages in thread
end of thread, other threads:[~1998-11-27 2:08 UTC | newest] Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 1998-11-19 0:20 Two naive questions and a suggestion jfm2 1998-11-19 20:05 ` Rik van Riel 1998-11-20 1:25 ` jfm2 1998-11-20 15:31 ` Eric W. Biederman 1998-11-23 18:08 ` Stephen C. Tweedie 1998-11-23 20:45 ` jfm2 1998-11-23 21:59 ` jfm2 1998-11-24 1:21 ` Vladimir Dergachev 1998-11-24 11:17 ` Stephen C. Tweedie 1998-11-24 21:44 ` jfm2 1998-11-25 6:41 ` Rik van Riel 1998-11-25 12:27 ` Stephen C. Tweedie 1998-11-25 13:08 ` Rik van Riel 1998-11-25 14:46 ` Stephen C. Tweedie 1998-11-25 16:47 ` Rik van Riel 1998-11-25 21:02 ` Stephen C. Tweedie 1998-11-25 21:21 ` Rik van Riel 1998-11-25 22:29 ` Stephen C. Tweedie 1998-11-26 7:30 ` Rik van Riel 1998-11-26 12:48 ` Stephen C. Tweedie 1998-11-25 20:01 ` jfm2 1998-11-26 7:16 ` Rik van Riel 1998-11-26 19:59 ` jfm2 1998-11-27 17:45 ` Stephen C. Tweedie 1998-11-27 21:14 ` jfm2 1998-11-25 14:48 ` Eric W. Biederman 1998-11-25 20:29 ` jfm2 1998-11-25 16:31 ` ralf 1998-11-26 12:18 ` Rik van Riel
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox