Two naive questions and a suggestion

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* Two naive questions and a suggestion
@ 1998-11-19  0:20 jfm2
  1998-11-19 20:05 ` Rik van Riel
  1998-11-23 18:08 ` Stephen C. Tweedie
  0 siblings, 2 replies; 29+ messages in thread
From: jfm2 @ 1998-11-19  0:20 UTC (permalink / raw)
  To: linux-mm

1) Is there any text describing memory management in 2.1?  (Forgive me
   if I missed an obvious URL)

2) Are there plans for implementing the swapping of whole processes a
   la BSD?

Suggestion: Given that the requiremnts for a workstation (quick
response) are different than for a server (high throughput) it could
make sense to allow the user either use /proc for selecting the VM
policy or have a form of loadable VM manager.  Or select it at compile
time.

-- 
			Jean Francois Martinez

--
This is a majordomo managed list.  To unsubscribe, send a message with
the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Two naive questions and a suggestion
  1998-11-19  0:20 Two naive questions and a suggestion jfm2
@ 1998-11-19 20:05 ` Rik van Riel
  1998-11-20  1:25   ` jfm2
  1998-11-23 18:08 ` Stephen C. Tweedie
  1 sibling, 1 reply; 29+ messages in thread
From: Rik van Riel @ 1998-11-19 20:05 UTC (permalink / raw)
  To: jfm2; +Cc: linux-mm

On 19 Nov 1998 jfm2@club-internet.fr wrote:

> 1) Is there any text describing memory management in 2.1?  (Forgive me
>    if I missed an obvious URL)

Not yet, I really should be working on that (the code
seems to have stabilized now)...

> 2) Are there plans for implementing the swapping of whole processes a
>    la BSD?

Yes, there are plans. The plans are quite detailed too, but
I think I haven't put them up on my home page yet.

> Suggestion: Given that the requiremnts for a workstation (quick
> response) are different than for a server (high throughput) it could
> make sense to allow the user either use /proc for selecting the VM
> policy or have a form of loadable VM manager.  Or select it at
> compile time. 

There are quite a lot of things you can tune in /proc,
I don't know if you have read the documentation, but
if you start trying things you'll be amazed hom much
you can change the system's behaviour with the existing
controls.

Btw, since you are so enthusiastic about documentation,
would you be willing to help me write it?

cheers,

Rik -- slowly getting used to dvorak kbd layout...
+-------------------------------------------------------------------+
| Linux memory management tour guide.        H.H.vanRiel@phys.uu.nl |
| Scouting Vries cubscout leader.      http://www.phys.uu.nl/~riel/ |
+-------------------------------------------------------------------+

--
This is a majordomo managed list.  To unsubscribe, send a message with
the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Two naive questions and a suggestion
  1998-11-19 20:05 ` Rik van Riel
@ 1998-11-20  1:25   ` jfm2
  1998-11-20 15:31     ` Eric W. Biederman
  0 siblings, 1 reply; 29+ messages in thread
From: jfm2 @ 1998-11-20  1:25 UTC (permalink / raw)
  To: H.H.vanRiel; +Cc: jfm2, linux-mm

> 
> On 19 Nov 1998 jfm2@club-internet.fr wrote:
> 
> > 1) Is there any text describing memory management in 2.1?  (Forgive me
> >    if I missed an obvious URL)
> 
> Not yet, I really should be working on that (the code
> seems to have stabilized now)...
> 
> > 2) Are there plans for implementing the swapping of whole processes a
> >    la BSD?
> 
> Yes, there are plans. The plans are quite detailed too, but
> I think I haven't put them up on my home page yet.
> 

This will close the gap between Linux and *BSDs at high loads.  It
will also close the mouthes of some BSD people who talk loudly about
what areas where BSD is superior and carefully forget SMP, Elf or
modules to name just a few areas where Linux got it first.

> > Suggestion: Given that the requiremnts for a workstation (quick
> > response) are different than for a server (high throughput) it could
> > make sense to allow the user either use /proc for selecting the VM
> > policy or have a form of loadable VM manager.  Or select it at
> > compile time. 
> 
> There are quite a lot of things you can tune in /proc,
> I don't know if you have read the documentation, but
> if you start trying things you'll be amazed hom much
> you can change the system's behaviour with the existing
> controls.
> 

I have read a bit about them but sometimes changing the algorythm is
the right thing to do.

> Btw, since you are so enthusiastic about documentation,
> would you be willing to help me write it?
> 

I could try to help you but it will be limited help.  I already work
on a project of "Linux for normal people" and I also wanted to write
an article about optimizing a Linux box.  The goal is to smash the
myth about kernel compiling.  Why?  Because in 95 my brother in law
needed a computer for his thesis in Spanish litterature.  I remebered
kernel compiling and I led him to Apple Expo.  That day one thing was
clear: Linux will never reach world domination as long as litterature
professors cannot use it and as long as kernel compiling be necessary
or even recommended then Linux will be off limits for litterature
professors.

So I scanned the source code in 2.0.34 and found the unsignificant
differences between code compiled for Pentiums and 386.  Then I
compiled the Byte benchmark using the same compile flags used for the
386 kernels and the ones for Pentiums and PPros.  Difference in speed
was < 2 % both on a real Pentium and on a K6.  So much for "it will
allow you to tune to the processor".

About memory savings.  First of all in 98 distributors shipping
crippled kernels should be shot: modular 2.0 has been around for over
two years.  Also modularity has reduced the memory savings you get
from recompiling the kernel (if the distributor did a good job) qwhile
machines got bigger: over 1.5 Megs saved on an 8 Meg box are
significant (1.2.13 in 95), 500K on a 32 Meg box are a triffle (2.0 in
98).  This is not entirely true: you can write pathological programs
where a single page means the difference between blinding speed and
hours swapping.  Also the significant number is the increase in memory
you lack: a 500K deficit becoming 1 Meg.  Consider also disk
bandwidth: being 16 megs short on a 32 Megs is much worse than 4 Meg
box on a 8 Meg because you need much more time to push 16 Megs to the
disk.  On the other hand proceses will have to spend more time
analyzing a big array on a big box than a small one in a small box
(processor speed being equal) and this plays in the side of the 32
Megs being 16 Megs short for normal, non pathological programs.
Finally there is the question of probability: 500K is under 2% on a 32
Meg box so there is a good chance that when programs need more memory
than what you have they miss the mark by 20 or 30% and rarely fall
just straight on the 500K zone.

Needs refining but indulge with the fact I am writing at 2am.

A (not to be published) conclusion is: "Kernel compiling is a thing
performed only by idiots and kernel hackers".  I am not a kernel
hacker and I have performed over 2 hundred of them.  :-)

Perhaps we could help one another for our docs/articles.

-- 
			Jean Francois Martinez

Project Independence: Linux for the Masses
http://www.independence.seul.org

--
This is a majordomo managed list.  To unsubscribe, send a message with
the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Two naive questions and a suggestion
  1998-11-20  1:25   ` jfm2
@ 1998-11-20 15:31     ` Eric W. Biederman
  0 siblings, 0 replies; 29+ messages in thread
From: Eric W. Biederman @ 1998-11-20 15:31 UTC (permalink / raw)
  To: jfm2; +Cc: H.H.vanRiel, linux-mm

>>>>> "jfm2" == jfm2  <jfm2@club-internet.fr> writes:

jfm2> A (not to be published) conclusion is: "Kernel compiling is a thing
jfm2> performed only by idiots and kernel hackers".  I am not a kernel
jfm2> hacker and I have performed over 2 hundred of them.  :-)

No.

As far as functionality I don't trust a linux box that doesn't have
it's standard hardware drivers, comm port, floppy disk etc compiled
in.  A modular kernel seems to work well for protocol layers however.

An important advantage of linux is what you can do if something isn't
working automatically.

With Windows you have 2 possibilities.  
1) Either something works automatically
2) Soemthing doesn't work.

With Linux you have 3 posibilities.  
1) Something works automatically.  (We need more in this category).
2) Something with research and looking around can be made to work.
   (The ability to compile a kernel is an advantage here)
3) Something doesn't work.  (Linux has much less in this category than
                             any other OS)

With the memory management system.  There are tuning paramenters.
But generally resorting to them is dropping down to case 2.
And in most cases with 2.0 and probably also with 2.2 the memory
management system should be a case of it works automatically.

Eric
--
This is a majordomo managed list.  To unsubscribe, send a message with
the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Two naive questions and a suggestion
  1998-11-19  0:20 Two naive questions and a suggestion jfm2
  1998-11-19 20:05 ` Rik van Riel
@ 1998-11-23 18:08 ` Stephen C. Tweedie
  1998-11-23 20:45   ` jfm2
  1998-11-23 21:59   ` jfm2
  1 sibling, 2 replies; 29+ messages in thread
From: Stephen C. Tweedie @ 1998-11-23 18:08 UTC (permalink / raw)
  To: jfm2; +Cc: linux-mm, Stephen Tweedie

Hi,

On 19 Nov 1998 00:20:37 -0000, jfm2@club-internet.fr said:

> 1) Is there any text describing memory management in 2.1?  (Forgive me
>    if I missed an obvious URL)

The source code. :)

> 2) Are there plans for implementing the swapping of whole processes a
>    la BSD?

Not exactly, but there are substantial plans for other related changes.
In particular, most of the benefits of BSD-style swapping can be
achieved through swapping of page tables, dynamic RSS limits and
streaming swapout, all of which are on the slate for 2.3.

--Stephen
--
This is a majordomo managed list.  To unsubscribe, send a message with
the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Two naive questions and a suggestion
  1998-11-23 18:08 ` Stephen C. Tweedie
@ 1998-11-23 20:45   ` jfm2
  1998-11-23 21:59   ` jfm2
  1 sibling, 0 replies; 29+ messages in thread
From: jfm2 @ 1998-11-23 20:45 UTC (permalink / raw)
  To: sct; +Cc: jfm2, linux-mm


--
This is a majordomo managed list.  To unsubscribe, send a message with
the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Two naive questions and a suggestion
  1998-11-23 18:08 ` Stephen C. Tweedie
  1998-11-23 20:45   ` jfm2
@ 1998-11-23 21:59   ` jfm2
  1998-11-24  1:21     ` Vladimir Dergachev
  1998-11-24 11:17     ` Stephen C. Tweedie
  1 sibling, 2 replies; 29+ messages in thread
From: jfm2 @ 1998-11-23 21:59 UTC (permalink / raw)
  To: sct; +Cc: jfm2, linux-mm

> 
> Hi,
> 
> On 19 Nov 1998 00:20:37 -0000, jfm2@club-internet.fr said:
> 
> > 1) Is there any text describing memory management in 2.1?  (Forgive me
> >    if I missed an obvious URL)
> 
> The source code. :)
> 

I knew about it.  :)  And this is not an URL :)

> > 2) Are there plans for implementing the swapping of whole processes a
> >    la BSD?
> 
> Not exactly, but there are substantial plans for other related changes.
> In particular, most of the benefits of BSD-style swapping can be
> achieved through swapping of page tables, dynamic RSS limits and
> streaming swapout, all of which are on the slate for 2.3.
> 

The problem is: will you be able to manage the following situation?

Two processes running in an 8 Meg box.  Both will page fault every ms
if you give them 4 Megs (they are scanning large arrays so no
locality), a page fault will take 20 ms to handle.  That means only 5%
of the CPU time is used, remainder is spent waiting for page being
brought from disk or pushing a page of the other process out of
memory.  And both of these processes would run like hell (no page
fault) given 6 Megs of memory.

Only solution I see is stop one of them (short of adding memory :) and
let the other one make some progress.  That is swapping.  Of course
swapping can be undesiarable in work stations and that is the reason I
suggested user control about MM policy be it by recompiling, by /proc
or by module insertion.

In 96 I asked for that same feature, gave the same example (same
numbers :-) and Alan Cox agreed but told Linux was not used under
heavy loads. That means we are in a catch 22 situation: Linux not used
for heavy loads because it does not handle them well and the necessary
feaatures not implemented because it is not used in such situations.

And now we are at it: in 2.0 I found a deamon can be killed by the
system if it runs out of VM.  Problem is: it was a normal user process
who had allocatedc most of it and in addition that daemon could be
important enough it is better to kill anything else, so it would be
useful to give some privilege to root processes here.

I think this ends my Christmas wish list.  :)

-- 
			Jean Francois Martinez

Project Independence: Linux for the Masses
http://www.independence.seul.org

--
This is a majordomo managed list.  To unsubscribe, send a message with
the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Two naive questions and a suggestion
  1998-11-23 21:59   ` jfm2
@ 1998-11-24  1:21     ` Vladimir Dergachev
  1998-11-24 11:17     ` Stephen C. Tweedie
  1 sibling, 0 replies; 29+ messages in thread
From: Vladimir Dergachev @ 1998-11-24  1:21 UTC (permalink / raw)
  To: jfm2; +Cc: sct, linux-mm



On 23 Nov 1998 jfm2@club-internet.fr wrote:

> 
> > 
> > Hi,
> > 
> > On 19 Nov 1998 00:20:37 -0000, jfm2@club-internet.fr said:
> > 
> > > 1) Is there any text describing memory management in 2.1?  (Forgive me
> > >    if I missed an obvious URL)
> > 
> > The source code. :)
> > 
> 
> I knew about it.  :)  And this is not an URL :)
> 
> > > 2) Are there plans for implementing the swapping of whole processes a
> > >    la BSD?
> > 
> > Not exactly, but there are substantial plans for other related changes.
> > In particular, most of the benefits of BSD-style swapping can be
> > achieved through swapping of page tables, dynamic RSS limits and
> > streaming swapout, all of which are on the slate for 2.3.
> > 
> 
> The problem is: will you be able to manage the following situation?
> 
> Two processes running in an 8 Meg box.  Both will page fault every ms
> if you give them 4 Megs (they are scanning large arrays so no
> locality), a page fault will take 20 ms to handle.  That means only 5%
> of the CPU time is used, remainder is spent waiting for page being
> brought from disk or pushing a page of the other process out of
> memory.  And both of these processes would run like hell (no page
> fault) given 6 Megs of memory.
> 
> Only solution I see is stop one of them (short of adding memory :) and
> let the other one make some progress.  That is swapping.  Of course
> swapping can be undesiarable in work stations and that is the reason I
> suggested user control about MM policy be it by recompiling, by /proc
> or by module insertion.
> 
> In 96 I asked for that same feature, gave the same example (same
> numbers :-) and Alan Cox agreed but told Linux was not used under
> heavy loads. That means we are in a catch 22 situation: Linux not used
> for heavy loads because it does not handle them well and the necessary
> feaatures not implemented because it is not used in such situations.
> 
> 
> And now we are at it: in 2.0 I found a deamon can be killed by the
> system if it runs out of VM.  Problem is: it was a normal user process
> who had allocatedc most of it and in addition that daemon could be
> important enough it is better to kill anything else, so it would be
> useful to give some privilege to root processes here.
> 
> I think this ends my Christmas wish list.  :)

what about this solution: write a small program the monitors for programs
(via /proc) that swap a lot. If this happens make them get "wide
slice" time - i.e. send SIGSTOP to everybody and SIGRESUME/SIGSTOP with
interval 0.5-1 sec. This would have the concurency you want and will also
eliminate the swapping (mostly). Since you need this heavy load anyway
I don't think your programs will complain about long delays - for all they
know this time is taken up swapping.

And this being a userspace program will allow you to fine tune this as
much as you want, all the way to OOM/OOT killer that will pop-up a nice
box on the terminal and ask what you want to do. 


                     Vladimir Dergachev


> 
> -- 
> 			Jean Francois Martinez
> 
> Project Independence: Linux for the Masses
> http://www.independence.seul.org
> 
> --
> This is a majordomo managed list.  To unsubscribe, send a message with
> the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org
> 

--
This is a majordomo managed list.  To unsubscribe, send a message with
the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Two naive questions and a suggestion
  1998-11-23 21:59   ` jfm2
  1998-11-24  1:21     ` Vladimir Dergachev
@ 1998-11-24 11:17     ` Stephen C. Tweedie
  1998-11-24 21:44       ` jfm2
  1 sibling, 1 reply; 29+ messages in thread
From: Stephen C. Tweedie @ 1998-11-24 11:17 UTC (permalink / raw)
  To: jfm2; +Cc: sct, linux-mm

Hi,

On 23 Nov 1998 21:59:33 -0000, jfm2@club-internet.fr said:

> The problem is: will you be able to manage the following situation?

> Two processes running in an 8 Meg box.  Both will page fault every ms
> if you give them 4 Megs (they are scanning large arrays so no
> locality), a page fault will take 20 ms to handle.  That means only 5%
> of the CPU time is used, remainder is spent waiting for page being
> brought from disk or pushing a page of the other process out of
> memory.  And both of these processes would run like hell (no page
> fault) given 6 Megs of memory.

These days, most people agree that in this situation your box is simply
misconfigured for the load. :)  Seriously, requirements have changed
enormously since swapping was first implemented.

> Only solution I see is stop one of them (short of adding memory :) and
> let the other one make some progress.  That is swapping.  

No it is not.  That is scheduling.  Swapping is a very precise term used
to define a mechanism by which we suspend a process and stream all of
its internal state to disk, including page tables and so on.  There's no
reason why we can't do a temporary schedule trick to deal with this in
Linux: it's still not true swapping.

> In 96 I asked for that same feature, gave the same example (same
> numbers :-) and Alan Cox agreed but told Linux was not used under
> heavy loads. That means we are in a catch 22 situation: Linux not used
> for heavy loads because it does not handle them well and the necessary
> feaatures not implemented because it is not used in such situations.

Linux is used under very heavy load, actually.

> And now we are at it: in 2.0 I found a deamon can be killed by the
> system if it runs out of VM.  

Same on any BSD.  Once virtual memory is full, any new memory
allocations must fail.  It doesn't matter whether that allocation comes
from a user process or a daemon: if there is no more virtual memory then
the process will get a NULL back from malloc.  If a daemon dies as a
result of that, the death will happen on any Unix system.  

> Problem is: it was a normal user process who had allocatedc most of it
> and in addition that daemon could be important enough it is better to
> kill anything else, so it would be useful to give some privilege to
> root processes here.

No.  It's not an issue of the operating system killing processes.  It is
an issue of the O/S failing a request for new memory, and a process
exit()ing as a result of that failed malloc.  The process is voluntarily
exiting, as far as the kernel is concerned.

--Stephen
--
This is a majordomo managed list.  To unsubscribe, send a message with
the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Two naive questions and a suggestion
  1998-11-24 11:17     ` Stephen C. Tweedie
@ 1998-11-24 21:44       ` jfm2
  1998-11-25  6:41         ` Rik van Riel
                           ` (2 more replies)
  0 siblings, 3 replies; 29+ messages in thread
From: jfm2 @ 1998-11-24 21:44 UTC (permalink / raw)
  To: sct; +Cc: jfm2, linux-mm

> 
> > The problem is: will you be able to manage the following situation?
> 
> > Two processes running in an 8 Meg box.  Both will page fault every ms
> > if you give them 4 Megs (they are scanning large arrays so no
> > locality), a page fault will take 20 ms to handle.  That means only 5%
> > of the CPU time is used, remainder is spent waiting for page being
> > brought from disk or pushing a page of the other process out of
> > memory.  And both of these processes would run like hell (no page
> > fault) given 6 Megs of memory.
> 
> These days, most people agree that in this situation your box is simply
> misconfigured for the load. :)  Seriously, requirements have changed
> enormously since swapping was first implemented.
> 
> > Only solution I see is stop one of them (short of adding memory :) and
> > let the other one make some progress.  That is swapping.  
> 
> No it is not.  That is scheduling.  Swapping is a very precise term used
> to define a mechanism by which we suspend a process and stream all of
> its internal state to disk, including page tables and so on.  There's no
> reason why we can't do a temporary schedule trick to deal with this in
> Linux: it's still not true swapping.
> 

Agreed, the important feature is the stopping one of the processes
when critically short of memory.  Swapping is only a trick for getting
more bandwidth at the expenses of pushing in an out of memory a
greater amount of process space so there is no proof it is faster than
letting other processes steal memory page by page from the now stopped
process.

> > In 96 I asked for that same feature, gave the same example (same
> > numbers :-) and Alan Cox agreed but told Linux was not used under
> > heavy loads. That means we are in a catch 22 situation: Linux not used
> > for heavy loads because it does not handle them well and the necessary
> > feaatures not implemented because it is not used in such situations.
> 
> Linux is used under very heavy load, actually.
> 

BSD and Solaris partisans are still boasting about how much better
those systems are at heavy loads.  I agree boasting tends to survive
to the situation who originated it.

> > And now we are at it: in 2.0 I found a deamon can be killed by the
> > system if it runs out of VM.  
> 
> Same on any BSD.  Once virtual memory is full, any new memory
> allocations must fail.  It doesn't matter whether that allocation comes
> from a user process or a daemon: if there is no more virtual memory then
> the process will get a NULL back from malloc.  If a daemon dies as a
> result of that, the death will happen on any Unix system.  
> 

Say the Web or database server can be deemed important enough for it
not being killed just because some dim witt is playing with the GIMP
at the console and the GIMP has allocated 80 Megs.

More reallistically, it can happen that the X server is killed
(-9) due to the misbeahviour of a user program and you get
trapped with a useless console.  Very diificult to recover.  Specially
if you consider inetd could have been killed too, so no telnetting.

You can also find half of your daemons, are gone.  That is no mail, no
printing, no nothing.

> > Problem is: it was a normal user process who had allocatedc most of it
> > and in addition that daemon could be important enough it is better to
> > kill anything else, so it would be useful to give some privilege to
> > root processes here.
> 
> No.  It's not an issue of the operating system killing processes.  It is
> an issue of the O/S failing a request for new memory, and a process
> exit()ing as a result of that failed malloc.  The process is voluntarily
> exiting, as far as the kernel is concerned.
> 

In situation like those above I would like Linux supported a concept
like guaranteed processses: if VM is exhausted by one of them then try
to get memory by killing non guaranteed processes and only kill the
original one if all reamining survivors are guaranteed ones.
It would be better for mission critical tasks.

-- 
			Jean Francois Martinez

Project Independence: Linux for the Masses
http://www.independence.seul.org

--
This is a majordomo managed list.  To unsubscribe, send a message with
the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Two naive questions and a suggestion
  1998-11-24 21:44       ` jfm2
@ 1998-11-25  6:41         ` Rik van Riel
  1998-11-25 12:27           ` Stephen C. Tweedie
  1998-11-25 20:01           ` jfm2
  1998-11-25 14:48         ` Eric W. Biederman
  1998-11-25 16:31         ` ralf
  2 siblings, 2 replies; 29+ messages in thread
From: Rik van Riel @ 1998-11-25  6:41 UTC (permalink / raw)
  To: jfm2; +Cc: sct, linux-mm

On 24 Nov 1998 jfm2@club-internet.fr wrote:

> Agreed, the important feature is the stopping one of the processes
> when critically short of memory.  Swapping is only a trick for
> getting more bandwidth at the expenses of pushing in an out of
> memory a greater amount of process space so there is no proof it is
> faster than letting other processes steal memory page by page from
> the now stopped process. 

When the mythical swapin readahead will be merged, we can
gain some ungodly amount of speed almost for free. I don't
know if we'll ever implement the scheduling tricks...

I do have a few ideas for the scheduling stuff though, with
RSS limits (we can safely implement those when the swap cache
trick is implemented) and the keeping of a few statistics,
we will be able to implement the swapping tricks.

Without swapin readahead, we'll be unable to implement them
properly however :(

> > > And now we are at it: in 2.0 I found a deamon can be killed by the
> > > system if it runs out of VM.  
> > 
> > Same on any BSD.
> 
> Say the Web or database server can be deemed important enough for it
> not being killed just because some dim witt is playing with the GIMP
> at the console and the GIMP has allocated 80 Megs.

I sounds remarkably like you want my Out Of Memory killer
patch. This patch tries to remove the randomness in killing
a process when you're OOM by carefully selecting a process
based on a lot of different factors (size, age, CPU used,
suid, root, IOPL, etc).

It needs to be cleaned up, ported to 2.1.129 and improved
a little bit though... After that it should be ready for
inclusion in the kernel.

cheers,

Rik -- slowly getting used to dvorak kbd layout...
+-------------------------------------------------------------------+
| Linux memory management tour guide.        H.H.vanRiel@phys.uu.nl |
| Scouting Vries cubscout leader.      http://www.phys.uu.nl/~riel/ |
+-------------------------------------------------------------------+

--
This is a majordomo managed list.  To unsubscribe, send a message with
the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Two naive questions and a suggestion
  1998-11-25  6:41         ` Rik van Riel
@ 1998-11-25 12:27           ` Stephen C. Tweedie
  1998-11-25 13:08             ` Rik van Riel
  1998-11-25 20:01           ` jfm2
  1 sibling, 1 reply; 29+ messages in thread
From: Stephen C. Tweedie @ 1998-11-25 12:27 UTC (permalink / raw)
  To: Rik van Riel; +Cc: jfm2, sct, linux-mm

Hi,

On Wed, 25 Nov 1998 07:41:41 +0100 (CET), Rik van Riel
<H.H.vanRiel@phys.uu.nl> said:

> When the mythical swapin readahead will be merged, we can
> gain some ungodly amount of speed almost for free. I don't
> know if we'll ever implement the scheduling tricks...

Agreed: usage patterns these days are very different.  We simply don't
expect to run parallel massive processes whose combined working sets
exceed physical memory these days.  Making sure that we don't thrash to
death is still an important point, but we can achieve that by
guaranteeing processes a minimum rss quota (so that only those processes
exceeding that quota compete for the remaining physical memory).

> I do have a few ideas for the scheduling stuff though, with
> RSS limits (we can safely implement those when the swap cache
> trick is implemented) and the keeping of a few statistics,
> we will be able to implement the swapping tricks.

Rick, get real: when will you work out how the VM works?  We can safely
implement RSS limits *today*, and have been able to since 2.1.89.
<grin>  It's just a matter of doing a vmscan on the current process
whenever it exceeds its own RSS limit.  The mechanism is all there.

> Without swapin readahead, we'll be unable to implement them
> properly however :(

No, we don't need readahead (although the swap cache itself already
includes all of the necessary mechanism: rw_swap_page(READ, nowait) will
do it).  The only extra functionality we might want is extra control
over when we write swap-cached pages: right now, all dirty pages need to
be in the RSS, and we write them to disk when we evict them to the swap
cache.  Thus, only clean pages can be in the swap cache.  If we want to
support processes with a dirty working set > RSS, we'd need to extend
this.

--Stephen
--
This is a majordomo managed list.  To unsubscribe, send a message with
the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Two naive questions and a suggestion
  1998-11-25 12:27           ` Stephen C. Tweedie
@ 1998-11-25 13:08             ` Rik van Riel
  1998-11-25 14:46               ` Stephen C. Tweedie
  0 siblings, 1 reply; 29+ messages in thread
From: Rik van Riel @ 1998-11-25 13:08 UTC (permalink / raw)
  To: Stephen C. Tweedie; +Cc: Rik van Riel, jfm2, linux-mm

On Wed, 25 Nov 1998, Stephen C. Tweedie wrote:
> On Wed, 25 Nov 1998 07:41:41 +0100 (CET), Rik van Riel
> <H.H.vanRiel@phys.uu.nl> said:
> 
> > I do have a few ideas for the scheduling stuff though, with
> > RSS limits (we can safely implement those when the swap cache
> > trick is implemented) and the keeping of a few statistics,
> > we will be able to implement the swapping tricks.
> 
> Rick, get real: when will you work out how the VM works?  We can
> safely implement RSS limits *today*, and have been able to since
> 2.1.89.  <grin> It's just a matter of doing a vmscan on the current
> process whenever it exceeds its own RSS limit.  The mechanism is all
> there. 

If we tried to implement RSS limits now, it would mean that
the large task(s) we limited would be continuously thrashing
and keep the I/O subsystem busy -- this impacts the rest of
the system a lot.

With the new scheme, we can implement the RSS limit, but the
truly busily used pages would simply stay inside the swap cache,
freeing up I/O bandwidth (at the cost of some memory) for the
rest of the system.

I think that with the new scheme the balancing will be so
much better that we can implement RSS limits without a
negative impact on the rest of the system. With the current
VM system RSS limits would probably hamper the performance
the rest of the system gets.

We might want to perform the scheduling tricks for over-RSS
processes however. Without swap readahead I really don't see
any way we could run them without keeping back the rest of
the system too much...

cheers,

Rik -- slowly getting used to dvorak kbd layout...
+-------------------------------------------------------------------+
| Linux memory management tour guide.        H.H.vanRiel@phys.uu.nl |
| Scouting Vries cubscout leader.      http://www.phys.uu.nl/~riel/ |
+-------------------------------------------------------------------+

--
This is a majordomo managed list.  To unsubscribe, send a message with
the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Two naive questions and a suggestion
  1998-11-25 13:08             ` Rik van Riel
@ 1998-11-25 14:46               ` Stephen C. Tweedie
  1998-11-25 16:47                 ` Rik van Riel
  0 siblings, 1 reply; 29+ messages in thread
From: Stephen C. Tweedie @ 1998-11-25 14:46 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Stephen C. Tweedie, jfm2, linux-mm

Hi,

On Wed, 25 Nov 1998 14:08:47 +0100 (CET), Rik van Riel
<H.H.vanRiel@phys.uu.nl> said:

> On Wed, 25 Nov 1998, Stephen C. Tweedie wrote:
>> Rick, get real: when will you work out how the VM works?  We can
>> safely implement RSS limits *today*, and have been able to since
>> 2.1.89.  <grin> 

> If we tried to implement RSS limits now, it would mean that
> the large task(s) we limited would be continuously thrashing
> and keep the I/O subsystem busy -- this impacts the rest of
> the system a lot.

WRONG.  We can very very easily unlink pages from a process's pte (hence
reducing the process's RSS) without removing that page from memory.
It's trivial.  We do it all the time.  We can do it both for
memory-mapped files and for anonymous pages.  In the latest 2.1.130
prepatch, this is in fact the *preferred* way of swapping.  This
mechanism is fundamental to the way we maintain page sharing of swapped
COW pages.

The only thing we cannot do is unlink dirty pages (for swap, that means
pages which have been modified since we last paged the swap back into
memory).  We have to write them back before we unlink.  That does not
mean that we have to throw the data away: as long as the copy on disk is
uptodate, we can have as much of a process's address space as we want in
the page cache or swap cache without it being mapped in the process'
address space and without it counting as task RSS.

Today, such an RSS limit would NOT thrash the IO: it would just cause
minor page faults as we relink the cached page back into the page
tables.  All of that functionality exists today.

Rik, you should probably try to work out how try_to_swap_out() actually
works one of these days.  You'll find it does a lot of neat stuff you
seem to be unaware of!  We are really a lot closer to having a proper
unified page handling mechanism than you think.  The handling of dirty
pages is pretty much the only missing part of the mechanism right now.
Even that is not necessarily a bad thing: there are good performance
reasons why we might want the swap cache to contain only clean pages:
for example, it makes it easier to guarantee that those pages can be
reclaimed for another use at short notice.

--Stephen
--
This is a majordomo managed list.  To unsubscribe, send a message with
the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Two naive questions and a suggestion
  1998-11-25 14:46               ` Stephen C. Tweedie
@ 1998-11-25 16:47                 ` Rik van Riel
  1998-11-25 21:02                   ` Stephen C. Tweedie
  0 siblings, 1 reply; 29+ messages in thread
From: Rik van Riel @ 1998-11-25 16:47 UTC (permalink / raw)
  To: Stephen C. Tweedie; +Cc: jfm2, Linux MM

On Wed, 25 Nov 1998, Stephen C. Tweedie wrote:
> On Wed, 25 Nov 1998 14:08:47 +0100 (CET), Rik van Riel
> <H.H.vanRiel@phys.uu.nl> said:
> 
> > If we tried to implement RSS limits now, it would mean that
> > the large task(s) we limited would be continuously thrashing
> > and keep the I/O subsystem busy -- this impacts the rest of
> > the system a lot.
> 
> WRONG.  We can very very easily unlink pages from a process's pte
> (hence reducing the process's RSS) without removing that page from
> memory.  It's trivial.  We do it all the time.  Rik, you should
> probably try to work out how try_to_swap_out() actually works one of
> these days.

I just looked in mm/vmscan.c of kernel version 2.1.129, and
line 173, 191 and 205 feature a prominent:
			free_page_and_swap_cache(page);

> We are really a lot closer to having a proper unified page handling
> mechanism than you think.  The handling of dirty pages is pretty
> much the only missing part of the mechanism right now. 

I know how close we are. I think I posted an assesment on
what to do and what to leave yesterday :)) The most essential
things can probably be coded in a day or two, if we want to.

Oh, one question. Can we attach a swap page to the swap cache
while there's no program using it? This way we can implement
a very primitive swapin readahead right now, improving the
algorithm as we go along...

> Even that is not necessarily a bad thing: there are good performance
> reasons why we might want the swap cache to contain only clean
> pages:  for example, it makes it easier to guarantee that those
> pages can be reclaimed for another use at short notice. 

IMHO it would be a big loss to have dirty pages in the swap
cache. Writing out swap pages is cheap since we do proper
I/O clustering, not writing them out immediately will result
in them being written out in the order that shrink_mmap()
comes across them, which is a suboptimal way for when we
want to read the pages back.

Besides, having a large/huge clean swap cache means that we
can very easily free up memory when we need to, this is
essential for NFS buffers, networking stuff, etc.

If we keep a quota of 20% of memory in buffers and unmapped
cache, we can also do away with a buffer for the 8 and 16kB
area's. We can always find some contiguous area in swap/page
cache that we can free...

cheers,

Rik -- slowly getting used to dvorak kbd layout...
+-------------------------------------------------------------------+
| Linux memory management tour guide.        H.H.vanRiel@phys.uu.nl |
| Scouting Vries cubscout leader.      http://www.phys.uu.nl/~riel/ |
+-------------------------------------------------------------------+

--
This is a majordomo managed list.  To unsubscribe, send a message with
the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Two naive questions and a suggestion
  1998-11-25 16:47                 ` Rik van Riel
@ 1998-11-25 21:02                   ` Stephen C. Tweedie
  1998-11-25 21:21                     ` Rik van Riel
  0 siblings, 1 reply; 29+ messages in thread
From: Stephen C. Tweedie @ 1998-11-25 21:02 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Stephen C. Tweedie, jfm2, Linux MM

Hi,

On Wed, 25 Nov 1998 17:47:18 +0100 (CET), Rik van Riel
<H.H.vanRiel@phys.uu.nl> said:

>> WRONG.  We can very very easily unlink pages from a process's pte
>> (hence reducing the process's RSS) without removing that page from
>> memory.  It's trivial.  We do it all the time.  Rik, you should
>> probably try to work out how try_to_swap_out() actually works one of
>> these days.

> I just looked in mm/vmscan.c of kernel version 2.1.129, and
> line 173, 191 and 205 feature a prominent:
> 			free_page_and_swap_cache(page);

It is not there in 2.1.130-pre3, however. :) That misses the point,
though.  The point is that it is trivial to remove these mappings
without freeing the swap cache, and the code you point to confirms this:
vmscan actually has to go to _extra_ trouble to free the underlying
cache if that is wanted (the shared page case is the same, hence the
unuse_page call at the end of try_to_swap_out() (also removed in
2.1.130-3).  The default action of the free_page alone removes the
mapping but not the cache entry, and the functionality of leaving the
cache present is already there.

> Oh, one question. Can we attach a swap page to the swap cache
> while there's no program using it? This way we can implement
> a very primitive swapin readahead right now, improving the
> algorithm as we go along...

Yes, rw_swap_page(READ, nowait) does exactly that: it primes the swap
cache asynchronously but does not map it anywhere.  It should be
completely safe right now: the normal swap read is just a special case
of this.

> IMHO it would be a big loss to have dirty pages in the swap
> cache. Writing out swap pages is cheap since we do proper
> I/O clustering ...

> Besides, having a large/huge clean swap cache means that we
> can very easily free up memory when we need to, this is
> essential for NFS buffers, networking stuff, etc.

Yep, absolutely: agreed on both counts.  This is exactly how 2.1.130-3
works! 

> If we keep a quota of 20% of memory in buffers and unmapped
> cache, we can also do away with a buffer for the 8 and 16kB
> area's. We can always find some contiguous area in swap/page
> cache that we can free...

That will kill performance if you have a large simulation which has a
legitimate need to keep 90% of physical memory full of anonymous pages.
I'd rather do without that 20% magic limit if we can.  The only special
limit we really need is to make sure that kswapd keeps far enough in
advance of interrupt memory load that the free list doesn't empty.

--Stephen
--
This is a majordomo managed list.  To unsubscribe, send a message with
the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Two naive questions and a suggestion
  1998-11-25 21:02                   ` Stephen C. Tweedie
@ 1998-11-25 21:21                     ` Rik van Riel
  1998-11-25 22:29                       ` Stephen C. Tweedie
  0 siblings, 1 reply; 29+ messages in thread
From: Rik van Riel @ 1998-11-25 21:21 UTC (permalink / raw)
  To: Stephen C. Tweedie; +Cc: jfm2, Linux MM

On Wed, 25 Nov 1998, Stephen C. Tweedie wrote:
> <H.H.vanRiel@phys.uu.nl> said:
> 
> It is not there in 2.1.130-pre3, however. :) That misses the point,
> though.  The point is that it is trivial to remove these mappings
> without freeing the swap cache, and the code you point to confirms this:

OK, point taken. {:-)

> > Oh, one question. Can we attach a swap page to the swap cache
> > while there's no program using it? This way we can implement
> > a very primitive swapin readahead right now, improving the
> > algorithm as we go along...
> 
> Yes, rw_swap_page(READ, nowait) does exactly that: it primes the
> swap cache asynchronously but does not map it anywhere.  It should
> be completely safe right now: the normal swap read is just a special
> case of this. 

Then I think it's time to do swapin readahead on the
entire SWAP_CLUSTER (or just from the point where we
faulted) on a dumb-and-dumber basis, awaiting a good
readahead scheme. Of course it will need to be sysctl
tuneable :)

The reason I propose this dumb scheme is because we
can read one SWAP_CLUSTER_MAX sized chunk in one
sweep without having to move the disks head... Plus
Linus might actually accept a change like this :)

> > If we keep a quota of 20% of memory in buffers and unmapped
> > cache, we can also do away with a buffer for the 8 and 16kB
> > area's. We can always find some contiguous area in swap/page
> > cache that we can free...
> 
> That will kill performance if you have a large simulation which has a
> legitimate need to keep 90% of physical memory full of anonymous pages.
> I'd rather do without that 20% magic limit if we can.  The only special
> limit we really need is to make sure that kswapd keeps far enough in
> advance of interrupt memory load that the free list doesn't empty.

OK, then we should let the kernel calculate the limit itself
based on the number of soft faults, swapout pressure, memory
pressure and process priority.

We can also use stats like this to temporarily suspend very
large processes when we've got multiple processes with:
 (p->vm_mm->rss + p->dec_flt) > RSS_THRASH_LIMIT, where
p->dec_flt is a floating average and the RSS limit is
calculated dynamically as well... I know this could be
a slightly expensive trick, but we can easily make that
sysctl tuneable as well.

Rik -- slowly getting used to dvorak kbd layout...
+-------------------------------------------------------------------+
| Linux memory management tour guide.        H.H.vanRiel@phys.uu.nl |
| Scouting Vries cubscout leader.      http://www.phys.uu.nl/~riel/ |
+-------------------------------------------------------------------+


--
This is a majordomo managed list.  To unsubscribe, send a message with
the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Two naive questions and a suggestion
  1998-11-25 21:21                     ` Rik van Riel
@ 1998-11-25 22:29                       ` Stephen C. Tweedie
  1998-11-26  7:30                         ` Rik van Riel
  0 siblings, 1 reply; 29+ messages in thread
From: Stephen C. Tweedie @ 1998-11-25 22:29 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Stephen C. Tweedie, jfm2, Linux MM

Hi,

On Wed, 25 Nov 1998 22:21:43 +0100 (CET), Rik van Riel
<H.H.vanRiel@phys.uu.nl> said:

> Then I think it's time to do swapin readahead on the
> entire SWAP_CLUSTER (or just from the point where we
> faulted) on a dumb-and-dumber basis, awaiting a good
> readahead scheme. Of course it will need to be sysctl
> tuneable :)

Yep, although I'm not sure that reading a whole SWAP_CLUSTER would be a
good idea.  Contrary to popular belief, disks are still quite slow at
sequential data transfer.  Non-sequential IO is obviously enormously
slower still, but doing readahead on a whole SWAP_CLUSTER (128k) is
definitely _not_ free.  It will increase the VM latency enormously if we
start reading in a lot of unnecessary data.

On the other hand, swap readahead is sufficiently trivial to code that
experimenting with good values is not hard.  Normal pagein already does
a one-block readahead, and doing this in swap would be pretty easy.  

The biggest problem with swap readahead is that there is very little
guarantee that the next page in any one swap partition is related to the
current page: the way we select pages for swapout makes it quite likely
that bits of different processes may intermix, and swap partitions can
also get fragmented over time.  To really benefit from swap readahead,
we would also want improved swap clustering which tried to keep a
logical association between adjacent physical pages, in the same way
that the filesystem does.  Right now, the swap clustering is great for
output performance but doesn't necessarily lead to disk layouts which
are good for swaping.

> Plus Linus might actually accept a change like this :)

If it is tunable, then it is so easy that he might well, yes.

--Stephen
--
This is a majordomo managed list.  To unsubscribe, send a message with
the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Two naive questions and a suggestion
  1998-11-25 22:29                       ` Stephen C. Tweedie
@ 1998-11-26  7:30                         ` Rik van Riel
  1998-11-26 12:48                           ` Stephen C. Tweedie
  0 siblings, 1 reply; 29+ messages in thread
From: Rik van Riel @ 1998-11-26  7:30 UTC (permalink / raw)
  To: Stephen C. Tweedie; +Cc: jfm2, Linux MM

On Wed, 25 Nov 1998, Stephen C. Tweedie wrote:
> On Wed, 25 Nov 1998 22:21:43 +0100 (CET), Rik van Riel
> <H.H.vanRiel@phys.uu.nl> said:
> 
> > Then I think it's time to do swapin readahead on the
> > entire SWAP_CLUSTER
> 
> Yep, although I'm not sure that reading a whole SWAP_CLUSTER would
> be a good idea.  Contrary to popular belief, disks are still quite
> slow at sequential data transfer.

I have a better idea for a default limit:
	swap_stream.max = num_physpages >> 9;
	if (swap_stream.max > SWAP_CLUSTER_MAX)
		swap_stream.max = SWAP_CLUSTER_MAX;
	swap_stream.enabled = 0;

> Non-sequential IO is obviously enormously slower still, but doing
> readahead on a whole SWAP_CLUSTER (128k) is definitely _not_ free. 
> It will increase the VM latency enormously if we start reading in a
> lot of unnecessary data. 

We could simply increase the readahead if we were more
than 50% succesful (ie. 80% of swap requests can be
satisfied from the swap cache) and decrease it if we
drop below 40% (or less than 50% of swap requests can
be serviced from the swap cache).

One thing that helps us enormously is the way kswapd
pages out stuff. If pages (within a process) have the
same kind of usage pattern and are near eachother, they
will be swapped out together. Now since they have the
same usage pattern, it is likely that they are needed
together as well.

Especially without page aging we are likely to store
adjecant pages next to eachother in swap.

Later on (when the simple code has been proven to
work and Linus doesn't pay attention) we can introduce
a really intelligent swapin readahead mechanism that
will make Linux rock :)

It's just that we need something simple now because
Linus wants the kernel to stay relatively unchanged
at the moment...

cheers,

Rik -- slowly getting used to dvorak kbd layout...
+-------------------------------------------------------------------+
| Linux memory management tour guide.        H.H.vanRiel@phys.uu.nl |
| Scouting Vries cubscout leader.      http://www.phys.uu.nl/~riel/ |
+-------------------------------------------------------------------+

--
This is a majordomo managed list.  To unsubscribe, send a message with
the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Two naive questions and a suggestion
  1998-11-26  7:30                         ` Rik van Riel
@ 1998-11-26 12:48                           ` Stephen C. Tweedie
  0 siblings, 0 replies; 29+ messages in thread
From: Stephen C. Tweedie @ 1998-11-26 12:48 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Stephen C. Tweedie, jfm2, Linux MM

Hi,

On Thu, 26 Nov 1998 08:30:20 +0100 (CET), Rik van Riel
<H.H.vanRiel@phys.uu.nl> said:

> We could simply increase the readahead if we were more
> than 50% succesful (ie. 80% of swap requests can be
> satisfied from the swap cache) and decrease it if we
> drop below 40% (or less than 50% of swap requests can
> be serviced from the swap cache).

Yes --- do a patch, show us some benchmarks!  We could make a big
difference with this.

--Stephen
--
This is a majordomo managed list.  To unsubscribe, send a message with
the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Two naive questions and a suggestion
  1998-11-25  6:41         ` Rik van Riel
  1998-11-25 12:27           ` Stephen C. Tweedie
@ 1998-11-25 20:01           ` jfm2
  1998-11-26  7:16             ` Rik van Riel
  1 sibling, 1 reply; 29+ messages in thread
From: jfm2 @ 1998-11-25 20:01 UTC (permalink / raw)
  To: H.H.vanRiel; +Cc: jfm2, sct, linux-mm

> 
> Without swapin readahead, we'll be unable to implement them
> properly however :(
> 
> > > > And now we are at it: in 2.0 I found a deamon can be killed by the
> > > > system if it runs out of VM.  
> > > 
> > > Same on any BSD.
> > 
> > Say the Web or database server can be deemed important enough for it
> > not being killed just because some dim witt is playing with the GIMP
> > at the console and the GIMP has allocated 80 Megs.
> 
> I sounds remarkably like you want my Out Of Memory killer
> patch. This patch tries to remove the randomness in killing
> a process when you're OOM by carefully selecting a process
> based on a lot of different factors (size, age, CPU used,
> suid, root, IOPL, etc).
> 
> It needs to be cleaned up, ported to 2.1.129 and improved
> a little bit though... After that it should be ready for
> inclusion in the kernel.
> 

Your scheme is (IMHO) far too complicated and (IMHO) falls short.  The
problem is that the kernel has no way to know what is the really
important process in the box.  For instance you can have a database
server running as normal user and that be considered far more
important the X server (setuid root) whose only real goal is to allow
a user friendly UI for administering the database.

Why not simply allow a root-owned process declare itself (and the
program it will exec into) as "guaranteed"?  Only a human can know
what is important and what is unimportant in a box so it should be a
human who, by the way of starting a program throuh a "guaranteer", has
the final word on what should be protected

Allow an option for having this priviliege extended to descendents of
the process given some database programs start special daemons for
other tasks and will not run without them.  Or a box used as a mail
server using qmail: qmail starts sub-servers each one for a different
task.

Of course this is only a suugestion for a mechanism but the important
is allowing a human to have the final word.

-- 
			Jean Francois Martinez

Project Independence: Linux for the Masses
http://www.independence.seul.org

--
This is a majordomo managed list.  To unsubscribe, send a message with
the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Two naive questions and a suggestion
  1998-11-25 20:01           ` jfm2
@ 1998-11-26  7:16             ` Rik van Riel
  1998-11-26 19:59               ` jfm2
  0 siblings, 1 reply; 29+ messages in thread
From: Rik van Riel @ 1998-11-26  7:16 UTC (permalink / raw)
  To: jfm2; +Cc: Stephen C. Tweedie, Linux MM

On 25 Nov 1998 jfm2@club-internet.fr wrote:

> > I sounds remarkably like you want my Out Of Memory killer
> > patch. This patch tries to remove the randomness in killing
> > a process when you're OOM by carefully selecting a process
> > based on a lot of different factors (size, age, CPU used,
> > suid, root, IOPL, etc).
> 
> Your scheme is (IMHO) far too complicated and (IMHO) falls short. 
> The problem is that the kernel has no way to know what is the really
> important process in the box. 

In my (and other people's) experience, an educated guess is
better than a random kill. Furthermore it is not possible to
get out of the OOM situation without killing one or more
processes, so we want to limit:
- the number of processes we kill (reducing the chance of
  killing something important)
- the CPU time 'lost' when we kill something (so we don't
  have to run that simulation for two weeks again)
- the risk of killing something important and stable, we
  try to avoid this by giving less hitpoints to older
  processes (which presumably are stable and take a long
  time to 'recreate' the state in which they are now)
- the amount of work lost -- killing new processes that
  haven't used much CPU is a way of doing this
- the probability of the machine hanging -- don't kill
  IOPL programs and limit the points for old daemons
  and root/suid stuff

Granted, we can never make a perfect guess. It will be a
lot better than a more or less random kill, however.

The large simulation that's taking 70% of your RAM and
has run for 2 weeks is the most likely victim under our
current scheme, but with my killer code it's priority
will be far less that that of a newly-started and exploded
GIMP or Netscape...

> Why not simply allow a root-owned process declare itself (and the
> program it will exec into) as "guaranteed"? 

If the guaranteed program explodes it will kill the machine.
Even for single-purpose machines this will be bad since it
will increase the downtime with a reboot&fsck cycle instead
of just a program restart.

> Or a box used as a mail server using qmail: qmail starts sub-servers
> each one for a different task. 

The children are younger and will be killed first. Starting
the master server from init will make sure that it is
restarted in the case of a real emergency or fluke.

> Of course this is only a suugestion for a mechanism but the important
> is allowing a human to have the final word.

What? You have a person sitting around keeping an eye on
your mailserver 24x7? Usually the most important servers
are tucked away in a closet and crash at 03:40 AM when
the sysadmin is in bed 20 miles away...

The kernel is there to prevent Murphy from taking over :)

cheers,

Rik -- slowly getting used to dvorak kbd layout...
+-------------------------------------------------------------------+
| Linux memory management tour guide.        H.H.vanRiel@phys.uu.nl |
| Scouting Vries cubscout leader.      http://www.phys.uu.nl/~riel/ |
+-------------------------------------------------------------------+

--
This is a majordomo managed list.  To unsubscribe, send a message with
the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Two naive questions and a suggestion
  1998-11-26  7:16             ` Rik van Riel
@ 1998-11-26 19:59               ` jfm2
  1998-11-27 17:45                 ` Stephen C. Tweedie
  0 siblings, 1 reply; 29+ messages in thread
From: jfm2 @ 1998-11-26 19:59 UTC (permalink / raw)
  To: H.H.vanRiel; +Cc: jfm2, sct, linux-mm


> On 25 Nov 1998 jfm2@club-internet.fr wrote:
> 
> > > I sounds remarkably like you want my Out Of Memory killer
> > > patch. This patch tries to remove the randomness in killing
> > > a process when you're OOM by carefully selecting a process
> > > based on a lot of different factors (size, age, CPU used,
> > > suid, root, IOPL, etc).
> > 
> > Your scheme is (IMHO) far too complicated and (IMHO) falls short. 
> > The problem is that the kernel has no way to know what is the really
> > important process in the box. 
> 
> In my (and other people's) experience, an educated guess is
> better than a random kill. Furthermore it is not possible to
> get out of the OOM situation without killing one or more
> processes, so we want to limit:
> - the number of processes we kill (reducing the chance of
>   killing something important)
> - the CPU time 'lost' when we kill something (so we don't
>   have to run that simulation for two weeks again)
> - the risk of killing something important and stable, we
>   try to avoid this by giving less hitpoints to older
>   processes (which presumably are stable and take a long
>   time to 'recreate' the state in which they are now)
> - the amount of work lost -- killing new processes that
>   haven't used much CPU is a way of doing this
> - the probability of the machine hanging -- don't kill
>   IOPL programs and limit the points for old daemons
>   and root/suid stuff
> 
> Granted, we can never make a perfect guess. It will be a
> lot better than a more or less random kill, however.
> 
> The large simulation that's taking 70% of your RAM and
> has run for 2 weeks is the most likely victim under our
> current scheme, but with my killer code it's priority
> will be far less that that of a newly-started and exploded
> GIMP or Netscape...
> 

My idea was:

-VM exhausted and process allocating is a normal process then kill
 process.

 -VM exhausted and process is a guaranteed one then kill a non
 guaranteed process.

-VM exhausted, process is guaranteed but only remaining processes are
 guaranteed ones.  Kill allocated process.

Of course INIT is guaranteed.

> > Why not simply allow a root-owned process declare itself (and the
> > program it will exec into) as "guaranteed"? 
> 
> If the guaranteed program explodes it will kill the machine.
> Even for single-purpose machines this will be bad since it
> will increase the downtime with a reboot&fsck cycle instead
> of just a program restart.
> 

Nope see higher.  The guaranteed program would be killed once
"unimportant" processes have been killed.  The goal is not to allow
impunity to guaranteed programs but to protect an important program
against possible misbehaviour of other programs: a misbehaving process
who has allocated all the VM except 1 page and then our database
server tries to allocate two more pages.

> > Or a box used as a mail server using qmail: qmail starts sub-servers
> > each one for a different task. 
> 
> The children are younger and will be killed first. Starting
> the master server from init will make sure that it is
> restarted in the case of a real emergency or fluke.
> 
> > Of course this is only a suugestion for a mechanism but the important
> > is allowing a human to have the final word.
> 
> What? You have a person sitting around keeping an eye on
> your mailserver 24x7? Usually the most important servers
> are tucked away in a closet and crash at 03:40 AM when
> the sysadmin is in bed 20 miles away...
> 

No.  The sysadmin uses emacs at normal hours to edit a file telling
what are the important processes.  Now it is to you to find a scheme
in order the sysadmin's wishes are communicated to the kernel.  :-)

-- 
			Jean Francois Martinez

Project Independence: Linux for the Masses
http://www.independence.seul.org

--
This is a majordomo managed list.  To unsubscribe, send a message with
the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Two naive questions and a suggestion
  1998-11-26 19:59               ` jfm2
@ 1998-11-27 17:45                 ` Stephen C. Tweedie
  1998-11-27 21:14                   ` jfm2
  0 siblings, 1 reply; 29+ messages in thread
From: Stephen C. Tweedie @ 1998-11-27 17:45 UTC (permalink / raw)
  To: jfm2; +Cc: H.H.vanRiel, sct, linux-mm

Hi,

On 26 Nov 1998 19:59:42 -0000, jfm2@club-internet.fr said:

> My idea was:

> -VM exhausted and process allocating is a normal process then kill
>  process.
>  -VM exhausted and process is a guaranteed one then kill a non
>  guaranteed process.
> -VM exhausted, process is guaranteed but only remaining processes are
>  guaranteed ones.  Kill allocated process.

But the _whole_ problem is that we do not necessarily go around
killing processes.  We just fail requests for new allocations.  In
that case we still have not run out of memory yet, but a daemon may
have died.  It is simply not possible to guarantee all of the future
memory allocations which a process might make!

--Stephen
--
This is a majordomo managed list.  To unsubscribe, send a message with
the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Two naive questions and a suggestion
  1998-11-27 17:45                 ` Stephen C. Tweedie
@ 1998-11-27 21:14                   ` jfm2
  0 siblings, 0 replies; 29+ messages in thread
From: jfm2 @ 1998-11-27 21:14 UTC (permalink / raw)
  To: sct; +Cc: jfm2, H.H.vanRiel, linux-mm

> Date: Fri, 27 Nov 1998 17:45:55 GMT
> From: "Stephen C. Tweedie" <sct@redhat.com>
> Content-Type: text/plain; charset=us-ascii
> Cc: H.H.vanRiel@phys.uu.nl, sct@redhat.com, linux-mm@kvack.org
> X-UIDL: 62f6721511a1878f885583dcf30990c3
> 
> Hi,
> 
> On 26 Nov 1998 19:59:42 -0000, jfm2@club-internet.fr said:
> 
> > My idea was:
> 
> > -VM exhausted and process allocating is a normal process then kill
> >  process.
> >  -VM exhausted and process is a guaranteed one then kill a non
> >  guaranteed process.
> > -VM exhausted, process is guaranteed but only remaining processes are
> >  guaranteed ones.  Kill allocated process.
> 
> But the _whole_ problem is that we do not necessarily go around
> killing processes.  We just fail requests for new allocations.  In
> that case we still have not run out of memory yet, but a daemon may
> have died.  It is simply not possible to guarantee all of the future
> memory allocations which a process might make!
> 

The word "guaranteed" was an unfortunate one.  "Protected" would have
been better.

As a user I feel there are processes more equal than others and I find
unfortunate one of them is killed when it tries to grow its stack
(SIGKILL so no recovering) and it is unable to do so due to
mibehaviour of an unimportant process.  I think they should be
protected and that it is the sysadmin and not a heuristic who should
define what is important and what is not in a box.  We cannot
guarantee the success of a memory allocation but we can make mission
critical software motre robust.

But if you think the idea is bad we can kill this thread.

-- 
			Jean Francois Martinez

Project Independence: Linux for the Masses
http://www.independence.seul.org

--
This is a majordomo managed list.  To unsubscribe, send a message with
the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Two naive questions and a suggestion
  1998-11-24 21:44       ` jfm2
  1998-11-25  6:41         ` Rik van Riel
@ 1998-11-25 14:48         ` Eric W. Biederman
  1998-11-25 20:29           ` jfm2
  1998-11-25 16:31         ` ralf
  2 siblings, 1 reply; 29+ messages in thread
From: Eric W. Biederman @ 1998-11-25 14:48 UTC (permalink / raw)
  To: jfm2; +Cc: sct, linux-mm

>>>>> "jfm2" == jfm2  <jfm2@club-internet.fr> writes:

jfm2> Say the Web or database server can be deemed important enough for it
jfm2> not being killed just because some dim witt is playing with the GIMP
jfm2> at the console and the GIMP has allocated 80 Megs.

jfm2> More reallistically, it can happen that the X server is killed
jfm2> (-9) due to the misbeahviour of a user program and you get
jfm2> trapped with a useless console.  Very diificult to recover.  Specially
jfm2> if you consider inetd could have been killed too, so no telnetting.

jfm2> You can also find half of your daemons, are gone.  That is no mail, no
jfm2> printing, no nothing.

initd is never killed. Won't & can't be killed.
initd should be configured to restart all of your important daemons if
they go down.

Currently most unix systems ( I don't think i'ts linux specific) are
misconfigured so they don't automatically restart their important
daemons if they go down.

jfm2> In situation like those above I would like Linux supported a concept
jfm2> like guaranteed processses: if VM is exhausted by one of them then try
jfm2> to get memory by killing non guaranteed processes and only kill the
jfm2> original one if all reamining survivors are guaranteed ones.
jfm2> It would be better for mission critical tasks.

Some.  But it would be simple and much healthier for tasks that can be
down for a little bit to have initd restart the processes after they
go down.   That allows for other cases when the important system
daemons goes down, is more robust, and doesn't require kernel changes.

Eric

--
This is a majordomo managed list.  To unsubscribe, send a message with
the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Two naive questions and a suggestion
  1998-11-25 14:48         ` Eric W. Biederman
@ 1998-11-25 20:29           ` jfm2
  0 siblings, 0 replies; 29+ messages in thread
From: jfm2 @ 1998-11-25 20:29 UTC (permalink / raw)
  To: ebiederm+eric; +Cc: jfm2, sct, linux-mm

> 
> >>>>> "jfm2" == jfm2  <jfm2@club-internet.fr> writes:
> 
> jfm2> Say the Web or database server can be deemed important enough for it
> jfm2> not being killed just because some dim witt is playing with the GIMP
> jfm2> at the console and the GIMP has allocated 80 Megs.
> 
> jfm2> More reallistically, it can happen that the X server is killed
> jfm2> (-9) due to the misbeahviour of a user program and you get
> jfm2> trapped with a useless console.  Very diificult to recover.  Specially
> jfm2> if you consider inetd could have been killed too, so no telnetting.
> 
> jfm2> You can also find half of your daemons, are gone.  That is no mail, no
> jfm2> printing, no nothing.
> 
> initd is never killed. Won't & can't be killed.
> initd should be configured to restart all of your important daemons if
> they go down.
> 

This does not solve the problem.  To begin with after an unclean
shutdown a database server spends time rolling back uncommitted
transactions and possibly writing somye comitted ones to the database
from its journals.  Users could prefer a database who doesn't go down
in the first place.

Second: the 80 Megs GIMP is still there so when init restarts the
database, the databse tries to allocate memory and it crashes again.

Third: A process can crash because it is misconfigured or a file is
corrupted.  And crash again if you restart it.  It si not Init's job
to do things like try five times and use a pager interface to send a
message to the admin in case there is a sixth crash.

It could be considered that "guaranteed" processes is not a good idea
but using Init is not the way to address the problem.

-- 
			Jean Francois Martinez

Project Independence: Linux for the Masses
http://www.independence.seul.org

--
This is a majordomo managed list.  To unsubscribe, send a message with
the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Two naive questions and a suggestion
  1998-11-24 21:44       ` jfm2
  1998-11-25  6:41         ` Rik van Riel
  1998-11-25 14:48         ` Eric W. Biederman
@ 1998-11-25 16:31         ` ralf
  1998-11-26 12:18           ` Rik van Riel
  2 siblings, 1 reply; 29+ messages in thread
From: ralf @ 1998-11-25 16:31 UTC (permalink / raw)
  To: jfm2, sct; +Cc: linux-mm

On Tue, Nov 24, 1998 at 09:44:32PM -0000, jfm2@club-internet.fr wrote:

> In situation like those above I would like Linux supported a concept
> like guaranteed processses: if VM is exhausted by one of them then try
> to get memory by killing non guaranteed processes and only kill the
> original one if all reamining survivors are guaranteed ones.
> It would be better for mission critical tasks.

Long time ago I suggested to make it configurable whether a process gets
memory which might be overcommited or not.  This leaves malloc(x) == NULL
to deal with and that's a userland problem anyway.

  Ralf
--
This is a majordomo managed list.  To unsubscribe, send a message with
the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Two naive questions and a suggestion
  1998-11-25 16:31         ` ralf
@ 1998-11-26 12:18           ` Rik van Riel
  0 siblings, 0 replies; 29+ messages in thread
From: Rik van Riel @ 1998-11-26 12:18 UTC (permalink / raw)
  To: ralf; +Cc: jfm2, Linux MM

On Wed, 25 Nov 1998 ralf@uni-koblenz.de wrote:
> On Tue, Nov 24, 1998 at 09:44:32PM -0000, jfm2@club-internet.fr wrote:
> 
> > In situation like those above I would like Linux supported a concept
> > like guaranteed processses: if VM is exhausted by one of them then try
> > to get memory by killing non guaranteed processes and only kill the
> > original one if all reamining survivors are guaranteed ones.
> > It would be better for mission critical tasks.
> 
> Long time ago I suggested to make it configurable whether a process
> gets memory which might be overcommited or not.  This leaves
> malloc(x) == NULL to deal with and that's a userland problem anyway. 

Then what would you do when your 250MB non-overcommitting
program needs to do a fork() in order to call /usr/bin/lpr?

Install an extra 250MB of swap? I don't think so :)
These are the situations where sane people want overcommit.

regards,

Rik -- who actually has 250MB of extra swap...
+-------------------------------------------------------------------+
| Linux memory management tour guide.        H.H.vanRiel@phys.uu.nl |
| Scouting Vries cubscout leader.      http://www.phys.uu.nl/~riel/ |
+-------------------------------------------------------------------+

--
This is a majordomo managed list.  To unsubscribe, send a message with
the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~1998-11-27  2:08 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
1998-11-19  0:20 Two naive questions and a suggestion jfm2
1998-11-19 20:05 ` Rik van Riel
1998-11-20  1:25   ` jfm2
1998-11-20 15:31     ` Eric W. Biederman
1998-11-23 18:08 ` Stephen C. Tweedie
1998-11-23 20:45   ` jfm2
1998-11-23 21:59   ` jfm2
1998-11-24  1:21     ` Vladimir Dergachev
1998-11-24 11:17     ` Stephen C. Tweedie
1998-11-24 21:44       ` jfm2
1998-11-25  6:41         ` Rik van Riel
1998-11-25 12:27           ` Stephen C. Tweedie
1998-11-25 13:08             ` Rik van Riel
1998-11-25 14:46               ` Stephen C. Tweedie
1998-11-25 16:47                 ` Rik van Riel
1998-11-25 21:02                   ` Stephen C. Tweedie
1998-11-25 21:21                     ` Rik van Riel
1998-11-25 22:29                       ` Stephen C. Tweedie
1998-11-26  7:30                         ` Rik van Riel
1998-11-26 12:48                           ` Stephen C. Tweedie
1998-11-25 20:01           ` jfm2
1998-11-26  7:16             ` Rik van Riel
1998-11-26 19:59               ` jfm2
1998-11-27 17:45                 ` Stephen C. Tweedie
1998-11-27 21:14                   ` jfm2
1998-11-25 14:48         ` Eric W. Biederman
1998-11-25 20:29           ` jfm2
1998-11-25 16:31         ` ralf
1998-11-26 12:18           ` Rik van Riel

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox