linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] oom_kill: use rss value instead of vm size for badness
@ 2009-10-28  8:58 KAMEZAWA Hiroyuki
  2009-10-28  9:15 ` David Rientjes
  0 siblings, 1 reply; 29+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-10-28  8:58 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, akpm, hugh.dickins, aarcange, vedran.furac,
	kosaki.motohiro

I may add more tweaks based on this. But simple start point as this patch
will be good. This patch is based on mmotm + Kosaki's
http://marc.info/?l=linux-kernel&m=125669809305167&w=2

Test results on various environment are appreciated.

Regards.
-Kame
==
From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

It's reported that OOM-Killer kills Gnone/KDE at first...
And yes, we can reproduce it easily.

Now, oom-killer uses mm->total_vm as its base value. But in recent
applications, there are a big gap between VM size and RSS size.
Because
  - Applications attaches much dynamic libraries. (Gnome, KDE, etc...)
  - Applications may alloc big VM area but use small part of them.
    (Java, and multi-threaded applications has this tendency because
     of default-size of stack.)

I think using mm->total_vm as score for oom-kill is not good.
By the same reason, overcommit memory can't work as expected.
(In other words, if we depends on total_vm, using overcommit more positive
 is a good choice.)

This patch uses mm->anon_rss/file_rss as base value for calculating badness.

Following is changes to OOM score(badness) on an environment with 1.6G memory
plus memory-eater(500M & 1G).

Top 10 of badness score. (The highest one is the first candidate to be killed)
Before
badness program
91228	gnome-settings-
94210	clock-applet
103202	mixer_applet2
106563	tomboy
112947	gnome-terminal
128944	mmap              <----------- 500M malloc
129332	nautilus
215476	bash              <----------- parent of 2 mallocs.
256944	mmap              <----------- 1G malloc
423586	gnome-session

After
badness 
1911	mixer_applet2
1955	clock-applet
1986	xinit
1989	gnome-session
2293	nautilus
2955	gnome-terminal
4113	tomboy
104163	mmap             <----------- 500M malloc.
168577	bash             <----------- parent of 2 mallocs
232375	mmap             <----------- 1G malloc

seems good for me. 

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/oom_kill.c |   10 +++++++---
 1 file changed, 7 insertions(+), 3 deletions(-)

Index: mm-test-kernel/mm/oom_kill.c
===================================================================
--- mm-test-kernel.orig/mm/oom_kill.c
+++ mm-test-kernel/mm/oom_kill.c
@@ -93,7 +93,7 @@ unsigned long badness(struct task_struct
 	/*
 	 * The memory size of the process is the basis for the badness.
 	 */
-	points = mm->total_vm;
+	points = get_mm_counter(mm, anon_rss) + get_mm_counter(mm, file_rss);
 
 	/*
 	 * After this unlock we can no longer dereference local variable `mm'
@@ -116,8 +116,12 @@ unsigned long badness(struct task_struct
 	 */
 	list_for_each_entry(child, &p->children, sibling) {
 		task_lock(child);
-		if (child->mm != mm && child->mm)
-			points += child->mm->total_vm/2 + 1;
+		if (child->mm != mm && child->mm) {
+			unsigned long cpoints;
+			cpoints = get_mm_counter(child->mm, anon_rss);
+				  + get_mm_counter(child->mm, file_rss);
+			points += cpoints/2 + 1;
+		}
 		task_unlock(child);
 	}
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] oom_kill: use rss value instead of vm size for badness
  2009-10-28  8:58 [PATCH] oom_kill: use rss value instead of vm size for badness KAMEZAWA Hiroyuki
@ 2009-10-28  9:15 ` David Rientjes
  2009-10-28 11:04   ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 29+ messages in thread
From: David Rientjes @ 2009-10-28  9:15 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, linux-kernel, Andrew Morton, Hugh Dickins,
	Andrea Arcangeli, vedran.furac, KOSAKI Motohiro

On Wed, 28 Oct 2009, KAMEZAWA Hiroyuki wrote:

> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
> It's reported that OOM-Killer kills Gnone/KDE at first...
> And yes, we can reproduce it easily.
> 
> Now, oom-killer uses mm->total_vm as its base value. But in recent
> applications, there are a big gap between VM size and RSS size.
> Because
>   - Applications attaches much dynamic libraries. (Gnome, KDE, etc...)
>   - Applications may alloc big VM area but use small part of them.
>     (Java, and multi-threaded applications has this tendency because
>      of default-size of stack.)
> 
> I think using mm->total_vm as score for oom-kill is not good.
> By the same reason, overcommit memory can't work as expected.
> (In other words, if we depends on total_vm, using overcommit more positive
>  is a good choice.)
> 
> This patch uses mm->anon_rss/file_rss as base value for calculating badness.
> 

How does this affect the ability of the user to tune the badness score of 
individual threads?  It seems like there will now only be two polarizing 
options: the equivalent of an oom_adj value of +15 or -17.  It is now 
heavily dependent on the rss which may be unclear at the time of oom and 
very dynamic.

I think a longer-term solution may rely more on the difference in 
get_mm_hiwater_rss() and get_mm_rss() instead to know the difference 
between what is resident in RAM at the time of oom compared to what has 
been swaped.  Using this with get_mm_hiwater_vm() would produce a nice 
picture for the pattern of each task's memory consumption.

> Following is changes to OOM score(badness) on an environment with 1.6G memory
> plus memory-eater(500M & 1G).
> 
> Top 10 of badness score. (The highest one is the first candidate to be killed)
> Before
> badness program
> 91228	gnome-settings-
> 94210	clock-applet
> 103202	mixer_applet2
> 106563	tomboy
> 112947	gnome-terminal
> 128944	mmap              <----------- 500M malloc
> 129332	nautilus
> 215476	bash              <----------- parent of 2 mallocs.
> 256944	mmap              <----------- 1G malloc
> 423586	gnome-session
> 
> After
> badness 
> 1911	mixer_applet2
> 1955	clock-applet
> 1986	xinit
> 1989	gnome-session
> 2293	nautilus
> 2955	gnome-terminal
> 4113	tomboy
> 104163	mmap             <----------- 500M malloc.
> 168577	bash             <----------- parent of 2 mallocs
> 232375	mmap             <----------- 1G malloc
> 
> seems good for me. 
> 
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> ---
>  mm/oom_kill.c |   10 +++++++---
>  1 file changed, 7 insertions(+), 3 deletions(-)
> 
> Index: mm-test-kernel/mm/oom_kill.c
> ===================================================================
> --- mm-test-kernel.orig/mm/oom_kill.c
> +++ mm-test-kernel/mm/oom_kill.c
> @@ -93,7 +93,7 @@ unsigned long badness(struct task_struct
>  	/*
>  	 * The memory size of the process is the basis for the badness.
>  	 */
> -	points = mm->total_vm;
> +	points = get_mm_counter(mm, anon_rss) + get_mm_counter(mm, file_rss);
>  
>  	/*
>  	 * After this unlock we can no longer dereference local variable `mm'
> @@ -116,8 +116,12 @@ unsigned long badness(struct task_struct
>  	 */
>  	list_for_each_entry(child, &p->children, sibling) {
>  		task_lock(child);
> -		if (child->mm != mm && child->mm)
> -			points += child->mm->total_vm/2 + 1;
> +		if (child->mm != mm && child->mm) {
> +			unsigned long cpoints;
> +			cpoints = get_mm_counter(child->mm, anon_rss);
> +				  + get_mm_counter(child->mm, file_rss);

That shouldn't compile.

> +			points += cpoints/2 + 1;
> +		}
>  		task_unlock(child);
>  	}
>  

This can all be simplified by just using get_mm_rss(mm) and 
get_mm_rss(child->mm).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] oom_kill: use rss value instead of vm size for badness
  2009-10-28  9:15 ` David Rientjes
@ 2009-10-28 11:04   ` KAMEZAWA Hiroyuki
  2009-10-29  1:00     ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 29+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-10-28 11:04 UTC (permalink / raw)
  To: David Rientjes
  Cc: KAMEZAWA Hiroyuki, linux-mm, linux-kernel, Andrew Morton,
	Hugh Dickins, Andrea Arcangeli, vedran.furac, KOSAKI Motohiro

David Rientjes さんは書きました:
> On Wed, 28 Oct 2009, KAMEZAWA Hiroyuki wrote:
>
>> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>>
>> It's reported that OOM-Killer kills Gnone/KDE at first...
>> And yes, we can reproduce it easily.
>>
>> Now, oom-killer uses mm->total_vm as its base value. But in recent
>> applications, there are a big gap between VM size and RSS size.
>> Because
>>   - Applications attaches much dynamic libraries. (Gnome, KDE, etc...)
>>   - Applications may alloc big VM area but use small part of them.
>>     (Java, and multi-threaded applications has this tendency because
>>      of default-size of stack.)
>>
>> I think using mm->total_vm as score for oom-kill is not good.
>> By the same reason, overcommit memory can't work as expected.
>> (In other words, if we depends on total_vm, using overcommit more
>> positive
>>  is a good choice.)
>>
>> This patch uses mm->anon_rss/file_rss as base value for calculating
>> badness.
>>
>
> How does this affect the ability of the user to tune the badness score of
> individual threads?
Threads ? process ?

> It seems like there will now only be two polarizing
> options: the equivalent of an oom_adj value of +15 or -17.  It is now
> heavily dependent on the rss which may be unclear at the time of oom and
> very dynamic.
>
yes. and that's "dynamic" is good thing.

I think one of problems for oom now is that user says "oom-killer kills
process at random." And yes, it's correct. mm->total_vm is not related
to memory usage. Then, oom-killer seems to kill processes at random.

For example, as Vetran shows, even if memory eater runs, processes are
killed _at random_.

After this patch, the biggest memory user will be the fist candidate
and it's reasonable. Users will know "The process is killed because
it uses much memory.", (seems not random) He can consider he should
use oom_adj for memory eater or not.



> I think a longer-term solution may rely more on the difference in
> get_mm_hiwater_rss() and get_mm_rss() instead to know the difference
> between what is resident in RAM at the time of oom compared to what has
> been swaped.  Using this with get_mm_hiwater_vm() would produce a nice
> picture for the pattern of each task's memory consumption.
>
Hmm, I don't want complicated calculation (it makes oom_adj usage worse.)
but yes, bare rss may be too simple.
Anyway, as I shown, I'll add swap statistics regardless of this patch.
That may adds new hint.
For example)
   if (vm_swap_full())
       points += mm->swap_usage

>> Following is changes to OOM score(badness) on an environment with 1.6G
>> memory
>> plus memory-eater(500M & 1G).
>>
>> Top 10 of badness score. (The highest one is the first candidate to be
>> killed)
>> Before
>> badness program
>> 91228	gnome-settings-
>> 94210	clock-applet
>> 103202	mixer_applet2
>> 106563	tomboy
>> 112947	gnome-terminal
>> 128944	mmap              <----------- 500M malloc
>> 129332	nautilus
>> 215476	bash              <----------- parent of 2 mallocs.
>> 256944	mmap              <----------- 1G malloc
>> 423586	gnome-session
>>
>> After
>> badness
>> 1911	mixer_applet2
>> 1955	clock-applet
>> 1986	xinit
>> 1989	gnome-session
>> 2293	nautilus
>> 2955	gnome-terminal
>> 4113	tomboy
>> 104163	mmap             <----------- 500M malloc.
>> 168577	bash             <----------- parent of 2 mallocs
>> 232375	mmap             <----------- 1G malloc
>>
>> seems good for me.
>>
>> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>> ---
>>  mm/oom_kill.c |   10 +++++++---
>>  1 file changed, 7 insertions(+), 3 deletions(-)
>>
>> Index: mm-test-kernel/mm/oom_kill.c
>> ===================================================================
>> --- mm-test-kernel.orig/mm/oom_kill.c
>> +++ mm-test-kernel/mm/oom_kill.c
>> @@ -93,7 +93,7 @@ unsigned long badness(struct task_struct
>>  	/*
>>  	 * The memory size of the process is the basis for the badness.
>>  	 */
>> -	points = mm->total_vm;
>> +	points = get_mm_counter(mm, anon_rss) + get_mm_counter(mm, file_rss);
>>
>>  	/*
>>  	 * After this unlock we can no longer dereference local variable `mm'
>> @@ -116,8 +116,12 @@ unsigned long badness(struct task_struct
>>  	 */
>>  	list_for_each_entry(child, &p->children, sibling) {
>>  		task_lock(child);
>> -		if (child->mm != mm && child->mm)
>> -			points += child->mm->total_vm/2 + 1;
>> +		if (child->mm != mm && child->mm) {
>> +			unsigned long cpoints;
>> +			cpoints = get_mm_counter(child->mm, anon_rss);
>> +				  + get_mm_counter(child->mm, file_rss);
>
> That shouldn't compile.
Oh, yes...thanks.

>
>> +			points += cpoints/2 + 1;
>> +		}
>>  		task_unlock(child);
>>  	}
>>
>
> This can all be simplified by just using get_mm_rss(mm) and
> get_mm_rss(child->mm).
>
will use that.

I'll wait until the next week to post a new patch.
We don't need rapid way.

Thanks,
-Kame



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] oom_kill: use rss value instead of vm size for badness
  2009-10-28 11:04   ` KAMEZAWA Hiroyuki
@ 2009-10-29  1:00     ` KAMEZAWA Hiroyuki
  2009-10-29  2:31       ` Minchan Kim
  2009-10-29  8:31       ` David Rientjes
  0 siblings, 2 replies; 29+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-10-29  1:00 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: David Rientjes, linux-mm, linux-kernel, Andrew Morton,
	Hugh Dickins, Andrea Arcangeli, vedran.furac, KOSAKI Motohiro

> I'll wait until the next week to post a new patch.
> We don't need rapid way.
> 
I wrote above...but for my mental health, this is bug-fixed version.
Sorry for my carelessness. David, thank you for your review.
Regards,
-Kame
==
From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

It's reported that OOM-Killer kills Gnone/KDE at first...
And yes, we can reproduce it easily.

Now, oom-killer uses mm->total_vm as its base value. But in recent
applications, there are a big gap between VM size and RSS size.
Because
  - Applications attaches much dynamic libraries. (Gnome, KDE, etc...)
  - Applications may alloc big VM area but use small part of them.
    (Java, and multi-threaded applications has this tendency because
     of default-size of stack.)

I think using mm->total_vm as score for oom-kill is not good.
By the same reason, overcommit memory can't work as expected.
(In other words, if we depends on total_vm, using overcommit more positive
 is a good choice.)

This patch uses mm->anon_rss/file_rss as base value for calculating badness.

Following is changes to OOM score(badness) on an environment with 1.6G memory
plus memory-eater(500M & 1G).

Top 10 of badness score. (The highest one is the first candidate to be killed)
Before
badness program
91228	gnome-settings-
94210	clock-applet
103202	mixer_applet2
106563	tomboy
112947	gnome-terminal
128944	mmap              <----------- 500M malloc
129332	nautilus
215476	bash              <----------- parent of 2 mallocs.
256944	mmap              <----------- 1G malloc
423586	gnome-session

After
badness 
1911	mixer_applet2
1955	clock-applet
1986	xinit
1989	gnome-session
2293	nautilus
2955	gnome-terminal
4113	tomboy
104163	mmap             <----------- 500M malloc.
168577	bash             <----------- parent of 2 mallocs
232375	mmap             <----------- 1G malloc

seems good for me. Maybe we can tweak this patch more,
but this one will be a good one as a start point.

Changelog: 2009/10/29
 - use get_mm_rss() instead of get_mm_counter()

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/oom_kill.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

Index: mm-test-kernel/mm/oom_kill.c
===================================================================
--- mm-test-kernel.orig/mm/oom_kill.c
+++ mm-test-kernel/mm/oom_kill.c
@@ -93,7 +93,7 @@ unsigned long badness(struct task_struct
 	/*
 	 * The memory size of the process is the basis for the badness.
 	 */
-	points = mm->total_vm;
+	points = get_mm_rss(mm);
 
 	/*
 	 * After this unlock we can no longer dereference local variable `mm'
@@ -117,7 +117,7 @@ unsigned long badness(struct task_struct
 	list_for_each_entry(child, &p->children, sibling) {
 		task_lock(child);
 		if (child->mm != mm && child->mm)
-			points += child->mm->total_vm/2 + 1;
+			points += get_mm_rss(child->mm)/2 + 1;
 		task_unlock(child);
 	}
 


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] oom_kill: use rss value instead of vm size for badness
  2009-10-29  1:00     ` KAMEZAWA Hiroyuki
@ 2009-10-29  2:31       ` Minchan Kim
  2009-10-29  8:31       ` David Rientjes
  1 sibling, 0 replies; 29+ messages in thread
From: Minchan Kim @ 2009-10-29  2:31 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: David Rientjes, linux-mm, linux-kernel, Andrew Morton,
	Hugh Dickins, Andrea Arcangeli, vedran.furac, KOSAKI Motohiro

On Thu, Oct 29, 2009 at 10:00 AM, KAMEZAWA Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
>> I'll wait until the next week to post a new patch.
>> We don't need rapid way.
>>
> I wrote above...but for my mental health, this is bug-fixed version.
> Sorry for my carelessness. David, thank you for your review.
> Regards,
> -Kame
> ==
> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>
> It's reported that OOM-Killer kills Gnone/KDE at first...
> And yes, we can reproduce it easily.
>
> Now, oom-killer uses mm->total_vm as its base value. But in recent
> applications, there are a big gap between VM size and RSS size.
> Because
>  - Applications attaches much dynamic libraries. (Gnome, KDE, etc...)
>  - Applications may alloc big VM area but use small part of them.
>    (Java, and multi-threaded applications has this tendency because
>     of default-size of stack.)
>
> I think using mm->total_vm as score for oom-kill is not good.
> By the same reason, overcommit memory can't work as expected.
> (In other words, if we depends on total_vm, using overcommit more positive
>  is a good choice.)
>
> This patch uses mm->anon_rss/file_rss as base value for calculating badness.
>
> Following is changes to OOM score(badness) on an environment with 1.6G memory
> plus memory-eater(500M & 1G).
>
> Top 10 of badness score. (The highest one is the first candidate to be killed)
> Before
> badness program
> 91228   gnome-settings-
> 94210   clock-applet
> 103202  mixer_applet2
> 106563  tomboy
> 112947  gnome-terminal
> 128944  mmap              <----------- 500M malloc
> 129332  nautilus
> 215476  bash              <----------- parent of 2 mallocs.
> 256944  mmap              <----------- 1G malloc
> 423586  gnome-session
>
> After
> badness
> 1911    mixer_applet2
> 1955    clock-applet
> 1986    xinit
> 1989    gnome-session
> 2293    nautilus
> 2955    gnome-terminal
> 4113    tomboy
> 104163  mmap             <----------- 500M malloc.
> 168577  bash             <----------- parent of 2 mallocs
> 232375  mmap             <----------- 1G malloc
>
> seems good for me. Maybe we can tweak this patch more,
> but this one will be a good one as a start point.
>
> Changelog: 2009/10/29
>  - use get_mm_rss() instead of get_mm_counter()
>
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>

Let's start from this.

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] oom_kill: use rss value instead of vm size for badness
  2009-10-29  1:00     ` KAMEZAWA Hiroyuki
  2009-10-29  2:31       ` Minchan Kim
@ 2009-10-29  8:31       ` David Rientjes
  2009-10-29  8:46         ` KAMEZAWA Hiroyuki
                           ` (2 more replies)
  1 sibling, 3 replies; 29+ messages in thread
From: David Rientjes @ 2009-10-29  8:31 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, linux-kernel, Andrew Morton, Hugh Dickins,
	Andrea Arcangeli, vedran.furac, KOSAKI Motohiro

On Thu, 29 Oct 2009, KAMEZAWA Hiroyuki wrote:

> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
> It's reported that OOM-Killer kills Gnone/KDE at first...
> And yes, we can reproduce it easily.
> 
> Now, oom-killer uses mm->total_vm as its base value. But in recent
> applications, there are a big gap between VM size and RSS size.
> Because
>   - Applications attaches much dynamic libraries. (Gnome, KDE, etc...)
>   - Applications may alloc big VM area but use small part of them.
>     (Java, and multi-threaded applications has this tendency because
>      of default-size of stack.)
> 
> I think using mm->total_vm as score for oom-kill is not good.
> By the same reason, overcommit memory can't work as expected.
> (In other words, if we depends on total_vm, using overcommit more positive
>  is a good choice.)
> 
> This patch uses mm->anon_rss/file_rss as base value for calculating badness.
> 
> Following is changes to OOM score(badness) on an environment with 1.6G memory
> plus memory-eater(500M & 1G).
> 
> Top 10 of badness score. (The highest one is the first candidate to be killed)
> Before
> badness program
> 91228	gnome-settings-
> 94210	clock-applet
> 103202	mixer_applet2
> 106563	tomboy
> 112947	gnome-terminal
> 128944	mmap              <----------- 500M malloc
> 129332	nautilus
> 215476	bash              <----------- parent of 2 mallocs.
> 256944	mmap              <----------- 1G malloc
> 423586	gnome-session
> 
> After
> badness 
> 1911	mixer_applet2
> 1955	clock-applet
> 1986	xinit
> 1989	gnome-session
> 2293	nautilus
> 2955	gnome-terminal
> 4113	tomboy
> 104163	mmap             <----------- 500M malloc.
> 168577	bash             <----------- parent of 2 mallocs
> 232375	mmap             <----------- 1G malloc
> 
> seems good for me. Maybe we can tweak this patch more,
> but this one will be a good one as a start point.
> 

This appears to actually prefer X more than total_vm in Vedran's test 
case.  He cited http://pastebin.com/f3f9674a0 in 
http://marc.info/?l=linux-kernel&m=125678557002888.

There are 12 ooms in this log, which has /proc/sys/vm/oom_dump_tasks 
enabled.  It shows the difference between the top total_vm candidates vs. 
the top rss candidates.

total_vm
708945 test
195695 krunner
168881 plasma-desktop
130567 ktorrent
127081 knotify4
125881 icedove-bin
123036 akregator
118641 kded4

rss
707878 test
42201 Xorg
13300 icedove-bin
10209 ktorrent
9277 akregator
8878 plasma-desktop
7546 krunner
4532 mysqld

This patch would pick the memory hogging task, "test", first everytime 
just like the current implementation does.  It would then prefer Xorg, 
icedove-bin, and ktorrent next as a starting point.

Admittedly, there are other heuristics that the oom killer uses to create 
a badness score.  But since this patch is only changing the baseline from 
mm->total_vm to get_mm_rss(mm), its behavior in this test case do not 
match the patch description.

The vast majority of the other ooms have identical top 8 candidates:

total_vm
673222 test
195695 krunner
168881 plasma-desktop
130567 ktorrent
127081 knotify4
125881 icedove-bin
123036 akregator
121869 firefox-bin

rss
672271 test
42192 Xorg
30763 firefox-bin
13292 icedove-bin
10208 ktorrent
9260 akregator
8859 plasma-desktop
7528 krunner

firefox-bin seems much more preferred in this case than total_vm, but Xorg 
still ranks very high with this patch compared to the current 
implementation.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] oom_kill: use rss value instead of vm size for badness
  2009-10-29  8:31       ` David Rientjes
@ 2009-10-29  8:46         ` KAMEZAWA Hiroyuki
  2009-10-29  9:01           ` David Rientjes
  2009-11-01 13:29         ` KOSAKI Motohiro
  2009-11-25 12:44         ` Andrea Arcangeli
  2 siblings, 1 reply; 29+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-10-29  8:46 UTC (permalink / raw)
  To: David Rientjes
  Cc: linux-mm, linux-kernel, Andrew Morton, Hugh Dickins,
	Andrea Arcangeli, vedran.furac, KOSAKI Motohiro

On Thu, 29 Oct 2009 01:31:59 -0700 (PDT)
David Rientjes <rientjes@google.com> wrote:

> On Thu, 29 Oct 2009, KAMEZAWA Hiroyuki wrote:
> 
> > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > 
> > It's reported that OOM-Killer kills Gnone/KDE at first...
> > And yes, we can reproduce it easily.
> > 
> > Now, oom-killer uses mm->total_vm as its base value. But in recent
> > applications, there are a big gap between VM size and RSS size.
> > Because
> >   - Applications attaches much dynamic libraries. (Gnome, KDE, etc...)
> >   - Applications may alloc big VM area but use small part of them.
> >     (Java, and multi-threaded applications has this tendency because
> >      of default-size of stack.)
> > 
> > I think using mm->total_vm as score for oom-kill is not good.
> > By the same reason, overcommit memory can't work as expected.
> > (In other words, if we depends on total_vm, using overcommit more positive
> >  is a good choice.)
> > 
> > This patch uses mm->anon_rss/file_rss as base value for calculating badness.
> > 
> > Following is changes to OOM score(badness) on an environment with 1.6G memory
> > plus memory-eater(500M & 1G).
> > 
> > Top 10 of badness score. (The highest one is the first candidate to be killed)
> > Before
> > badness program
> > 91228	gnome-settings-
> > 94210	clock-applet
> > 103202	mixer_applet2
> > 106563	tomboy
> > 112947	gnome-terminal
> > 128944	mmap              <----------- 500M malloc
> > 129332	nautilus
> > 215476	bash              <----------- parent of 2 mallocs.
> > 256944	mmap              <----------- 1G malloc
> > 423586	gnome-session
> > 
> > After
> > badness 
> > 1911	mixer_applet2
> > 1955	clock-applet
> > 1986	xinit
> > 1989	gnome-session
> > 2293	nautilus
> > 2955	gnome-terminal
> > 4113	tomboy
> > 104163	mmap             <----------- 500M malloc.
> > 168577	bash             <----------- parent of 2 mallocs
> > 232375	mmap             <----------- 1G malloc
> > 
> > seems good for me. Maybe we can tweak this patch more,
> > but this one will be a good one as a start point.
> > 
> 
> This appears to actually prefer X more than total_vm in Vedran's test 
> case.  He cited http://pastebin.com/f3f9674a0 in 
> http://marc.info/?l=linux-kernel&m=125678557002888.
> 
> There are 12 ooms in this log, which has /proc/sys/vm/oom_dump_tasks 
> enabled.  It shows the difference between the top total_vm candidates vs. 
> the top rss candidates.
> 
> total_vm
> 708945 test
> 195695 krunner
> 168881 plasma-desktop
> 130567 ktorrent
> 127081 knotify4
> 125881 icedove-bin
> 123036 akregator
> 118641 kded4
> 
> rss
> 707878 test
> 42201 Xorg
> 13300 icedove-bin
> 10209 ktorrent
> 9277 akregator
> 8878 plasma-desktop
> 7546 krunner
> 4532 mysqld
> 
> This patch would pick the memory hogging task, "test", first everytime 
> just like the current implementation does.  It would then prefer Xorg, 
> icedove-bin, and ktorrent next as a starting point.
> 
> Admittedly, there are other heuristics that the oom killer uses to create 
> a badness score.  But since this patch is only changing the baseline from 
> mm->total_vm to get_mm_rss(mm), its behavior in this test case do not 
> match the patch description.
> 
yes, then I wrote "as start point". There are many environments.
But I'm not sure why ntpd can be the first candidate...
The scores you shown doesn't include children's score, right ?

I believe I'll have to remove "adding child's score to parents".
I'm now considering how to implement fork-bomb detector for removing it.

> The vast majority of the other ooms have identical top 8 candidates:
> 
> total_vm
> 673222 test
> 195695 krunner
> 168881 plasma-desktop
> 130567 ktorrent
> 127081 knotify4
> 125881 icedove-bin
> 123036 akregator
> 121869 firefox-bin
> 
> rss
> 672271 test
> 42192 Xorg
> 30763 firefox-bin
> 13292 icedove-bin
> 10208 ktorrent
> 9260 akregator
> 8859 plasma-desktop
> 7528 krunner
> 
> firefox-bin seems much more preferred in this case than total_vm, but Xorg 
> still ranks very high with this patch compared to the current 
> implementation.
> 
ya, I'm now considering to drop file_rss from calculation.

some reasons.

  - file caches remaining in memory at OOM tend to have some trouble to remove it.
  - file caches tend to be shared.
  - if file caches are from shmem, we never be able to drop them if no swap/swapfull.

Maybe we'll have better result.

Regards,
-Kame




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] oom_kill: use rss value instead of vm size for badness
  2009-10-29  8:46         ` KAMEZAWA Hiroyuki
@ 2009-10-29  9:01           ` David Rientjes
  2009-10-29  9:16             ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 29+ messages in thread
From: David Rientjes @ 2009-10-29  9:01 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, linux-kernel, Andrew Morton, Hugh Dickins,
	Andrea Arcangeli, vedran.furac, KOSAKI Motohiro

On Thu, 29 Oct 2009, KAMEZAWA Hiroyuki wrote:

> > This appears to actually prefer X more than total_vm in Vedran's test 
> > case.  He cited http://pastebin.com/f3f9674a0 in 
> > http://marc.info/?l=linux-kernel&m=125678557002888.
> > 
> > There are 12 ooms in this log, which has /proc/sys/vm/oom_dump_tasks 
> > enabled.  It shows the difference between the top total_vm candidates vs. 
> > the top rss candidates.
> > 
> > total_vm
> > 708945 test
> > 195695 krunner
> > 168881 plasma-desktop
> > 130567 ktorrent
> > 127081 knotify4
> > 125881 icedove-bin
> > 123036 akregator
> > 118641 kded4
> > 
> > rss
> > 707878 test
> > 42201 Xorg
> > 13300 icedove-bin
> > 10209 ktorrent
> > 9277 akregator
> > 8878 plasma-desktop
> > 7546 krunner
> > 4532 mysqld
> > 
> > This patch would pick the memory hogging task, "test", first everytime 
> > just like the current implementation does.  It would then prefer Xorg, 
> > icedove-bin, and ktorrent next as a starting point.
> > 
> > Admittedly, there are other heuristics that the oom killer uses to create 
> > a badness score.  But since this patch is only changing the baseline from 
> > mm->total_vm to get_mm_rss(mm), its behavior in this test case do not 
> > match the patch description.
> > 
> yes, then I wrote "as start point". There are many environments.

And this environment has a particularly bad result.

> But I'm not sure why ntpd can be the first candidate...
> The scores you shown doesn't include children's score, right ?
> 

Right, it's just the get_mm_rss(mm) for each thread shown in the oom dump, 
the same value you've used as the new baseline.  The actual badness scores 
could easily be calculated by cat'ing /proc/*/oom_score prior to oom, but 
this data was meant to illustrate the preference given the rss compared to 
total_vm in a heuristic sense.

> I believe I'll have to remove "adding child's score to parents".
> I'm now considering how to implement fork-bomb detector for removing it.
> 

Agreed, I'm looking forward to your proposal.

> ya, I'm now considering to drop file_rss from calculation.
> 
> some reasons.
> 
>   - file caches remaining in memory at OOM tend to have some trouble to remove it.
>   - file caches tend to be shared.
>   - if file caches are from shmem, we never be able to drop them if no swap/swapfull.
> 
> Maybe we'll have better result.
> 

That sounds more appropriate.

I'm surprised you still don't see a value in using the peak VM and RSS 
sizes, though, as part of your formula as it would indicate the proportion 
of memory resident in RAM at the time of oom.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] oom_kill: use rss value instead of vm size for badness
  2009-10-29  9:01           ` David Rientjes
@ 2009-10-29  9:16             ` KAMEZAWA Hiroyuki
  2009-10-29  9:44               ` David Rientjes
  0 siblings, 1 reply; 29+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-10-29  9:16 UTC (permalink / raw)
  To: David Rientjes
  Cc: linux-mm, linux-kernel, Andrew Morton, Hugh Dickins,
	Andrea Arcangeli, vedran.furac, KOSAKI Motohiro

On Thu, 29 Oct 2009 02:01:49 -0700 (PDT)
David Rientjes <rientjes@google.com> wrote:
> > yes, then I wrote "as start point". There are many environments.
> 
> And this environment has a particularly bad result.
> yes, then I wrote "as start point". There are many environments.

In my understanding, 2nd, 3rd candidates are not important. If both of
total_vm and RSS catches the same process as 1st candidate, it's ok.
(i.e. If killed, oom situation will go away.)


> > ya, I'm now considering to drop file_rss from calculation.
> > 
> > some reasons.
> > 
> >   - file caches remaining in memory at OOM tend to have some trouble to remove it.
> >   - file caches tend to be shared.
> >   - if file caches are from shmem, we never be able to drop them if no swap/swapfull.
> > 
> > Maybe we'll have better result.
> > 
> 
> That sounds more appropriate.
> 
> I'm surprised you still don't see a value in using the peak VM and RSS 
> sizes, though, as part of your formula as it would indicate the proportion 
> of memory resident in RAM at the time of oom.
> 
I'll use swap_usage instead of peak VM size as bonus.

  anon_rss + swap_usage/2 ? or some.

My first purpose is not to kill not-guilty process at random.
If memory eater is killed, it's reasnoable.

In my consideration

  - "Killing a process because of OOM" is something bad, but not avoidable.

  - We don't need to do compliated/too-wise calculation for killing a process.
    "The worst one is memory-eater!" is easy to understand to users and admins.

  - We have oom_adj, now. User can customize it if he run _important_ memory eater.

  - But fork-bomb doesn't seem memory eater if we see each process.
    We need some cares.

  Then,
  - I'd like to drop file_rss.
  - I'd like to take swap_usage into acccount.
  - I'd like to remove cpu_time bonus. runtime bonus is much more important.
  - I'd like to remove penalty from children. To do that, fork-bomb detector
    is necessary.
  - nice bonus is bad. (We have oom_adj instead of this.) It should be
    if (task_nice(p) < 0)
	points /= 2;
    But we have "root user" bonus already. We can remove this line.

After above, much more simple selection, easy-to-understand,  will be done.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] oom_kill: use rss value instead of vm size for badness
  2009-10-29  9:16             ` KAMEZAWA Hiroyuki
@ 2009-10-29  9:44               ` David Rientjes
  2009-10-29 23:41                 ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 29+ messages in thread
From: David Rientjes @ 2009-10-29  9:44 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, linux-kernel, Andrew Morton, Hugh Dickins,
	Andrea Arcangeli, vedran.furac, KOSAKI Motohiro

On Thu, 29 Oct 2009, KAMEZAWA Hiroyuki wrote:

> > > yes, then I wrote "as start point". There are many environments.
> > 
> > And this environment has a particularly bad result.
> > yes, then I wrote "as start point". There are many environments.
> 
> In my understanding, 2nd, 3rd candidates are not important. If both of
> total_vm and RSS catches the same process as 1st candidate, it's ok.
> (i.e. If killed, oom situation will go away.)
> 

The ordering would matter on a machine with smaller capacity or if Vedran 
was using mem=, theoretically at the size of its current capacity minus 
the amount of anonymous memory being mlocked by the "test" program.  When 
the oom occurs (and notice it's not triggered by "test" each time), it 
would have killed Xorg in what would otherwise be the same conditions.

> > I'm surprised you still don't see a value in using the peak VM and RSS 
> > sizes, though, as part of your formula as it would indicate the proportion 
> > of memory resident in RAM at the time of oom.
> > 
> I'll use swap_usage instead of peak VM size as bonus.
> 
>   anon_rss + swap_usage/2 ? or some.
> 
> My first purpose is not to kill not-guilty process at random.
> If memory eater is killed, it's reasnoable.
> 

We again arrive at the distinction I made earlier where there're two 
approaches: kill a task that is consuming the majority of resident RAM, or 
kill a thread that is using much more memory than expected such as a 
memory leaker.  I know that you've argued that the kernel can never know 
the latter, which I agree, but it does have the benefit of allowing the 
user to have more input and determine when an actual task is using much 
more RAM than expected; the anon_rss and swap_usage in your formula is 
highly dynamic, so you've have to expect the user to dynamically alter 
oom_adj to specify a preference in the case of the memory leaker.

> In my consideration
> 
>   - "Killing a process because of OOM" is something bad, but not avoidable.
> 
>   - We don't need to do compliated/too-wise calculation for killing a process.
>     "The worst one is memory-eater!" is easy to understand to users and admins.
> 

Is this a proposal to remove the remainder of the heuristics as well such 
as considering superuser tasks and those with longer uptimes?  I'd agree 
with removing most of it other than the oom_adj and current->mems_allowed 
intersection penalty.  We're probably going to need rewrite the badness 
heuristic from scratch instead of simply changing the baseline.

>   - We have oom_adj, now. User can customize it if he run _important_ memory eater.
> 

If he runs an important memory eater, he can always polarize it by 
disabling oom killing completely for that task.  However, oom_adj is also 
used to identify memory leakers when the amount of memory that it uses is 
roughly known.  Most people don't know how much memory their applications 
use, but there are systems where users have tuned oom_adj specifically 
based on comparative /proc/pid/oom_score results.  Simply using anon_rss 
and swap_usage will make that vary much more than previously.

>   - But fork-bomb doesn't seem memory eater if we see each process.
>     We need some cares.
> 

The forkbomb can be addressed in multiple ways, the most simple of which 
is simply counting the number of children and their runtime.  It'd 
probably even be better to isolate the forkbomb case away from the badness 
score and simply kill the parent by returning ULONG_MAX when it's 
recognized.

>   Then,
>   - I'd like to drop file_rss.
>   - I'd like to take swap_usage into acccount.
>   - I'd like to remove cpu_time bonus. runtime bonus is much more important.
>   - I'd like to remove penalty from children. To do that, fork-bomb detector
>     is necessary.
>   - nice bonus is bad. (We have oom_adj instead of this.) It should be
>     if (task_nice(p) < 0)
> 	points /= 2;
>     But we have "root user" bonus already. We can remove this line.
> 
> After above, much more simple selection, easy-to-understand,  will be done.
> 

Agreed, I think we'll need to rewrite most of the heuristic from scratch.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] oom_kill: use rss value instead of vm size for badness
  2009-10-29  9:44               ` David Rientjes
@ 2009-10-29 23:41                 ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 29+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-10-29 23:41 UTC (permalink / raw)
  To: David Rientjes
  Cc: linux-mm, linux-kernel, Andrew Morton, Hugh Dickins,
	Andrea Arcangeli, vedran.furac, KOSAKI Motohiro

On Thu, 29 Oct 2009 02:44:45 -0700 (PDT)
David Rientjes <rientjes@google.com> wrote:
> >   Then,
> >   - I'd like to drop file_rss.
> >   - I'd like to take swap_usage into acccount.
> >   - I'd like to remove cpu_time bonus. runtime bonus is much more important.
> >   - I'd like to remove penalty from children. To do that, fork-bomb detector
> >     is necessary.
> >   - nice bonus is bad. (We have oom_adj instead of this.) It should be
> >     if (task_nice(p) < 0)
> > 	points /= 2;
> >     But we have "root user" bonus already. We can remove this line.
> > 
> > After above, much more simple selection, easy-to-understand,  will be done.
> > 
> 
> Agreed, I think we'll need to rewrite most of the heuristic from scratch.

I'd like to post total redesgin of oom-killer in the next week.
plz wait.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] oom_kill: use rss value instead of vm size for badness
  2009-10-29  8:31       ` David Rientjes
  2009-10-29  8:46         ` KAMEZAWA Hiroyuki
@ 2009-11-01 13:29         ` KOSAKI Motohiro
  2009-11-02 10:42           ` David Rientjes
  2009-11-25 12:44         ` Andrea Arcangeli
  2 siblings, 1 reply; 29+ messages in thread
From: KOSAKI Motohiro @ 2009-11-01 13:29 UTC (permalink / raw)
  To: David Rientjes
  Cc: KAMEZAWA Hiroyuki, linux-mm, linux-kernel, Andrew Morton,
	Hugh Dickins, Andrea Arcangeli, vedran.furac

> This patch would pick the memory hogging task, "test", first everytime
> just like the current implementation does.  It would then prefer Xorg,
> icedove-bin, and ktorrent next as a starting point.
>
> Admittedly, there are other heuristics that the oom killer uses to create
> a badness score.  But since this patch is only changing the baseline from
> mm->total_vm to get_mm_rss(mm), its behavior in this test case do not
> match the patch description.
>
> The vast majority of the other ooms have identical top 8 candidates:
>
> total_vm
> 673222 test
> 195695 krunner
> 168881 plasma-desktop
> 130567 ktorrent
> 127081 knotify4
> 125881 icedove-bin
> 123036 akregator
> 121869 firefox-bin
>
> rss
> 672271 test
> 42192 Xorg
> 30763 firefox-bin
> 13292 icedove-bin
> 10208 ktorrent
> 9260 akregator
> 8859 plasma-desktop
> 7528 krunner
>
> firefox-bin seems much more preferred in this case than total_vm, but Xorg
> still ranks very high with this patch compared to the current
> implementation.

Hi David,

I'm very interesting your pointing out. thanks good testing.
So, I'd like to clarify your point a bit.

following are badness list on my desktop environment (x86_64 6GB mem).
it show Xorg have pretty small badness score. Do you know why such
different happen?


score    pid        comm
==============================
56382   3241    run-mozilla.sh
23345   3289    run-mozilla.sh
21461   3050    gnome-do
20079   2867    gnome-session
14016   3258    firefox
9212    3306    firefox
8468    3115    gnome-do
6902    3325    emacs
6783    3212    tomboy
4865    2968    python
4861    2948    nautilus
4221    1       init
(snip about 100line)
548     2590    Xorg

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] oom_kill: use rss value instead of vm size for badness
  2009-11-01 13:29         ` KOSAKI Motohiro
@ 2009-11-02 10:42           ` David Rientjes
  2009-11-02 12:35             ` KOSAKI Motohiro
  0 siblings, 1 reply; 29+ messages in thread
From: David Rientjes @ 2009-11-02 10:42 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: KAMEZAWA Hiroyuki, linux-mm, linux-kernel, Andrew Morton,
	Hugh Dickins, Andrea Arcangeli, vedran.furac

On Sun, 1 Nov 2009, KOSAKI Motohiro wrote:

> > total_vm
> > 673222 test
> > 195695 krunner
> > 168881 plasma-desktop
> > 130567 ktorrent
> > 127081 knotify4
> > 125881 icedove-bin
> > 123036 akregator
> > 121869 firefox-bin
> >
> > rss
> > 672271 test
> > 42192 Xorg
> > 30763 firefox-bin
> > 13292 icedove-bin
> > 10208 ktorrent
> > 9260 akregator
> > 8859 plasma-desktop
> > 7528 krunner
> >
> > firefox-bin seems much more preferred in this case than total_vm, but Xorg
> > still ranks very high with this patch compared to the current
> > implementation.
> 
> Hi David,
> 
> I'm very interesting your pointing out. thanks good testing.
> So, I'd like to clarify your point a bit.
> 
> following are badness list on my desktop environment (x86_64 6GB mem).
> it show Xorg have pretty small badness score. Do you know why such
> different happen?
> 

I don't know specifically what's different on your machine than Vedran's, 
my data is simply a collection of the /proc/sys/vm/oom_dump_tasks output 
from Vedran's oom log.

I guess we could add a call to badness() for the oom_dump_tasks tasklist 
dump to get a clearer picture so we know the score for each thread group 
leader.  Anything else would be speculation at this point, though.

> score    pid        comm
> ==============================
> 56382   3241    run-mozilla.sh
> 23345   3289    run-mozilla.sh
> 21461   3050    gnome-do
> 20079   2867    gnome-session
> 14016   3258    firefox
> 9212    3306    firefox
> 8468    3115    gnome-do
> 6902    3325    emacs
> 6783    3212    tomboy
> 4865    2968    python
> 4861    2948    nautilus
> 4221    1       init
> (snip about 100line)
> 548     2590    Xorg
> 

Are these scores with your rss patch or without?  If it's without the 
patch, this is understandable since Xorg didn't appear highly in Vedran's 
log either.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] oom_kill: use rss value instead of vm size for badness
  2009-11-02 10:42           ` David Rientjes
@ 2009-11-02 12:35             ` KOSAKI Motohiro
  2009-11-02 19:55               ` Vedran Furač
  0 siblings, 1 reply; 29+ messages in thread
From: KOSAKI Motohiro @ 2009-11-02 12:35 UTC (permalink / raw)
  To: David Rientjes, vedran.furac
  Cc: KAMEZAWA Hiroyuki, linux-mm, linux-kernel, Andrew Morton,
	Hugh Dickins, Andrea Arcangeli

>> Hi David,
>>
>> I'm very interesting your pointing out. thanks good testing.
>> So, I'd like to clarify your point a bit.
>>
>> following are badness list on my desktop environment (x86_64 6GB mem).
>> it show Xorg have pretty small badness score. Do you know why such
>> different happen?
>
> I don't know specifically what's different on your machine than Vedran's,
> my data is simply a collection of the /proc/sys/vm/oom_dump_tasks output
> from Vedran's oom log.
>
> I guess we could add a call to badness() for the oom_dump_tasks tasklist
> dump to get a clearer picture so we know the score for each thread group
> leader.  Anything else would be speculation at this point, though.
>
>> score    pid        comm
>> ==============================
>> 56382   3241    run-mozilla.sh
>> 23345   3289    run-mozilla.sh
>> 21461   3050    gnome-do
>> 20079   2867    gnome-session
>> 14016   3258    firefox
>> 9212    3306    firefox
>> 8468    3115    gnome-do
>> 6902    3325    emacs
>> 6783    3212    tomboy
>> 4865    2968    python
>> 4861    2948    nautilus
>> 4221    1       init
>> (snip about 100line)
>> 548     2590    Xorg
>>
>
> Are these scores with your rss patch or without?  If it's without the
> patch, this is understandable since Xorg didn't appear highly in Vedran's
> log either.

Oh, I'm sorry. I mesured with rss patch.
Then, I haven't understand what makes Xorg bad score.

Hmm...
Vedran,  Can you please post following command result?

# cat /proc/`pidof Xorg`/smaps


I hope to undestand the issue clearly before modify any code.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] oom_kill: use rss value instead of vm size for badness
  2009-11-02 12:35             ` KOSAKI Motohiro
@ 2009-11-02 19:55               ` Vedran Furač
  2009-11-03 23:09                 ` KOSAKI Motohiro
  0 siblings, 1 reply; 29+ messages in thread
From: Vedran Furač @ 2009-11-02 19:55 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: David Rientjes, KAMEZAWA Hiroyuki, linux-mm, linux-kernel,
	Andrew Morton, Hugh Dickins, Andrea Arcangeli

KOSAKI Motohiro wrote:

> Oh, I'm sorry. I mesured with rss patch.
> Then, I haven't understand what makes Xorg bad score.
> 
> Hmm...
> Vedran,  Can you please post following command result?
> 
> # cat /proc/`pidof Xorg`/smaps
> 
> I hope to undestand the issue clearly before modify any code.


No problem:

http://pastebin.com/d66972025 (long)

Xorg is from debian unstable.

Regards,

Vedran

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] oom_kill: use rss value instead of vm size for badness
  2009-11-02 19:55               ` Vedran Furač
@ 2009-11-03 23:09                 ` KOSAKI Motohiro
  2009-11-07 19:16                   ` Vedran Furač
  0 siblings, 1 reply; 29+ messages in thread
From: KOSAKI Motohiro @ 2009-11-03 23:09 UTC (permalink / raw)
  To: vedran.furac
  Cc: kosaki.motohiro, David Rientjes, KAMEZAWA Hiroyuki, linux-mm,
	linux-kernel, Andrew Morton, Hugh Dickins, Andrea Arcangeli

> KOSAKI Motohiro wrote:
> 
> > Oh, I'm sorry. I mesured with rss patch.
> > Then, I haven't understand what makes Xorg bad score.
> > 
> > Hmm...
> > Vedran,  Can you please post following command result?
> > 
> > # cat /proc/`pidof Xorg`/smaps
> > 
> > I hope to undestand the issue clearly before modify any code.
> 
> No problem:
> 
> http://pastebin.com/d66972025 (long)
> 
> Xorg is from debian unstable.

Hmm...

Your Xorg have pretty large heap. I'm not sure why it happen.
(ATI video card issue?)
Unfortunatelly, It is showed as normal large heap from kernel. then,
I doubt kernel can distinguish X from other process. Probably
oom-adj is most reasonable option....

-------------------------------------------------
[heap]
Size:             433812 kB
Rss:              433304 kB
Pss:              433304 kB
Shared_Clean:          0 kB
Shared_Dirty:          0 kB
Private_Clean:       280 kB
Private_Dirty:    433024 kB
Referenced:       415656 kB
Swap:                  0 kB
KernelPageSize:        4 kB
MMUPageSize:           4 kB




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] oom_kill: use rss value instead of vm size for badness
  2009-11-03 23:09                 ` KOSAKI Motohiro
@ 2009-11-07 19:16                   ` Vedran Furač
  0 siblings, 0 replies; 29+ messages in thread
From: Vedran Furač @ 2009-11-07 19:16 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: David Rientjes, KAMEZAWA Hiroyuki, linux-mm, linux-kernel,
	Andrew Morton, Hugh Dickins, Andrea Arcangeli

KOSAKI Motohiro wrote:

> Your Xorg have pretty large heap. I'm not sure why it happen. (ATI
> video card issue?)

It is ATI (fglrx), but I don't know if it is driver's issue or not. I
have a lot of apps running, firefox with high number of tabs and so on.
It adds up probably.

> Unfortunatelly, It is showed as normal large heap from kernel. then, 
> I doubt kernel can distinguish X from other process. Probably oom-adj
> is most reasonable option....
>
> -------------------------------------------------
> [heap]
> Size:             433812 kB
> Rss:              433304 kB
> Pss:              433304 kB
> Shared_Clean:          0 kB
> Shared_Dirty:          0 kB
> Private_Clean:       280 kB
> Private_Dirty:    433024 kB
> Referenced:       415656 kB
> Swap:                  0 kB
> KernelPageSize:        4 kB
> MMUPageSize:           4 kB



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] oom_kill: use rss value instead of vm size for badness
  2009-10-29  8:31       ` David Rientjes
  2009-10-29  8:46         ` KAMEZAWA Hiroyuki
  2009-11-01 13:29         ` KOSAKI Motohiro
@ 2009-11-25 12:44         ` Andrea Arcangeli
  2009-11-25 21:39           ` David Rientjes
  2009-11-26  0:10           ` Vedran Furač
  2 siblings, 2 replies; 29+ messages in thread
From: Andrea Arcangeli @ 2009-11-25 12:44 UTC (permalink / raw)
  To: David Rientjes
  Cc: KAMEZAWA Hiroyuki, linux-mm, linux-kernel, Andrew Morton,
	Hugh Dickins, vedran.furac, KOSAKI Motohiro

Hello,

lengthy discussion on something I think is quite obviously better and
I tried to change a couple of years back already (rss instead of
total_vm).

On Thu, Oct 29, 2009 at 01:31:59AM -0700, David Rientjes wrote:
> total_vm
> 708945 test
> 195695 krunner
> 168881 plasma-desktop
> 130567 ktorrent
> 127081 knotify4
> 125881 icedove-bin
> 123036 akregator
> 118641 kded4
> 
> rss
> 707878 test
> 42201 Xorg
> 13300 icedove-bin
> 10209 ktorrent
> 9277 akregator
> 8878 plasma-desktop
> 7546 krunner
> 4532 mysqld
> 
> This patch would pick the memory hogging task, "test", first everytime 

That is by far the only thing that matters. There's plenty of logic in
the oom killer to remove races with tasks with TIF_MEMDIE set, to
ensure not to fall into the second task until the first task had the
time to release all its memory back to the system.

> just like the current implementation does.  It would then prefer Xorg, 

You're focusing on the noise and not looking at the only thing that
matters.

The noise level with rss went down to 50000, it doesn't matter the
order of what's below 50000. Only thing it matters is the _delta_
between "noise-level innocent apps" and "exploit".

The delta is clearly increase from 708945-max(noise) to
707878-max(noise) which translates to a increase of precision from
513250 to 665677, which shows how much more rss is making the
detection more accurate (i.e. the distance between exploit and first
innocent app). The lower level the noise level starts, the less likely
the innocent apps are killed.

There's simply no way to get to perfection, some innocent apps will
always have high total_vm or rss levels, but this at least removes
lots of innocent apps from the equation. The fact X isn't less
innocent than before is because its rss is quite big, and this is not
an error, luckily much smaller than the hog itself. Surely there are
ways to force X to load huge bitmaps into its address space too
(regardless of total_vm or rss) but again no perfection, just better
with rss even in this testcase.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] oom_kill: use rss value instead of vm size for badness
  2009-11-25 12:44         ` Andrea Arcangeli
@ 2009-11-25 21:39           ` David Rientjes
  2009-11-27 18:26             ` Andrea Arcangeli
  2009-11-26  0:10           ` Vedran Furač
  1 sibling, 1 reply; 29+ messages in thread
From: David Rientjes @ 2009-11-25 21:39 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: KAMEZAWA Hiroyuki, linux-mm, linux-kernel, Andrew Morton,
	Hugh Dickins, vedran.furac, KOSAKI Motohiro

On Wed, 25 Nov 2009, Andrea Arcangeli wrote:

> You're focusing on the noise and not looking at the only thing that
> matters.
> 
> The noise level with rss went down to 50000, it doesn't matter the
> order of what's below 50000. Only thing it matters is the _delta_
> between "noise-level innocent apps" and "exploit".
> 
> The delta is clearly increase from 708945-max(noise) to
> 707878-max(noise) which translates to a increase of precision from
> 513250 to 665677, which shows how much more rss is making the
> detection more accurate (i.e. the distance between exploit and first
> innocent app). The lower level the noise level starts, the less likely
> the innocent apps are killed.
> 

That's not surprising since the amount of physical RAM is the constraining 
factor.

> There's simply no way to get to perfection, some innocent apps will
> always have high total_vm or rss levels, but this at least removes
> lots of innocent apps from the equation. The fact X isn't less
> innocent than before is because its rss is quite big, and this is not
> an error, luckily much smaller than the hog itself. Surely there are
> ways to force X to load huge bitmaps into its address space too
> (regardless of total_vm or rss) but again no perfection, just better
> with rss even in this testcase.
> 

We use the oom killer as a mechanism to enforce memory containment policy, 
we are much more interested in the oom killing priority than the oom 
killer's own heuristics to determine the ideal task to kill.  Those 
heuristics can't possibly represent the priorities for all possible 
workloads, so we require input from the user via /proc/pid/oom_adj to 
adjust that heuristic.  That has traditionally always used total_vm as a 
baseline which is a much more static value and can be quantified within a 
reasonable range by experimental data when it would not be defined as 
rogue.  By changing the baseline to rss, we lose much of that control 
since its more dynamic and dependent on the current state of the machine 
at the time of the oom which can be predicted with less accuracy.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] oom_kill: use rss value instead of vm size for badness
  2009-11-25 12:44         ` Andrea Arcangeli
  2009-11-25 21:39           ` David Rientjes
@ 2009-11-26  0:10           ` Vedran Furač
  2009-11-26  1:32             ` KAMEZAWA Hiroyuki
  1 sibling, 1 reply; 29+ messages in thread
From: Vedran Furač @ 2009-11-26  0:10 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: David Rientjes, KAMEZAWA Hiroyuki, linux-mm, linux-kernel,
	Andrew Morton, Hugh Dickins, KOSAKI Motohiro

Andrea Arcangeli wrote:

> Hello,

Hi all!

> lengthy discussion on something I think is quite obviously better and
> I tried to change a couple of years back already (rss instead of
> total_vm).

Now that 2.6.32 is almost out, is it possible to get OOMK fixed in
2.6.33 so that I could turn overcommit on (overcommit_memory=0) again
without fear of loosing my work?

Regards,

Vedran


-- 
http://vedranf.net | a8e7a7783ca0d460fee090cc584adc12

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] oom_kill: use rss value instead of vm size for badness
  2009-11-26  0:10           ` Vedran Furač
@ 2009-11-26  1:32             ` KAMEZAWA Hiroyuki
  2009-11-27  1:56               ` Vedran Furač
  0 siblings, 1 reply; 29+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-11-26  1:32 UTC (permalink / raw)
  To: vedran.furac
  Cc: Andrea Arcangeli, David Rientjes, linux-mm, linux-kernel,
	Andrew Morton, Hugh Dickins, KOSAKI Motohiro

On Thu, 26 Nov 2009 01:10:12 +0100
Vedran FuraA? <vedran.furac@gmail.com> wrote:

> Andrea Arcangeli wrote:
> 
> > Hello,
> 
> Hi all!
> 
> > lengthy discussion on something I think is quite obviously better and
> > I tried to change a couple of years back already (rss instead of
> > total_vm).
> 
> Now that 2.6.32 is almost out, is it possible to get OOMK fixed in
> 2.6.33 so that I could turn overcommit on (overcommit_memory=0) again
> without fear of loosing my work?
> 
I'll try fork-bomb detector again. That will finally help your X.org.
But It may lose 2.6.33.

Adding new counter to mm_struct is now rejected because of scalability, so
total work will need more time (than expected).
I'm sorry I can't get enough time in these weeks.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] oom_kill: use rss value instead of vm size for badness
  2009-11-26  1:32             ` KAMEZAWA Hiroyuki
@ 2009-11-27  1:56               ` Vedran Furač
  0 siblings, 0 replies; 29+ messages in thread
From: Vedran Furač @ 2009-11-27  1:56 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrea Arcangeli, David Rientjes, linux-mm, linux-kernel,
	Andrew Morton, Hugh Dickins, KOSAKI Motohiro

KAMEZAWA Hiroyuki wrote:

> On Thu, 26 Nov 2009 01:10:12 +0100
> Vedran FuraA? <vedran.furac@gmail.com> wrote:
> 
>> Andrea Arcangeli wrote:
>>
>>> lengthy discussion on something I think is quite obviously better and
>>> I tried to change a couple of years back already (rss instead of
>>> total_vm).
>> Now that 2.6.32 is almost out, is it possible to get OOMK fixed in
>> 2.6.33 so that I could turn overcommit on (overcommit_memory=0) again
>> without fear of loosing my work?
>>
> I'll try fork-bomb detector again. That will finally help your X.org.
> But It may lose 2.6.33.
> 
> Adding new counter to mm_struct is now rejected because of scalability, so
> total work will need more time (than expected).
> I'm sorry I can't get enough time in these weeks.

Thanks for working on this! Hope it gets into 33. Keep me posted.

Regards,

Vedran


-- 
http://vedranf.net | a8e7a7783ca0d460fee090cc584adc12

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] oom_kill: use rss value instead of vm size for badness
  2009-11-25 21:39           ` David Rientjes
@ 2009-11-27 18:26             ` Andrea Arcangeli
  2009-11-30 23:09               ` David Rientjes
  0 siblings, 1 reply; 29+ messages in thread
From: Andrea Arcangeli @ 2009-11-27 18:26 UTC (permalink / raw)
  To: David Rientjes
  Cc: KAMEZAWA Hiroyuki, linux-mm, linux-kernel, Andrew Morton,
	Hugh Dickins, vedran.furac, KOSAKI Motohiro

On Wed, Nov 25, 2009 at 01:39:59PM -0800, David Rientjes wrote:
> adjust that heuristic.  That has traditionally always used total_vm as a 
> baseline which is a much more static value and can be quantified within a 
> reasonable range by experimental data when it would not be defined as 
> rogue.  By changing the baseline to rss, we lose much of that control 
> since its more dynamic and dependent on the current state of the machine 
> at the time of the oom which can be predicted with less accuracy.

Ok I can see the fact by being dynamic and less predictable worries
you. The "second to last" tasks especially are going to be less
predictable, but the memory hog would normally end up accounting for
most of the memory and this should increase the badness delta between
the offending tasks (or tasks) and the innocent stuff, so making it
more reliable. The innocent stuff should be more and more paged out
from ram. So I tend to think it'll be much less likely to kill an
innocent task this way (as demonstrated in practice by your
measurement too), but it's true there's no guarantee it'll always do
the right thing, because it's a heuristic anyway, but even total_vm
doesn't provide guarantee unless your workload is stationary and your
badness scores are fixed and no virtual memory is ever allocated by
any task in the system and no new task are spawned.

It'd help if you posted a regression showing smaller delta between
oom-target task and second task. My email was just to point out, your
measurement was a good thing in oom killing terms. If I've to imagine
the worst case for this, is an app allocating memory at very low
peace, and then slowly getting swapped out and taking huge swap
size. Maybe we need to add swap size to rss, dunno, but the paged out
MAP_SHARED equivalent can't be accounted like we can account swap
size, so in practice I feel a raw rss is going to be more practical
than making swap special vs file mappings.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] oom_kill: use rss value instead of vm size for badness
  2009-11-27 18:26             ` Andrea Arcangeli
@ 2009-11-30 23:09               ` David Rientjes
  2009-12-01  4:43                 ` KOSAKI Motohiro
  0 siblings, 1 reply; 29+ messages in thread
From: David Rientjes @ 2009-11-30 23:09 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: KAMEZAWA Hiroyuki, linux-mm, linux-kernel, Andrew Morton,
	Hugh Dickins, vedran.furac, KOSAKI Motohiro

On Fri, 27 Nov 2009, Andrea Arcangeli wrote:

> Ok I can see the fact by being dynamic and less predictable worries
> you. The "second to last" tasks especially are going to be less
> predictable, but the memory hog would normally end up accounting for
> most of the memory and this should increase the badness delta between
> the offending tasks (or tasks) and the innocent stuff, so making it
> more reliable. The innocent stuff should be more and more paged out
> from ram. So I tend to think it'll be much less likely to kill an
> innocent task this way (as demonstrated in practice by your
> measurement too), but it's true there's no guarantee it'll always do
> the right thing, because it's a heuristic anyway, but even total_vm
> doesn't provide guarantee unless your workload is stationary and your
> badness scores are fixed and no virtual memory is ever allocated by
> any task in the system and no new task are spawned.
> 

The purpose of /proc/pid/oom_adj is not always to polarize the heuristic 
for the task it represents, it allows userspace to define when a task is 
rogue.  Working with total_vm as a baseline, it is simple to use the 
interface to tune the heuristic to prefer a certain task over another when 
its memory consumption goes beyond what is expected.  With this interface, 
I can easily define when an application should be oom killed because it is 
using far more memory than expected.  I can also disable oom killing 
completely for it, if necessary.  Unless you have a consistent baseline 
for all tasks, the adjustment wouldn't contextually make any sense.  Using 
rss does not allow users to statically define when a task is rogue and is 
dependent on the current state of memory at the time of oom.

I would support removing most of the other heuristics other than the 
baseline and the nodes intersection with mems_allowed to prefer tasks in 
the same cpuset, though, to make it easier to understand and tune.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] oom_kill: use rss value instead of vm size for badness
  2009-11-30 23:09               ` David Rientjes
@ 2009-12-01  4:43                 ` KOSAKI Motohiro
  2009-12-01 22:20                   ` David Rientjes
  0 siblings, 1 reply; 29+ messages in thread
From: KOSAKI Motohiro @ 2009-12-01  4:43 UTC (permalink / raw)
  To: David Rientjes
  Cc: kosaki.motohiro, Andrea Arcangeli, KAMEZAWA Hiroyuki, linux-mm,
	linux-kernel, Andrew Morton, Hugh Dickins, vedran.furac

> On Fri, 27 Nov 2009, Andrea Arcangeli wrote:
> 
> > Ok I can see the fact by being dynamic and less predictable worries
> > you. The "second to last" tasks especially are going to be less
> > predictable, but the memory hog would normally end up accounting for
> > most of the memory and this should increase the badness delta between
> > the offending tasks (or tasks) and the innocent stuff, so making it
> > more reliable. The innocent stuff should be more and more paged out
> > from ram. So I tend to think it'll be much less likely to kill an
> > innocent task this way (as demonstrated in practice by your
> > measurement too), but it's true there's no guarantee it'll always do
> > the right thing, because it's a heuristic anyway, but even total_vm
> > doesn't provide guarantee unless your workload is stationary and your
> > badness scores are fixed and no virtual memory is ever allocated by
> > any task in the system and no new task are spawned.
> > 
> 
> The purpose of /proc/pid/oom_adj is not always to polarize the heuristic 
> for the task it represents, it allows userspace to define when a task is 
> rogue.  Working with total_vm as a baseline, it is simple to use the 
> interface to tune the heuristic to prefer a certain task over another when 
> its memory consumption goes beyond what is expected.  With this interface, 
> I can easily define when an application should be oom killed because it is 
> using far more memory than expected.  I can also disable oom killing 
> completely for it, if necessary.  Unless you have a consistent baseline 
> for all tasks, the adjustment wouldn't contextually make any sense.  Using 
> rss does not allow users to statically define when a task is rogue and is 
> dependent on the current state of memory at the time of oom.
> 
> I would support removing most of the other heuristics other than the 
> baseline and the nodes intersection with mems_allowed to prefer tasks in 
> the same cpuset, though, to make it easier to understand and tune.

I feel you talked about oom_adj doesn't fit your use case. probably you need
/proc/{pid}/oom_priority new knob. oom adjustment doesn't fit you.
you need job severity based oom killing order. severity doesn't depend on any
hueristic.
server administrator should know job severity on his system.

OOM heuristic should mainly consider desktop usage. because desktop user
doesn't change oom knob at all. and they doesn't know what deamon is important.
any userful heuristics have some dynamically aspect. we can't avoid it.

thought?


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] oom_kill: use rss value instead of vm size for badness
  2009-12-01  4:43                 ` KOSAKI Motohiro
@ 2009-12-01 22:20                   ` David Rientjes
  2009-12-02  0:35                     ` KOSAKI Motohiro
  0 siblings, 1 reply; 29+ messages in thread
From: David Rientjes @ 2009-12-01 22:20 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Andrea Arcangeli, KAMEZAWA Hiroyuki, linux-mm, linux-kernel,
	Andrew Morton, Hugh Dickins, vedran.furac

On Tue, 1 Dec 2009, KOSAKI Motohiro wrote:

> > The purpose of /proc/pid/oom_adj is not always to polarize the heuristic 
> > for the task it represents, it allows userspace to define when a task is 
> > rogue.  Working with total_vm as a baseline, it is simple to use the 
> > interface to tune the heuristic to prefer a certain task over another when 
> > its memory consumption goes beyond what is expected.  With this interface, 
> > I can easily define when an application should be oom killed because it is 
> > using far more memory than expected.  I can also disable oom killing 
> > completely for it, if necessary.  Unless you have a consistent baseline 
> > for all tasks, the adjustment wouldn't contextually make any sense.  Using 
> > rss does not allow users to statically define when a task is rogue and is 
> > dependent on the current state of memory at the time of oom.
> > 
> > I would support removing most of the other heuristics other than the 
> > baseline and the nodes intersection with mems_allowed to prefer tasks in 
> > the same cpuset, though, to make it easier to understand and tune.
> 
> I feel you talked about oom_adj doesn't fit your use case. probably you need
> /proc/{pid}/oom_priority new knob. oom adjustment doesn't fit you.
> you need job severity based oom killing order. severity doesn't depend on any
> hueristic.
> server administrator should know job severity on his system.
> 

That's the complete opposite of what I wrote above, we use oom_adj to 
define when a user application is considered "rogue," meaning that it is 
using far more memory than expected and so we want it killed.  As you 
mentioned weeks ago, the kernel cannot identify a memory leaker; this is 
the user interface to allow the oom killer to identify a memory-hogging 
rogue task that will (probably) consume all system memory eventually.  
The way oom_adj is implemented, with a bit shift on a baseline of 
total_vm, it can also polarize the badness heuristic to kill an 
application based on priority by examining /proc/pid/oom_score, but that 
wasn't my concern in this case.  Using rss as a baseline reduces my 
ability to tune oom_adj appropriately to identify those rogue tasks 
because it is highly dynamic depending on the state of the VM at the time 
of oom.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] oom_kill: use rss value instead of vm size for badness
  2009-12-01 22:20                   ` David Rientjes
@ 2009-12-02  0:35                     ` KOSAKI Motohiro
  2009-12-03 23:25                       ` David Rientjes
  0 siblings, 1 reply; 29+ messages in thread
From: KOSAKI Motohiro @ 2009-12-02  0:35 UTC (permalink / raw)
  To: David Rientjes
  Cc: kosaki.motohiro, Andrea Arcangeli, KAMEZAWA Hiroyuki, linux-mm,
	linux-kernel, Andrew Morton, Hugh Dickins, vedran.furac

> On Tue, 1 Dec 2009, KOSAKI Motohiro wrote:
> 
> > > The purpose of /proc/pid/oom_adj is not always to polarize the heuristic 
> > > for the task it represents, it allows userspace to define when a task is 
> > > rogue.  Working with total_vm as a baseline, it is simple to use the 
> > > interface to tune the heuristic to prefer a certain task over another when 
> > > its memory consumption goes beyond what is expected.  With this interface, 
> > > I can easily define when an application should be oom killed because it is 
> > > using far more memory than expected.  I can also disable oom killing 
> > > completely for it, if necessary.  Unless you have a consistent baseline 
> > > for all tasks, the adjustment wouldn't contextually make any sense.  Using 
> > > rss does not allow users to statically define when a task is rogue and is 
> > > dependent on the current state of memory at the time of oom.
> > > 
> > > I would support removing most of the other heuristics other than the 
> > > baseline and the nodes intersection with mems_allowed to prefer tasks in 
> > > the same cpuset, though, to make it easier to understand and tune.
> > 
> > I feel you talked about oom_adj doesn't fit your use case. probably you need
> > /proc/{pid}/oom_priority new knob. oom adjustment doesn't fit you.
> > you need job severity based oom killing order. severity doesn't depend on any
> > hueristic.
> > server administrator should know job severity on his system.
> 
> That's the complete opposite of what I wrote above, we use oom_adj to 
> define when a user application is considered "rogue," meaning that it is 
> using far more memory than expected and so we want it killed.  As you 
> mentioned weeks ago, the kernel cannot identify a memory leaker; this is 
> the user interface to allow the oom killer to identify a memory-hogging 
> rogue task that will (probably) consume all system memory eventually.  
> The way oom_adj is implemented, with a bit shift on a baseline of 
> total_vm, it can also polarize the badness heuristic to kill an 
> application based on priority by examining /proc/pid/oom_score, but that 
> wasn't my concern in this case.  Using rss as a baseline reduces my 
> ability to tune oom_adj appropriately to identify those rogue tasks 
> because it is highly dynamic depending on the state of the VM at the time 
> of oom.

 - I mean you don't need almost kernel heuristic. but desktop user need it.
 - All job scheduler provide memory limitation feature. but OOM killer isn't
   for to implement memory limitation. we have memory cgroup.
 - if you need memory usage based know, read /proc/{pid}/statm and write
   /proc/{pid}/oom_priority works well probably.
 - Unfortunatelly, We can't continue to use VSZ based heuristics. because
   modern application waste 10x VSZ more than RSS comsumption. in nowadays,
   VSZ isn't good approximation value of RSS. There isn't any good reason to
   continue form desktop user view.

IOW, kernel hueristic should adjust to target majority user. we provide a knob
to help minority user.

or, Can you have any detection idea to distigish typical desktop and your use case?



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] oom_kill: use rss value instead of vm size for badness
  2009-12-02  0:35                     ` KOSAKI Motohiro
@ 2009-12-03 23:25                       ` David Rientjes
  2009-12-04  0:44                         ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 29+ messages in thread
From: David Rientjes @ 2009-12-03 23:25 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Andrea Arcangeli, KAMEZAWA Hiroyuki, linux-mm, linux-kernel,
	Andrew Morton, Hugh Dickins, vedran.furac

On Wed, 2 Dec 2009, KOSAKI Motohiro wrote:

>  - I mean you don't need almost kernel heuristic. but desktop user need it.

My point is that userspace needs to be able to identify memory leaking 
tasks and polarize oom killing priorities.  /proc/pid/oom_adj does a good 
job of both with total_vm as a baseline.

>  - All job scheduler provide memory limitation feature. but OOM killer isn't
>    for to implement memory limitation. we have memory cgroup.

Wrong, the oom killer implements cpuset memory limitations.

>  - if you need memory usage based know, read /proc/{pid}/statm and write
>    /proc/{pid}/oom_priority works well probably.

Constantly polling /proc/pid/stat and updating the oom killer priorities 
at a constant interval is a ridiculous proposal for identifying memory 
leakers, sorry.

>  - Unfortunatelly, We can't continue to use VSZ based heuristics. because
>    modern application waste 10x VSZ more than RSS comsumption. in nowadays,
>    VSZ isn't good approximation value of RSS. There isn't any good reason to
>    continue form desktop user view.
> 

Then leave the heuristic alone by default so we don't lose any 
functionality that we once had and then add additional heuristics 
depending on the environment as determined by the manipulation of a new 
tunable.

> IOW, kernel hueristic should adjust to target majority user. we provide a knob
> to help minority user.
> 

Moving the baseline to rss severely impacts the legitimacy of that knob, 
we lose a lot of control over identifying memory leakers and polarizing 
oom killer priorities because it depends on the state of the VM at the 
time of oom for which /proc/pid/oom_adj may not have recently been updated 
to represent.

I don't know why you continuously invoke the same arguments to completely 
change the baseline for the oom killer heuristic because you falsely 
believe that killing the task with the largest memory resident in RAM is 
more often than not the ideal task to kill.  It's very frustrating when 
you insist on changing the default heuristic based on your own belief that 
people use Linux in the same way you do.

If Andrew pushes the patch to change the baseline to rss 
(oom_kill-use-rss-instead-of-vm-size-for-badness.patch) to Linus, I'll 
strongly nack it because you totally lack the ability to identify memory 
leakers as defined by userspace which should be the prime target for the 
oom killer.  You have not addressed that problem, you've merely talked 
around it, and yet the patch unbelievably still sits in -mm.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] oom_kill: use rss value instead of vm size for badness
  2009-12-03 23:25                       ` David Rientjes
@ 2009-12-04  0:44                         ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 29+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-12-04  0:44 UTC (permalink / raw)
  To: David Rientjes
  Cc: KOSAKI Motohiro, Andrea Arcangeli, linux-mm, linux-kernel,
	Andrew Morton, Hugh Dickins, vedran.furac

On Thu, 3 Dec 2009 15:25:05 -0800 (PST)
David Rientjes <rientjes@google.com> wrote:

> If Andrew pushes the patch to change the baseline to rss 
> (oom_kill-use-rss-instead-of-vm-size-for-badness.patch) to Linus, I'll 
> strongly nack it because you totally lack the ability to identify memory 
> leakers as defined by userspace which should be the prime target for the 
> oom killer.  You have not addressed that problem, you've merely talked 
> around it, and yet the patch unbelievably still sits in -mm.
>
It's still cook-time about oom-kill patches and I'll ask Andrew not to send
it when he asks in mm-merge plan. At least, per-mm swap counter and lowmem-rss
counter is necessary. I'll rewrite fork-bomb detector, too.

Repeatedly saying, calculating badness from vm_size _only_ is bad.
I'm not sure how google's magical applications works, but in general,
vm_size doesn't means private memory usage i.e. how well oom-killer can free
pages.
And current oom-killer kills wrong process. Please share your idea to making
oom-killer better rather than just saying "don't do that".

Do you have good algorithm for detecting memory-leaking process in user land ?
I think I added some in my old set but it's not enough.
I'll add more statistics to mm_struct to do better work.

BTW, I hate oom_adj very much. It's nature of "shift" is hard to understand.
I wonder why static oom priority or oom_threshold was not implemented...

Thanks,
-Kame 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2009-12-04  0:47 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-10-28  8:58 [PATCH] oom_kill: use rss value instead of vm size for badness KAMEZAWA Hiroyuki
2009-10-28  9:15 ` David Rientjes
2009-10-28 11:04   ` KAMEZAWA Hiroyuki
2009-10-29  1:00     ` KAMEZAWA Hiroyuki
2009-10-29  2:31       ` Minchan Kim
2009-10-29  8:31       ` David Rientjes
2009-10-29  8:46         ` KAMEZAWA Hiroyuki
2009-10-29  9:01           ` David Rientjes
2009-10-29  9:16             ` KAMEZAWA Hiroyuki
2009-10-29  9:44               ` David Rientjes
2009-10-29 23:41                 ` KAMEZAWA Hiroyuki
2009-11-01 13:29         ` KOSAKI Motohiro
2009-11-02 10:42           ` David Rientjes
2009-11-02 12:35             ` KOSAKI Motohiro
2009-11-02 19:55               ` Vedran Furač
2009-11-03 23:09                 ` KOSAKI Motohiro
2009-11-07 19:16                   ` Vedran Furač
2009-11-25 12:44         ` Andrea Arcangeli
2009-11-25 21:39           ` David Rientjes
2009-11-27 18:26             ` Andrea Arcangeli
2009-11-30 23:09               ` David Rientjes
2009-12-01  4:43                 ` KOSAKI Motohiro
2009-12-01 22:20                   ` David Rientjes
2009-12-02  0:35                     ` KOSAKI Motohiro
2009-12-03 23:25                       ` David Rientjes
2009-12-04  0:44                         ` KAMEZAWA Hiroyuki
2009-11-26  0:10           ` Vedran Furač
2009-11-26  1:32             ` KAMEZAWA Hiroyuki
2009-11-27  1:56               ` Vedran Furač

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox