* [PATCH] oom_kill: use rss value instead of vm size for badness
@ 2009-10-28 8:58 KAMEZAWA Hiroyuki
2009-10-28 9:15 ` David Rientjes
0 siblings, 1 reply; 29+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-10-28 8:58 UTC (permalink / raw)
To: linux-mm
Cc: linux-kernel, akpm, hugh.dickins, aarcange, vedran.furac,
kosaki.motohiro
I may add more tweaks based on this. But simple start point as this patch
will be good. This patch is based on mmotm + Kosaki's
http://marc.info/?l=linux-kernel&m=125669809305167&w=2
Test results on various environment are appreciated.
Regards.
-Kame
==
From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
It's reported that OOM-Killer kills Gnone/KDE at first...
And yes, we can reproduce it easily.
Now, oom-killer uses mm->total_vm as its base value. But in recent
applications, there are a big gap between VM size and RSS size.
Because
- Applications attaches much dynamic libraries. (Gnome, KDE, etc...)
- Applications may alloc big VM area but use small part of them.
(Java, and multi-threaded applications has this tendency because
of default-size of stack.)
I think using mm->total_vm as score for oom-kill is not good.
By the same reason, overcommit memory can't work as expected.
(In other words, if we depends on total_vm, using overcommit more positive
is a good choice.)
This patch uses mm->anon_rss/file_rss as base value for calculating badness.
Following is changes to OOM score(badness) on an environment with 1.6G memory
plus memory-eater(500M & 1G).
Top 10 of badness score. (The highest one is the first candidate to be killed)
Before
badness program
91228 gnome-settings-
94210 clock-applet
103202 mixer_applet2
106563 tomboy
112947 gnome-terminal
128944 mmap <----------- 500M malloc
129332 nautilus
215476 bash <----------- parent of 2 mallocs.
256944 mmap <----------- 1G malloc
423586 gnome-session
After
badness
1911 mixer_applet2
1955 clock-applet
1986 xinit
1989 gnome-session
2293 nautilus
2955 gnome-terminal
4113 tomboy
104163 mmap <----------- 500M malloc.
168577 bash <----------- parent of 2 mallocs
232375 mmap <----------- 1G malloc
seems good for me.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
mm/oom_kill.c | 10 +++++++---
1 file changed, 7 insertions(+), 3 deletions(-)
Index: mm-test-kernel/mm/oom_kill.c
===================================================================
--- mm-test-kernel.orig/mm/oom_kill.c
+++ mm-test-kernel/mm/oom_kill.c
@@ -93,7 +93,7 @@ unsigned long badness(struct task_struct
/*
* The memory size of the process is the basis for the badness.
*/
- points = mm->total_vm;
+ points = get_mm_counter(mm, anon_rss) + get_mm_counter(mm, file_rss);
/*
* After this unlock we can no longer dereference local variable `mm'
@@ -116,8 +116,12 @@ unsigned long badness(struct task_struct
*/
list_for_each_entry(child, &p->children, sibling) {
task_lock(child);
- if (child->mm != mm && child->mm)
- points += child->mm->total_vm/2 + 1;
+ if (child->mm != mm && child->mm) {
+ unsigned long cpoints;
+ cpoints = get_mm_counter(child->mm, anon_rss);
+ + get_mm_counter(child->mm, file_rss);
+ points += cpoints/2 + 1;
+ }
task_unlock(child);
}
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 29+ messages in thread* Re: [PATCH] oom_kill: use rss value instead of vm size for badness 2009-10-28 8:58 [PATCH] oom_kill: use rss value instead of vm size for badness KAMEZAWA Hiroyuki @ 2009-10-28 9:15 ` David Rientjes 2009-10-28 11:04 ` KAMEZAWA Hiroyuki 0 siblings, 1 reply; 29+ messages in thread From: David Rientjes @ 2009-10-28 9:15 UTC (permalink / raw) To: KAMEZAWA Hiroyuki Cc: linux-mm, linux-kernel, Andrew Morton, Hugh Dickins, Andrea Arcangeli, vedran.furac, KOSAKI Motohiro On Wed, 28 Oct 2009, KAMEZAWA Hiroyuki wrote: > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> > > It's reported that OOM-Killer kills Gnone/KDE at first... > And yes, we can reproduce it easily. > > Now, oom-killer uses mm->total_vm as its base value. But in recent > applications, there are a big gap between VM size and RSS size. > Because > - Applications attaches much dynamic libraries. (Gnome, KDE, etc...) > - Applications may alloc big VM area but use small part of them. > (Java, and multi-threaded applications has this tendency because > of default-size of stack.) > > I think using mm->total_vm as score for oom-kill is not good. > By the same reason, overcommit memory can't work as expected. > (In other words, if we depends on total_vm, using overcommit more positive > is a good choice.) > > This patch uses mm->anon_rss/file_rss as base value for calculating badness. > How does this affect the ability of the user to tune the badness score of individual threads? It seems like there will now only be two polarizing options: the equivalent of an oom_adj value of +15 or -17. It is now heavily dependent on the rss which may be unclear at the time of oom and very dynamic. I think a longer-term solution may rely more on the difference in get_mm_hiwater_rss() and get_mm_rss() instead to know the difference between what is resident in RAM at the time of oom compared to what has been swaped. Using this with get_mm_hiwater_vm() would produce a nice picture for the pattern of each task's memory consumption. > Following is changes to OOM score(badness) on an environment with 1.6G memory > plus memory-eater(500M & 1G). > > Top 10 of badness score. (The highest one is the first candidate to be killed) > Before > badness program > 91228 gnome-settings- > 94210 clock-applet > 103202 mixer_applet2 > 106563 tomboy > 112947 gnome-terminal > 128944 mmap <----------- 500M malloc > 129332 nautilus > 215476 bash <----------- parent of 2 mallocs. > 256944 mmap <----------- 1G malloc > 423586 gnome-session > > After > badness > 1911 mixer_applet2 > 1955 clock-applet > 1986 xinit > 1989 gnome-session > 2293 nautilus > 2955 gnome-terminal > 4113 tomboy > 104163 mmap <----------- 500M malloc. > 168577 bash <----------- parent of 2 mallocs > 232375 mmap <----------- 1G malloc > > seems good for me. > > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> > --- > mm/oom_kill.c | 10 +++++++--- > 1 file changed, 7 insertions(+), 3 deletions(-) > > Index: mm-test-kernel/mm/oom_kill.c > =================================================================== > --- mm-test-kernel.orig/mm/oom_kill.c > +++ mm-test-kernel/mm/oom_kill.c > @@ -93,7 +93,7 @@ unsigned long badness(struct task_struct > /* > * The memory size of the process is the basis for the badness. > */ > - points = mm->total_vm; > + points = get_mm_counter(mm, anon_rss) + get_mm_counter(mm, file_rss); > > /* > * After this unlock we can no longer dereference local variable `mm' > @@ -116,8 +116,12 @@ unsigned long badness(struct task_struct > */ > list_for_each_entry(child, &p->children, sibling) { > task_lock(child); > - if (child->mm != mm && child->mm) > - points += child->mm->total_vm/2 + 1; > + if (child->mm != mm && child->mm) { > + unsigned long cpoints; > + cpoints = get_mm_counter(child->mm, anon_rss); > + + get_mm_counter(child->mm, file_rss); That shouldn't compile. > + points += cpoints/2 + 1; > + } > task_unlock(child); > } > This can all be simplified by just using get_mm_rss(mm) and get_mm_rss(child->mm). -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH] oom_kill: use rss value instead of vm size for badness 2009-10-28 9:15 ` David Rientjes @ 2009-10-28 11:04 ` KAMEZAWA Hiroyuki 2009-10-29 1:00 ` KAMEZAWA Hiroyuki 0 siblings, 1 reply; 29+ messages in thread From: KAMEZAWA Hiroyuki @ 2009-10-28 11:04 UTC (permalink / raw) To: David Rientjes Cc: KAMEZAWA Hiroyuki, linux-mm, linux-kernel, Andrew Morton, Hugh Dickins, Andrea Arcangeli, vedran.furac, KOSAKI Motohiro David Rientjes さんは書きました: > On Wed, 28 Oct 2009, KAMEZAWA Hiroyuki wrote: > >> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> >> >> It's reported that OOM-Killer kills Gnone/KDE at first... >> And yes, we can reproduce it easily. >> >> Now, oom-killer uses mm->total_vm as its base value. But in recent >> applications, there are a big gap between VM size and RSS size. >> Because >> - Applications attaches much dynamic libraries. (Gnome, KDE, etc...) >> - Applications may alloc big VM area but use small part of them. >> (Java, and multi-threaded applications has this tendency because >> of default-size of stack.) >> >> I think using mm->total_vm as score for oom-kill is not good. >> By the same reason, overcommit memory can't work as expected. >> (In other words, if we depends on total_vm, using overcommit more >> positive >> is a good choice.) >> >> This patch uses mm->anon_rss/file_rss as base value for calculating >> badness. >> > > How does this affect the ability of the user to tune the badness score of > individual threads? Threads ? process ? > It seems like there will now only be two polarizing > options: the equivalent of an oom_adj value of +15 or -17. It is now > heavily dependent on the rss which may be unclear at the time of oom and > very dynamic. > yes. and that's "dynamic" is good thing. I think one of problems for oom now is that user says "oom-killer kills process at random." And yes, it's correct. mm->total_vm is not related to memory usage. Then, oom-killer seems to kill processes at random. For example, as Vetran shows, even if memory eater runs, processes are killed _at random_. After this patch, the biggest memory user will be the fist candidate and it's reasonable. Users will know "The process is killed because it uses much memory.", (seems not random) He can consider he should use oom_adj for memory eater or not. > I think a longer-term solution may rely more on the difference in > get_mm_hiwater_rss() and get_mm_rss() instead to know the difference > between what is resident in RAM at the time of oom compared to what has > been swaped. Using this with get_mm_hiwater_vm() would produce a nice > picture for the pattern of each task's memory consumption. > Hmm, I don't want complicated calculation (it makes oom_adj usage worse.) but yes, bare rss may be too simple. Anyway, as I shown, I'll add swap statistics regardless of this patch. That may adds new hint. For example) if (vm_swap_full()) points += mm->swap_usage >> Following is changes to OOM score(badness) on an environment with 1.6G >> memory >> plus memory-eater(500M & 1G). >> >> Top 10 of badness score. (The highest one is the first candidate to be >> killed) >> Before >> badness program >> 91228 gnome-settings- >> 94210 clock-applet >> 103202 mixer_applet2 >> 106563 tomboy >> 112947 gnome-terminal >> 128944 mmap <----------- 500M malloc >> 129332 nautilus >> 215476 bash <----------- parent of 2 mallocs. >> 256944 mmap <----------- 1G malloc >> 423586 gnome-session >> >> After >> badness >> 1911 mixer_applet2 >> 1955 clock-applet >> 1986 xinit >> 1989 gnome-session >> 2293 nautilus >> 2955 gnome-terminal >> 4113 tomboy >> 104163 mmap <----------- 500M malloc. >> 168577 bash <----------- parent of 2 mallocs >> 232375 mmap <----------- 1G malloc >> >> seems good for me. >> >> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> >> --- >> mm/oom_kill.c | 10 +++++++--- >> 1 file changed, 7 insertions(+), 3 deletions(-) >> >> Index: mm-test-kernel/mm/oom_kill.c >> =================================================================== >> --- mm-test-kernel.orig/mm/oom_kill.c >> +++ mm-test-kernel/mm/oom_kill.c >> @@ -93,7 +93,7 @@ unsigned long badness(struct task_struct >> /* >> * The memory size of the process is the basis for the badness. >> */ >> - points = mm->total_vm; >> + points = get_mm_counter(mm, anon_rss) + get_mm_counter(mm, file_rss); >> >> /* >> * After this unlock we can no longer dereference local variable `mm' >> @@ -116,8 +116,12 @@ unsigned long badness(struct task_struct >> */ >> list_for_each_entry(child, &p->children, sibling) { >> task_lock(child); >> - if (child->mm != mm && child->mm) >> - points += child->mm->total_vm/2 + 1; >> + if (child->mm != mm && child->mm) { >> + unsigned long cpoints; >> + cpoints = get_mm_counter(child->mm, anon_rss); >> + + get_mm_counter(child->mm, file_rss); > > That shouldn't compile. Oh, yes...thanks. > >> + points += cpoints/2 + 1; >> + } >> task_unlock(child); >> } >> > > This can all be simplified by just using get_mm_rss(mm) and > get_mm_rss(child->mm). > will use that. I'll wait until the next week to post a new patch. We don't need rapid way. Thanks, -Kame -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH] oom_kill: use rss value instead of vm size for badness 2009-10-28 11:04 ` KAMEZAWA Hiroyuki @ 2009-10-29 1:00 ` KAMEZAWA Hiroyuki 2009-10-29 2:31 ` Minchan Kim 2009-10-29 8:31 ` David Rientjes 0 siblings, 2 replies; 29+ messages in thread From: KAMEZAWA Hiroyuki @ 2009-10-29 1:00 UTC (permalink / raw) To: KAMEZAWA Hiroyuki Cc: David Rientjes, linux-mm, linux-kernel, Andrew Morton, Hugh Dickins, Andrea Arcangeli, vedran.furac, KOSAKI Motohiro > I'll wait until the next week to post a new patch. > We don't need rapid way. > I wrote above...but for my mental health, this is bug-fixed version. Sorry for my carelessness. David, thank you for your review. Regards, -Kame == From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> It's reported that OOM-Killer kills Gnone/KDE at first... And yes, we can reproduce it easily. Now, oom-killer uses mm->total_vm as its base value. But in recent applications, there are a big gap between VM size and RSS size. Because - Applications attaches much dynamic libraries. (Gnome, KDE, etc...) - Applications may alloc big VM area but use small part of them. (Java, and multi-threaded applications has this tendency because of default-size of stack.) I think using mm->total_vm as score for oom-kill is not good. By the same reason, overcommit memory can't work as expected. (In other words, if we depends on total_vm, using overcommit more positive is a good choice.) This patch uses mm->anon_rss/file_rss as base value for calculating badness. Following is changes to OOM score(badness) on an environment with 1.6G memory plus memory-eater(500M & 1G). Top 10 of badness score. (The highest one is the first candidate to be killed) Before badness program 91228 gnome-settings- 94210 clock-applet 103202 mixer_applet2 106563 tomboy 112947 gnome-terminal 128944 mmap <----------- 500M malloc 129332 nautilus 215476 bash <----------- parent of 2 mallocs. 256944 mmap <----------- 1G malloc 423586 gnome-session After badness 1911 mixer_applet2 1955 clock-applet 1986 xinit 1989 gnome-session 2293 nautilus 2955 gnome-terminal 4113 tomboy 104163 mmap <----------- 500M malloc. 168577 bash <----------- parent of 2 mallocs 232375 mmap <----------- 1G malloc seems good for me. Maybe we can tweak this patch more, but this one will be a good one as a start point. Changelog: 2009/10/29 - use get_mm_rss() instead of get_mm_counter() Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> --- mm/oom_kill.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) Index: mm-test-kernel/mm/oom_kill.c =================================================================== --- mm-test-kernel.orig/mm/oom_kill.c +++ mm-test-kernel/mm/oom_kill.c @@ -93,7 +93,7 @@ unsigned long badness(struct task_struct /* * The memory size of the process is the basis for the badness. */ - points = mm->total_vm; + points = get_mm_rss(mm); /* * After this unlock we can no longer dereference local variable `mm' @@ -117,7 +117,7 @@ unsigned long badness(struct task_struct list_for_each_entry(child, &p->children, sibling) { task_lock(child); if (child->mm != mm && child->mm) - points += child->mm->total_vm/2 + 1; + points += get_mm_rss(child->mm)/2 + 1; task_unlock(child); } -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH] oom_kill: use rss value instead of vm size for badness 2009-10-29 1:00 ` KAMEZAWA Hiroyuki @ 2009-10-29 2:31 ` Minchan Kim 2009-10-29 8:31 ` David Rientjes 1 sibling, 0 replies; 29+ messages in thread From: Minchan Kim @ 2009-10-29 2:31 UTC (permalink / raw) To: KAMEZAWA Hiroyuki Cc: David Rientjes, linux-mm, linux-kernel, Andrew Morton, Hugh Dickins, Andrea Arcangeli, vedran.furac, KOSAKI Motohiro On Thu, Oct 29, 2009 at 10:00 AM, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote: >> I'll wait until the next week to post a new patch. >> We don't need rapid way. >> > I wrote above...but for my mental health, this is bug-fixed version. > Sorry for my carelessness. David, thank you for your review. > Regards, > -Kame > == > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> > > It's reported that OOM-Killer kills Gnone/KDE at first... > And yes, we can reproduce it easily. > > Now, oom-killer uses mm->total_vm as its base value. But in recent > applications, there are a big gap between VM size and RSS size. > Because > - Applications attaches much dynamic libraries. (Gnome, KDE, etc...) > - Applications may alloc big VM area but use small part of them. > (Java, and multi-threaded applications has this tendency because > of default-size of stack.) > > I think using mm->total_vm as score for oom-kill is not good. > By the same reason, overcommit memory can't work as expected. > (In other words, if we depends on total_vm, using overcommit more positive > is a good choice.) > > This patch uses mm->anon_rss/file_rss as base value for calculating badness. > > Following is changes to OOM score(badness) on an environment with 1.6G memory > plus memory-eater(500M & 1G). > > Top 10 of badness score. (The highest one is the first candidate to be killed) > Before > badness program > 91228 gnome-settings- > 94210 clock-applet > 103202 mixer_applet2 > 106563 tomboy > 112947 gnome-terminal > 128944 mmap <----------- 500M malloc > 129332 nautilus > 215476 bash <----------- parent of 2 mallocs. > 256944 mmap <----------- 1G malloc > 423586 gnome-session > > After > badness > 1911 mixer_applet2 > 1955 clock-applet > 1986 xinit > 1989 gnome-session > 2293 nautilus > 2955 gnome-terminal > 4113 tomboy > 104163 mmap <----------- 500M malloc. > 168577 bash <----------- parent of 2 mallocs > 232375 mmap <----------- 1G malloc > > seems good for me. Maybe we can tweak this patch more, > but this one will be a good one as a start point. > > Changelog: 2009/10/29 > - use get_mm_rss() instead of get_mm_counter() > > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Let's start from this. -- Kind regards, Minchan Kim -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH] oom_kill: use rss value instead of vm size for badness 2009-10-29 1:00 ` KAMEZAWA Hiroyuki 2009-10-29 2:31 ` Minchan Kim @ 2009-10-29 8:31 ` David Rientjes 2009-10-29 8:46 ` KAMEZAWA Hiroyuki ` (2 more replies) 1 sibling, 3 replies; 29+ messages in thread From: David Rientjes @ 2009-10-29 8:31 UTC (permalink / raw) To: KAMEZAWA Hiroyuki Cc: linux-mm, linux-kernel, Andrew Morton, Hugh Dickins, Andrea Arcangeli, vedran.furac, KOSAKI Motohiro On Thu, 29 Oct 2009, KAMEZAWA Hiroyuki wrote: > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> > > It's reported that OOM-Killer kills Gnone/KDE at first... > And yes, we can reproduce it easily. > > Now, oom-killer uses mm->total_vm as its base value. But in recent > applications, there are a big gap between VM size and RSS size. > Because > - Applications attaches much dynamic libraries. (Gnome, KDE, etc...) > - Applications may alloc big VM area but use small part of them. > (Java, and multi-threaded applications has this tendency because > of default-size of stack.) > > I think using mm->total_vm as score for oom-kill is not good. > By the same reason, overcommit memory can't work as expected. > (In other words, if we depends on total_vm, using overcommit more positive > is a good choice.) > > This patch uses mm->anon_rss/file_rss as base value for calculating badness. > > Following is changes to OOM score(badness) on an environment with 1.6G memory > plus memory-eater(500M & 1G). > > Top 10 of badness score. (The highest one is the first candidate to be killed) > Before > badness program > 91228 gnome-settings- > 94210 clock-applet > 103202 mixer_applet2 > 106563 tomboy > 112947 gnome-terminal > 128944 mmap <----------- 500M malloc > 129332 nautilus > 215476 bash <----------- parent of 2 mallocs. > 256944 mmap <----------- 1G malloc > 423586 gnome-session > > After > badness > 1911 mixer_applet2 > 1955 clock-applet > 1986 xinit > 1989 gnome-session > 2293 nautilus > 2955 gnome-terminal > 4113 tomboy > 104163 mmap <----------- 500M malloc. > 168577 bash <----------- parent of 2 mallocs > 232375 mmap <----------- 1G malloc > > seems good for me. Maybe we can tweak this patch more, > but this one will be a good one as a start point. > This appears to actually prefer X more than total_vm in Vedran's test case. He cited http://pastebin.com/f3f9674a0 in http://marc.info/?l=linux-kernel&m=125678557002888. There are 12 ooms in this log, which has /proc/sys/vm/oom_dump_tasks enabled. It shows the difference between the top total_vm candidates vs. the top rss candidates. total_vm 708945 test 195695 krunner 168881 plasma-desktop 130567 ktorrent 127081 knotify4 125881 icedove-bin 123036 akregator 118641 kded4 rss 707878 test 42201 Xorg 13300 icedove-bin 10209 ktorrent 9277 akregator 8878 plasma-desktop 7546 krunner 4532 mysqld This patch would pick the memory hogging task, "test", first everytime just like the current implementation does. It would then prefer Xorg, icedove-bin, and ktorrent next as a starting point. Admittedly, there are other heuristics that the oom killer uses to create a badness score. But since this patch is only changing the baseline from mm->total_vm to get_mm_rss(mm), its behavior in this test case do not match the patch description. The vast majority of the other ooms have identical top 8 candidates: total_vm 673222 test 195695 krunner 168881 plasma-desktop 130567 ktorrent 127081 knotify4 125881 icedove-bin 123036 akregator 121869 firefox-bin rss 672271 test 42192 Xorg 30763 firefox-bin 13292 icedove-bin 10208 ktorrent 9260 akregator 8859 plasma-desktop 7528 krunner firefox-bin seems much more preferred in this case than total_vm, but Xorg still ranks very high with this patch compared to the current implementation. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH] oom_kill: use rss value instead of vm size for badness 2009-10-29 8:31 ` David Rientjes @ 2009-10-29 8:46 ` KAMEZAWA Hiroyuki 2009-10-29 9:01 ` David Rientjes 2009-11-01 13:29 ` KOSAKI Motohiro 2009-11-25 12:44 ` Andrea Arcangeli 2 siblings, 1 reply; 29+ messages in thread From: KAMEZAWA Hiroyuki @ 2009-10-29 8:46 UTC (permalink / raw) To: David Rientjes Cc: linux-mm, linux-kernel, Andrew Morton, Hugh Dickins, Andrea Arcangeli, vedran.furac, KOSAKI Motohiro On Thu, 29 Oct 2009 01:31:59 -0700 (PDT) David Rientjes <rientjes@google.com> wrote: > On Thu, 29 Oct 2009, KAMEZAWA Hiroyuki wrote: > > > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> > > > > It's reported that OOM-Killer kills Gnone/KDE at first... > > And yes, we can reproduce it easily. > > > > Now, oom-killer uses mm->total_vm as its base value. But in recent > > applications, there are a big gap between VM size and RSS size. > > Because > > - Applications attaches much dynamic libraries. (Gnome, KDE, etc...) > > - Applications may alloc big VM area but use small part of them. > > (Java, and multi-threaded applications has this tendency because > > of default-size of stack.) > > > > I think using mm->total_vm as score for oom-kill is not good. > > By the same reason, overcommit memory can't work as expected. > > (In other words, if we depends on total_vm, using overcommit more positive > > is a good choice.) > > > > This patch uses mm->anon_rss/file_rss as base value for calculating badness. > > > > Following is changes to OOM score(badness) on an environment with 1.6G memory > > plus memory-eater(500M & 1G). > > > > Top 10 of badness score. (The highest one is the first candidate to be killed) > > Before > > badness program > > 91228 gnome-settings- > > 94210 clock-applet > > 103202 mixer_applet2 > > 106563 tomboy > > 112947 gnome-terminal > > 128944 mmap <----------- 500M malloc > > 129332 nautilus > > 215476 bash <----------- parent of 2 mallocs. > > 256944 mmap <----------- 1G malloc > > 423586 gnome-session > > > > After > > badness > > 1911 mixer_applet2 > > 1955 clock-applet > > 1986 xinit > > 1989 gnome-session > > 2293 nautilus > > 2955 gnome-terminal > > 4113 tomboy > > 104163 mmap <----------- 500M malloc. > > 168577 bash <----------- parent of 2 mallocs > > 232375 mmap <----------- 1G malloc > > > > seems good for me. Maybe we can tweak this patch more, > > but this one will be a good one as a start point. > > > > This appears to actually prefer X more than total_vm in Vedran's test > case. He cited http://pastebin.com/f3f9674a0 in > http://marc.info/?l=linux-kernel&m=125678557002888. > > There are 12 ooms in this log, which has /proc/sys/vm/oom_dump_tasks > enabled. It shows the difference between the top total_vm candidates vs. > the top rss candidates. > > total_vm > 708945 test > 195695 krunner > 168881 plasma-desktop > 130567 ktorrent > 127081 knotify4 > 125881 icedove-bin > 123036 akregator > 118641 kded4 > > rss > 707878 test > 42201 Xorg > 13300 icedove-bin > 10209 ktorrent > 9277 akregator > 8878 plasma-desktop > 7546 krunner > 4532 mysqld > > This patch would pick the memory hogging task, "test", first everytime > just like the current implementation does. It would then prefer Xorg, > icedove-bin, and ktorrent next as a starting point. > > Admittedly, there are other heuristics that the oom killer uses to create > a badness score. But since this patch is only changing the baseline from > mm->total_vm to get_mm_rss(mm), its behavior in this test case do not > match the patch description. > yes, then I wrote "as start point". There are many environments. But I'm not sure why ntpd can be the first candidate... The scores you shown doesn't include children's score, right ? I believe I'll have to remove "adding child's score to parents". I'm now considering how to implement fork-bomb detector for removing it. > The vast majority of the other ooms have identical top 8 candidates: > > total_vm > 673222 test > 195695 krunner > 168881 plasma-desktop > 130567 ktorrent > 127081 knotify4 > 125881 icedove-bin > 123036 akregator > 121869 firefox-bin > > rss > 672271 test > 42192 Xorg > 30763 firefox-bin > 13292 icedove-bin > 10208 ktorrent > 9260 akregator > 8859 plasma-desktop > 7528 krunner > > firefox-bin seems much more preferred in this case than total_vm, but Xorg > still ranks very high with this patch compared to the current > implementation. > ya, I'm now considering to drop file_rss from calculation. some reasons. - file caches remaining in memory at OOM tend to have some trouble to remove it. - file caches tend to be shared. - if file caches are from shmem, we never be able to drop them if no swap/swapfull. Maybe we'll have better result. Regards, -Kame -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH] oom_kill: use rss value instead of vm size for badness 2009-10-29 8:46 ` KAMEZAWA Hiroyuki @ 2009-10-29 9:01 ` David Rientjes 2009-10-29 9:16 ` KAMEZAWA Hiroyuki 0 siblings, 1 reply; 29+ messages in thread From: David Rientjes @ 2009-10-29 9:01 UTC (permalink / raw) To: KAMEZAWA Hiroyuki Cc: linux-mm, linux-kernel, Andrew Morton, Hugh Dickins, Andrea Arcangeli, vedran.furac, KOSAKI Motohiro On Thu, 29 Oct 2009, KAMEZAWA Hiroyuki wrote: > > This appears to actually prefer X more than total_vm in Vedran's test > > case. He cited http://pastebin.com/f3f9674a0 in > > http://marc.info/?l=linux-kernel&m=125678557002888. > > > > There are 12 ooms in this log, which has /proc/sys/vm/oom_dump_tasks > > enabled. It shows the difference between the top total_vm candidates vs. > > the top rss candidates. > > > > total_vm > > 708945 test > > 195695 krunner > > 168881 plasma-desktop > > 130567 ktorrent > > 127081 knotify4 > > 125881 icedove-bin > > 123036 akregator > > 118641 kded4 > > > > rss > > 707878 test > > 42201 Xorg > > 13300 icedove-bin > > 10209 ktorrent > > 9277 akregator > > 8878 plasma-desktop > > 7546 krunner > > 4532 mysqld > > > > This patch would pick the memory hogging task, "test", first everytime > > just like the current implementation does. It would then prefer Xorg, > > icedove-bin, and ktorrent next as a starting point. > > > > Admittedly, there are other heuristics that the oom killer uses to create > > a badness score. But since this patch is only changing the baseline from > > mm->total_vm to get_mm_rss(mm), its behavior in this test case do not > > match the patch description. > > > yes, then I wrote "as start point". There are many environments. And this environment has a particularly bad result. > But I'm not sure why ntpd can be the first candidate... > The scores you shown doesn't include children's score, right ? > Right, it's just the get_mm_rss(mm) for each thread shown in the oom dump, the same value you've used as the new baseline. The actual badness scores could easily be calculated by cat'ing /proc/*/oom_score prior to oom, but this data was meant to illustrate the preference given the rss compared to total_vm in a heuristic sense. > I believe I'll have to remove "adding child's score to parents". > I'm now considering how to implement fork-bomb detector for removing it. > Agreed, I'm looking forward to your proposal. > ya, I'm now considering to drop file_rss from calculation. > > some reasons. > > - file caches remaining in memory at OOM tend to have some trouble to remove it. > - file caches tend to be shared. > - if file caches are from shmem, we never be able to drop them if no swap/swapfull. > > Maybe we'll have better result. > That sounds more appropriate. I'm surprised you still don't see a value in using the peak VM and RSS sizes, though, as part of your formula as it would indicate the proportion of memory resident in RAM at the time of oom. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH] oom_kill: use rss value instead of vm size for badness 2009-10-29 9:01 ` David Rientjes @ 2009-10-29 9:16 ` KAMEZAWA Hiroyuki 2009-10-29 9:44 ` David Rientjes 0 siblings, 1 reply; 29+ messages in thread From: KAMEZAWA Hiroyuki @ 2009-10-29 9:16 UTC (permalink / raw) To: David Rientjes Cc: linux-mm, linux-kernel, Andrew Morton, Hugh Dickins, Andrea Arcangeli, vedran.furac, KOSAKI Motohiro On Thu, 29 Oct 2009 02:01:49 -0700 (PDT) David Rientjes <rientjes@google.com> wrote: > > yes, then I wrote "as start point". There are many environments. > > And this environment has a particularly bad result. > yes, then I wrote "as start point". There are many environments. In my understanding, 2nd, 3rd candidates are not important. If both of total_vm and RSS catches the same process as 1st candidate, it's ok. (i.e. If killed, oom situation will go away.) > > ya, I'm now considering to drop file_rss from calculation. > > > > some reasons. > > > > - file caches remaining in memory at OOM tend to have some trouble to remove it. > > - file caches tend to be shared. > > - if file caches are from shmem, we never be able to drop them if no swap/swapfull. > > > > Maybe we'll have better result. > > > > That sounds more appropriate. > > I'm surprised you still don't see a value in using the peak VM and RSS > sizes, though, as part of your formula as it would indicate the proportion > of memory resident in RAM at the time of oom. > I'll use swap_usage instead of peak VM size as bonus. anon_rss + swap_usage/2 ? or some. My first purpose is not to kill not-guilty process at random. If memory eater is killed, it's reasnoable. In my consideration - "Killing a process because of OOM" is something bad, but not avoidable. - We don't need to do compliated/too-wise calculation for killing a process. "The worst one is memory-eater!" is easy to understand to users and admins. - We have oom_adj, now. User can customize it if he run _important_ memory eater. - But fork-bomb doesn't seem memory eater if we see each process. We need some cares. Then, - I'd like to drop file_rss. - I'd like to take swap_usage into acccount. - I'd like to remove cpu_time bonus. runtime bonus is much more important. - I'd like to remove penalty from children. To do that, fork-bomb detector is necessary. - nice bonus is bad. (We have oom_adj instead of this.) It should be if (task_nice(p) < 0) points /= 2; But we have "root user" bonus already. We can remove this line. After above, much more simple selection, easy-to-understand, will be done. Thanks, -Kame -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH] oom_kill: use rss value instead of vm size for badness 2009-10-29 9:16 ` KAMEZAWA Hiroyuki @ 2009-10-29 9:44 ` David Rientjes 2009-10-29 23:41 ` KAMEZAWA Hiroyuki 0 siblings, 1 reply; 29+ messages in thread From: David Rientjes @ 2009-10-29 9:44 UTC (permalink / raw) To: KAMEZAWA Hiroyuki Cc: linux-mm, linux-kernel, Andrew Morton, Hugh Dickins, Andrea Arcangeli, vedran.furac, KOSAKI Motohiro On Thu, 29 Oct 2009, KAMEZAWA Hiroyuki wrote: > > > yes, then I wrote "as start point". There are many environments. > > > > And this environment has a particularly bad result. > > yes, then I wrote "as start point". There are many environments. > > In my understanding, 2nd, 3rd candidates are not important. If both of > total_vm and RSS catches the same process as 1st candidate, it's ok. > (i.e. If killed, oom situation will go away.) > The ordering would matter on a machine with smaller capacity or if Vedran was using mem=, theoretically at the size of its current capacity minus the amount of anonymous memory being mlocked by the "test" program. When the oom occurs (and notice it's not triggered by "test" each time), it would have killed Xorg in what would otherwise be the same conditions. > > I'm surprised you still don't see a value in using the peak VM and RSS > > sizes, though, as part of your formula as it would indicate the proportion > > of memory resident in RAM at the time of oom. > > > I'll use swap_usage instead of peak VM size as bonus. > > anon_rss + swap_usage/2 ? or some. > > My first purpose is not to kill not-guilty process at random. > If memory eater is killed, it's reasnoable. > We again arrive at the distinction I made earlier where there're two approaches: kill a task that is consuming the majority of resident RAM, or kill a thread that is using much more memory than expected such as a memory leaker. I know that you've argued that the kernel can never know the latter, which I agree, but it does have the benefit of allowing the user to have more input and determine when an actual task is using much more RAM than expected; the anon_rss and swap_usage in your formula is highly dynamic, so you've have to expect the user to dynamically alter oom_adj to specify a preference in the case of the memory leaker. > In my consideration > > - "Killing a process because of OOM" is something bad, but not avoidable. > > - We don't need to do compliated/too-wise calculation for killing a process. > "The worst one is memory-eater!" is easy to understand to users and admins. > Is this a proposal to remove the remainder of the heuristics as well such as considering superuser tasks and those with longer uptimes? I'd agree with removing most of it other than the oom_adj and current->mems_allowed intersection penalty. We're probably going to need rewrite the badness heuristic from scratch instead of simply changing the baseline. > - We have oom_adj, now. User can customize it if he run _important_ memory eater. > If he runs an important memory eater, he can always polarize it by disabling oom killing completely for that task. However, oom_adj is also used to identify memory leakers when the amount of memory that it uses is roughly known. Most people don't know how much memory their applications use, but there are systems where users have tuned oom_adj specifically based on comparative /proc/pid/oom_score results. Simply using anon_rss and swap_usage will make that vary much more than previously. > - But fork-bomb doesn't seem memory eater if we see each process. > We need some cares. > The forkbomb can be addressed in multiple ways, the most simple of which is simply counting the number of children and their runtime. It'd probably even be better to isolate the forkbomb case away from the badness score and simply kill the parent by returning ULONG_MAX when it's recognized. > Then, > - I'd like to drop file_rss. > - I'd like to take swap_usage into acccount. > - I'd like to remove cpu_time bonus. runtime bonus is much more important. > - I'd like to remove penalty from children. To do that, fork-bomb detector > is necessary. > - nice bonus is bad. (We have oom_adj instead of this.) It should be > if (task_nice(p) < 0) > points /= 2; > But we have "root user" bonus already. We can remove this line. > > After above, much more simple selection, easy-to-understand, will be done. > Agreed, I think we'll need to rewrite most of the heuristic from scratch. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH] oom_kill: use rss value instead of vm size for badness 2009-10-29 9:44 ` David Rientjes @ 2009-10-29 23:41 ` KAMEZAWA Hiroyuki 0 siblings, 0 replies; 29+ messages in thread From: KAMEZAWA Hiroyuki @ 2009-10-29 23:41 UTC (permalink / raw) To: David Rientjes Cc: linux-mm, linux-kernel, Andrew Morton, Hugh Dickins, Andrea Arcangeli, vedran.furac, KOSAKI Motohiro On Thu, 29 Oct 2009 02:44:45 -0700 (PDT) David Rientjes <rientjes@google.com> wrote: > > Then, > > - I'd like to drop file_rss. > > - I'd like to take swap_usage into acccount. > > - I'd like to remove cpu_time bonus. runtime bonus is much more important. > > - I'd like to remove penalty from children. To do that, fork-bomb detector > > is necessary. > > - nice bonus is bad. (We have oom_adj instead of this.) It should be > > if (task_nice(p) < 0) > > points /= 2; > > But we have "root user" bonus already. We can remove this line. > > > > After above, much more simple selection, easy-to-understand, will be done. > > > > Agreed, I think we'll need to rewrite most of the heuristic from scratch. I'd like to post total redesgin of oom-killer in the next week. plz wait. Thanks, -Kame -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH] oom_kill: use rss value instead of vm size for badness 2009-10-29 8:31 ` David Rientjes 2009-10-29 8:46 ` KAMEZAWA Hiroyuki @ 2009-11-01 13:29 ` KOSAKI Motohiro 2009-11-02 10:42 ` David Rientjes 2009-11-25 12:44 ` Andrea Arcangeli 2 siblings, 1 reply; 29+ messages in thread From: KOSAKI Motohiro @ 2009-11-01 13:29 UTC (permalink / raw) To: David Rientjes Cc: KAMEZAWA Hiroyuki, linux-mm, linux-kernel, Andrew Morton, Hugh Dickins, Andrea Arcangeli, vedran.furac > This patch would pick the memory hogging task, "test", first everytime > just like the current implementation does. It would then prefer Xorg, > icedove-bin, and ktorrent next as a starting point. > > Admittedly, there are other heuristics that the oom killer uses to create > a badness score. But since this patch is only changing the baseline from > mm->total_vm to get_mm_rss(mm), its behavior in this test case do not > match the patch description. > > The vast majority of the other ooms have identical top 8 candidates: > > total_vm > 673222 test > 195695 krunner > 168881 plasma-desktop > 130567 ktorrent > 127081 knotify4 > 125881 icedove-bin > 123036 akregator > 121869 firefox-bin > > rss > 672271 test > 42192 Xorg > 30763 firefox-bin > 13292 icedove-bin > 10208 ktorrent > 9260 akregator > 8859 plasma-desktop > 7528 krunner > > firefox-bin seems much more preferred in this case than total_vm, but Xorg > still ranks very high with this patch compared to the current > implementation. Hi David, I'm very interesting your pointing out. thanks good testing. So, I'd like to clarify your point a bit. following are badness list on my desktop environment (x86_64 6GB mem). it show Xorg have pretty small badness score. Do you know why such different happen? score pid comm ============================== 56382 3241 run-mozilla.sh 23345 3289 run-mozilla.sh 21461 3050 gnome-do 20079 2867 gnome-session 14016 3258 firefox 9212 3306 firefox 8468 3115 gnome-do 6902 3325 emacs 6783 3212 tomboy 4865 2968 python 4861 2948 nautilus 4221 1 init (snip about 100line) 548 2590 Xorg -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH] oom_kill: use rss value instead of vm size for badness 2009-11-01 13:29 ` KOSAKI Motohiro @ 2009-11-02 10:42 ` David Rientjes 2009-11-02 12:35 ` KOSAKI Motohiro 0 siblings, 1 reply; 29+ messages in thread From: David Rientjes @ 2009-11-02 10:42 UTC (permalink / raw) To: KOSAKI Motohiro Cc: KAMEZAWA Hiroyuki, linux-mm, linux-kernel, Andrew Morton, Hugh Dickins, Andrea Arcangeli, vedran.furac On Sun, 1 Nov 2009, KOSAKI Motohiro wrote: > > total_vm > > 673222 test > > 195695 krunner > > 168881 plasma-desktop > > 130567 ktorrent > > 127081 knotify4 > > 125881 icedove-bin > > 123036 akregator > > 121869 firefox-bin > > > > rss > > 672271 test > > 42192 Xorg > > 30763 firefox-bin > > 13292 icedove-bin > > 10208 ktorrent > > 9260 akregator > > 8859 plasma-desktop > > 7528 krunner > > > > firefox-bin seems much more preferred in this case than total_vm, but Xorg > > still ranks very high with this patch compared to the current > > implementation. > > Hi David, > > I'm very interesting your pointing out. thanks good testing. > So, I'd like to clarify your point a bit. > > following are badness list on my desktop environment (x86_64 6GB mem). > it show Xorg have pretty small badness score. Do you know why such > different happen? > I don't know specifically what's different on your machine than Vedran's, my data is simply a collection of the /proc/sys/vm/oom_dump_tasks output from Vedran's oom log. I guess we could add a call to badness() for the oom_dump_tasks tasklist dump to get a clearer picture so we know the score for each thread group leader. Anything else would be speculation at this point, though. > score pid comm > ============================== > 56382 3241 run-mozilla.sh > 23345 3289 run-mozilla.sh > 21461 3050 gnome-do > 20079 2867 gnome-session > 14016 3258 firefox > 9212 3306 firefox > 8468 3115 gnome-do > 6902 3325 emacs > 6783 3212 tomboy > 4865 2968 python > 4861 2948 nautilus > 4221 1 init > (snip about 100line) > 548 2590 Xorg > Are these scores with your rss patch or without? If it's without the patch, this is understandable since Xorg didn't appear highly in Vedran's log either. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH] oom_kill: use rss value instead of vm size for badness 2009-11-02 10:42 ` David Rientjes @ 2009-11-02 12:35 ` KOSAKI Motohiro 2009-11-02 19:55 ` Vedran Furač 0 siblings, 1 reply; 29+ messages in thread From: KOSAKI Motohiro @ 2009-11-02 12:35 UTC (permalink / raw) To: David Rientjes, vedran.furac Cc: KAMEZAWA Hiroyuki, linux-mm, linux-kernel, Andrew Morton, Hugh Dickins, Andrea Arcangeli >> Hi David, >> >> I'm very interesting your pointing out. thanks good testing. >> So, I'd like to clarify your point a bit. >> >> following are badness list on my desktop environment (x86_64 6GB mem). >> it show Xorg have pretty small badness score. Do you know why such >> different happen? > > I don't know specifically what's different on your machine than Vedran's, > my data is simply a collection of the /proc/sys/vm/oom_dump_tasks output > from Vedran's oom log. > > I guess we could add a call to badness() for the oom_dump_tasks tasklist > dump to get a clearer picture so we know the score for each thread group > leader. Anything else would be speculation at this point, though. > >> score pid comm >> ============================== >> 56382 3241 run-mozilla.sh >> 23345 3289 run-mozilla.sh >> 21461 3050 gnome-do >> 20079 2867 gnome-session >> 14016 3258 firefox >> 9212 3306 firefox >> 8468 3115 gnome-do >> 6902 3325 emacs >> 6783 3212 tomboy >> 4865 2968 python >> 4861 2948 nautilus >> 4221 1 init >> (snip about 100line) >> 548 2590 Xorg >> > > Are these scores with your rss patch or without? If it's without the > patch, this is understandable since Xorg didn't appear highly in Vedran's > log either. Oh, I'm sorry. I mesured with rss patch. Then, I haven't understand what makes Xorg bad score. Hmm... Vedran, Can you please post following command result? # cat /proc/`pidof Xorg`/smaps I hope to undestand the issue clearly before modify any code. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH] oom_kill: use rss value instead of vm size for badness 2009-11-02 12:35 ` KOSAKI Motohiro @ 2009-11-02 19:55 ` Vedran Furač 2009-11-03 23:09 ` KOSAKI Motohiro 0 siblings, 1 reply; 29+ messages in thread From: Vedran Furač @ 2009-11-02 19:55 UTC (permalink / raw) To: KOSAKI Motohiro Cc: David Rientjes, KAMEZAWA Hiroyuki, linux-mm, linux-kernel, Andrew Morton, Hugh Dickins, Andrea Arcangeli KOSAKI Motohiro wrote: > Oh, I'm sorry. I mesured with rss patch. > Then, I haven't understand what makes Xorg bad score. > > Hmm... > Vedran, Can you please post following command result? > > # cat /proc/`pidof Xorg`/smaps > > I hope to undestand the issue clearly before modify any code. No problem: http://pastebin.com/d66972025 (long) Xorg is from debian unstable. Regards, Vedran -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH] oom_kill: use rss value instead of vm size for badness 2009-11-02 19:55 ` Vedran Furač @ 2009-11-03 23:09 ` KOSAKI Motohiro 2009-11-07 19:16 ` Vedran Furač 0 siblings, 1 reply; 29+ messages in thread From: KOSAKI Motohiro @ 2009-11-03 23:09 UTC (permalink / raw) To: vedran.furac Cc: kosaki.motohiro, David Rientjes, KAMEZAWA Hiroyuki, linux-mm, linux-kernel, Andrew Morton, Hugh Dickins, Andrea Arcangeli > KOSAKI Motohiro wrote: > > > Oh, I'm sorry. I mesured with rss patch. > > Then, I haven't understand what makes Xorg bad score. > > > > Hmm... > > Vedran, Can you please post following command result? > > > > # cat /proc/`pidof Xorg`/smaps > > > > I hope to undestand the issue clearly before modify any code. > > No problem: > > http://pastebin.com/d66972025 (long) > > Xorg is from debian unstable. Hmm... Your Xorg have pretty large heap. I'm not sure why it happen. (ATI video card issue?) Unfortunatelly, It is showed as normal large heap from kernel. then, I doubt kernel can distinguish X from other process. Probably oom-adj is most reasonable option.... ------------------------------------------------- [heap] Size: 433812 kB Rss: 433304 kB Pss: 433304 kB Shared_Clean: 0 kB Shared_Dirty: 0 kB Private_Clean: 280 kB Private_Dirty: 433024 kB Referenced: 415656 kB Swap: 0 kB KernelPageSize: 4 kB MMUPageSize: 4 kB -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH] oom_kill: use rss value instead of vm size for badness 2009-11-03 23:09 ` KOSAKI Motohiro @ 2009-11-07 19:16 ` Vedran Furač 0 siblings, 0 replies; 29+ messages in thread From: Vedran Furač @ 2009-11-07 19:16 UTC (permalink / raw) To: KOSAKI Motohiro Cc: David Rientjes, KAMEZAWA Hiroyuki, linux-mm, linux-kernel, Andrew Morton, Hugh Dickins, Andrea Arcangeli KOSAKI Motohiro wrote: > Your Xorg have pretty large heap. I'm not sure why it happen. (ATI > video card issue?) It is ATI (fglrx), but I don't know if it is driver's issue or not. I have a lot of apps running, firefox with high number of tabs and so on. It adds up probably. > Unfortunatelly, It is showed as normal large heap from kernel. then, > I doubt kernel can distinguish X from other process. Probably oom-adj > is most reasonable option.... > > ------------------------------------------------- > [heap] > Size: 433812 kB > Rss: 433304 kB > Pss: 433304 kB > Shared_Clean: 0 kB > Shared_Dirty: 0 kB > Private_Clean: 280 kB > Private_Dirty: 433024 kB > Referenced: 415656 kB > Swap: 0 kB > KernelPageSize: 4 kB > MMUPageSize: 4 kB -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH] oom_kill: use rss value instead of vm size for badness 2009-10-29 8:31 ` David Rientjes 2009-10-29 8:46 ` KAMEZAWA Hiroyuki 2009-11-01 13:29 ` KOSAKI Motohiro @ 2009-11-25 12:44 ` Andrea Arcangeli 2009-11-25 21:39 ` David Rientjes 2009-11-26 0:10 ` Vedran Furač 2 siblings, 2 replies; 29+ messages in thread From: Andrea Arcangeli @ 2009-11-25 12:44 UTC (permalink / raw) To: David Rientjes Cc: KAMEZAWA Hiroyuki, linux-mm, linux-kernel, Andrew Morton, Hugh Dickins, vedran.furac, KOSAKI Motohiro Hello, lengthy discussion on something I think is quite obviously better and I tried to change a couple of years back already (rss instead of total_vm). On Thu, Oct 29, 2009 at 01:31:59AM -0700, David Rientjes wrote: > total_vm > 708945 test > 195695 krunner > 168881 plasma-desktop > 130567 ktorrent > 127081 knotify4 > 125881 icedove-bin > 123036 akregator > 118641 kded4 > > rss > 707878 test > 42201 Xorg > 13300 icedove-bin > 10209 ktorrent > 9277 akregator > 8878 plasma-desktop > 7546 krunner > 4532 mysqld > > This patch would pick the memory hogging task, "test", first everytime That is by far the only thing that matters. There's plenty of logic in the oom killer to remove races with tasks with TIF_MEMDIE set, to ensure not to fall into the second task until the first task had the time to release all its memory back to the system. > just like the current implementation does. It would then prefer Xorg, You're focusing on the noise and not looking at the only thing that matters. The noise level with rss went down to 50000, it doesn't matter the order of what's below 50000. Only thing it matters is the _delta_ between "noise-level innocent apps" and "exploit". The delta is clearly increase from 708945-max(noise) to 707878-max(noise) which translates to a increase of precision from 513250 to 665677, which shows how much more rss is making the detection more accurate (i.e. the distance between exploit and first innocent app). The lower level the noise level starts, the less likely the innocent apps are killed. There's simply no way to get to perfection, some innocent apps will always have high total_vm or rss levels, but this at least removes lots of innocent apps from the equation. The fact X isn't less innocent than before is because its rss is quite big, and this is not an error, luckily much smaller than the hog itself. Surely there are ways to force X to load huge bitmaps into its address space too (regardless of total_vm or rss) but again no perfection, just better with rss even in this testcase. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH] oom_kill: use rss value instead of vm size for badness 2009-11-25 12:44 ` Andrea Arcangeli @ 2009-11-25 21:39 ` David Rientjes 2009-11-27 18:26 ` Andrea Arcangeli 2009-11-26 0:10 ` Vedran Furač 1 sibling, 1 reply; 29+ messages in thread From: David Rientjes @ 2009-11-25 21:39 UTC (permalink / raw) To: Andrea Arcangeli Cc: KAMEZAWA Hiroyuki, linux-mm, linux-kernel, Andrew Morton, Hugh Dickins, vedran.furac, KOSAKI Motohiro On Wed, 25 Nov 2009, Andrea Arcangeli wrote: > You're focusing on the noise and not looking at the only thing that > matters. > > The noise level with rss went down to 50000, it doesn't matter the > order of what's below 50000. Only thing it matters is the _delta_ > between "noise-level innocent apps" and "exploit". > > The delta is clearly increase from 708945-max(noise) to > 707878-max(noise) which translates to a increase of precision from > 513250 to 665677, which shows how much more rss is making the > detection more accurate (i.e. the distance between exploit and first > innocent app). The lower level the noise level starts, the less likely > the innocent apps are killed. > That's not surprising since the amount of physical RAM is the constraining factor. > There's simply no way to get to perfection, some innocent apps will > always have high total_vm or rss levels, but this at least removes > lots of innocent apps from the equation. The fact X isn't less > innocent than before is because its rss is quite big, and this is not > an error, luckily much smaller than the hog itself. Surely there are > ways to force X to load huge bitmaps into its address space too > (regardless of total_vm or rss) but again no perfection, just better > with rss even in this testcase. > We use the oom killer as a mechanism to enforce memory containment policy, we are much more interested in the oom killing priority than the oom killer's own heuristics to determine the ideal task to kill. Those heuristics can't possibly represent the priorities for all possible workloads, so we require input from the user via /proc/pid/oom_adj to adjust that heuristic. That has traditionally always used total_vm as a baseline which is a much more static value and can be quantified within a reasonable range by experimental data when it would not be defined as rogue. By changing the baseline to rss, we lose much of that control since its more dynamic and dependent on the current state of the machine at the time of the oom which can be predicted with less accuracy. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH] oom_kill: use rss value instead of vm size for badness 2009-11-25 21:39 ` David Rientjes @ 2009-11-27 18:26 ` Andrea Arcangeli 2009-11-30 23:09 ` David Rientjes 0 siblings, 1 reply; 29+ messages in thread From: Andrea Arcangeli @ 2009-11-27 18:26 UTC (permalink / raw) To: David Rientjes Cc: KAMEZAWA Hiroyuki, linux-mm, linux-kernel, Andrew Morton, Hugh Dickins, vedran.furac, KOSAKI Motohiro On Wed, Nov 25, 2009 at 01:39:59PM -0800, David Rientjes wrote: > adjust that heuristic. That has traditionally always used total_vm as a > baseline which is a much more static value and can be quantified within a > reasonable range by experimental data when it would not be defined as > rogue. By changing the baseline to rss, we lose much of that control > since its more dynamic and dependent on the current state of the machine > at the time of the oom which can be predicted with less accuracy. Ok I can see the fact by being dynamic and less predictable worries you. The "second to last" tasks especially are going to be less predictable, but the memory hog would normally end up accounting for most of the memory and this should increase the badness delta between the offending tasks (or tasks) and the innocent stuff, so making it more reliable. The innocent stuff should be more and more paged out from ram. So I tend to think it'll be much less likely to kill an innocent task this way (as demonstrated in practice by your measurement too), but it's true there's no guarantee it'll always do the right thing, because it's a heuristic anyway, but even total_vm doesn't provide guarantee unless your workload is stationary and your badness scores are fixed and no virtual memory is ever allocated by any task in the system and no new task are spawned. It'd help if you posted a regression showing smaller delta between oom-target task and second task. My email was just to point out, your measurement was a good thing in oom killing terms. If I've to imagine the worst case for this, is an app allocating memory at very low peace, and then slowly getting swapped out and taking huge swap size. Maybe we need to add swap size to rss, dunno, but the paged out MAP_SHARED equivalent can't be accounted like we can account swap size, so in practice I feel a raw rss is going to be more practical than making swap special vs file mappings. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH] oom_kill: use rss value instead of vm size for badness 2009-11-27 18:26 ` Andrea Arcangeli @ 2009-11-30 23:09 ` David Rientjes 2009-12-01 4:43 ` KOSAKI Motohiro 0 siblings, 1 reply; 29+ messages in thread From: David Rientjes @ 2009-11-30 23:09 UTC (permalink / raw) To: Andrea Arcangeli Cc: KAMEZAWA Hiroyuki, linux-mm, linux-kernel, Andrew Morton, Hugh Dickins, vedran.furac, KOSAKI Motohiro On Fri, 27 Nov 2009, Andrea Arcangeli wrote: > Ok I can see the fact by being dynamic and less predictable worries > you. The "second to last" tasks especially are going to be less > predictable, but the memory hog would normally end up accounting for > most of the memory and this should increase the badness delta between > the offending tasks (or tasks) and the innocent stuff, so making it > more reliable. The innocent stuff should be more and more paged out > from ram. So I tend to think it'll be much less likely to kill an > innocent task this way (as demonstrated in practice by your > measurement too), but it's true there's no guarantee it'll always do > the right thing, because it's a heuristic anyway, but even total_vm > doesn't provide guarantee unless your workload is stationary and your > badness scores are fixed and no virtual memory is ever allocated by > any task in the system and no new task are spawned. > The purpose of /proc/pid/oom_adj is not always to polarize the heuristic for the task it represents, it allows userspace to define when a task is rogue. Working with total_vm as a baseline, it is simple to use the interface to tune the heuristic to prefer a certain task over another when its memory consumption goes beyond what is expected. With this interface, I can easily define when an application should be oom killed because it is using far more memory than expected. I can also disable oom killing completely for it, if necessary. Unless you have a consistent baseline for all tasks, the adjustment wouldn't contextually make any sense. Using rss does not allow users to statically define when a task is rogue and is dependent on the current state of memory at the time of oom. I would support removing most of the other heuristics other than the baseline and the nodes intersection with mems_allowed to prefer tasks in the same cpuset, though, to make it easier to understand and tune. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH] oom_kill: use rss value instead of vm size for badness 2009-11-30 23:09 ` David Rientjes @ 2009-12-01 4:43 ` KOSAKI Motohiro 2009-12-01 22:20 ` David Rientjes 0 siblings, 1 reply; 29+ messages in thread From: KOSAKI Motohiro @ 2009-12-01 4:43 UTC (permalink / raw) To: David Rientjes Cc: kosaki.motohiro, Andrea Arcangeli, KAMEZAWA Hiroyuki, linux-mm, linux-kernel, Andrew Morton, Hugh Dickins, vedran.furac > On Fri, 27 Nov 2009, Andrea Arcangeli wrote: > > > Ok I can see the fact by being dynamic and less predictable worries > > you. The "second to last" tasks especially are going to be less > > predictable, but the memory hog would normally end up accounting for > > most of the memory and this should increase the badness delta between > > the offending tasks (or tasks) and the innocent stuff, so making it > > more reliable. The innocent stuff should be more and more paged out > > from ram. So I tend to think it'll be much less likely to kill an > > innocent task this way (as demonstrated in practice by your > > measurement too), but it's true there's no guarantee it'll always do > > the right thing, because it's a heuristic anyway, but even total_vm > > doesn't provide guarantee unless your workload is stationary and your > > badness scores are fixed and no virtual memory is ever allocated by > > any task in the system and no new task are spawned. > > > > The purpose of /proc/pid/oom_adj is not always to polarize the heuristic > for the task it represents, it allows userspace to define when a task is > rogue. Working with total_vm as a baseline, it is simple to use the > interface to tune the heuristic to prefer a certain task over another when > its memory consumption goes beyond what is expected. With this interface, > I can easily define when an application should be oom killed because it is > using far more memory than expected. I can also disable oom killing > completely for it, if necessary. Unless you have a consistent baseline > for all tasks, the adjustment wouldn't contextually make any sense. Using > rss does not allow users to statically define when a task is rogue and is > dependent on the current state of memory at the time of oom. > > I would support removing most of the other heuristics other than the > baseline and the nodes intersection with mems_allowed to prefer tasks in > the same cpuset, though, to make it easier to understand and tune. I feel you talked about oom_adj doesn't fit your use case. probably you need /proc/{pid}/oom_priority new knob. oom adjustment doesn't fit you. you need job severity based oom killing order. severity doesn't depend on any hueristic. server administrator should know job severity on his system. OOM heuristic should mainly consider desktop usage. because desktop user doesn't change oom knob at all. and they doesn't know what deamon is important. any userful heuristics have some dynamically aspect. we can't avoid it. thought? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH] oom_kill: use rss value instead of vm size for badness 2009-12-01 4:43 ` KOSAKI Motohiro @ 2009-12-01 22:20 ` David Rientjes 2009-12-02 0:35 ` KOSAKI Motohiro 0 siblings, 1 reply; 29+ messages in thread From: David Rientjes @ 2009-12-01 22:20 UTC (permalink / raw) To: KOSAKI Motohiro Cc: Andrea Arcangeli, KAMEZAWA Hiroyuki, linux-mm, linux-kernel, Andrew Morton, Hugh Dickins, vedran.furac On Tue, 1 Dec 2009, KOSAKI Motohiro wrote: > > The purpose of /proc/pid/oom_adj is not always to polarize the heuristic > > for the task it represents, it allows userspace to define when a task is > > rogue. Working with total_vm as a baseline, it is simple to use the > > interface to tune the heuristic to prefer a certain task over another when > > its memory consumption goes beyond what is expected. With this interface, > > I can easily define when an application should be oom killed because it is > > using far more memory than expected. I can also disable oom killing > > completely for it, if necessary. Unless you have a consistent baseline > > for all tasks, the adjustment wouldn't contextually make any sense. Using > > rss does not allow users to statically define when a task is rogue and is > > dependent on the current state of memory at the time of oom. > > > > I would support removing most of the other heuristics other than the > > baseline and the nodes intersection with mems_allowed to prefer tasks in > > the same cpuset, though, to make it easier to understand and tune. > > I feel you talked about oom_adj doesn't fit your use case. probably you need > /proc/{pid}/oom_priority new knob. oom adjustment doesn't fit you. > you need job severity based oom killing order. severity doesn't depend on any > hueristic. > server administrator should know job severity on his system. > That's the complete opposite of what I wrote above, we use oom_adj to define when a user application is considered "rogue," meaning that it is using far more memory than expected and so we want it killed. As you mentioned weeks ago, the kernel cannot identify a memory leaker; this is the user interface to allow the oom killer to identify a memory-hogging rogue task that will (probably) consume all system memory eventually. The way oom_adj is implemented, with a bit shift on a baseline of total_vm, it can also polarize the badness heuristic to kill an application based on priority by examining /proc/pid/oom_score, but that wasn't my concern in this case. Using rss as a baseline reduces my ability to tune oom_adj appropriately to identify those rogue tasks because it is highly dynamic depending on the state of the VM at the time of oom. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH] oom_kill: use rss value instead of vm size for badness 2009-12-01 22:20 ` David Rientjes @ 2009-12-02 0:35 ` KOSAKI Motohiro 2009-12-03 23:25 ` David Rientjes 0 siblings, 1 reply; 29+ messages in thread From: KOSAKI Motohiro @ 2009-12-02 0:35 UTC (permalink / raw) To: David Rientjes Cc: kosaki.motohiro, Andrea Arcangeli, KAMEZAWA Hiroyuki, linux-mm, linux-kernel, Andrew Morton, Hugh Dickins, vedran.furac > On Tue, 1 Dec 2009, KOSAKI Motohiro wrote: > > > > The purpose of /proc/pid/oom_adj is not always to polarize the heuristic > > > for the task it represents, it allows userspace to define when a task is > > > rogue. Working with total_vm as a baseline, it is simple to use the > > > interface to tune the heuristic to prefer a certain task over another when > > > its memory consumption goes beyond what is expected. With this interface, > > > I can easily define when an application should be oom killed because it is > > > using far more memory than expected. I can also disable oom killing > > > completely for it, if necessary. Unless you have a consistent baseline > > > for all tasks, the adjustment wouldn't contextually make any sense. Using > > > rss does not allow users to statically define when a task is rogue and is > > > dependent on the current state of memory at the time of oom. > > > > > > I would support removing most of the other heuristics other than the > > > baseline and the nodes intersection with mems_allowed to prefer tasks in > > > the same cpuset, though, to make it easier to understand and tune. > > > > I feel you talked about oom_adj doesn't fit your use case. probably you need > > /proc/{pid}/oom_priority new knob. oom adjustment doesn't fit you. > > you need job severity based oom killing order. severity doesn't depend on any > > hueristic. > > server administrator should know job severity on his system. > > That's the complete opposite of what I wrote above, we use oom_adj to > define when a user application is considered "rogue," meaning that it is > using far more memory than expected and so we want it killed. As you > mentioned weeks ago, the kernel cannot identify a memory leaker; this is > the user interface to allow the oom killer to identify a memory-hogging > rogue task that will (probably) consume all system memory eventually. > The way oom_adj is implemented, with a bit shift on a baseline of > total_vm, it can also polarize the badness heuristic to kill an > application based on priority by examining /proc/pid/oom_score, but that > wasn't my concern in this case. Using rss as a baseline reduces my > ability to tune oom_adj appropriately to identify those rogue tasks > because it is highly dynamic depending on the state of the VM at the time > of oom. - I mean you don't need almost kernel heuristic. but desktop user need it. - All job scheduler provide memory limitation feature. but OOM killer isn't for to implement memory limitation. we have memory cgroup. - if you need memory usage based know, read /proc/{pid}/statm and write /proc/{pid}/oom_priority works well probably. - Unfortunatelly, We can't continue to use VSZ based heuristics. because modern application waste 10x VSZ more than RSS comsumption. in nowadays, VSZ isn't good approximation value of RSS. There isn't any good reason to continue form desktop user view. IOW, kernel hueristic should adjust to target majority user. we provide a knob to help minority user. or, Can you have any detection idea to distigish typical desktop and your use case? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH] oom_kill: use rss value instead of vm size for badness 2009-12-02 0:35 ` KOSAKI Motohiro @ 2009-12-03 23:25 ` David Rientjes 2009-12-04 0:44 ` KAMEZAWA Hiroyuki 0 siblings, 1 reply; 29+ messages in thread From: David Rientjes @ 2009-12-03 23:25 UTC (permalink / raw) To: KOSAKI Motohiro Cc: Andrea Arcangeli, KAMEZAWA Hiroyuki, linux-mm, linux-kernel, Andrew Morton, Hugh Dickins, vedran.furac On Wed, 2 Dec 2009, KOSAKI Motohiro wrote: > - I mean you don't need almost kernel heuristic. but desktop user need it. My point is that userspace needs to be able to identify memory leaking tasks and polarize oom killing priorities. /proc/pid/oom_adj does a good job of both with total_vm as a baseline. > - All job scheduler provide memory limitation feature. but OOM killer isn't > for to implement memory limitation. we have memory cgroup. Wrong, the oom killer implements cpuset memory limitations. > - if you need memory usage based know, read /proc/{pid}/statm and write > /proc/{pid}/oom_priority works well probably. Constantly polling /proc/pid/stat and updating the oom killer priorities at a constant interval is a ridiculous proposal for identifying memory leakers, sorry. > - Unfortunatelly, We can't continue to use VSZ based heuristics. because > modern application waste 10x VSZ more than RSS comsumption. in nowadays, > VSZ isn't good approximation value of RSS. There isn't any good reason to > continue form desktop user view. > Then leave the heuristic alone by default so we don't lose any functionality that we once had and then add additional heuristics depending on the environment as determined by the manipulation of a new tunable. > IOW, kernel hueristic should adjust to target majority user. we provide a knob > to help minority user. > Moving the baseline to rss severely impacts the legitimacy of that knob, we lose a lot of control over identifying memory leakers and polarizing oom killer priorities because it depends on the state of the VM at the time of oom for which /proc/pid/oom_adj may not have recently been updated to represent. I don't know why you continuously invoke the same arguments to completely change the baseline for the oom killer heuristic because you falsely believe that killing the task with the largest memory resident in RAM is more often than not the ideal task to kill. It's very frustrating when you insist on changing the default heuristic based on your own belief that people use Linux in the same way you do. If Andrew pushes the patch to change the baseline to rss (oom_kill-use-rss-instead-of-vm-size-for-badness.patch) to Linus, I'll strongly nack it because you totally lack the ability to identify memory leakers as defined by userspace which should be the prime target for the oom killer. You have not addressed that problem, you've merely talked around it, and yet the patch unbelievably still sits in -mm. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH] oom_kill: use rss value instead of vm size for badness 2009-12-03 23:25 ` David Rientjes @ 2009-12-04 0:44 ` KAMEZAWA Hiroyuki 0 siblings, 0 replies; 29+ messages in thread From: KAMEZAWA Hiroyuki @ 2009-12-04 0:44 UTC (permalink / raw) To: David Rientjes Cc: KOSAKI Motohiro, Andrea Arcangeli, linux-mm, linux-kernel, Andrew Morton, Hugh Dickins, vedran.furac On Thu, 3 Dec 2009 15:25:05 -0800 (PST) David Rientjes <rientjes@google.com> wrote: > If Andrew pushes the patch to change the baseline to rss > (oom_kill-use-rss-instead-of-vm-size-for-badness.patch) to Linus, I'll > strongly nack it because you totally lack the ability to identify memory > leakers as defined by userspace which should be the prime target for the > oom killer. You have not addressed that problem, you've merely talked > around it, and yet the patch unbelievably still sits in -mm. > It's still cook-time about oom-kill patches and I'll ask Andrew not to send it when he asks in mm-merge plan. At least, per-mm swap counter and lowmem-rss counter is necessary. I'll rewrite fork-bomb detector, too. Repeatedly saying, calculating badness from vm_size _only_ is bad. I'm not sure how google's magical applications works, but in general, vm_size doesn't means private memory usage i.e. how well oom-killer can free pages. And current oom-killer kills wrong process. Please share your idea to making oom-killer better rather than just saying "don't do that". Do you have good algorithm for detecting memory-leaking process in user land ? I think I added some in my old set but it's not enough. I'll add more statistics to mm_struct to do better work. BTW, I hate oom_adj very much. It's nature of "shift" is hard to understand. I wonder why static oom priority or oom_threshold was not implemented... Thanks, -Kame -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH] oom_kill: use rss value instead of vm size for badness 2009-11-25 12:44 ` Andrea Arcangeli 2009-11-25 21:39 ` David Rientjes @ 2009-11-26 0:10 ` Vedran Furač 2009-11-26 1:32 ` KAMEZAWA Hiroyuki 1 sibling, 1 reply; 29+ messages in thread From: Vedran Furač @ 2009-11-26 0:10 UTC (permalink / raw) To: Andrea Arcangeli Cc: David Rientjes, KAMEZAWA Hiroyuki, linux-mm, linux-kernel, Andrew Morton, Hugh Dickins, KOSAKI Motohiro Andrea Arcangeli wrote: > Hello, Hi all! > lengthy discussion on something I think is quite obviously better and > I tried to change a couple of years back already (rss instead of > total_vm). Now that 2.6.32 is almost out, is it possible to get OOMK fixed in 2.6.33 so that I could turn overcommit on (overcommit_memory=0) again without fear of loosing my work? Regards, Vedran -- http://vedranf.net | a8e7a7783ca0d460fee090cc584adc12 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH] oom_kill: use rss value instead of vm size for badness 2009-11-26 0:10 ` Vedran Furač @ 2009-11-26 1:32 ` KAMEZAWA Hiroyuki 2009-11-27 1:56 ` Vedran Furač 0 siblings, 1 reply; 29+ messages in thread From: KAMEZAWA Hiroyuki @ 2009-11-26 1:32 UTC (permalink / raw) To: vedran.furac Cc: Andrea Arcangeli, David Rientjes, linux-mm, linux-kernel, Andrew Morton, Hugh Dickins, KOSAKI Motohiro On Thu, 26 Nov 2009 01:10:12 +0100 Vedran FuraA? <vedran.furac@gmail.com> wrote: > Andrea Arcangeli wrote: > > > Hello, > > Hi all! > > > lengthy discussion on something I think is quite obviously better and > > I tried to change a couple of years back already (rss instead of > > total_vm). > > Now that 2.6.32 is almost out, is it possible to get OOMK fixed in > 2.6.33 so that I could turn overcommit on (overcommit_memory=0) again > without fear of loosing my work? > I'll try fork-bomb detector again. That will finally help your X.org. But It may lose 2.6.33. Adding new counter to mm_struct is now rejected because of scalability, so total work will need more time (than expected). I'm sorry I can't get enough time in these weeks. Thanks, -Kame -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH] oom_kill: use rss value instead of vm size for badness 2009-11-26 1:32 ` KAMEZAWA Hiroyuki @ 2009-11-27 1:56 ` Vedran Furač 0 siblings, 0 replies; 29+ messages in thread From: Vedran Furač @ 2009-11-27 1:56 UTC (permalink / raw) To: KAMEZAWA Hiroyuki Cc: Andrea Arcangeli, David Rientjes, linux-mm, linux-kernel, Andrew Morton, Hugh Dickins, KOSAKI Motohiro KAMEZAWA Hiroyuki wrote: > On Thu, 26 Nov 2009 01:10:12 +0100 > Vedran FuraA? <vedran.furac@gmail.com> wrote: > >> Andrea Arcangeli wrote: >> >>> lengthy discussion on something I think is quite obviously better and >>> I tried to change a couple of years back already (rss instead of >>> total_vm). >> Now that 2.6.32 is almost out, is it possible to get OOMK fixed in >> 2.6.33 so that I could turn overcommit on (overcommit_memory=0) again >> without fear of loosing my work? >> > I'll try fork-bomb detector again. That will finally help your X.org. > But It may lose 2.6.33. > > Adding new counter to mm_struct is now rejected because of scalability, so > total work will need more time (than expected). > I'm sorry I can't get enough time in these weeks. Thanks for working on this! Hope it gets into 33. Keep me posted. Regards, Vedran -- http://vedranf.net | a8e7a7783ca0d460fee090cc584adc12 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 29+ messages in thread
end of thread, other threads:[~2009-12-04 0:47 UTC | newest] Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2009-10-28 8:58 [PATCH] oom_kill: use rss value instead of vm size for badness KAMEZAWA Hiroyuki 2009-10-28 9:15 ` David Rientjes 2009-10-28 11:04 ` KAMEZAWA Hiroyuki 2009-10-29 1:00 ` KAMEZAWA Hiroyuki 2009-10-29 2:31 ` Minchan Kim 2009-10-29 8:31 ` David Rientjes 2009-10-29 8:46 ` KAMEZAWA Hiroyuki 2009-10-29 9:01 ` David Rientjes 2009-10-29 9:16 ` KAMEZAWA Hiroyuki 2009-10-29 9:44 ` David Rientjes 2009-10-29 23:41 ` KAMEZAWA Hiroyuki 2009-11-01 13:29 ` KOSAKI Motohiro 2009-11-02 10:42 ` David Rientjes 2009-11-02 12:35 ` KOSAKI Motohiro 2009-11-02 19:55 ` Vedran Furač 2009-11-03 23:09 ` KOSAKI Motohiro 2009-11-07 19:16 ` Vedran Furač 2009-11-25 12:44 ` Andrea Arcangeli 2009-11-25 21:39 ` David Rientjes 2009-11-27 18:26 ` Andrea Arcangeli 2009-11-30 23:09 ` David Rientjes 2009-12-01 4:43 ` KOSAKI Motohiro 2009-12-01 22:20 ` David Rientjes 2009-12-02 0:35 ` KOSAKI Motohiro 2009-12-03 23:25 ` David Rientjes 2009-12-04 0:44 ` KAMEZAWA Hiroyuki 2009-11-26 0:10 ` Vedran Furač 2009-11-26 1:32 ` KAMEZAWA Hiroyuki 2009-11-27 1:56 ` Vedran Furač
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox