From: David Rientjes <rientjes@google.com>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: Rik van Riel <riel@redhat.com>, Nick Piggin <npiggin@suse.de>,
Oleg Nesterov <oleg@redhat.com>,
Balbir Singh <balbir@linux.vnet.ibm.com>,
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>,
KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>,
linux-mm@kvack.org
Subject: Re: [patch 02/18] oom: introduce find_lock_task_mm() to fix !mm false positives
Date: Tue, 8 Jun 2010 16:50:02 -0700 (PDT) [thread overview]
Message-ID: <alpine.DEB.2.00.1006081642370.19582@chino.kir.corp.google.com> (raw)
In-Reply-To: <20100608124246.9258ccab.akpm@linux-foundation.org>
On Tue, 8 Jun 2010, Andrew Morton wrote:
> > From: Oleg Nesterov <oleg@redhat.com>
> >
> > Almost all ->mm == NUL checks in oom_kill.c are wrong.
> >
> > The current code assumes that the task without ->mm has already
> > released its memory and ignores the process. However this is not
> > necessarily true when this process is multithreaded, other live
> > sub-threads can use this ->mm.
> >
> > - Remove the "if (!p->mm)" check in select_bad_process(), it is
> > just wrong.
> >
> > - Add the new helper, find_lock_task_mm(), which finds the live
> > thread which uses the memory and takes task_lock() to pin ->mm
> >
> > - change oom_badness() to use this helper instead of just checking
> > ->mm != NULL.
> >
> > - As David pointed out, select_bad_process() must never choose the
> > task without ->mm, but no matter what oom_badness() returns the
> > task can be chosen if nothing else has been found yet.
> >
> > Change oom_badness() to return int, change it to return -1 if
> > find_lock_task_mm() fails, and change select_bad_process() to
> > check points >= 0.
> >
> > Note! This patch is not enough, we need more changes.
> >
> > - oom_badness() was fixed, but oom_kill_task() still ignores
> > the task without ->mm
> >
> > - oom_forkbomb_penalty() should use find_lock_task_mm() too,
> > and it also needs other changes to actually find the first
> > first-descendant children
> >
> > This will be addressed later.
> >
> > [kosaki.motohiro@jp.fujitsu.com: use in badness(), __oom_kill_task()]
> > Signed-off-by: Oleg Nesterov <oleg@redhat.com>
> > Signed-off-by: David Rientjes <rientjes@google.com>
>
> I assume from the above that we should have a Signed-off-by:kosaki
> here. I didn't make that change yet - please advise.
>
Oops, that was accidently dropped, sorry about that. I folded two of his
patches into this one since it introduces find_lock_task_mm() and it needs
to be used in the places KOSAKI fixed as well. His original patches are
at
http://marc.info/?l=linux-mm&m=127537136419677
http://marc.info/?l=linux-mm&m=127537153619893
along with his sign-off.
>
> > mm/oom_kill.c | 74 +++++++++++++++++++++++++++++++++------------------------
> > 1 files changed, 43 insertions(+), 31 deletions(-)
> >
> > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -52,6 +52,20 @@ static int has_intersects_mems_allowed(struct task_struct *tsk)
> > return 0;
> > }
> >
> > +static struct task_struct *find_lock_task_mm(struct task_struct *p)
> > +{
> > + struct task_struct *t = p;
> > +
> > + do {
> > + task_lock(t);
> > + if (likely(t->mm))
> > + return t;
> > + task_unlock(t);
> > + } while_each_thread(p, t);
> > +
> > + return NULL;
> > +}
>
> What pins `p'? Ah, caller must hold tasklist_lock.
>
I'll add a comment about this in a followup patch, it should remove the
the confusion others have had about the naming of the function as well,
which I think is good but could use some explanation.
> > /**
> > * badness - calculate a numeric value for how bad this task has been
> > * @p: task struct of which task we should calculate
> > @@ -74,8 +88,8 @@ static int has_intersects_mems_allowed(struct task_struct *tsk)
> > unsigned long badness(struct task_struct *p, unsigned long uptime)
> > {
> > unsigned long points, cpu_time, run_time;
> > - struct mm_struct *mm;
> > struct task_struct *child;
> > + struct task_struct *c, *t;
> > int oom_adj = p->signal->oom_adj;
> > struct task_cputime task_time;
> > unsigned long utime;
> > @@ -84,17 +98,14 @@ unsigned long badness(struct task_struct *p, unsigned long uptime)
> > if (oom_adj == OOM_DISABLE)
> > return 0;
> >
> > - task_lock(p);
> > - mm = p->mm;
> > - if (!mm) {
> > - task_unlock(p);
> > + p = find_lock_task_mm(p);
> > + if (!p)
> > return 0;
> > - }
> >
> > /*
> > * The memory size of the process is the basis for the badness.
> > */
> > - points = mm->total_vm;
> > + points = p->mm->total_vm;
> >
> > /*
> > * After this unlock we can no longer dereference local variable `mm'
>
> This comment is stale. Replace with p->mm.
>
Indeed, find_lock_task_mm() returns with task_lock() held for p->mm here
so the deference is always safe. I'll send a followup.
> > @@ -115,12 +126,17 @@ unsigned long badness(struct task_struct *p, unsigned long uptime)
> > * child is eating the vast majority of memory, adding only half
> > * to the parents will make the child our kill candidate of choice.
> > */
> > - list_for_each_entry(child, &p->children, sibling) {
> > - task_lock(child);
> > - if (child->mm != mm && child->mm)
> > - points += child->mm->total_vm/2 + 1;
> > - task_unlock(child);
> > - }
> > + t = p;
> > + do {
> > + list_for_each_entry(c, &t->children, sibling) {
> > + child = find_lock_task_mm(c);
> > + if (child) {
> > + if (child->mm != p->mm)
> > + points += child->mm->total_vm/2 + 1;
>
> What if 1000 children share the same mm? Doesn't this give a grossly
> wrong result?
>
It does, and that's why there has been large criticism about this
particular part of the heuristic over the past few months. It gets
removed in my badness() rewrite, but the change here is concerned solely
about the use_mm() race so closes a gap that currently exists.
> > + task_unlock(child);
> > + }
> > + }
> > + } while_each_thread(p, t);
> >
> > /*
> > * CPU time is in tens of seconds and run time is in thousands
> > @@ -256,9 +272,6 @@ static struct task_struct *select_bad_process(unsigned long *ppoints,
> > for_each_process(p) {
> > unsigned long points;
> >
> > - /* skip tasks that have already released their mm */
> > - if (!p->mm)
> > - continue;
> > /* skip the init task and kthreads */
> > if (is_global_init(p) || (p->flags & PF_KTHREAD))
> > continue;
> > @@ -385,14 +398,9 @@ static void __oom_kill_task(struct task_struct *p, int verbose)
> > return;
> > }
> >
> > - task_lock(p);
> > - if (!p->mm) {
> > - WARN_ON(1);
> > - printk(KERN_WARNING "tried to kill an mm-less task %d (%s)!\n",
> > - task_pid_nr(p), p->comm);
> > - task_unlock(p);
> > + p = find_lock_task_mm(p);
> > + if (!p)
> > return;
> > - }
> >
> > if (verbose)
> > printk(KERN_ERR "Killed process %d (%s) "
> > @@ -437,6 +445,7 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
> > const char *message)
> > {
> > struct task_struct *c;
> > + struct task_struct *t = p;
> >
> > if (printk_ratelimit())
> > dump_header(p, gfp_mask, order, mem);
> > @@ -454,14 +463,17 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
> > message, task_pid_nr(p), p->comm, points);
> >
> > /* Try to kill a child first */
>
> It'd be nice to improve the comments a bit. This one tells us the
> "what" (which is usually obvious) but didn't tell us "why", which is
> often the unobvious.
>
This gets modified in
oom-sacrifice-child-with-highest-badness-score-for-parent.patch, so I'll
expand upon it there and post a followup patch since it's already merged.
Thanks!
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2010-06-08 23:50 UTC|newest]
Thread overview: 104+ messages / expand[flat|nested] mbox.gz Atom feed top
2010-06-06 22:33 [patch 00/18] oom killer rewrite David Rientjes
2010-06-06 22:34 ` [patch 01/18] oom: check PF_KTHREAD instead of !mm to skip kthreads David Rientjes
2010-06-07 12:12 ` Balbir Singh
2010-06-07 19:50 ` David Rientjes
2010-06-08 19:33 ` Andrew Morton
2010-06-08 23:40 ` David Rientjes
2010-06-08 23:52 ` Andrew Morton
2010-06-06 22:34 ` [patch 02/18] oom: introduce find_lock_task_mm() to fix !mm false positives David Rientjes
2010-06-07 12:58 ` Balbir Singh
2010-06-07 13:49 ` Minchan Kim
2010-06-07 19:49 ` David Rientjes
2010-06-08 19:42 ` Andrew Morton
2010-06-08 20:14 ` Oleg Nesterov
2010-06-08 20:17 ` Oleg Nesterov
2010-06-08 21:34 ` Andrew Morton
2010-06-08 23:50 ` David Rientjes [this message]
2010-06-06 22:34 ` [patch 03/18] oom: dump_tasks use find_lock_task_mm too David Rientjes
2010-06-08 19:55 ` Andrew Morton
2010-06-09 0:06 ` David Rientjes
2010-06-06 22:34 ` [patch 04/18] oom: PF_EXITING check should take mm into account David Rientjes
2010-06-08 20:00 ` Andrew Morton
2010-06-06 22:34 ` [patch 05/18] oom: give current access to memory reserves if it has been killed David Rientjes
2010-06-08 11:41 ` KOSAKI Motohiro
2010-06-08 18:47 ` David Rientjes
2010-06-14 11:08 ` KOSAKI Motohiro
2010-06-08 20:12 ` Andrew Morton
2010-06-13 11:24 ` KOSAKI Motohiro
2010-06-08 20:08 ` Andrew Morton
2010-06-09 0:14 ` David Rientjes
2010-06-06 22:34 ` [patch 06/18] oom: avoid sending exiting tasks a SIGKILL David Rientjes
2010-06-08 11:41 ` KOSAKI Motohiro
2010-06-08 18:48 ` David Rientjes
2010-06-08 20:17 ` Andrew Morton
2010-06-08 20:26 ` Oleg Nesterov
2010-06-09 6:32 ` David Rientjes
2010-06-09 16:25 ` Oleg Nesterov
2010-06-09 19:44 ` David Rientjes
2010-06-09 20:14 ` Oleg Nesterov
2010-06-10 0:15 ` KAMEZAWA Hiroyuki
2010-06-10 1:21 ` Oleg Nesterov
2010-06-10 1:43 ` KAMEZAWA Hiroyuki
2010-06-10 1:51 ` Oleg Nesterov
2010-06-06 22:34 ` [patch 07/18] oom: filter tasks not sharing the same cpuset David Rientjes
2010-06-08 11:41 ` KOSAKI Motohiro
2010-06-08 18:51 ` David Rientjes
2010-06-08 19:27 ` Andrew Morton
2010-06-13 11:24 ` KOSAKI Motohiro
2010-07-02 22:35 ` Andrew Morton
2010-07-04 22:08 ` David Rientjes
2010-07-09 3:00 ` KOSAKI Motohiro
2010-06-08 20:23 ` Andrew Morton
2010-06-09 0:25 ` David Rientjes
2010-06-06 22:34 ` [patch 08/18] oom: sacrifice child with highest badness score for parent David Rientjes
2010-06-08 11:41 ` KOSAKI Motohiro
2010-06-08 18:53 ` David Rientjes
2010-06-08 20:33 ` Andrew Morton
2010-06-09 0:30 ` David Rientjes
2010-06-06 22:34 ` [patch 09/18] oom: select task from tasklist for mempolicy ooms David Rientjes
2010-06-08 11:41 ` KOSAKI Motohiro
2010-06-08 21:08 ` Andrew Morton
2010-06-08 21:17 ` Oleg Nesterov
2010-06-09 0:46 ` David Rientjes
2010-06-08 23:43 ` Andrew Morton
2010-06-09 0:40 ` David Rientjes
2010-06-06 22:34 ` [patch 10/18] oom: enable oom tasklist dump by default David Rientjes
2010-06-08 11:42 ` KOSAKI Motohiro
2010-06-08 18:56 ` David Rientjes
2010-06-08 21:13 ` Andrew Morton
2010-06-09 0:52 ` David Rientjes
2010-06-06 22:34 ` [patch 11/18] oom: avoid oom killer for lowmem allocations David Rientjes
2010-06-08 11:42 ` KOSAKI Motohiro
2010-06-08 21:19 ` Andrew Morton
2010-06-06 22:34 ` [patch 12/18] oom: extract panic helper function David Rientjes
2010-06-08 11:42 ` KOSAKI Motohiro
2010-06-06 22:34 ` [patch 13/18] oom: remove special handling for pagefault ooms David Rientjes
2010-06-08 11:42 ` KOSAKI Motohiro
2010-06-08 18:57 ` David Rientjes
2010-06-08 21:27 ` Andrew Morton
2010-06-06 22:34 ` [patch 14/18] oom: move sysctl declarations to oom.h David Rientjes
2010-06-08 11:42 ` KOSAKI Motohiro
2010-06-06 22:34 ` [patch 15/18] oom: remove unnecessary code and cleanup David Rientjes
2010-06-06 22:34 ` [patch 16/18] oom: badness heuristic rewrite David Rientjes
2010-06-08 11:41 ` KOSAKI Motohiro
2010-06-08 23:02 ` Andrew Morton
2010-06-13 11:24 ` KOSAKI Motohiro
2010-06-17 5:14 ` David Rientjes
2010-06-21 11:45 ` KOSAKI Motohiro
2010-06-21 20:47 ` David Rientjes
2010-06-30 9:26 ` KOSAKI Motohiro
2010-06-17 5:12 ` David Rientjes
2010-06-21 11:45 ` KOSAKI Motohiro
2010-06-08 22:58 ` Andrew Morton
2010-06-17 5:32 ` David Rientjes
2010-06-06 22:34 ` [patch 17/18] oom: add forkbomb penalty to badness heuristic David Rientjes
2010-06-08 11:41 ` KOSAKI Motohiro
2010-06-08 23:15 ` Andrew Morton
2010-06-06 22:35 ` [patch 18/18] oom: deprecate oom_adj tunable David Rientjes
2010-06-08 11:42 ` KOSAKI Motohiro
2010-06-08 19:00 ` David Rientjes
2010-06-08 23:18 ` Andrew Morton
2010-06-13 11:24 ` KOSAKI Motohiro
2010-06-17 3:36 ` David Rientjes
2010-06-21 11:45 ` KOSAKI Motohiro
2010-06-21 20:54 ` David Rientjes
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=alpine.DEB.2.00.1006081642370.19582@chino.kir.corp.google.com \
--to=rientjes@google.com \
--cc=akpm@linux-foundation.org \
--cc=balbir@linux.vnet.ibm.com \
--cc=kamezawa.hiroyu@jp.fujitsu.com \
--cc=kosaki.motohiro@jp.fujitsu.com \
--cc=linux-mm@kvack.org \
--cc=npiggin@suse.de \
--cc=oleg@redhat.com \
--cc=riel@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox