[RFC][-mm][PATCH 6/6] oom-killer: rewrite badness

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
To: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"linux-mm@kvack.org" <linux-mm@kvack.org>,
	"kosaki.motohiro@jp.fujitsu.com" <kosaki.motohiro@jp.fujitsu.com>,
	aarcange@redhat.com, akpm@linux-foundation.org,
	minchan.kim@gmail.com, rientjes@google.com,
	vedran.furac@gmail.com,
	"hugh.dickins@tiscali.co.uk" <hugh.dickins@tiscali.co.uk>
Subject: [RFC][-mm][PATCH 6/6] oom-killer: rewrite badness
Date: Mon, 2 Nov 2009 16:30:23 +0900	[thread overview]
Message-ID: <20091102163023.4f5c7282.kamezawa.hiroyu@jp.fujitsu.com> (raw)
In-Reply-To: <20091102162244.9425e49b.kamezawa.hiroyu@jp.fujitsu.com>

From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

rewrite __badness() heuristics.
Now, we have much more useful information for badness. use it.
And this patch changes too strong bonuses of cputime and runtime.

 - use "constraint" for changing base value.
   CPUSET: RSS tend to be unbalnaced between nodes. And we don't have
           per node RSS value....use total_vm instead of it.
   LOWMEM: we need to kill a process witch has low_rss.
   MEMCG, NONE: use RSS+SWAP as base value.

 - Runtime bonus.
   Runtime bonus is 0.1% per sec for each base value up to 50%
   For NONE/MEMCG, using total_vm-shared_vm here for taking requested amounts
   of memory into account. This may be bigger than base value.

 - cputime bonus
   removed.

 - Last Expansion bonus
   If last call for mmap() which expands hiwat_total_vm was far in past,
   get bonus. 0.1% per sec up to 25%.

 - nice bonus was removed. (we have oom_adj, ROOT is checked.)

 - import codes from KOSAKI's patch which coalesce capability checks.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/oom_kill.c |  124 +++++++++++++++++++++++++++++++++++++++-------------------
 1 file changed, 84 insertions(+), 40 deletions(-)

Index: mmotm-2.6.32-Nov2/mm/oom_kill.c
===================================================================
--- mmotm-2.6.32-Nov2.orig/mm/oom_kill.c
+++ mmotm-2.6.32-Nov2/mm/oom_kill.c
@@ -77,12 +77,10 @@ static unsigned long __badness(struct ta
 		      unsigned long uptime, enum oom_constraint constraint,
 		      struct mem_cgroup *mem)
 {
-	unsigned long points, cpu_time, run_time;
+	unsigned long points;
+	long runtime, quiet_time, penalty;
 	struct mm_struct *mm;
 	int oom_adj = p->signal->oom_adj;
-	struct task_cputime task_time;
-	unsigned long utime;
-	unsigned long stime;
 
 	if (oom_adj == OOM_DISABLE)
 		return 0;
@@ -93,11 +91,28 @@ static unsigned long __badness(struct ta
 		task_unlock(p);
 		return 0;
 	}
-
-	/*
-	 * The memory size of the process is the basis for the badness.
-	 */
-	points = get_mm_rss(mm);
+	switch (constraint) {
+	case CONSTRAINT_CPUSET:
+		/*
+		 * Because size of RSS/SWAP is highly affected by cpuset's
+		 * configuration and not by result of memory reclaim.
+		 * Then we use VM size here instead of RSS.
+		 * (we don't have per-node-rss counting, now)
+		 */
+		points = mm->total_vm;
+		break;
+	case CONSTRAINT_LOWMEM:
+		points = get_mm_counter(mm, low_rss);
+		break;
+	case CONSTRAINT_MEMCG:
+	case CONSTRAINT_NONE:
+		points = get_mm_counter(mm, anon_rss);
+		points += get_mm_counter(mm, swap_usage);
+		break;
+	default: /* mempolicy will not come here */
+		BUG();
+		break;
+	}
 
 	/*
 	 * After this unlock we can no longer dereference local variable `mm'
@@ -109,53 +124,82 @@ static unsigned long __badness(struct ta
 	 */
 	if (p->flags & PF_OOM_ORIGIN)
 		return ULONG_MAX;
-
 	/*
-	 * CPU time is in tens of seconds and run time is in thousands
-         * of seconds. There is no particular reason for this other than
-         * that it turned out to work very well in practice.
-	 */
-	thread_group_cputime(p, &task_time);
-	utime = cputime_to_jiffies(task_time.utime);
-	stime = cputime_to_jiffies(task_time.stime);
-	cpu_time = (utime + stime) >> (SHIFT_HZ + 3);
-
+ 	 * Check process's behavior and vm activity. And give bonus and
+ 	 * penalty.
+ 	 */
+	runtime = uptime - p->start_time.tv_sec;
+	penalty = 0;
+	/*
+	 * At oom, younger processes tend to be bad one. And there is no
+	 * good reason to kill a process which works very well befor OOM.
+	 * This adds short-run-time penalty at most 50% of its vm size.
+	 * and long-run process will get bonus up to 50% of its vm size.
+	 * If a process runs 1sec, it gets 0.1% bonus.
+	 *
+	 * We just check run_time here.
+	 */
+	runtime = 5000 - runtime;
+	if (runtime < -5000)
+		runtime = -5000;
+	switch (constraint) {
+	case CONSTRAINT_LOWMEM:
+		/* If LOWMEM OOM, seeing total_vm is wrong */
+		penalty = points * penalty / 10000;
+		break;
+	case CONSTRAINT_CPUSET:
+		penalty = mm->total_vm * penalty / 10000;
+		break;
+	default:
+		/* use total_vm - shared size as base of bonus */
+		penalty = (mm->total_vm - mm->shared_vm)* penalty / 10000;
+		break;
+	}
 
-	if (uptime >= p->start_time.tv_sec)
-		run_time = (uptime - p->start_time.tv_sec) >> 10;
+	if (likely(jiffies > mm->last_vm_expansion))
+		quiet_time = jiffies - mm->last_vm_expansion;
 	else
-		run_time = 0;
-
-	if (cpu_time)
-		points /= int_sqrt(cpu_time);
-	if (run_time)
-		points /= int_sqrt(int_sqrt(run_time));
+		quiet_time = ULONG_MAX - mm->last_vm_expansion + jiffies;
 
+	quiet_time = jiffies_to_msecs(quiet_time)/1000;
 	/*
-	 * Niced processes are most likely less important, so double
-	 * their badness points.
+	 * If a process recently expanded its (highest) vm size, get penalty.
+	 * This is for catching slow memory leaker. 12.5% is half of runtime
+	 * penalty.
 	 */
-	if (task_nice(p) > 0)
-		points *= 2;
+	quiet_time = 2500 - quiet_time;
+	if (quiet_time < -2500)
+		quiet_time = -1250;
+
+	switch (constraint) {
+	case CONSTRAINT_LOWMEM:
+		/* If LOWMEM OOM, seeing total_vm is wrong */
+		penalty += points * quiet_time / 10000;
+		break;
+	case CONSTRAINT_CPUSET:
+		penalty += mm->total_vm * quiet_time / 10000;
+		break;
+	default:
+		penalty += (mm->total_vm - mm->shared_vm) * quiet_time / 10000;
+		break;
+	}
+	/*
+ 	 * If an old process was quiet, it gets 75% of bonus at maximum.
+ 	 */
+	if ((penalty < 0) && (-penalty > points))
+		return 0;
+	points += penalty;
 
 	/*
 	 * Superuser processes are usually more important, so we make it
 	 * less likely that we kill those.
 	 */
 	if (has_capability_noaudit(p, CAP_SYS_ADMIN) ||
+	    has_capability_noaudit(p, CAP_SYS_RAWIO) ||
 	    has_capability_noaudit(p, CAP_SYS_RESOURCE))
 		points /= 4;
 
 	/*
-	 * We don't want to kill a process with direct hardware access.
-	 * Not only could that mess up the hardware, but usually users
-	 * tend to only have this flag set on applications they think
-	 * of as important.
-	 */
-	if (has_capability_noaudit(p, CAP_SYS_RAWIO))
-		points /= 4;
-
-	/*
 	 * If p's nodes don't overlap ours, it may still help to kill p
 	 * because p may have allocated or otherwise mapped memory on
 	 * this node before. However it will be less likely.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

next prev parent reply	other threads:[~2009-11-02  7:33 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-11-02  7:22 [RFC][-mm][PATCH 0/6] oom-killer: total renewal KAMEZAWA Hiroyuki
2009-11-02  7:24 ` [RFC][-mm][PATCH 1/6] oom-killer: updates for classification of OOM KAMEZAWA Hiroyuki
2009-11-02 17:05   ` Christoph Lameter
2009-11-02 23:02     ` KAMEZAWA Hiroyuki
2009-11-03 20:18   ` David Rientjes
2009-11-04  0:01     ` KAMEZAWA Hiroyuki
2009-11-02  7:25 ` [RFC][-mm][PATCH 2/6] oom-killer: count swap usage per process KAMEZAWA Hiroyuki
2009-11-02 17:07   ` Christoph Lameter
2009-11-02 23:03     ` KAMEZAWA Hiroyuki
2009-11-03 19:47   ` David Rientjes
2009-11-04  0:02     ` KAMEZAWA Hiroyuki
2009-11-02  7:26 ` [RFC][-mm][PATCH 3/6] oom-killer: count lowmem rss KAMEZAWA Hiroyuki
2009-11-02 17:09   ` Christoph Lameter
2009-11-02 23:11     ` KAMEZAWA Hiroyuki
2009-11-03 20:24   ` David Rientjes
2009-11-04  0:22     ` KAMEZAWA Hiroyuki
2009-11-02  7:27 ` [RFC][-mm][PATCH 4/6] oom-killer: fork bomb detector KAMEZAWA Hiroyuki
2009-11-02  8:39   ` KAMEZAWA Hiroyuki
2009-11-02  7:28 ` [RFC][-mm][PATCH 5/6] oom-killer: check last total_vm expansion KAMEZAWA Hiroyuki
2009-11-03 20:29   ` David Rientjes
2009-11-04  0:25     ` KAMEZAWA Hiroyuki
2009-11-02  7:30 ` KAMEZAWA Hiroyuki [this message]
2009-11-02 15:04 ` [RFC][-mm][PATCH 0/6] oom-killer: total renewal Minchan Kim
2009-11-02 15:44   ` KAMEZAWA Hiroyuki
2009-11-03 20:34 ` David Rientjes
2009-11-03 23:56   ` KAMEZAWA Hiroyuki

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20091102163023.4f5c7282.kamezawa.hiroyu@jp.fujitsu.com \
    --to=kamezawa.hiroyu@jp.fujitsu.com \
    --cc=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=hugh.dickins@tiscali.co.uk \
    --cc=kosaki.motohiro@jp.fujitsu.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=minchan.kim@gmail.com \
    --cc=rientjes@google.com \
    --cc=vedran.furac@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox