[PATCH] OOM handling - Martin Dalecki

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Martin Dalecki <dalecki@evision-ventures.com>
To: Alan Cox <alan@lxorguk.ukuu.org.uk>,
	"James A. Sutherland" <jas88@cam.ac.uk>,
	Guest section DW <dwguest@win.tue.nl>,
	Rik van Riel <riel@conectiva.com.br>,
	Patrick O'Rourke <orourke@missioncriticallinux.com>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org
Subject: [PATCH] OOM handling
Date: Sun, 25 Mar 2001 15:54:46 +0200	[thread overview]
Message-ID: <3ABDF8A6.7580BD7D@evision-ventures.com> (raw)
In-Reply-To: <3ABB9CF2.E7715667@evision-ventures.com>

[-- Attachment #1: Type: text/plain, Size: 2087 bytes --]

Martin Dalecki wrote:
> 
> I have a constructive proposal:
> 
> It would make much sense to make the oom killer
> leave not just root processes alone but processes belonging to a UID
> lower
> then a certain value as well (500). This would be:
> 
> 1. Easly managable by the admin. Just let oracle/www and analogous users
>    have a UID lower then let's say 500.
> 
> 2. In full compliance with the port trick done by TCP/IP (ports < 1024
> vers other)
> 
> 3. It wouldn't need any addition of new interface (no jebanoje gawno in
> /proc in addition()
> 
> 4. Really simple to implement/document understand.
> 
> 5. Be the same way as Solaris does similiar things.
> 
> ...
> 
> Damn: I will let my chess club alone toady and will just code it down
> NOW.
> 
> Spec:
> 
> 1. Processes with a UID < 100 are immune to OOM killers.
> 2. Processes with a UID >= 100 && < 500 are hard for the OOM killer to
> take on.
> 3. Processes with a UID >= 500 are easy targets.
> 
> Let me introduce a new terminology in full analogy to "fire walls"
> routers and therabouts:
> 
> Processes of category 1. are called captains (oficerzy)
> Processes of category 2. are called corporals (porucznicy)
> Processes of category 2. are called privates (?o3nierze)

OK I just did it. as I already told I have "stress tested it" by 
installing the Orcale insternet application server suide
on a hoplessly underequipped box ("only" 128MByte RMA).
The assorted patch is attached. 

Since I'm one day late up to my promise to provide this
patch it's actually fascinating that already 4 people (in esp. not
newbees requesting a new /proc entry for everything)
for reassurance that I will indeed implement it... Well 
this kind of "high" and "eager" feadback seems for me to indicate 
that there is very serious desire for it. And then of course I
just have to ask our people working with DB's here at work as well :-).

Ah... and of course I think this patch can already go directly 
into the official kernel. The quality of code should permit 
it. I would esp. request Rik van Riel to have a closer look
at it...

[-- Attachment #2: oom.diff --]
[-- Type: text/plain, Size: 11110 bytes --]

diff -urN linux/mm/oom_kill.c linux-new/mm/oom_kill.c
--- linux/mm/oom_kill.c	Tue Nov 14 19:56:46 2000
+++ linux-new/mm/oom_kill.c	Sun Mar 25 17:17:34 2001
@@ -1,18 +1,64 @@
 /*
  *  linux/mm/oom_kill.c
- * 
+ *
  *  Copyright (C)  1998,2000  Rik van Riel
  *	Thanks go out to Claus Fischer for some serious inspiration and
  *	for goading me into coding this file...
  *
- *  The routines in this file are used to kill a process when
- *  we're seriously out of memory. This gets called from kswapd()
- *  in linux/mm/vmscan.c when we really run out of memory.
- *
- *  Since we won't call these routines often (on a well-configured
- *  machine) this file will double as a 'coding guide' and a signpost
- *  for newbie kernel hackers. It features several pointers to major
- *  kernel subsystems and hints as to where to find out what things do.
+ *  Sat Mar 24 22:07:15 CET 2001 Marcin Dalecki <dalecki@evision-ventures.com>:
+ *
+ *	Replaced the original algorith with something reasonably, predictable
+ *	and managable. I will call this "Stalins Eviction".
+ */
+
+/*
+ *  The routines in this file are used to kill a process when the system is
+ *  entierly out of memmory (both: RAM and swap).  This gets called from
+ *  kswapd() in linux/mm/vmscan.c when we are in total starvation due to the
+ *  fact, that the only thing the system is busy at, is to try to allocate some
+ *  physical memmory page, where there are no pages anymore left. In such it
+ *  does make perfect sense to kill some offending process, just to make the
+ *  system go on and survive.
+ *
+ *  IT IS A LAST RESORT!
+ *
+ *  ALLERT: In contrast to popular beleve the invention of the mechanism
+ *  presented here IS IMPORTANT for system security reasons. It is preventing
+ *  one border corner of an easy DNS attack in case the sysadmin didn't take
+ *  other measures, which he either overworked or incompetent as he is usually
+ *  doesn't.
+ *
+ *  Basically the eviction goes on as follows:
+ *
+ *  1. Normal interactive user processes are the first candidates for a shoot.
+ *  We consider all users with a UID >= 500 as normal interactive users.
+ *
+ *  2. If there are no processes started by a normal interactive user, we aim
+ *  at the processes from nonessential processes (for the "live" of the system
+ *  as a whole).  We consider users with a UID >= 100 and < 500 as essential
+ *  service user.
+ *
+ *  3. If this still isn't the case we start to shut down the system components
+ *  peace by peace... (UID < 100).
+ *
+ *  In fact the heuristics used to determine, at which of the process classes
+ *  to aim first, are a bit more sophisticated, If you wan't those details
+ *  please read the code below. It does (hopefully so) speak for itself.
+ *
+ *  As an example: If you are running a big Linux box, which is mainly deployed
+ *  as an oracle server, but where normal interactive human users can log on as
+ *  well, then you should run oracle server with a UID < 500 and >= 100. Then
+ *  dumb ass loosers starting 100 netscape and 500 emacs sessions, won't be
+ *  able anylonger to kill the essential oracle service.
+ *
+ *  The introduction of this additional UID semantics shouldn't affect any
+ *  present systems. (Read: It won't make anything worser in comparision to
+ *  previous versions of the Linux kernel.) However every single distributor of
+ *  "enterprise grade" applications for Linux SHOULD take a note on this.
+ *
+ *  regards:
+ *
+ *		Marcin Dalecki
  */
 
 #include <linux/mm.h>
@@ -23,125 +69,141 @@
 
 /* #define DEBUG */
 
-/**
- * int_sqrt - oom_kill.c internal function, rough approximation to sqrt
- * @x: integer of which to calculate the sqrt
- * 
- * A very rough approximation to the sqrt() function.
- */
-static unsigned int int_sqrt(unsigned int x)
-{
-	unsigned int out = x;
-	while (x & ~(unsigned int)1) x >>=2, out >>=1;
-	if (x) out -= out >> 2;
-	return (out ? out : 1);
-}	
-
-/**
- * oom_badness - calculate a numeric value for how bad this task has been
- * @p: task struct of which task we should calculate
- *
- * The formula used is relatively simple and documented inline in the
- * function. The main rationale is that we want to select a good task
- * to kill when we run out of memory.
- *
- * Good in this context means that:
- * 1) we lose the minimum amount of work done
- * 2) we recover a large amount of memory
- * 3) we don't kill anything innocent of eating tons of memory
- * 4) we want to kill the minimum amount of processes (one)
- * 5) we try to kill the process the user expects us to kill, this
- *    algorithm has been meticulously tuned to meet the priniciple
- *    of least surprise ... (be careful when you change it)
- */
+#define CPU_FACTOR 32
+#define AGE_FACTOR 256
 
-static int badness(struct task_struct *p)
+enum uid_class {
+	normal,
+	service,
+	system,
+	immune
+};
+
+static int determine_uid_class(struct task_struct *p)
 {
-	int points, cpu_time, run_time;
+	int uid;
+	int uid_class = system;
 
-	if (!p->mm)
-		return 0;
-	/*
-	 * The memory size of the process is the basis for the badness.
+	/* This makes processes started by for example suexec be better killing
+	 * candidates then root's processes themself.
 	 */
-	points = p->mm->total_vm;
+	uid = p->uid;
+	if (p->euid > p->uid)
+		uid = p->euid;
 
-	/*
-	 * CPU time is in seconds and run time is in minutes. There is no
-	 * particular reason for this other than that it turned out to work
-	 * very well in practice. This is not safe against jiffie wraps
-	 * but we don't care _that_ much...
+	/* This is implementing the intendid semantics of different user id
+	 * value ranges.
 	 */
-	cpu_time = (p->times.tms_utime + p->times.tms_stime) >> (SHIFT_HZ + 3);
-	run_time = (jiffies - p->start_time) >> (SHIFT_HZ + 10);
+	if (uid < 100)
+		uid_class = system;
+	else if (uid < 500)
+		uid_class = service;
+	else
+		uid_class = normal;
 
-	points /= int_sqrt(cpu_time);
-	points /= int_sqrt(int_sqrt(run_time));
-
-	/*
-	 * Niced processes are most likely less important, so double
-	 * their badness points.
-	 */
-	if (p->nice > 0)
-		points *= 2;
 
-	/*
-	 * Superuser processes are usually more important, so we make it
+	/* Superuser processes are usually more important, so we make it
 	 * less likely that we kill those.
 	 */
-	if (cap_t(p->cap_effective) & CAP_TO_MASK(CAP_SYS_ADMIN) ||
-				p->uid == 0 || p->euid == 0)
-		points /= 4;
+	if (cap_t(p->cap_effective) & CAP_TO_MASK(CAP_SYS_ADMIN))
+		uid_class = system;
 
-	/*
-	 * We don't want to kill a process with direct hardware access.
+	/* We don't want to kill a process with direct hardware access.
 	 * Not only could that mess up the hardware, but usually users
 	 * tend to only have this flag set on applications they think
 	 * of as important.
 	 */
 	if (cap_t(p->cap_effective) & CAP_TO_MASK(CAP_SYS_RAWIO))
-		points /= 4;
-#ifdef DEBUG
-	printk(KERN_DEBUG "OOMkill: task %d (%s) got %d points\n",
-	p->pid, p->comm, points);
-#endif
-	return points;
+		uid_class = system;
+
+	return uid_class;
+}
+
+static int calculate_penalty(struct task_struct *p)
+{
+	int cpu_penalty = 0;
+	int age_penalty = 0;
+
+
+	/* Now we calculate the penalty due to the cpu usage.  NOTE: This is
+	 * not safe against jiffie wraps.
+	 */
+	{
+		int run_time = (jiffies - p->start_time) >> (SHIFT_HZ + 10);
+
+		if (run_time > 0) {
+			cpu_penalty = (CPU_FACTOR * run_time) /
+				((p->times.tms_utime + p->times.tms_stime) >> (SHIFT_HZ + 3) + run_time);
+		} else
+			cpu_penalty = CPU_FACTOR;
+	}
+
+	/* Let's make older processes more important then newer ones.
+	 * This is not safe against jiffie wraps, delibrately so.
+	 */
+	if (p->start_time > 0)
+		age_penalty = AGE_FACTOR * p->start_time / jiffies;
+	else
+		age_penalty = 0;
+
+	/* OK this should be sufficient, we don't want to make things more
+	 * complicated then needed. In esp. since there is no easy and portable
+	 * way to determine the total amount of memmory pages present, we don't
+	 * take this into account here.
+	 *
+	 * Let us worry about more detailed heuristics here, only if there will
+	 * be still many people reporting serious problems on linux-kernel.
+	 */
+
+	return cpu_penalty + age_penalty;
 }
 
 /*
- * Simple selection loop. We chose the process with the highest
- * number of 'points'. We need the locks to make sure that the
- * list of task structs doesn't change while we look the other way.
- *
- * (not docbooked, we don't want this one cluttering up the manual)
+ * Simple selection loop. We chose the process with the highest penalty.
  */
-static struct task_struct * select_bad_process(void)
+static struct task_struct * select_process(void)
 {
-	int maxpoints = 0;
-	struct task_struct *p = NULL;
-	struct task_struct *chosen = NULL;
-
-	read_lock(&tasklist_lock);
-	for_each_task(p) {
-		if (p->pid) {
-			int points = badness(p);
-			if (points > maxpoints) {
-				chosen = p;
-				maxpoints = points;
+	enum uid_class i;
+	struct task_struct *choice = NULL;
+
+	for (i = normal; i != immune; ++i) {
+		int maxpenalty = 0;
+		struct task_struct *p = NULL;
+
+		/* The locks make sure that the list of task structs doesn't
+		 * change while we look at it.
+		 */
+
+		read_lock(&tasklist_lock);
+		for_each_task(p) {
+			if (!p->mm)
+				continue;
+
+			if (i != determine_uid_class(p))
+				continue;
+
+			if (p->pid) {
+				int penalty = calculate_penalty(p);
+
+				if (penalty > maxpenalty) {
+					choice = p;
+					maxpenalty = penalty;
+				}
 			}
 		}
+		read_unlock(&tasklist_lock);
+
+		if (choice != NULL)
+			break;
 	}
-	read_unlock(&tasklist_lock);
-	return chosen;
+
+	return choice;
 }
 
-/**
- * oom_kill - kill the "best" process when we run out of memory
- *
+/*
  * If we run out of memory, we have the choice between either
  * killing a random task (bad), letting the system crash (worse)
- * OR try to be smart about which process to kill. Note that we
- * don't have to be perfect here, we just have to be good.
+ * OR try to be smart about which process to kill.
  *
  * We must be careful though to never send SIGKILL a process with
  * CAP_SYS_RAW_IO set, send SIGTERM instead (but it's unlikely that
@@ -149,14 +211,12 @@
  */
 void oom_kill(void)
 {
+	struct task_struct *p = select_process();
 
-	struct task_struct *p = select_bad_process();
-
-	/* Found nothing?!?! Either we hang forever, or we panic. */
 	if (p == NULL)
 		panic("Out of memory and no killable processes...\n");
 
-	printk(KERN_ERR "Out of Memory: Killed process %d (%s).\n", p->pid, p->comm);
+	printk(KERN_ERR "Out of memory: killed process %d (%s).\n", p->pid, p->comm);
 
 	/*
 	 * We give our sacrificial lamb high priority and access to
@@ -180,14 +240,14 @@
 	 */
 	current->policy |= SCHED_YIELD;
 	schedule();
+
 	return;
 }
 
-/**
- * out_of_memory - is the system out of memory?
+/** out_of_memory - is the system out of memory?
  *
- * Returns 0 if there is still enough memory left,
- * 1 when we are out of memory (otherwise).
+ * Returns 0 if there is still enough memory left, 1 when we are out of memory
+ * (otherwise).
  */
 int out_of_memory(void)
 {

next prev parent reply	other threads:[~2001-03-25 13:54 UTC|newest]

Thread overview: 82+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2001-03-21 22:54 [PATCH] Prevent OOM from killing init Patrick O'Rourke
2001-03-21 23:11 ` Eli Carter
2001-03-21 23:40   ` Patrick O'Rourke
2001-03-21 23:48 ` Rik van Riel
2001-03-22  8:14   ` Eric W. Biederman
2001-03-22  9:24     ` Rik van Riel
2001-03-22 19:29     ` Philipp Rumpf
2001-03-22 11:47   ` Guest section DW
2001-03-22 15:01     ` Rik van Riel
2001-03-22 19:04       ` Guest section DW
2001-03-22 23:10         ` Jordi Polo
2001-03-22 16:41     ` Eric W. Biederman
2001-03-22 20:28     ` Stephen Clouse
2001-03-22 21:01       ` Ingo Oeser
2001-03-22 21:23       ` Alan Cox
2001-03-22 22:00         ` Guest section DW
2001-03-22 22:12           ` Ed Tomlinson
2001-03-22 22:52           ` Alan Cox
2001-03-22 23:27             ` Guest section DW
2001-03-22 23:37               ` Rik van Riel
2001-03-26 19:04                 ` James Antill
2001-03-26 20:05                   ` Rik van Riel
2001-03-22 23:40               ` Alan Cox
2001-03-23 20:09                 ` Szabolcs Szakacsits
2001-03-23 22:21                   ` Alan Cox
2001-03-23 22:37                     ` Szabolcs Szakacsits
2001-03-23 19:57           ` Szabolcs Szakacsits
2001-03-22 22:10         ` Doug Ledford
2001-03-22 22:53           ` Alan Cox
2001-03-22 23:30             ` Doug Ledford
2001-03-22 23:40               ` Alan Cox
2001-03-22 23:43         ` Stephen Clouse
2001-03-23 19:26         ` Szabolcs Szakacsits
2001-03-23 20:41           ` Paul Jakma
2001-03-23 21:58             ` george anzinger
2001-03-24  5:55               ` Rik van Riel
2001-03-24  8:04                 ` Mike Galbraith
2001-03-27 14:05                 ` Scott F. Kaplan
2001-03-28  0:00                   ` Rik van Riel
2001-03-30  3:18                     ` Scott F. Kaplan
2001-03-30 23:03                       ` Rik van Riel
2001-03-23 22:18             ` Szabolcs Szakacsits
2001-03-24  2:08               ` Paul Jakma
2001-03-23  1:31       ` Michael Peddemors
2002-03-23  0:33       ` Martin Dalecki
2001-03-22 23:53         ` Rik van Riel
2002-03-23  1:21           ` Martin Dalecki
2001-03-23  0:20         ` Stephen Clouse
2002-03-23  1:30           ` Martin Dalecki
2001-03-23  1:37             ` Rik van Riel
2001-03-23 10:48               ` Martin Dalecki
2001-03-23 14:56                 ` Rik van Riel
2001-03-23 16:43                   ` Guest section DW
2001-03-24  5:57                     ` Rik van Riel
2001-03-25 16:35                       ` Guest section DW
2001-03-23 17:26     ` James A. Sutherland
2001-03-23 17:32       ` Alan Cox
2001-03-23 18:58         ` Martin Dalecki
2001-03-23 19:45           ` Jonathan Morton
2001-03-23 23:26             ` Eric W. Biederman
2001-03-25 13:54           ` Martin Dalecki [this message]
2001-03-25 15:06             ` [PATCH] OOM handling Rik van Riel
2001-03-25 15:20               ` Martin Dalecki
2001-03-25 17:08                 ` Rik van Riel
2001-03-25 15:44               ` Jonathan Morton
2001-03-25 15:47                 ` Martin Dalecki
2001-03-25 16:36                   ` Jonathan Morton
2001-03-26 21:34                     ` Kevin Buhr
2001-03-26 22:00                       ` Jonathan Morton
2001-03-25 15:30         ` [PATCH] Prevent OOM from killing init Martin Dalecki
2001-03-25 20:47           ` Stephen Satchell
2001-03-25 21:51             ` [PATCH] non-overcommit memory, improved OOM handling, safety margin (was Re: Prevent OOM from killing init) Jonathan Morton
2001-03-27 15:23               ` Pavel Machek
2001-03-23 20:16       ` [PATCH] Prevent OOM from killing init Jordi Polo
2001-03-24  0:03       ` Guest section DW
2001-03-24  7:52       ` Doug Ledford
2001-03-22 14:53   ` Patrick O'Rourke
2001-03-22 19:24   ` Philipp Rumpf
2001-03-22 22:20   ` James A. Sutherland
2001-03-23 17:31   ` Szabolcs Szakacsits
2001-03-24  5:54     ` Rik van Riel
2001-03-24  6:55       ` Juha Saarinen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=3ABDF8A6.7580BD7D@evision-ventures.com \
    --to=dalecki@evision-ventures.com \
    --cc=alan@lxorguk.ukuu.org.uk \
    --cc=dwguest@win.tue.nl \
    --cc=jas88@cam.ac.uk \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=orourke@missioncriticallinux.com \
    --cc=riel@conectiva.com.br \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox