linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Nick Piggin <npiggin@suse.de>
To: Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	Linux Memory Management List <linux-mm@kvack.org>,
	Andrew Morton <akpm@osdl.org>, "David S. Peterson" <dsp@llnl.gov>,
	Paul Jackson <pj@sgi.com>
Subject: [rfc][patch] fixes for several oom killer problems
Date: Mon, 26 Jun 2006 18:20:38 +0200	[thread overview]
Message-ID: <20060626162038.GB7573@wotan.suse.de> (raw)

Hi,

We have reports of OOM killer panicing the system even if there are
tasks currently exiting and/or plenty able to be freed.

The main problem is the cpuset_excl_nodes_overlap causing an immediate
panic if current is exiting; I haven't got confirmation of whether
or not a minimal patch for that is effective.

The minimal patch basically involved ripping out the test completely.
I'd rather something more comprehensive in mainline, and I spotted
several other issues as well.

---

Fix several OOM killer problems.

Big ones:
- cpuset_excl_nodes_overlap always returns 0 if current is exiting. This
  caused customer's systems to panic in the OOM killer when processes were
  having trouble getting memory for the final put_user in mm_release. Even
  though there were lots of processes to kill. Fix this by just causing
  cpuset_excl_nodes_overlap to reduce the badness rather than disallow it
  (it may still be pinning memory somehow on this node or that this task
  may use).

- If current *is* exiting, it should actually be allowed to access reserved
  memory rather than OOM kill something else. Can't do this via a straight
  check in page_alloc.c because that would allow multiple tasks to use up
  reserves. Instead cause current to wind up marking itself as TIF_MEMDIE.

- In cpuset_excl_nodes_overlap, return 1 for PF_EXITING tasks. This retains
  parity with !CONFIG_CPUSETS case.

Little ones:
- PF_SWAPOFF processes cause select_bad_process to return straight away.
  Instead, give them high priority and ensure no parallel OOM kills are
  happening at the same time.

- cpuset_exlc_nodes_overlap may still free up some memory we're allowed to
  use. Kernel allocated memory, memory touched first by other processes or
  when we were in a different group. Cause this just to minimise the
  badness of a process.

- Skip kernel threads, rather than having them return 0 from badness.
  Theoretically, badness might truncate all results to 0, thus a kernel
  thread might be picked first, causing an infinite loop.

- Skip PF_DEAD tasks, for similar reasons.

- Print the name of the task that invoked the OOM killer. Could make
  debugging easier.

Signed-off-by: Nick Piggin <npiggin@suse.de>

Index: linux-2.6/mm/oom_kill.c
===================================================================
--- linux-2.6.orig/mm/oom_kill.c
+++ linux-2.6/mm/oom_kill.c
@@ -57,6 +57,12 @@ unsigned long badness(struct task_struct
 	}
 
 	/*
+	 * swapoff can easily use up all memory, so kill those first.
+	 */
+	if (p->flags & PF_SWAPOFF)
+		return ULONG_MAX;
+
+	/*
 	 * The memory size of the process is the basis for the badness.
 	 */
 	points = mm->total_vm;
@@ -125,6 +131,15 @@ unsigned long badness(struct task_struct
 	if (cap_t(p->cap_effective) & CAP_TO_MASK(CAP_SYS_RAWIO))
 		points /= 4;
 
+
+	/*
+	 * If p's nodes don't overlap ours, it may still help to kill p
+	 * because p may have allocated or otherwise mapped memory on
+	 * this node before. However it will be less likely.
+	 */
+	if (!cpuset_excl_nodes_overlap(p))
+		points /= 4;
+
 	/*
 	 * Adjust the score by oomkilladj.
 	 */
@@ -190,25 +205,35 @@ static struct task_struct *select_bad_pr
 		unsigned long points;
 		int releasing;
 
+		/* skip kernel threads */
+		if (!p->mm)
+			continue;
 		/* skip the init task with pid == 1 */
 		if (p->pid == 1)
 			continue;
-		if (p->oomkilladj == OOM_DISABLE)
-			continue;
-		/* If p's nodes don't overlap ours, it won't help to kill p. */
-		if (!cpuset_excl_nodes_overlap(p))
-			continue;
 
 		/*
 		 * This is in the process of releasing memory so for wait it
 		 * to finish before killing some other task by mistake.
+		 *
+		 * However, if p is the current task, we allow the 'kill' to
+		 * go ahead if it is exiting: this will simply set TIF_MEMDIE,
+		 * which will allow it to gain access to memory reserves in
+		 * the process of exiting and releasing its resources.
 		 */
 		releasing = test_tsk_thread_flag(p, TIF_MEMDIE) ||
 						p->flags & PF_EXITING;
-		if (releasing && !(p->flags & PF_DEAD))
+		if (releasing) {
+			/* PF_DEAD tasks have already released their mm */
+			if (p->flags & PF_DEAD)
+				continue;
+			if (p == current) {
+				chosen = p;
+				*ppoints = ULONG_MAX;
+				continue;
+			}
 			return ERR_PTR(-1UL);
-		if (p->flags & PF_SWAPOFF)
-			return p;
+		}
 
 		points = badness(p, uptime.tv_sec);
 		if (points > *ppoints || !chosen) {
@@ -216,6 +241,7 @@ static struct task_struct *select_bad_pr
 			*ppoints = points;
 		}
 	} while_each_thread(g, p);
+
 	return chosen;
 }
 
@@ -319,8 +345,8 @@ void out_of_memory(struct zonelist *zone
 	unsigned long points = 0;
 
 	if (printk_ratelimit()) {
-		printk("oom-killer: gfp_mask=0x%x, order=%d\n",
-			gfp_mask, order);
+		printk(KERN_WARNING "%s invoked oom-killer: "
+			"gfp_mask=0x%x, order=%d, oomkilladj=%d\n", current->comm, gfp_mask, order, current->oomkilladj);
 		dump_stack();
 		show_mem();
 	}
Index: linux-2.6/kernel/cpuset.c
===================================================================
--- linux-2.6.orig/kernel/cpuset.c
+++ linux-2.6/kernel/cpuset.c
@@ -2362,7 +2362,7 @@ EXPORT_SYMBOL_GPL(cpuset_mem_spread_node
 int cpuset_excl_nodes_overlap(const struct task_struct *p)
 {
 	const struct cpuset *cs1, *cs2;	/* my and p's cpuset ancestors */
-	int overlap = 0;		/* do cpusets overlap? */
+	int overlap = 1;		/* do cpusets overlap? */
 
 	task_lock(current);
 	if (current->flags & PF_EXITING) {

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

             reply	other threads:[~2006-06-26 16:20 UTC|newest]

Thread overview: 3+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2006-06-26 16:20 Nick Piggin [this message]
2006-06-26 17:39 ` Nick Piggin
2006-06-26 18:09 ` Paul Jackson

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20060626162038.GB7573@wotan.suse.de \
    --to=npiggin@suse.de \
    --cc=akpm@osdl.org \
    --cc=dsp@llnl.gov \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=pj@sgi.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox