linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [patch 1/4] oom: extract deadlock helper function
@ 2007-06-27 14:44 David Rientjes
  2007-06-27 14:44 ` [patch 2/4] oom: select process to kill for cpusets David Rientjes
  2007-07-26  6:15 ` [patch 1/4] oom: extract deadlock helper function David Rientjes
  0 siblings, 2 replies; 25+ messages in thread
From: David Rientjes @ 2007-06-27 14:44 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Andrea Arcangeli, linux-mm

Extracts the jiffies comparison operation, the assignment of the
last_tif_memdie actual, and diagnostic message to its own function.

Cc: Andrea Arcangeli <andrea@suse.de>
Signed-off-by: David Rientjes <rientjes@google.com>
---
 mm/oom_kill.c |   27 +++++++++++++++++++++------
 1 files changed, 21 insertions(+), 6 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -28,6 +28,8 @@
 int sysctl_panic_on_oom;
 /* #define DEBUG */
 
+#define OOM_DEADLOCK_TIMEOUT	(10*HZ)
+
 unsigned long VM_is_OOM;
 static unsigned long last_tif_memdie_jiffies;
 
@@ -365,6 +367,22 @@ int unregister_oom_notifier(struct notifier_block *nb)
 }
 EXPORT_SYMBOL_GPL(unregister_oom_notifier);
 
+/*
+ * Returns 1 if the OOM killer is deadlocked, meaning more than
+ * OOM_DEADLOCK_TIMEOUT time has elapsed since the last task was set to
+ * TIF_MEMDIE.  If it is deadlocked, the actual is updated to jiffies to check
+ * for future timeouts.  Otherwise, return 0.
+ */
+static int oom_is_deadlocked(unsigned long *last_tif_memdie)
+{
+	if (unlikely(time_before(jiffies, *last_tif_memdie +
+					  OOM_DEADLOCK_TIMEOUT)))
+		return 0;
+	*last_tif_memdie = jiffies;
+	printk("detected probable OOM deadlock, so killing another task\n");
+	return 1;
+}
+
 /**
  * out_of_memory - kill the "best" process when we run out of memory
  *
@@ -421,12 +439,9 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, int order)
 		 * so it's equivalent to write_lock_irq(tasklist_lock) as
 		 * far as VM_is_OOM is concerned.
 		 */
-		if (unlikely(test_bit(0, &VM_is_OOM))) {
-			if (time_before(jiffies, last_tif_memdie_jiffies + 10*HZ))
-				goto out;
-			printk("detected probable OOM deadlock, so killing another task\n");
-			last_tif_memdie_jiffies = jiffies;
-		}
+		if (unlikely(test_bit(0, &VM_is_OOM)) &&
+		    !oom_is_deadlocked(&last_tif_memdie_jiffies))
+			goto out;
 
 		if (sysctl_panic_on_oom) {
 			read_unlock(&tasklist_lock);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [patch 2/4] oom: select process to kill for cpusets
  2007-06-27 14:44 [patch 1/4] oom: extract deadlock helper function David Rientjes
@ 2007-06-27 14:44 ` David Rientjes
  2007-06-27 14:44   ` [patch 3/4] oom: extract select helper function David Rientjes
  2007-06-27 21:52   ` [patch 2/4] oom: select process to kill for cpusets Christoph Lameter
  2007-07-26  6:15 ` [patch 1/4] oom: extract deadlock helper function David Rientjes
  1 sibling, 2 replies; 25+ messages in thread
From: David Rientjes @ 2007-06-27 14:44 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Andrea Arcangeli, Christoph Lameter, linux-mm

Passes the memory allocation constraint into select_bad_process() so
that, in the CONSTRAINT_CPUSET case, we can exclude tasks that do not
overlap nodes with the triggering task's cpuset.

The OOM killer now invokes select_bad_process() even in the cpuset case
to select a rogue task to kill instead of simply using current.  Although
killing current is guaranteed to help alleviate the OOM condition, it is
by no means guaranteed to be the "best" process to kill.  The
select_bad_process() heuristics will do a much better job of determining
that.

As an added bonus, this also addresses an issue whereas current could be
set to OOM_DISABLE and is not respected for the CONSTRAINT_CPUSET case.
Currently we loop back out to __alloc_pages() waiting for another cpuset
task to trigger the OOM killer that hopefully won't be OOM_DISABLE.  With
this patch, we're guaranteed to find a task to kill that is not
OOM_DISABLE if it matches our eligibility requirements the first time.

If we cannot find any tasks to kill in the cpuset case, we simply make
the entire OOM killer a no-op since it's better for one cpuset to fail
memory allocations repeatedly instead of panicing the entire system.

Cc: Andrea Arcangeli <andrea@suse.de>
Cc: Christoph Lameter <clameter@sgi.com>
Signed-off-by: David Rientjes <rientjes@google.com>
---
 mm/oom_kill.c |   25 +++++++++++++++++--------
 1 files changed, 17 insertions(+), 8 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -187,9 +187,13 @@ static inline int constrained_alloc(struct zonelist *zonelist, gfp_t gfp_mask)
  * Simple selection loop. We chose the process with the highest
  * number of 'points'. We expect the caller will lock the tasklist.
  *
+ * If constraint is CONSTRAINT_CPUSET, then only choose a task that overlaps
+ * the nodes of the task that triggered the OOM killer.
+ *
  * (not docbooked, we don't want this one cluttering up the manual)
  */
-static struct task_struct *select_bad_process(unsigned long *ppoints)
+static struct task_struct *select_bad_process(unsigned long *ppoints,
+					      int constraint)
 {
 	struct task_struct *g, *p;
 	struct task_struct *chosen = NULL;
@@ -221,6 +225,9 @@ static struct task_struct *select_bad_process(unsigned long *ppoints)
 
 		if (p->oomkilladj == OOM_DISABLE)
 			continue;
+		if (constraint == CONSTRAINT_CPUSET &&
+		    !cpuset_excl_nodes_overlap(p))
+			continue;
 
 		points = badness(p, uptime.tv_sec);
 		if (points > *ppoints || !chosen) {
@@ -423,12 +430,6 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, int order)
 		break;
 
 	case CONSTRAINT_CPUSET:
-		read_lock(&tasklist_lock);
-		oom_kill_process(current, points,
-				 "No available memory in cpuset", gfp_mask, order);
-		read_unlock(&tasklist_lock);
-		break;
-
 	case CONSTRAINT_NONE:
 		if (down_trylock(&OOM_lock))
 			break;
@@ -453,9 +454,17 @@ retry:
 		 * Rambo mode: Shoot down a process and hope it solves whatever
 		 * issues we may have.
 		 */
-		p = select_bad_process(&points);
+		p = select_bad_process(&points, constraint);
 		/* Found nothing?!?! Either we hang forever, or we panic. */
 		if (unlikely(!p)) {
+			/*
+			 * We shouldn't panic the entire system if we can't
+			 * find any eligible tasks to kill in a
+			 * cpuset-constrained OOM condition.  Instead, we do
+			 * nothing and allow other cpusets to continue.
+			 */
+			if (constraint == CONSTRAINT_CPUSET)
+				goto out;
 			read_unlock(&tasklist_lock);
 			cpuset_unlock();
 			panic("Out of memory and no killable processes...\n");

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [patch 3/4] oom: extract select helper function
  2007-06-27 14:44 ` [patch 2/4] oom: select process to kill for cpusets David Rientjes
@ 2007-06-27 14:44   ` David Rientjes
  2007-06-27 14:44     ` [patch 4/4] oom: serialize for cpusets David Rientjes
  2007-06-27 21:52   ` [patch 2/4] oom: select process to kill for cpusets Christoph Lameter
  1 sibling, 1 reply; 25+ messages in thread
From: David Rientjes @ 2007-06-27 14:44 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Andrea Arcangeli, linux-mm

Extracts the call to select_bad_process() and the corresponding check for
a NULL return value or call to oom_kill_process() to its own function.
This will be used later for the cpuset case where we will require
different locking mechanisms than the generic case.

Cc: Andrea Arcangeli <andrea@suse.de>
Signed-off-by: David Rientjes <rientjes@google.com>
---
 mm/oom_kill.c |   53 ++++++++++++++++++++++++++++-------------------------
 1 files changed, 28 insertions(+), 25 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -390,6 +390,32 @@ static int oom_is_deadlocked(unsigned long *last_tif_memdie)
 	return 1;
 }
 
+static void select_and_kill_process(gfp_t gfp_mask, int order, int constraint)
+{
+	struct task_struct *p;
+	unsigned long points = 0;
+
+retry:
+	p = select_bad_process(&points, constraint);
+	/* Found nothing?!?! Either we hang forever, or we panic. */
+	if (unlikely(!p)) {
+		/*
+		 * We shouldn't panic the entire system if we can't find any
+		 * eligible tasks to kill in a cpuset-constrained OOM
+		 * condition.  Instead, we do nothing and allow other cpusets
+		 * to continue.
+		 */
+		if (constraint == CONSTRAINT_CPUSET)
+			return;
+		read_unlock(&tasklist_lock);
+		cpuset_unlock();
+		panic("Out of memory and no killable processes...\n");
+	}
+
+	if (oom_kill_process(p, points, "Out of memory", gfp_mask, order))
+		goto retry;
+}
+
 /**
  * out_of_memory - kill the "best" process when we run out of memory
  *
@@ -400,8 +426,6 @@ static int oom_is_deadlocked(unsigned long *last_tif_memdie)
  */
 void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, int order)
 {
-	struct task_struct *p;
-	unsigned long points = 0;
 	unsigned long freed = 0;
 	int constraint;
 	static DECLARE_MUTEX(OOM_lock);
@@ -424,7 +448,7 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, int order)
 	switch (constraint) {
 	case CONSTRAINT_MEMORY_POLICY:
 		read_lock(&tasklist_lock);
-		oom_kill_process(current, points,
+		oom_kill_process(current, 0,
 				 "No available memory (MPOL_BIND)", gfp_mask, order);
 		read_unlock(&tasklist_lock);
 		break;
@@ -449,29 +473,8 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, int order)
 			cpuset_unlock();
 			panic("out of memory. panic_on_oom is selected\n");
 		}
-retry:
-		/*
-		 * Rambo mode: Shoot down a process and hope it solves whatever
-		 * issues we may have.
-		 */
-		p = select_bad_process(&points, constraint);
-		/* Found nothing?!?! Either we hang forever, or we panic. */
-		if (unlikely(!p)) {
-			/*
-			 * We shouldn't panic the entire system if we can't
-			 * find any eligible tasks to kill in a
-			 * cpuset-constrained OOM condition.  Instead, we do
-			 * nothing and allow other cpusets to continue.
-			 */
-			if (constraint == CONSTRAINT_CPUSET)
-				goto out;
-			read_unlock(&tasklist_lock);
-			cpuset_unlock();
-			panic("Out of memory and no killable processes...\n");
-		}
 
-		if (oom_kill_process(p, points, "Out of memory", gfp_mask, order))
-			goto retry;
+		select_and_kill_process(gfp_mask, order, constraint);
 
 	out:
 		read_unlock(&tasklist_lock);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [patch 4/4] oom: serialize for cpusets
  2007-06-27 14:44   ` [patch 3/4] oom: extract select helper function David Rientjes
@ 2007-06-27 14:44     ` David Rientjes
  2007-06-27 21:53       ` Christoph Lameter
  2007-06-28 20:41       ` [patch 5/4] oom: add oom_kill_asking_task flag David Rientjes
  0 siblings, 2 replies; 25+ messages in thread
From: David Rientjes @ 2007-06-27 14:44 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Andrea Arcangeli, Christoph Lameter, linux-mm

Adds a last_tif_memdie_jiffies field to struct cpuset to store the
jiffies value at the last OOM kill.  This will detect deadlocks in the
CONSTRAINT_CPUSET case and kill another task if its detected.

Adds a CS_OOM bit to struct cpuset's flags field.  This will be tested,
set, and cleared atomically to denote a cpuset that currently has an
attached task exiting as a result of the OOM killer.  We are required to
take p->alloc_lock to dereference p->cpuset so this cannot be implemented
as a simple trylock.

As a result, we cannot allow the detachment of a task from a cpuset that
is currently OOM killing one of its tasks.  If we did, we would end up
clearing the CS_OOM bit in the wrong cpuset upon that task's exit.

sysctl's panic_on_oom is now only effected in the non-cpuset-constrained
case.

Cc: Andrea Arcangeli <andrea@suse.de>
Cc: Christoph Lameter <clameter@sgi.com>
Signed-off-by: David Rientjes <rientjes@google.com>
---
 include/linux/cpuset.h |   19 ++++++++++++++
 kernel/cpuset.c        |   65 +++++++++++++++++++++++++++++++++++++++++++++---
 kernel/exit.c          |    1 +
 mm/oom_kill.c          |   21 ++++++++++++++-
 4 files changed, 100 insertions(+), 6 deletions(-)

diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -47,6 +47,12 @@ static int inline cpuset_zone_allowed_hardwall(struct zone *z, gfp_t gfp_mask)
 
 extern int cpuset_excl_nodes_overlap(const struct task_struct *p);
 
+extern int cpuset_get_last_tif_memdie(struct task_struct *task);
+extern void cpuset_set_last_tif_memdie(struct task_struct *task,
+				       unsigned long last_tif_memdie);
+extern int cpuset_set_oom(struct task_struct *task);
+extern void cpuset_clear_oom(struct task_struct *task);
+
 #define cpuset_memory_pressure_bump() 				\
 	do {							\
 		if (cpuset_memory_pressure_enabled)		\
@@ -118,6 +124,19 @@ static inline int cpuset_excl_nodes_overlap(const struct task_struct *p)
 	return 1;
 }
 
+static inline int cpuset_get_last_tif_memdie(struct task_struct *task)
+{
+	return jiffies;
+}
+static inline void cpuset_set_last_tif_memdie(struct task_struct *task,
+					      unsigned long last_tif_memdie) {}
+
+static inline int cpuset_set_oom(struct task_struct *task)
+{
+	return 0;
+}
+static inline void cpuset_clear_oom(struct task_struct *task) {}
+
 static inline void cpuset_memory_pressure_bump(void) {}
 
 static inline char *cpuset_task_status_allowed(struct task_struct *task,
diff --git a/kernel/cpuset.c b/kernel/cpuset.c
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -98,6 +98,12 @@ struct cpuset {
 	int mems_generation;
 
 	struct fmeter fmeter;		/* memory_pressure filter */
+
+	/*
+	 * The jiffies at the last time TIF_MEMDIE was set for a task
+	 * associated with this cpuset.
+	 */
+	unsigned long last_tif_memdie_jiffies;
 };
 
 /* bits in struct cpuset flags field */
@@ -109,6 +115,7 @@ typedef enum {
 	CS_NOTIFY_ON_RELEASE,
 	CS_SPREAD_PAGE,
 	CS_SPREAD_SLAB,
+	CS_OOM,
 } cpuset_flagbits_t;
 
 /* convenient tests for these bits */
@@ -147,6 +154,11 @@ static inline int is_spread_slab(const struct cpuset *cs)
 	return test_bit(CS_SPREAD_SLAB, &cs->flags);
 }
 
+static inline int is_oom(const struct cpuset *cs)
+{
+	return test_bit(CS_OOM, &cs->flags);
+}
+
 /*
  * Increment this integer everytime any cpuset changes its
  * mems_allowed value.  Users of cpusets can track this generation
@@ -1251,10 +1263,16 @@ static int attach_task(struct cpuset *cs, char *pidbuf, char **ppathbuf)
 	 * then fail this attach_task(), to avoid breaking top_cpuset.count.
 	 */
 	if (tsk->flags & PF_EXITING) {
-		task_unlock(tsk);
-		mutex_unlock(&callback_mutex);
-		put_task_struct(tsk);
-		return -ESRCH;
+		retval = -ESRCH;
+		goto error;
+	}
+	/*
+	 * If the task's cpuset is currently in the OOM killer, we cannot
+	 * move it or we'll clear the CS_OOM flag in the new cpuset.
+	 */
+	if (unlikely(is_oom(oldcs))) {
+		retval = -EBUSY;
+		goto error;
 	}
 	atomic_inc(&cs->count);
 	rcu_assign_pointer(tsk->cpuset, cs);
@@ -1281,6 +1299,12 @@ static int attach_task(struct cpuset *cs, char *pidbuf, char **ppathbuf)
 	if (atomic_dec_and_test(&oldcs->count))
 		check_for_release(oldcs, ppathbuf);
 	return 0;
+
+error:
+	task_unlock(tsk);
+	mutex_unlock(&callback_mutex);
+	put_task_struct(tsk);
+	return retval;
 }
 
 /* The various types of files and directories in a cpuset file system */
@@ -2600,6 +2624,39 @@ done:
 	return overlap;
 }
 
+int cpuset_get_last_tif_memdie(struct task_struct *task)
+{
+	unsigned long ret;
+	task_lock(task);
+	ret = task->cpuset->last_tif_memdie_jiffies;
+	task_unlock(task);
+	return ret;
+}
+
+void cpuset_set_last_tif_memdie(struct task_struct *task,
+				unsigned long last_tif_memdie)
+{
+	task_lock(task);
+	task->cpuset->last_tif_memdie_jiffies = last_tif_memdie;
+	task_unlock(task);
+}
+
+int cpuset_set_oom(struct task_struct *task)
+{
+	int ret;
+	task_lock(task);
+	ret = test_and_set_bit(CS_OOM, &task->cpuset->flags);
+	task_unlock(task);
+	return ret;
+}
+
+void cpuset_clear_oom(struct task_struct *task)
+{
+	task_lock(task);
+	clear_bit(CS_OOM, &task->cpuset->flags);
+	task_unlock(task);
+}
+
 /*
  * Collection of memory_pressure is suppressed unless
  * this flag is enabled by writing "1" to the special
diff --git a/kernel/exit.c b/kernel/exit.c
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -853,6 +853,7 @@ static void exit_notify(struct task_struct *tsk)
 	if (unlikely(test_tsk_thread_flag(tsk, TIF_MEMDIE))) {
 		extern unsigned long VM_is_OOM;
 		clear_bit(0, &VM_is_OOM);
+		cpuset_clear_oom(tsk);
 	}
 
 	write_unlock_irq(&tasklist_lock);
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -427,6 +427,7 @@ retry:
 void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, int order)
 {
 	unsigned long freed = 0;
+	unsigned long last_tif_memdie;
 	int constraint;
 	static DECLARE_MUTEX(OOM_lock);
 
@@ -454,6 +455,22 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, int order)
 		break;
 
 	case CONSTRAINT_CPUSET:
+		read_lock(&tasklist_lock);
+		last_tif_memdie = cpuset_get_last_tif_memdie(current);
+		/*
+		 * If current's cpuset is already in the OOM killer or its killed
+		 * task has not yet exited and a deadlock hasn't been detected, then
+		 * do nothing.
+		 */
+		if (unlikely(cpuset_set_oom(current)) &&
+		    !oom_is_deadlocked(&last_tif_memdie))
+			goto out_cpuset;
+		cpuset_set_last_tif_memdie(current, last_tif_memdie);
+		select_and_kill_process(gfp_mask, order, constraint);
+
+	out_cpuset:
+		read_unlock(&tasklist_lock);
+		break;
 	case CONSTRAINT_NONE:
 		if (down_trylock(&OOM_lock))
 			break;
@@ -466,7 +483,7 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, int order)
 		 */
 		if (unlikely(test_bit(0, &VM_is_OOM)) &&
 		    !oom_is_deadlocked(&last_tif_memdie_jiffies))
-			goto out;
+			goto out_none;
 
 		if (sysctl_panic_on_oom) {
 			read_unlock(&tasklist_lock);
@@ -476,7 +493,7 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, int order)
 
 		select_and_kill_process(gfp_mask, order, constraint);
 
-	out:
+	out_none:
 		read_unlock(&tasklist_lock);
 		up(&OOM_lock);
 		break;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [patch 2/4] oom: select process to kill for cpusets
  2007-06-27 14:44 ` [patch 2/4] oom: select process to kill for cpusets David Rientjes
  2007-06-27 14:44   ` [patch 3/4] oom: extract select helper function David Rientjes
@ 2007-06-27 21:52   ` Christoph Lameter
  2007-06-28  6:13     ` David Rientjes
  1 sibling, 1 reply; 25+ messages in thread
From: Christoph Lameter @ 2007-06-27 21:52 UTC (permalink / raw)
  To: David Rientjes; +Cc: Andrew Morton, Andrea Arcangeli, linux-mm, pj

Paul needs to review this too I think. Some comments below.

On Wed, 27 Jun 2007, David Rientjes wrote:

> @@ -423,12 +430,6 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, int order)
>  		break;
>  
>  	case CONSTRAINT_CPUSET:
> -		read_lock(&tasklist_lock);
> -		oom_kill_process(current, points,
> -				 "No available memory in cpuset", gfp_mask, order);
> -		read_unlock(&tasklist_lock);
> -		break;
> -
>  	case CONSTRAINT_NONE:
>  		if (down_trylock(&OOM_lock))
>  			break;

Would be better if this would now become an "if" instead of "switch". You 
only got two branches.

> @@ -453,9 +454,17 @@ retry:
>  		 * Rambo mode: Shoot down a process and hope it solves whatever
>  		 * issues we may have.
>  		 */
> -		p = select_bad_process(&points);
> +		p = select_bad_process(&points, constraint);
>  		/* Found nothing?!?! Either we hang forever, or we panic. */
>  		if (unlikely(!p)) {
> +			/*
> +			 * We shouldn't panic the entire system if we can't
> +			 * find any eligible tasks to kill in a
> +			 * cpuset-constrained OOM condition.  Instead, we do
> +			 * nothing and allow other cpusets to continue.
> +			 */
> +			if (constraint == CONSTRAINT_CPUSET)
> +				goto out;

Put something into the syslog to note the strange condition?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [patch 4/4] oom: serialize for cpusets
  2007-06-27 14:44     ` [patch 4/4] oom: serialize for cpusets David Rientjes
@ 2007-06-27 21:53       ` Christoph Lameter
  2007-06-27 22:13         ` Paul Jackson
  2007-06-28  0:26         ` Andrea Arcangeli
  2007-06-28 20:41       ` [patch 5/4] oom: add oom_kill_asking_task flag David Rientjes
  1 sibling, 2 replies; 25+ messages in thread
From: Christoph Lameter @ 2007-06-27 21:53 UTC (permalink / raw)
  To: Andrea Arcangeli, David Rientjes; +Cc: Andrew Morton, linux-mm, pj

Paul have you seen this?


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [patch 4/4] oom: serialize for cpusets
  2007-06-27 21:53       ` Christoph Lameter
@ 2007-06-27 22:13         ` Paul Jackson
  2007-06-28  6:24           ` David Rientjes
  2007-06-28  0:26         ` Andrea Arcangeli
  1 sibling, 1 reply; 25+ messages in thread
From: Paul Jackson @ 2007-06-27 22:13 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: andrea, rientjes, akpm, linux-mm

> Paul have you seen this?

I saw it, but I'm up to my eyeballs in unrelated stuff,
and had to avoid thinking about it enough to be of any
use.

I did have this vague recollection that I had seen something
like this before, and it got shot down, because even tasks
in entirely nonoverlapping cpusets might be holding memory
resources on the nodes where we're running out of memory.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [patch 4/4] oom: serialize for cpusets
  2007-06-27 21:53       ` Christoph Lameter
  2007-06-27 22:13         ` Paul Jackson
@ 2007-06-28  0:26         ` Andrea Arcangeli
  1 sibling, 0 replies; 25+ messages in thread
From: Andrea Arcangeli @ 2007-06-28  0:26 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: David Rientjes, Andrew Morton, linux-mm, pj

On Wed, Jun 27, 2007 at 02:53:27PM -0700, Christoph Lameter wrote:
> Andrea: How does this jive with your patches?

Good! it's incremental to it.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [patch 2/4] oom: select process to kill for cpusets
  2007-06-27 21:52   ` [patch 2/4] oom: select process to kill for cpusets Christoph Lameter
@ 2007-06-28  6:13     ` David Rientjes
  0 siblings, 0 replies; 25+ messages in thread
From: David Rientjes @ 2007-06-28  6:13 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Andrew Morton, Andrea Arcangeli, linux-mm, pj

On Wed, 27 Jun 2007, Christoph Lameter wrote:

> > @@ -423,12 +430,6 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, int order)
> >  		break;
> >  
> >  	case CONSTRAINT_CPUSET:
> > -		read_lock(&tasklist_lock);
> > -		oom_kill_process(current, points,
> > -				 "No available memory in cpuset", gfp_mask, order);
> > -		read_unlock(&tasklist_lock);
> > -		break;
> > -
> >  	case CONSTRAINT_NONE:
> >  		if (down_trylock(&OOM_lock))
> >  			break;
> 
> Would be better if this would now become an "if" instead of "switch". You 
> only got two branches.
> 

The fourth patch in the series actually uses CONSTRAINT_CPUSET differently 
than CONSTRAINT_NONE when the select_bad_process()->oom_kill_process() 
calls get moved to a helper function.  The difference is due to the 
locking that we require: in the CONSTRAINT_CPUSET case, we need to "lock" 
the CS_OOM flag in p->cpuset->flags and in the CONSTRAINT_NONE case, we 
need to lock Andrea's OOM_lock.  So maintaining the switch clause is 
better but, if patches 3-4 aren't applied, we can certainly change it to 
an if.

> > @@ -453,9 +454,17 @@ retry:
> >  		 * Rambo mode: Shoot down a process and hope it solves whatever
> >  		 * issues we may have.
> >  		 */
> > -		p = select_bad_process(&points);
> > +		p = select_bad_process(&points, constraint);
> >  		/* Found nothing?!?! Either we hang forever, or we panic. */
> >  		if (unlikely(!p)) {
> > +			/*
> > +			 * We shouldn't panic the entire system if we can't
> > +			 * find any eligible tasks to kill in a
> > +			 * cpuset-constrained OOM condition.  Instead, we do
> > +			 * nothing and allow other cpusets to continue.
> > +			 */
> > +			if (constraint == CONSTRAINT_CPUSET)
> > +				goto out;
> 
> Put something into the syslog to note the strange condition?
> 

Sure.  Unfortunately there's probably a high liklihood that we'll never 
get out of the OOM condition for that cpuset so we'd need to check the 
time elapsed since p->cpuset->last_tif_memdie_jiffies and print the 
diagnostic at a set interval.  A static automatic variable doesn't work to 
limit the printk because it's also plausible that we can OOM, get stuck 
without any killable tasks, eventually free some memory, and then OOM 
again later in that cpuset.

		David

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [patch 4/4] oom: serialize for cpusets
  2007-06-27 22:13         ` Paul Jackson
@ 2007-06-28  6:24           ` David Rientjes
  2007-06-28  7:33             ` Paul Jackson
  0 siblings, 1 reply; 25+ messages in thread
From: David Rientjes @ 2007-06-28  6:24 UTC (permalink / raw)
  To: Paul Jackson; +Cc: Christoph Lameter, andrea, akpm, linux-mm

On Wed, 27 Jun 2007, Paul Jackson wrote:

> I did have this vague recollection that I had seen something
> like this before, and it got shot down, because even tasks
> in entirely nonoverlapping cpusets might be holding memory
> resources on the nodes where we're running out of memory.
> 

There's only three cases I'm aware of (and correct me if I'm wrong) where 
that can happen: the GFP_ATOMIC exception, tasks that have switched their 
cpuset attachment, or a change in p->mems_allowed and left pages behind in 
other nodes with memory_migrate set to 0.

My patches do nothing but improve the behavior because what mainline 
does right now is simply kill current.  If that doesn't work, for whatever 
reason, in oom_kill_process() because its mm is detached, its 
OOM_DISABLE'd, etc, then the OOM killer becomes a no-op and we rely on 
another task to also fail a memory allocation later to enter the OOM 
killer and hope that it is killable.  So, unless we have two OOM'ing 
cpusets on the system and current turns out to hold memory allocations on 
another because of one of the three reasons above (which are unlikely), 
that is the only time when it would benefit the other OOM'ing cpuset to 
kill current.

		David

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [patch 4/4] oom: serialize for cpusets
  2007-06-28  6:24           ` David Rientjes
@ 2007-06-28  7:33             ` Paul Jackson
  2007-06-28  8:05               ` David Rientjes
  0 siblings, 1 reply; 25+ messages in thread
From: Paul Jackson @ 2007-06-28  7:33 UTC (permalink / raw)
  To: David Rientjes; +Cc: clameter, andrea, akpm, linux-mm

> > I did have this vague recollection that I had seen something
> > like this before, and it got shot down, because even tasks
> > in entirely nonoverlapping cpusets might be holding memory
> > resources on the nodes where we're running out of memory.
> > 
> 
> There's only three cases I'm aware of (and correct me if I'm wrong) where 
> that can happen: the GFP_ATOMIC exception, tasks that have switched their 
> cpuset attachment, or a change in p->mems_allowed and left pages behind in 
> other nodes with memory_migrate set to 0.

Perhaps also shared memory - shared with another task in another cpuset
that originally placed the page, then exited, leaving the current task
as the only one holding it.

Yes - that too would not be a common occurrence.

Christoph Lameter had a reply to Andrea Arcangeli on linux-mm (11 Jun 2007)
in a similar discussion, that might be relevant here:

=============================== begin snip ===============================
On Sat, 9 Jun 2007, Andrea Arcangeli wrote:

> On a side note about the current way you select the task to kill if a
> constrained alloc failure triggers, I think it would have been better
> if you simply extended the oom-selector by filtering tasks in function
> of the current->mems_allowed. Now I agree the current badness is quite

Filtering tasks is a very expensive operation on huge systems. We have had
cases where it took an hour or so for the OOM to complete. OOM usually
occurs under heavy processing loads which makes the taking of global locks
quite expensive.

> bad, now with rss instead of the virtual space, it works a bit better
> at least, but the whole point is that if you integrate the cpuset task
> filtering in the oom-selector algorithm, then once we fix the badness
> algorithm to actually do something more meaningful than to check
> static values, you'll get the better algorithm working for your
> local-oom killing too. This if you really care about the huge-numa
> niche to get node-partitioning working really like if this was a
> virtualized environment. If you just have kill something to release
> memory, killing the current task is always the safest choice
> obviously, so as your customers are ok with it I'm certainly fine with
> the current approach too.

The "kill-the-current-process" approach is most effective in hitting the
process that is allocating the most. And as far as I can tell its easiest
to understand for our customer.
================================ end snip ================================

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [patch 4/4] oom: serialize for cpusets
  2007-06-28  7:33             ` Paul Jackson
@ 2007-06-28  8:05               ` David Rientjes
  2007-06-28  9:03                 ` Paul Jackson
  0 siblings, 1 reply; 25+ messages in thread
From: David Rientjes @ 2007-06-28  8:05 UTC (permalink / raw)
  To: Paul Jackson; +Cc: clameter, andrea, akpm, linux-mm

On Thu, 28 Jun 2007, Paul Jackson wrote:

> > There's only three cases I'm aware of (and correct me if I'm wrong) where 
> > that can happen: the GFP_ATOMIC exception, tasks that have switched their 
> > cpuset attachment, or a change in p->mems_allowed and left pages behind in 
> > other nodes with memory_migrate set to 0.
> 
> Perhaps also shared memory - shared with another task in another cpuset
> that originally placed the page, then exited, leaving the current task
> as the only one holding it.
> 

That's possible, but then the user gets what he deserves because he's 
chosen to share memory across cpusets.  Without the expensive search 
through all system tasks and an examination of their associated mm's to 
determine whether it has allocated memory on an OOM'ing node, we can't fix 
that problem.  And, even then, this would only help if that task were the 
only user of such memory.  Otherwise we kill it unnecessarily.

>From Christoph Lameter:
> Filtering tasks is a very expensive operation on huge systems. We have had
> cases where it took an hour or so for the OOM to complete. OOM usually
> occurs under heavy processing loads which makes the taking of global locks
> quite expensive.
> 

The cost of this operation as I've enabled it in the OOM killer would be 
similiar to cating /dev/cpuset/my_cpuset/tasks, with the exception that it 
will take slightly longer if we have an elaborate hierarchy of 
non-mem_exclusive cpusets.  We need to hold a read_lock(&tasklist_lock) 
and callback_mutex for this, but I would argue that if it's perfectly 
legitimate for a system-wide (CONSTRAINT_NONE) OOM condition, then it 
should be legitimate for a cpuset-wide (CONSTRAINT_CPUSET) OOM.  All "huge 
systems" surely don't use cpusets currently and they must not be affected 
by this contention with current mainline behavior or we'd hear complaints.  
The OOM killer does not act egregiously.

Also from Christoph Lameter:
> The "kill-the-current-process" approach is most effective in hitting the
> process that is allocating the most. And as far as I can tell its easiest
> to understand for our customer.

Hmm, it probably goes without saying that I disagree with the first 
sentence or otherwise I wouldn't have written the patchset.  There is 
actually no guarantee at all that current is allocating the most, it could 
have just attempted to allocate at a very unfortunate time and ended up 
being the sacrificial lamb in the OOM killer.

A much better set of rules to determine what the best task to kill is 
through the select_bad_process() heuristics which take things such as 
OOM_DISABLE and total VM size into account when scoring tasks.  It's 
certainly the fairest way of determining which task to kill that will, 
hopefully, alleviate the OOM condition as soon as possible for that 
cpuset.  I would argue that going through select_bad_process() would not 
be as great of a performance hit as you might suspect compared with git 
HEAD's behavior of trying to kill current, which may be ineligible for 
several different reasons, making out_of_memory() a no-op, looping back to 
__alloc_pages(), rescheduling, and spinning until such time as an eligible 
task does hit out_of_memory() which would require an explicit memory 
allocation attempt from it to even occur.

		David

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [patch 4/4] oom: serialize for cpusets
  2007-06-28  8:05               ` David Rientjes
@ 2007-06-28  9:03                 ` Paul Jackson
  2007-06-28 18:13                   ` David Rientjes
  0 siblings, 1 reply; 25+ messages in thread
From: Paul Jackson @ 2007-06-28  9:03 UTC (permalink / raw)
  To: David Rientjes; +Cc: clameter, andrea, akpm, linux-mm

David wrote:
> That's possible, but then the user gets what he deserves because he's 
> chosen to share memory across cpusets.

Well, actually, I've detested System V Shared Memory for over 20 years
now -- can't say as I'm going to loose any sleep over it either way.
I just couldn't resist your challenge to state another example of a
situation in which a task ends up being the sole owner of memory outside
its current cpuset.

I don't really have a strong sense on your proposed change to the OOM
behaviour.  As you probably know better than I, the oom handler is a
house of cards, and I'm more concerned that continued tweaking it is
just a shell game, moving the cases that work better or worse to and
fro.

I'll have to leave it to Christoph to justify further the current
behaviour in these cases, since he's the one who did it.

Do you have real world cases where your change is necessary?  Perhaps
you could describe those scenarios a bit, so that we can separate out
what's going wrong, from the possible remedies, and so we can get a
sense of the importance of this proposed tweak.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [patch 4/4] oom: serialize for cpusets
  2007-06-28  9:03                 ` Paul Jackson
@ 2007-06-28 18:13                   ` David Rientjes
  2007-06-28 18:55                     ` Paul Jackson
  2007-06-29  1:33                     ` Christoph Lameter
  0 siblings, 2 replies; 25+ messages in thread
From: David Rientjes @ 2007-06-28 18:13 UTC (permalink / raw)
  To: Paul Jackson; +Cc: clameter, andrea, akpm, linux-mm

On Thu, 28 Jun 2007, Paul Jackson wrote:

> Do you have real world cases where your change is necessary?  Perhaps
> you could describe those scenarios a bit, so that we can separate out
> what's going wrong, from the possible remedies, and so we can get a
> sense of the importance of this proposed tweak.
> 

It's pretty simple to show how killing current is not the best choice for 
cpuset-constrained memory allocations that encounter an OOM condition.

If you attach all your system tasks to a single small node and then 
attempt to allocate large amounts of memory in that node, tasks get killed 
unnecessarily.  This is a good way to approximate a cpuset's memory 
pressure in real-world examples.  The actual rogue task can avoid getting 
killed by simply not allocating the last N kB in that node while other 
tasks, such as sshd or sendmail, require memory on a spurious basis.  So 
we've often seen tasks such as those get OOM killed even though they don't 
alleviate the condition much at all: sshd and sendmail are not normally 
memory hogs.

The much better policy in terms of sharing memory among a cpuset's task is 
to kill the actual rogue task which we can estimate pretty well with 
select_bad_process() since it takes into consideration, most importantly, 
the total VM size.

So my belief is that it is better to kill one large memory-hogging task in 
a cpuset instead of killing multiple smaller ones based on their 
scheduling and unfortunate luck of being the one to enter the OOM killer.  
Even worse is when the OOM killer, which is not at all serialized for 
cpuset-constrained allocations at present, kills multiple smaller tasks 
before killing the rogue task.  Then those previous kills were unnecessary 
and certainly would qualify as a strong example for why current git's 
behavior is broken.

		David

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [patch 4/4] oom: serialize for cpusets
  2007-06-28 18:13                   ` David Rientjes
@ 2007-06-28 18:55                     ` Paul Jackson
  2007-06-28 19:27                       ` Paul Menage
  2007-06-28 20:43                       ` David Rientjes
  2007-06-29  1:33                     ` Christoph Lameter
  1 sibling, 2 replies; 25+ messages in thread
From: Paul Jackson @ 2007-06-28 18:55 UTC (permalink / raw)
  To: David Rientjes; +Cc: clameter, andrea, akpm, linux-mm

> So my belief is that it is better to kill one large memory-hogging task in 
> a cpuset instead of killing multiple smaller ones based on their 
> scheduling and unfortunate luck of being the one to enter the OOM killer.  

So ... sounds like you're arguing from principle, not from having some
specific customer biting at your heels.

Ok.

My guess is that what happened here is that, back in the earlier days of
cpusets, when most people looked on it with a bit of suspicion, and few
depended on it for their livelihood, my good colleague Christoph Lameter
got in a change to avoid the OOM scoring for cpuset constrained tasks,
and instead just shoot the messenger - kill the task asking for the
memory.

At the time, I was a little surprised he got that change in, but it
suited the needs of our particular customers, so I happily kept quiet.

Now, cpusets are finding wider usage, in ways I never would have
imagined. I guess by and large I did a good enough job of designing
them from principle, not from the specific needs of my (rather odd)
customers that now cpusets have found other users and uses.  Good.

But this particular change, avoiding OOM scoring on cpuset constrained
tasks, is probably more of a heuristic that will fit some uses, not others.

Sounds like its time for a tunable, to determine whether or not to OOM
score cpuset constrained tasks.

As for the default of the tunable, I could argue that it should default to
the current behaviour (avoid OOM scoring cpuset constrained tasks), for
compatibility.  And you could argue that it should default to OOM scoring
even cpuset constrained tasks, because that is what a larger set of users
will expect and need.  Grumble, grumble ... I wish I could think of a
reason why you'd be wrong to say that, and thereby save myself some work
adding hooks to my user space code to flip this tunable to the way I need
it (don't OOM score cpuset constrained tasks.) ... so far nothing ... drat.

Would you like to propose a patch, adding a per-cpuset Boolean flag
that has inheritance properties similar to the memory_spread_* flags?
Set at the top and inherited on cpuset creation; overridable per-cpuset.

How about calling it "oom_kill_asking_task", defaulting to 0 (the
default you will like, not the one I will use for my customers.)

You can leave it up to me to provide such a patch, but I can pretty much
promise I won't get to it for the next two or three months, at least.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [patch 4/4] oom: serialize for cpusets
  2007-06-28 18:55                     ` Paul Jackson
@ 2007-06-28 19:27                       ` Paul Menage
  2007-06-28 20:15                         ` Paul Jackson
  2007-06-28 20:43                       ` David Rientjes
  1 sibling, 1 reply; 25+ messages in thread
From: Paul Menage @ 2007-06-28 19:27 UTC (permalink / raw)
  To: Paul Jackson; +Cc: David Rientjes, clameter, andrea, akpm, linux-mm

On 6/28/07, Paul Jackson <pj@sgi.com> wrote:
> Would you like to propose a patch, adding a per-cpuset Boolean flag
> that has inheritance properties similar to the memory_spread_* flags?
> Set at the top and inherited on cpuset creation; overridable per-cpuset.
>
> How about calling it "oom_kill_asking_task", defaulting to 0 (the
> default you will like, not the one I will use for my customers.)

Seems that this could be a system global, with just the control file
in the top-level cpuset directory. I can't see people wanting
different behaviour in different cpusets at the same time.

Paul

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [patch 4/4] oom: serialize for cpusets
  2007-06-28 19:27                       ` Paul Menage
@ 2007-06-28 20:15                         ` Paul Jackson
  0 siblings, 0 replies; 25+ messages in thread
From: Paul Jackson @ 2007-06-28 20:15 UTC (permalink / raw)
  To: Paul Menage; +Cc: rientjes, clameter, andrea, akpm, linux-mm

> Seems that this could be a system global, with just the control file
> in the top-level cpuset directory. I can't see people wanting
> different behaviour in different cpusets at the same time.

Perhaps -- if it is just as easy either way, then I'd go with
the inherited property, just because we do that with every
other cpuset property, except for one, memory_pressure_enabled,
which had to be system global, because the place it was used
could not get to per-cpuset state.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [patch 5/4] oom: add oom_kill_asking_task flag
  2007-06-27 14:44     ` [patch 4/4] oom: serialize for cpusets David Rientjes
  2007-06-27 21:53       ` Christoph Lameter
@ 2007-06-28 20:41       ` David Rientjes
  2007-06-28 22:07         ` Paul Jackson
  1 sibling, 1 reply; 25+ messages in thread
From: David Rientjes @ 2007-06-28 20:41 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Andrea Arcangeli, Paul Jackson, Christoph Lameter, linux-mm

Adds an oom_kill_asking_task flag to cpusets.  If unset (by default), we
iterate through the task list via select_bad_process() during a
cpuset-constrained OOM to find the best candidate task to kill.  If set,
we simply kill current to avoid the overhead which is needed for some
customers with a large number of threads or heavy workload.

Cc: Andrea Arcangeli <andrea@suse.de>
Cc: Paul Jackson <pj@sgi.com>
Cc: Christoph Lameter <clameter@sgi.com>
Signed-off-by: David Rientjes <rientjes@google.com>
---
 Documentation/cpusets.txt |    3 +++
 include/linux/cpuset.h    |    5 +++++
 kernel/cpuset.c           |   39 ++++++++++++++++++++++++++++++++++++++-
 mm/oom_kill.c             |    7 +++++++
 4 files changed, 53 insertions(+), 1 deletions(-)

diff --git a/Documentation/cpusets.txt b/Documentation/cpusets.txt
--- a/Documentation/cpusets.txt
+++ b/Documentation/cpusets.txt
@@ -181,6 +181,9 @@ containing the following files describing that cpuset:
  - tasks: list of tasks (by pid) attached to that cpuset
  - notify_on_release flag: run /sbin/cpuset_release_agent on exit?
  - memory_pressure: measure of how much paging pressure in cpuset
+ - oom_kill_asking_task flag: when this cpuset OOM's, should we kill
+	the task that asked for the memory or should we iterate through
+	the task list to find the best task to kill (can be expensive)?
 
 In addition, the root cpuset only has the following file:
  - memory_pressure_enabled flag: compute memory_pressure?
diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -52,6 +52,7 @@ extern void cpuset_set_last_tif_memdie(struct task_struct *task,
 				       unsigned long last_tif_memdie);
 extern int cpuset_set_oom(struct task_struct *task);
 extern void cpuset_clear_oom(struct task_struct *task);
+extern int cpuset_oom_kill_asking_task(struct task_struct *task);
 
 #define cpuset_memory_pressure_bump() 				\
 	do {							\
@@ -136,6 +137,10 @@ static inline int cpuset_set_oom(struct task_struct *task)
 	return 0;
 }
 static inline void cpuset_clear_oom(struct task_struct *task) {}
+static inline int cpuset_oom_kill_asking_task(struct task_struct *task)
+{
+	return 0;
+}
 
 static inline void cpuset_memory_pressure_bump(void) {}
 
diff --git a/kernel/cpuset.c b/kernel/cpuset.c
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -116,6 +116,7 @@ typedef enum {
 	CS_SPREAD_PAGE,
 	CS_SPREAD_SLAB,
 	CS_OOM,
+	CS_OOM_KILL_ASKING_TASK,
 } cpuset_flagbits_t;
 
 /* convenient tests for these bits */
@@ -159,6 +160,11 @@ static inline int is_oom(const struct cpuset *cs)
 	return test_bit(CS_OOM, &cs->flags);
 }
 
+static inline int is_oom_kill_asking_task(const struct cpuset *cs)
+{
+	return test_bit(CS_OOM_KILL_ASKING_TASK, &cs->flags);
+}
+
 /*
  * Increment this integer everytime any cpuset changes its
  * mems_allowed value.  Users of cpusets can track this generation
@@ -1068,7 +1074,8 @@ static int update_memory_pressure_enabled(struct cpuset *cs, char *buf)
  * update_flag - read a 0 or a 1 in a file and update associated flag
  * bit:	the bit to update (CS_CPU_EXCLUSIVE, CS_MEM_EXCLUSIVE,
  *				CS_NOTIFY_ON_RELEASE, CS_MEMORY_MIGRATE,
- *				CS_SPREAD_PAGE, CS_SPREAD_SLAB)
+ *				CS_SPREAD_PAGE, CS_SPREAD_SLAB,
+ *				CS_OOM_KILL_ASKING_TASK)
  * cs:	the cpuset to update
  * buf:	the buffer where we read the 0 or 1
  *
@@ -1320,6 +1327,7 @@ typedef enum {
 	FILE_NOTIFY_ON_RELEASE,
 	FILE_MEMORY_PRESSURE_ENABLED,
 	FILE_MEMORY_PRESSURE,
+	FILE_OOM_KILL_ASKING_TASK,
 	FILE_SPREAD_PAGE,
 	FILE_SPREAD_SLAB,
 	FILE_TASKLIST,
@@ -1382,6 +1390,9 @@ static ssize_t cpuset_common_file_write(struct file *file,
 	case FILE_MEMORY_PRESSURE:
 		retval = -EACCES;
 		break;
+	case FILE_OOM_KILL_ASKING_TASK:
+		retval = update_flag(CS_OOM_KILL_ASKING_TASK, cs, buffer);
+		break;
 	case FILE_SPREAD_PAGE:
 		retval = update_flag(CS_SPREAD_PAGE, cs, buffer);
 		cs->mems_generation = cpuset_mems_generation++;
@@ -1499,6 +1510,9 @@ static ssize_t cpuset_common_file_read(struct file *file, char __user *buf,
 	case FILE_MEMORY_PRESSURE:
 		s += sprintf(s, "%d", fmeter_getrate(&cs->fmeter));
 		break;
+	case FILE_OOM_KILL_ASKING_TASK:
+		*s++ = is_oom_kill_asking_task(cs) ? '1' : '0';
+		break;
 	case FILE_SPREAD_PAGE:
 		*s++ = is_spread_page(cs) ? '1' : '0';
 		break;
@@ -1863,6 +1877,11 @@ static struct cftype cft_memory_pressure = {
 	.private = FILE_MEMORY_PRESSURE,
 };
 
+static struct cftype cft_oom_kill_asking_task = {
+	.name = "oom_kill_asking_task",
+	.private = FILE_OOM_KILL_ASKING_TASK,
+};
+
 static struct cftype cft_spread_page = {
 	.name = "memory_spread_page",
 	.private = FILE_SPREAD_PAGE,
@@ -1891,6 +1910,8 @@ static int cpuset_populate_dir(struct dentry *cs_dentry)
 		return err;
 	if ((err = cpuset_add_file(cs_dentry, &cft_memory_pressure)) < 0)
 		return err;
+	if ((err = cpuset_add_file(cs_dentry, &cft_oom_kill_asking_task)) < 0)
+		return err;
 	if ((err = cpuset_add_file(cs_dentry, &cft_spread_page)) < 0)
 		return err;
 	if ((err = cpuset_add_file(cs_dentry, &cft_spread_slab)) < 0)
@@ -1923,6 +1944,8 @@ static long cpuset_create(struct cpuset *parent, const char *name, int mode)
 	cs->flags = 0;
 	if (notify_on_release(parent))
 		set_bit(CS_NOTIFY_ON_RELEASE, &cs->flags);
+	if (is_oom_kill_asking_task(parent))
+		set_bit(CS_OOM_KILL_ASKING_TASK, &cs->flags);
 	if (is_spread_page(parent))
 		set_bit(CS_SPREAD_PAGE, &cs->flags);
 	if (is_spread_slab(parent))
@@ -2658,6 +2681,20 @@ void cpuset_clear_oom(struct task_struct *task)
 }
 
 /*
+ * Returns 1 if current should simply be killed when a cpuset-constrained OOM
+ * occurs.  Otherwise, we iterate through the task list and select the best
+ * candidate we can find.
+ */
+int cpuset_oom_kill_asking_task(struct task_struct *task)
+{
+	int ret;
+	task_lock(task);
+	ret = is_oom_kill_asking_task(task->cpuset);
+	task_unlock(task);
+	return ret;
+}
+
+/*
  * Collection of memory_pressure is suppressed unless
  * this flag is enabled by writing "1" to the special
  * cpuset file 'memory_pressure_enabled' in the root cpuset.
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -456,6 +456,13 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, int order)
 
 	case CONSTRAINT_CPUSET:
 		read_lock(&tasklist_lock);
+		if (cpuset_oom_kill_asking_task(current)) {
+			oom_kill_process(current, 0,
+					 "No available memory in cpuset", gfp_mask,
+					 order);
+			goto out_cpuset;
+		}
+
 		last_tif_memdie = cpuset_get_last_tif_memdie(current);
 		/*
 		 * If current's cpuset is already in the OOM killer or its killed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [patch 4/4] oom: serialize for cpusets
  2007-06-28 18:55                     ` Paul Jackson
  2007-06-28 19:27                       ` Paul Menage
@ 2007-06-28 20:43                       ` David Rientjes
  1 sibling, 0 replies; 25+ messages in thread
From: David Rientjes @ 2007-06-28 20:43 UTC (permalink / raw)
  To: Paul Jackson; +Cc: clameter, andrea, akpm, linux-mm

On Thu, 28 Jun 2007, Paul Jackson wrote:

> Would you like to propose a patch, adding a per-cpuset Boolean flag
> that has inheritance properties similar to the memory_spread_* flags?
> Set at the top and inherited on cpuset creation; overridable per-cpuset.
> 
> How about calling it "oom_kill_asking_task", defaulting to 0 (the
> default you will like, not the one I will use for my customers.)
> 

That sounds like a good solution.  I certainly don't want to cause a 
regression for your customers where this change would cause the OOM killer 
to become excessively expensive.

I'd like an ack from Christoph on my posted patch that does this before 
it's merged, however, to make sure he thinks its worth the addition of yet 
another cpuset flag.

Thanks for the reviews.

		David

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [patch 5/4] oom: add oom_kill_asking_task flag
  2007-06-28 20:41       ` [patch 5/4] oom: add oom_kill_asking_task flag David Rientjes
@ 2007-06-28 22:07         ` Paul Jackson
  0 siblings, 0 replies; 25+ messages in thread
From: Paul Jackson @ 2007-06-28 22:07 UTC (permalink / raw)
  To: David Rientjes; +Cc: akpm, andrea, clameter, linux-mm

Good - thanks.

Acked-by: Paul Jackson <pj@sgi.com>

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [patch 4/4] oom: serialize for cpusets
  2007-06-28 18:13                   ` David Rientjes
  2007-06-28 18:55                     ` Paul Jackson
@ 2007-06-29  1:33                     ` Christoph Lameter
  2007-06-29  4:07                       ` David Rientjes
  1 sibling, 1 reply; 25+ messages in thread
From: Christoph Lameter @ 2007-06-29  1:33 UTC (permalink / raw)
  To: David Rientjes; +Cc: Paul Jackson, andrea, akpm, linux-mm

On Thu, 28 Jun 2007, David Rientjes wrote:

> If you attach all your system tasks to a single small node and then 
> attempt to allocate large amounts of memory in that node, tasks get killed 
> unnecessarily.  This is a good way to approximate a cpuset's memory 
> pressure in real-world examples.  The actual rogue task can avoid getting 
> killed by simply not allocating the last N kB in that node while other 
> tasks, such as sshd or sendmail, require memory on a spurious basis.  So 
> we've often seen tasks such as those get OOM killed even though they don't 
> alleviate the condition much at all: sshd and sendmail are not normally 
> memory hogs.

Yeah but to get there seems to require intention on the part of the 
rogue tasks.

> The much better policy in terms of sharing memory among a cpuset's task is 
> to kill the actual rogue task which we can estimate pretty well with 
> select_bad_process() since it takes into consideration, most importantly, 
> the total VM size.

Sorry that is too expensive. I did not see that initially. Thanks Paul for 
reminding me. I am at the OLS and my mindshare for this is pretty limited 
right now.

> So my belief is that it is better to kill one large memory-hogging task in 
> a cpuset instead of killing multiple smaller ones based on their 
> scheduling and unfortunate luck of being the one to enter the OOM killer.  
> Even worse is when the OOM killer, which is not at all serialized for 
> cpuset-constrained allocations at present, kills multiple smaller tasks 
> before killing the rogue task.  Then those previous kills were unnecessary 
> and certainly would qualify as a strong example for why current git's 
> behavior is broken.

The current behavior will usually kill the memory hogging task and it can 
do so with minimal effort. If there is a whole array of memory hogging 
tasks then the existing approach will be much easier on the system.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [patch 4/4] oom: serialize for cpusets
  2007-06-29  1:33                     ` Christoph Lameter
@ 2007-06-29  4:07                       ` David Rientjes
  0 siblings, 0 replies; 25+ messages in thread
From: David Rientjes @ 2007-06-29  4:07 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Paul Jackson, andrea, akpm, linux-mm

On Thu, 28 Jun 2007, Christoph Lameter wrote:

> > If you attach all your system tasks to a single small node and then 
> > attempt to allocate large amounts of memory in that node, tasks get killed 
> > unnecessarily.  This is a good way to approximate a cpuset's memory 
> > pressure in real-world examples.  The actual rogue task can avoid getting 
> > killed by simply not allocating the last N kB in that node while other 
> > tasks, such as sshd or sendmail, require memory on a spurious basis.  So 
> > we've often seen tasks such as those get OOM killed even though they don't 
> > alleviate the condition much at all: sshd and sendmail are not normally 
> > memory hogs.
> 
> Yeah but to get there seems to require intention on the part of the 
> rogue tasks.
> 

I'm confused.  The "rogue" task has avoided getting killed because it just 
barely got the memory it required and now other tasks, such as the 
examples of sendmail or sshd, which have spurious memory usage will get 
killed with current git and probably not do a darn thing to alleviate the 
OOM condition.  That's broken behavior.

> > The much better policy in terms of sharing memory among a cpuset's task is 
> > to kill the actual rogue task which we can estimate pretty well with 
> > select_bad_process() since it takes into consideration, most importantly, 
> > the total VM size.
> 
> Sorry that is too expensive. I did not see that initially. Thanks Paul for 
> reminding me. I am at the OLS and my mindshare for this is pretty limited 
> right now.
> 

Please see patch 5 posted today which uses select_bad_process() to find 
the task to kill the default but allows you to
echo 1 > /dev/cpuset/my_cpuset/oom_kill_asking_task to cheaply kill 
current instead.

And can you please explain why this is only objected to in the 
CONSTRAINT_CPUSET case and not CONSTRAINT_NONE where this is an even more 
expensive operation?

> The current behavior will usually kill the memory hogging task and it can 
> do so with minimal effort. If there is a whole array of memory hogging 
> tasks then the existing approach will be much easier on the system.
> 

No, it kills the current task.  You cannot make the inference that the 
task that exceeded the available cpuset memory is the one that should be 
killed.  And if you do, you'll kill tasks unnecessarily regardless of 
whether the cpuset OOM killer is serialized or not.

If my application requests more memory than it should have or it's 
leaking but it hasn't yet reached an OOM state and then sshd wants a small 
amount of memory and OOM's, we kill it even though it's futile because my 
application is going to continue to leak.  Eventually we'll get around to 
killing my application because of scheduling decisions and I will OOM, but 
sshd and any other task that was unlucky enough to be scheduled before me 
will already be dead.  That's broken behavior, but we have enabled it 
through the oom_kill_asking_task flag on a per-cpuset basis for special 
situations where killing current would be acceptable.

		David

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [patch 1/4] oom: extract deadlock helper function
  2007-06-27 14:44 [patch 1/4] oom: extract deadlock helper function David Rientjes
  2007-06-27 14:44 ` [patch 2/4] oom: select process to kill for cpusets David Rientjes
@ 2007-07-26  6:15 ` David Rientjes
  2007-07-26  6:25   ` Andrew Morton
  1 sibling, 1 reply; 25+ messages in thread
From: David Rientjes @ 2007-07-26  6:15 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Andrea Arcangeli, linux-mm

On Wed, 27 Jun 2007, David Rientjes wrote:

> Extracts the jiffies comparison operation, the assignment of the
> last_tif_memdie actual, and diagnostic message to its own function.
> 

Andrew, can you give me an update on where Andrea and I's patchsets for 
the OOM killer stand for inclusion in -mm?  Andrea's were posted June 
8-13 and mine were posted June 27-28 to linux-mm.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [patch 1/4] oom: extract deadlock helper function
  2007-07-26  6:15 ` [patch 1/4] oom: extract deadlock helper function David Rientjes
@ 2007-07-26  6:25   ` Andrew Morton
  2007-07-26  7:29     ` David Rientjes
  0 siblings, 1 reply; 25+ messages in thread
From: Andrew Morton @ 2007-07-26  6:25 UTC (permalink / raw)
  To: David Rientjes; +Cc: Andrea Arcangeli, linux-mm

On Wed, 25 Jul 2007 23:15:10 -0700 (PDT) David Rientjes <rientjes@google.com> wrote:

> On Wed, 27 Jun 2007, David Rientjes wrote:
> 
> > Extracts the jiffies comparison operation, the assignment of the
> > last_tif_memdie actual, and diagnostic message to its own function.
> > 
> 
> Andrew, can you give me an update on where Andrea and I's patchsets for 
> the OOM killer stand for inclusion in -mm?  Andrea's were posted June 
> 8-13 and mine were posted June 27-28 to linux-mm.

Not forgotten about, but backlogged, and things are arriving at the same
rate as they are being dispatched, so no progress is being made.

(I could merge stuff faster, but my typing speed isn't the rate-limiting
factor here.  We're limited by our ability to review, integrate, test and
debug stuff).

If you have time, what would help heaps would be if you could adopt
Andrea's patches and then maintain those and yours as a single coherent
patchset.  Refresh, retest and squirt them all at me?

It'll take a lot of testing to test those things, and I haven't started to
think about how we set about that.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [patch 1/4] oom: extract deadlock helper function
  2007-07-26  6:25   ` Andrew Morton
@ 2007-07-26  7:29     ` David Rientjes
  0 siblings, 0 replies; 25+ messages in thread
From: David Rientjes @ 2007-07-26  7:29 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Andrea Arcangeli, linux-mm

On Wed, 25 Jul 2007, Andrew Morton wrote:

> If you have time, what would help heaps would be if you could adopt
> Andrea's patches and then maintain those and yours as a single coherent
> patchset.  Refresh, retest and squirt them all at me?
> 
> It'll take a lot of testing to test those things, and I haven't started to
> think about how we set about that.
> 

Sure, I'll actively test all the patches collectively as much as possible 
and then run them by the community again.  Thanks for the update.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2007-07-26  7:29 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-06-27 14:44 [patch 1/4] oom: extract deadlock helper function David Rientjes
2007-06-27 14:44 ` [patch 2/4] oom: select process to kill for cpusets David Rientjes
2007-06-27 14:44   ` [patch 3/4] oom: extract select helper function David Rientjes
2007-06-27 14:44     ` [patch 4/4] oom: serialize for cpusets David Rientjes
2007-06-27 21:53       ` Christoph Lameter
2007-06-27 22:13         ` Paul Jackson
2007-06-28  6:24           ` David Rientjes
2007-06-28  7:33             ` Paul Jackson
2007-06-28  8:05               ` David Rientjes
2007-06-28  9:03                 ` Paul Jackson
2007-06-28 18:13                   ` David Rientjes
2007-06-28 18:55                     ` Paul Jackson
2007-06-28 19:27                       ` Paul Menage
2007-06-28 20:15                         ` Paul Jackson
2007-06-28 20:43                       ` David Rientjes
2007-06-29  1:33                     ` Christoph Lameter
2007-06-29  4:07                       ` David Rientjes
2007-06-28  0:26         ` Andrea Arcangeli
2007-06-28 20:41       ` [patch 5/4] oom: add oom_kill_asking_task flag David Rientjes
2007-06-28 22:07         ` Paul Jackson
2007-06-27 21:52   ` [patch 2/4] oom: select process to kill for cpusets Christoph Lameter
2007-06-28  6:13     ` David Rientjes
2007-07-26  6:15 ` [patch 1/4] oom: extract deadlock helper function David Rientjes
2007-07-26  6:25   ` Andrew Morton
2007-07-26  7:29     ` David Rientjes

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox