[patch -mm] memcg: make oom killer a no-op when no killable task can be found

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [patch -mm] memcg: make oom killer a no-op when no killable task can be found
@ 2010-05-04 23:51 David Rientjes
  0 siblings, 0 replies; 29+ messages in thread
From: David Rientjes @ 2010-05-04 23:51 UTC (permalink / raw)
  To: Andrew Morton; +Cc: KAMEZAWA Hiroyuki, Balbir Singh, linux-mm

It's pointless to try to kill current if select_bad_process() did not
find an eligible task to kill in mem_cgroup_out_of_memory() since it's
guaranteed that current is a member of the memcg that is oom and it is,
by definition, unkillable.

Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: David Rientjes <rientjes@google.com>
---
 mm/oom_kill.c |    5 +----
 1 files changed, 1 insertions(+), 4 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -512,12 +512,9 @@ void mem_cgroup_out_of_memory(struct mem_cgroup *mem, gfp_t gfp_mask)
 	read_lock(&tasklist_lock);
 retry:
 	p = select_bad_process(&points, limit, mem, CONSTRAINT_NONE, NULL);
-	if (PTR_ERR(p) == -1UL)
+	if (!p || PTR_ERR(p) == -1UL)
 		goto out;
 
-	if (!p)
-		p = current;
-
 	if (oom_kill_process(p, gfp_mask, 0, points, limit, mem,
 				"Memory cgroup out of memory"))
 		goto retry;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] oom killer: break from infinite loop
@ 2010-03-28 14:55 anfei
  2010-03-28 16:28 ` Oleg Nesterov
  0 siblings, 1 reply; 29+ messages in thread
From: anfei @ 2010-03-28 14:55 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andrew Morton, rientjes, kosaki.motohiro, nishimura,
	kamezawa.hiroyu, linux-mm, linux-kernel

On Fri, Mar 26, 2010 at 11:33:56PM +0100, Oleg Nesterov wrote:
> On 03/26, Andrew Morton wrote:
> >
> > On Thu, 25 Mar 2010 00:25:05 +0800
> > Anfei Zhou <anfei.zhou@gmail.com> wrote:
> >
> > > --- a/mm/oom_kill.c
> > > +++ b/mm/oom_kill.c
> > > @@ -381,6 +381,8 @@ static void dump_header(struct task_struct *p, gfp_t gfp_mask, int order,
> > >   */
> > >  static void __oom_kill_task(struct task_struct *p, int verbose)
> > >  {
> > > +	struct task_struct *t;
> > > +
> > >  	if (is_global_init(p)) {
> > >  		WARN_ON(1);
> > >  		printk(KERN_WARNING "tried to kill init!\n");
> > > @@ -412,6 +414,8 @@ static void __oom_kill_task(struct task_struct *p, int verbose)
> > >  	 */
> > >  	p->rt.time_slice = HZ;
> > >  	set_tsk_thread_flag(p, TIF_MEMDIE);
> > > +	for (t = next_thread(p); t != p; t = next_thread(t))
> > > +		set_tsk_thread_flag(t, TIF_MEMDIE);
> > >
> > >  	force_sig(SIGKILL, p);
> >
> > Don't we need some sort of locking while walking that ring?
> 
> This should be always called under tasklist_lock, I think.
> At least this seems to be true in Linus's tree.
> 
Yes, this function is always called with read_lock(&tasklist_lock), so
it should be okay.

> I'd suggest to do
> 
> 	- set_tsk_thread_flag(p, TIF_MEMDIE);
> 	+ t = p;
> 	+ do {
> 	+	 set_tsk_thread_flag(t, TIF_MEMDIE);
> 	+ } while_each_thread(p, t);
> 
> but this is matter of taste.
> 
Yes, this is better.

> Off-topic, but we shouldn't use force_sig(), SIGKILL doesn't
> need "force" semantics.
> 
This may need a dedicated patch, there are some other places to
force_sig(SIGKILL, ...) too.

> I'd wish I could understand the changelog ;)
> 
Assume thread A and B are in the same group.  If A runs into the oom,
and selects B as the victim, B won't exit because at least in exit_mm(),
it can not get the mm->mmap_sem semaphore which A has already got.  So
no memory is freed, and no other task will be selected to kill.

I formatted the patch for -mm tree as David suggested.


---
 mm/oom_kill.c |    9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -418,8 +418,15 @@ static void dump_header(struct task_stru
  */
 static void __oom_kill_task(struct task_struct *p)
 {
+	struct task_struct *t;
+
 	p->rt.time_slice = HZ;
-	set_tsk_thread_flag(p, TIF_MEMDIE);
+
+	t = p;
+	do {
+		set_tsk_thread_flag(t, TIF_MEMDIE);
+	} while_each_thread(p, t);
+
 	force_sig(SIGKILL, p);
 }
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] oom killer: break from infinite loop
  2010-03-28 14:55 [PATCH] oom killer: break from infinite loop anfei
@ 2010-03-28 16:28 ` Oleg Nesterov
  2010-03-28 21:21   ` David Rientjes
  0 siblings, 1 reply; 29+ messages in thread
From: Oleg Nesterov @ 2010-03-28 16:28 UTC (permalink / raw)
  To: anfei
  Cc: Andrew Morton, rientjes, kosaki.motohiro, nishimura,
	kamezawa.hiroyu, linux-mm, linux-kernel

On 03/28, anfei wrote:
>
> On Fri, Mar 26, 2010 at 11:33:56PM +0100, Oleg Nesterov wrote:
>
> > Off-topic, but we shouldn't use force_sig(), SIGKILL doesn't
> > need "force" semantics.
> >
> This may need a dedicated patch, there are some other places to
> force_sig(SIGKILL, ...) too.

Yes, yes, sure.

> > I'd wish I could understand the changelog ;)
> >
> Assume thread A and B are in the same group.  If A runs into the oom,
> and selects B as the victim, B won't exit because at least in exit_mm(),
> it can not get the mm->mmap_sem semaphore which A has already got.

I see. But still I can't understand. To me, the problem is not that
B can't exit, the problem is that A doesn't know it should exit. All
threads should exit and free ->mm. Even if B could exit, this is not
enough. And, to some extent, it doesn't matter if it holds mmap_sem
or not.

Don't get me wrong. Even if I don't understand oom_kill.c the patch
looks obviously good to me, even from "common sense" pov. I am just
curious.

So, my understanding is: we are going to kill the whole thread group
but TIF_MEMDIE is per-thread. Mark the whole thread group as TIF_MEMDIE
so that any thread can notice this flag and (say, __alloc_pages_slowpath)
fail asap.

Is my understanding correct?

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] oom killer: break from infinite loop
  2010-03-28 16:28 ` Oleg Nesterov
@ 2010-03-28 21:21   ` David Rientjes
  2010-03-29 14:06     ` anfei
  0 siblings, 1 reply; 29+ messages in thread
From: David Rientjes @ 2010-03-28 21:21 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: anfei, Andrew Morton, KOSAKI Motohiro, nishimura,
	KAMEZAWA Hiroyuki, Mel Gorman, linux-mm, linux-kernel

On Sun, 28 Mar 2010, Oleg Nesterov wrote:

> I see. But still I can't understand. To me, the problem is not that
> B can't exit, the problem is that A doesn't know it should exit. All
> threads should exit and free ->mm. Even if B could exit, this is not
> enough. And, to some extent, it doesn't matter if it holds mmap_sem
> or not.
> 
> Don't get me wrong. Even if I don't understand oom_kill.c the patch
> looks obviously good to me, even from "common sense" pov. I am just
> curious.
> 
> So, my understanding is: we are going to kill the whole thread group
> but TIF_MEMDIE is per-thread. Mark the whole thread group as TIF_MEMDIE
> so that any thread can notice this flag and (say, __alloc_pages_slowpath)
> fail asap.
> 
> Is my understanding correct?
> 

[Adding Mel Gorman <mel@csn.ul.ie> to the cc]

The problem with this approach is that we could easily deplete all memory 
reserves if the oom killed task has an extremely large number of threads, 
there has always been only a single thread with TIF_MEMDIE set per cpuset 
or memcg; for systems that don't run with cpusets or memory controller,
this has been limited to one thread with TIF_MEMDIE for the entire system.

There's risk involved with suddenly allowing 1000 threads to have 
TIF_MEMDIE set and the chances of fully depleting all allowed zones is 
much higher if they allocate memory prior to exit, for example.

An alternative is to fail allocations if they are failable and the 
allocating task has a pending SIGKILL.  It's better to preempt the oom 
killer since current is going to be exiting anyway and this avoids a 
needless kill.

That's possible if it's guaranteed that __GFP_NOFAIL allocations with a 
pending SIGKILL are granted ALLOC_NO_WATERMARKS to prevent them from 
endlessly looping while making no progress.

Comments?
---
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1610,13 +1610,21 @@ try_next_zone:
 }
 
 static inline int
-should_alloc_retry(gfp_t gfp_mask, unsigned int order,
+should_alloc_retry(struct task_struct *p, gfp_t gfp_mask, unsigned int order,
 				unsigned long pages_reclaimed)
 {
 	/* Do not loop if specifically requested */
 	if (gfp_mask & __GFP_NORETRY)
 		return 0;
 
+	/* Loop if specifically requested */
+	if (gfp_mask & __GFP_NOFAIL)
+		return 1;
+
+	/* Task is killed, fail the allocation if possible */
+	if (fatal_signal_pending(p))
+		return 0;
+
 	/*
 	 * In this implementation, order <= PAGE_ALLOC_COSTLY_ORDER
 	 * means __GFP_NOFAIL, but that may not be true in other
@@ -1635,13 +1643,6 @@ should_alloc_retry(gfp_t gfp_mask, unsigned int order,
 	if (gfp_mask & __GFP_REPEAT && pages_reclaimed < (1 << order))
 		return 1;
 
-	/*
-	 * Don't let big-order allocations loop unless the caller
-	 * explicitly requests that.
-	 */
-	if (gfp_mask & __GFP_NOFAIL)
-		return 1;
-
 	return 0;
 }
 
@@ -1798,6 +1799,7 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
 	if (likely(!(gfp_mask & __GFP_NOMEMALLOC))) {
 		if (!in_interrupt() &&
 		    ((p->flags & PF_MEMALLOC) ||
+		     (fatal_signal_pending(p) && (gfp_mask & __GFP_NOFAIL)) ||
 		     unlikely(test_thread_flag(TIF_MEMDIE))))
 			alloc_flags |= ALLOC_NO_WATERMARKS;
 	}
@@ -1812,6 +1814,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	int migratetype)
 {
 	const gfp_t wait = gfp_mask & __GFP_WAIT;
+	const gfp_t nofail = gfp_mask & __GFP_NOFAIL;
 	struct page *page = NULL;
 	int alloc_flags;
 	unsigned long pages_reclaimed = 0;
@@ -1876,7 +1879,7 @@ rebalance:
 		goto nopage;
 
 	/* Avoid allocations with no watermarks from looping endlessly */
-	if (test_thread_flag(TIF_MEMDIE) && !(gfp_mask & __GFP_NOFAIL))
+	if (test_thread_flag(TIF_MEMDIE) && !nofail)
 		goto nopage;
 
 	/* Try direct reclaim and then allocating */
@@ -1888,6 +1891,10 @@ rebalance:
 	if (page)
 		goto got_pg;
 
+	/* Task is killed, fail the allocation if possible */
+	if (fatal_signal_pending(p) && !nofail)
+		goto nopage;
+
 	/*
 	 * If we failed to make any progress reclaiming, then we are
 	 * running out of options and have to consider going OOM
@@ -1909,8 +1916,7 @@ rebalance:
 			 * made, there are no other options and retrying is
 			 * unlikely to help.
 			 */
-			if (order > PAGE_ALLOC_COSTLY_ORDER &&
-						!(gfp_mask & __GFP_NOFAIL))
+			if (order > PAGE_ALLOC_COSTLY_ORDER && !nofail)
 				goto nopage;
 
 			goto restart;
@@ -1919,7 +1925,7 @@ rebalance:
 
 	/* Check if we should retry the allocation */
 	pages_reclaimed += did_some_progress;
-	if (should_alloc_retry(gfp_mask, order, pages_reclaimed)) {
+	if (should_alloc_retry(p, gfp_mask, order, pages_reclaimed)) {
 		/* Wait for some write requests to complete then retry */
 		congestion_wait(BLK_RW_ASYNC, HZ/50);
 		goto rebalance;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] oom killer: break from infinite loop
  2010-03-28 21:21   ` David Rientjes
@ 2010-03-29 14:06     ` anfei
  2010-03-29 20:01       ` David Rientjes
  0 siblings, 1 reply; 29+ messages in thread
From: anfei @ 2010-03-29 14:06 UTC (permalink / raw)
  To: David Rientjes
  Cc: Oleg Nesterov, Andrew Morton, KOSAKI Motohiro, nishimura,
	KAMEZAWA Hiroyuki, Mel Gorman, linux-mm, linux-kernel

On Sun, Mar 28, 2010 at 02:21:01PM -0700, David Rientjes wrote:
> On Sun, 28 Mar 2010, Oleg Nesterov wrote:
> 
> > I see. But still I can't understand. To me, the problem is not that
> > B can't exit, the problem is that A doesn't know it should exit. All
> > threads should exit and free ->mm. Even if B could exit, this is not
> > enough. And, to some extent, it doesn't matter if it holds mmap_sem
> > or not.
> > 
> > Don't get me wrong. Even if I don't understand oom_kill.c the patch
> > looks obviously good to me, even from "common sense" pov. I am just
> > curious.
> > 
> > So, my understanding is: we are going to kill the whole thread group
> > but TIF_MEMDIE is per-thread. Mark the whole thread group as TIF_MEMDIE
> > so that any thread can notice this flag and (say, __alloc_pages_slowpath)
> > fail asap.
> > 
> > Is my understanding correct?
> > 
> 
> [Adding Mel Gorman <mel@csn.ul.ie> to the cc]
> 
> The problem with this approach is that we could easily deplete all memory 
> reserves if the oom killed task has an extremely large number of threads, 
> there has always been only a single thread with TIF_MEMDIE set per cpuset 
> or memcg; for systems that don't run with cpusets or memory controller,
> this has been limited to one thread with TIF_MEMDIE for the entire system.
> 
> There's risk involved with suddenly allowing 1000 threads to have 
> TIF_MEMDIE set and the chances of fully depleting all allowed zones is 
> much higher if they allocate memory prior to exit, for example.
> 
> An alternative is to fail allocations if they are failable and the 
> allocating task has a pending SIGKILL.  It's better to preempt the oom 
> killer since current is going to be exiting anyway and this avoids a 
> needless kill.
> 
I think this method is okay, but it's easy to trigger another bug of
oom.  See select_bad_process():
	if (!p->mm)
		continue;
!p->mm is not always an unaccepted condition.  e.g. "p" is killed and
doing exit, setting tsk->mm to NULL is before releasing the memory.
And in multi threading environment, this happens much more.
In __out_of_memory(), it panics if select_bad_process returns NULL.
The simple way to fix it is as mem_cgroup_out_of_memory() does.

So I think both of these 2 patches are needed.

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index afeab2a..9aae208 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -588,12 +588,8 @@ retry:
 	if (PTR_ERR(p) == -1UL)
 		return;
 
-	/* Found nothing?!?! Either we hang forever, or we panic. */
-	if (!p) {
-		read_unlock(&tasklist_lock);
-		dump_header(NULL, gfp_mask, order, NULL);
-		panic("Out of memory and no killable processes...\n");
-	}
+	if (!p)
+		p = current;
 
 	if (oom_kill_process(p, gfp_mask, order, points, NULL,
 			     "Out of memory"))


> That's possible if it's guaranteed that __GFP_NOFAIL allocations with a 
> pending SIGKILL are granted ALLOC_NO_WATERMARKS to prevent them from 
> endlessly looping while making no progress.
> 
> Comments?
> ---
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1610,13 +1610,21 @@ try_next_zone:
>  }
>  
>  static inline int
> -should_alloc_retry(gfp_t gfp_mask, unsigned int order,
> +should_alloc_retry(struct task_struct *p, gfp_t gfp_mask, unsigned int order,
>  				unsigned long pages_reclaimed)
>  {
>  	/* Do not loop if specifically requested */
>  	if (gfp_mask & __GFP_NORETRY)
>  		return 0;
>  
> +	/* Loop if specifically requested */
> +	if (gfp_mask & __GFP_NOFAIL)
> +		return 1;
> +
> +	/* Task is killed, fail the allocation if possible */
> +	if (fatal_signal_pending(p))
> +		return 0;
> +
>  	/*
>  	 * In this implementation, order <= PAGE_ALLOC_COSTLY_ORDER
>  	 * means __GFP_NOFAIL, but that may not be true in other
> @@ -1635,13 +1643,6 @@ should_alloc_retry(gfp_t gfp_mask, unsigned int order,
>  	if (gfp_mask & __GFP_REPEAT && pages_reclaimed < (1 << order))
>  		return 1;
>  
> -	/*
> -	 * Don't let big-order allocations loop unless the caller
> -	 * explicitly requests that.
> -	 */
> -	if (gfp_mask & __GFP_NOFAIL)
> -		return 1;
> -
>  	return 0;
>  }
>  
> @@ -1798,6 +1799,7 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
>  	if (likely(!(gfp_mask & __GFP_NOMEMALLOC))) {
>  		if (!in_interrupt() &&
>  		    ((p->flags & PF_MEMALLOC) ||
> +		     (fatal_signal_pending(p) && (gfp_mask & __GFP_NOFAIL)) ||
>  		     unlikely(test_thread_flag(TIF_MEMDIE))))
>  			alloc_flags |= ALLOC_NO_WATERMARKS;
>  	}
> @@ -1812,6 +1814,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  	int migratetype)
>  {
>  	const gfp_t wait = gfp_mask & __GFP_WAIT;
> +	const gfp_t nofail = gfp_mask & __GFP_NOFAIL;
>  	struct page *page = NULL;
>  	int alloc_flags;
>  	unsigned long pages_reclaimed = 0;
> @@ -1876,7 +1879,7 @@ rebalance:
>  		goto nopage;
>  
>  	/* Avoid allocations with no watermarks from looping endlessly */
> -	if (test_thread_flag(TIF_MEMDIE) && !(gfp_mask & __GFP_NOFAIL))
> +	if (test_thread_flag(TIF_MEMDIE) && !nofail)
>  		goto nopage;
>  
>  	/* Try direct reclaim and then allocating */
> @@ -1888,6 +1891,10 @@ rebalance:
>  	if (page)
>  		goto got_pg;
>  
> +	/* Task is killed, fail the allocation if possible */
> +	if (fatal_signal_pending(p) && !nofail)
> +		goto nopage;
> +
>  	/*
>  	 * If we failed to make any progress reclaiming, then we are
>  	 * running out of options and have to consider going OOM
> @@ -1909,8 +1916,7 @@ rebalance:
>  			 * made, there are no other options and retrying is
>  			 * unlikely to help.
>  			 */
> -			if (order > PAGE_ALLOC_COSTLY_ORDER &&
> -						!(gfp_mask & __GFP_NOFAIL))
> +			if (order > PAGE_ALLOC_COSTLY_ORDER && !nofail)
>  				goto nopage;
>  
>  			goto restart;
> @@ -1919,7 +1925,7 @@ rebalance:
>  
>  	/* Check if we should retry the allocation */
>  	pages_reclaimed += did_some_progress;
> -	if (should_alloc_retry(gfp_mask, order, pages_reclaimed)) {
> +	if (should_alloc_retry(p, gfp_mask, order, pages_reclaimed)) {
>  		/* Wait for some write requests to complete then retry */
>  		congestion_wait(BLK_RW_ASYNC, HZ/50);
>  		goto rebalance;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] oom killer: break from infinite loop
  2010-03-29 14:06     ` anfei
@ 2010-03-29 20:01       ` David Rientjes
  2010-03-30 14:29         ` anfei
  0 siblings, 1 reply; 29+ messages in thread
From: David Rientjes @ 2010-03-29 20:01 UTC (permalink / raw)
  To: anfei
  Cc: Oleg Nesterov, Andrew Morton, KOSAKI Motohiro, nishimura,
	KAMEZAWA Hiroyuki, Mel Gorman, linux-mm, linux-kernel

On Mon, 29 Mar 2010, anfei wrote:

> I think this method is okay, but it's easy to trigger another bug of
> oom.  See select_bad_process():
> 	if (!p->mm)
> 		continue;
> !p->mm is not always an unaccepted condition.  e.g. "p" is killed and
> doing exit, setting tsk->mm to NULL is before releasing the memory.
> And in multi threading environment, this happens much more.
> In __out_of_memory(), it panics if select_bad_process returns NULL.
> The simple way to fix it is as mem_cgroup_out_of_memory() does.
> 

This is fixed by 
oom-avoid-race-for-oom-killed-tasks-detaching-mm-prior-to-exit.patch in 
the -mm tree.

See 
http://userweb.kernel.org/~akpm/mmotm/broken-out/oom-avoid-race-for-oom-killed-tasks-detaching-mm-prior-to-exit.patch

> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index afeab2a..9aae208 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -588,12 +588,8 @@ retry:
>  	if (PTR_ERR(p) == -1UL)
>  		return;
>  
> -	/* Found nothing?!?! Either we hang forever, or we panic. */
> -	if (!p) {
> -		read_unlock(&tasklist_lock);
> -		dump_header(NULL, gfp_mask, order, NULL);
> -		panic("Out of memory and no killable processes...\n");
> -	}
> +	if (!p)
> +		p = current;
>  
>  	if (oom_kill_process(p, gfp_mask, order, points, NULL,
>  			     "Out of memory"))

The reason p wasn't selected is because it fails to meet the criteria for 
candidacy in select_bad_process(), not necessarily because of a race with 
the !p->mm check that the -mm patch cited above fixes.  It's quite 
possible that current has an oom_adj value of OOM_DISABLE, for example, 
where this would be wrong.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] oom killer: break from infinite loop
  2010-03-29 20:01       ` David Rientjes
@ 2010-03-30 14:29         ` anfei
  2010-03-30 20:29           ` David Rientjes
  0 siblings, 1 reply; 29+ messages in thread
From: anfei @ 2010-03-30 14:29 UTC (permalink / raw)
  To: David Rientjes
  Cc: Oleg Nesterov, Andrew Morton, KOSAKI Motohiro, nishimura,
	KAMEZAWA Hiroyuki, Mel Gorman, linux-mm, linux-kernel

On Mon, Mar 29, 2010 at 01:01:58PM -0700, David Rientjes wrote:
> On Mon, 29 Mar 2010, anfei wrote:
> 
> > I think this method is okay, but it's easy to trigger another bug of
> > oom.  See select_bad_process():
> > 	if (!p->mm)
> > 		continue;
> > !p->mm is not always an unaccepted condition.  e.g. "p" is killed and
> > doing exit, setting tsk->mm to NULL is before releasing the memory.
> > And in multi threading environment, this happens much more.
> > In __out_of_memory(), it panics if select_bad_process returns NULL.
> > The simple way to fix it is as mem_cgroup_out_of_memory() does.
> > 
> 
> This is fixed by 
> oom-avoid-race-for-oom-killed-tasks-detaching-mm-prior-to-exit.patch in 
> the -mm tree.
> 
> See 
> http://userweb.kernel.org/~akpm/mmotm/broken-out/oom-avoid-race-for-oom-killed-tasks-detaching-mm-prior-to-exit.patch
> 
> > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > index afeab2a..9aae208 100644
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -588,12 +588,8 @@ retry:
> >  	if (PTR_ERR(p) == -1UL)
> >  		return;
> >  
> > -	/* Found nothing?!?! Either we hang forever, or we panic. */
> > -	if (!p) {
> > -		read_unlock(&tasklist_lock);
> > -		dump_header(NULL, gfp_mask, order, NULL);
> > -		panic("Out of memory and no killable processes...\n");
> > -	}
> > +	if (!p)
> > +		p = current;
> >  
> >  	if (oom_kill_process(p, gfp_mask, order, points, NULL,
> >  			     "Out of memory"))
> 
> The reason p wasn't selected is because it fails to meet the criteria for 
> candidacy in select_bad_process(), not necessarily because of a race with 
> the !p->mm check that the -mm patch cited above fixes.  It's quite 
> possible that current has an oom_adj value of OOM_DISABLE, for example, 
> where this would be wrong.

I see.  And what about changing mem_cgroup_out_of_memory() too?


diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 0cb1ca4..9e89a29 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -510,8 +510,10 @@ retry:
 	if (PTR_ERR(p) == -1UL)
 		goto out;
 
-	if (!p)
-		p = current;
+	if (!p) {
+		read_unlock(&tasklist_lock);
+		panic("Out of memory and no killable processes...\n");
+	}
 
 	if (oom_kill_process(p, gfp_mask, 0, points, limit, mem,
 				"Memory cgroup out of memory"))

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] oom killer: break from infinite loop
  2010-03-30 14:29         ` anfei
@ 2010-03-30 20:29           ` David Rientjes
  2010-03-31  0:57             ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 29+ messages in thread
From: David Rientjes @ 2010-03-30 20:29 UTC (permalink / raw)
  To: anfei, KAMEZAWA Hiroyuki
  Cc: Oleg Nesterov, Andrew Morton, KOSAKI Motohiro, nishimura,
	Mel Gorman, linux-mm, linux-kernel

On Tue, 30 Mar 2010, anfei wrote:

> > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > > index afeab2a..9aae208 100644
> > > --- a/mm/oom_kill.c
> > > +++ b/mm/oom_kill.c
> > > @@ -588,12 +588,8 @@ retry:
> > >  	if (PTR_ERR(p) == -1UL)
> > >  		return;
> > >  
> > > -	/* Found nothing?!?! Either we hang forever, or we panic. */
> > > -	if (!p) {
> > > -		read_unlock(&tasklist_lock);
> > > -		dump_header(NULL, gfp_mask, order, NULL);
> > > -		panic("Out of memory and no killable processes...\n");
> > > -	}
> > > +	if (!p)
> > > +		p = current;
> > >  
> > >  	if (oom_kill_process(p, gfp_mask, order, points, NULL,
> > >  			     "Out of memory"))
> > 
> > The reason p wasn't selected is because it fails to meet the criteria for 
> > candidacy in select_bad_process(), not necessarily because of a race with 
> > the !p->mm check that the -mm patch cited above fixes.  It's quite 
> > possible that current has an oom_adj value of OOM_DISABLE, for example, 
> > where this would be wrong.
> 
> I see.  And what about changing mem_cgroup_out_of_memory() too?
> 

The memory controller is different because it must kill a task even if 
another task is exiting since the imposed limit has been reached.

> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 0cb1ca4..9e89a29 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -510,8 +510,10 @@ retry:
>  	if (PTR_ERR(p) == -1UL)
>  		goto out;
>  
> -	if (!p)
> -		p = current;
> +	if (!p) {
> +		read_unlock(&tasklist_lock);
> +		panic("Out of memory and no killable processes...\n");
> +	}
>  
>  	if (oom_kill_process(p, gfp_mask, 0, points, limit, mem,
>  				"Memory cgroup out of memory"))
> 

This actually does appear to be necessary but for a different reason: if 
current is unkillable because it has OOM_DISABLE, for example, then 
oom_kill_process() will repeatedly fail and mem_cgroup_out_of_memory() 
will infinitely loop.

Kame-san?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] oom killer: break from infinite loop
  2010-03-30 20:29           ` David Rientjes
@ 2010-03-31  0:57             ` KAMEZAWA Hiroyuki
  2010-03-31  6:07               ` David Rientjes
  0 siblings, 1 reply; 29+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-03-31  0:57 UTC (permalink / raw)
  To: David Rientjes
  Cc: anfei, Oleg Nesterov, Andrew Morton, KOSAKI Motohiro, nishimura,
	Mel Gorman, linux-mm, linux-kernel

On Tue, 30 Mar 2010 13:29:29 -0700 (PDT)
David Rientjes <rientjes@google.com> wrote:

> > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > index 0cb1ca4..9e89a29 100644
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -510,8 +510,10 @@ retry:
> >  	if (PTR_ERR(p) == -1UL)
> >  		goto out;
> >  
> > -	if (!p)
> > -		p = current;
> > +	if (!p) {
> > +		read_unlock(&tasklist_lock);
> > +		panic("Out of memory and no killable processes...\n");
> > +	}
> >  
> >  	if (oom_kill_process(p, gfp_mask, 0, points, limit, mem,
> >  				"Memory cgroup out of memory"))
> > 
> 
> This actually does appear to be necessary but for a different reason: if 
> current is unkillable because it has OOM_DISABLE, for example, then 
> oom_kill_process() will repeatedly fail and mem_cgroup_out_of_memory() 
> will infinitely loop.
> 
> Kame-san?
> 

When a memcg goes into OOM and it only has unkillable processes (OOM_DISABLE),
we can do nothing. (we can't panic because container's death != system death.)

Because memcg itself has mutex+waitqueue for mutual execusion of OOM killer, 
I think infinite-loop will not be critical probelm for the whole system.

And, now, memcg has oom-kill-disable + oom-kill-notifier features.
So, If a memcg goes into OOM and there is no killable process, but oom-kill is
not disabled by memcg.....it means system admin's mis-configuraton.

He can stop inifite loop by hand, anyway.
# echo 1 > ..../group_A/memory.oom_control

Thanks,
-Kame


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] oom killer: break from infinite loop
  2010-03-31  0:57             ` KAMEZAWA Hiroyuki
@ 2010-03-31  6:07               ` David Rientjes
  2010-03-31  6:13                 ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 29+ messages in thread
From: David Rientjes @ 2010-03-31  6:07 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: anfei, Oleg Nesterov, Andrew Morton, KOSAKI Motohiro, nishimura,
	Mel Gorman, linux-mm, linux-kernel

On Wed, 31 Mar 2010, KAMEZAWA Hiroyuki wrote:

> > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > > index 0cb1ca4..9e89a29 100644
> > > --- a/mm/oom_kill.c
> > > +++ b/mm/oom_kill.c
> > > @@ -510,8 +510,10 @@ retry:
> > >  	if (PTR_ERR(p) == -1UL)
> > >  		goto out;
> > >  
> > > -	if (!p)
> > > -		p = current;
> > > +	if (!p) {
> > > +		read_unlock(&tasklist_lock);
> > > +		panic("Out of memory and no killable processes...\n");
> > > +	}
> > >  
> > >  	if (oom_kill_process(p, gfp_mask, 0, points, limit, mem,
> > >  				"Memory cgroup out of memory"))
> > > 
> > 
> > This actually does appear to be necessary but for a different reason: if 
> > current is unkillable because it has OOM_DISABLE, for example, then 
> > oom_kill_process() will repeatedly fail and mem_cgroup_out_of_memory() 
> > will infinitely loop.
> > 
> > Kame-san?
> > 
> 
> When a memcg goes into OOM and it only has unkillable processes (OOM_DISABLE),
> we can do nothing. (we can't panic because container's death != system death.)
> 
> Because memcg itself has mutex+waitqueue for mutual execusion of OOM killer, 
> I think infinite-loop will not be critical probelm for the whole system.
> 
> And, now, memcg has oom-kill-disable + oom-kill-notifier features.
> So, If a memcg goes into OOM and there is no killable process, but oom-kill is
> not disabled by memcg.....it means system admin's mis-configuraton.
> 
> He can stop inifite loop by hand, anyway.
> # echo 1 > ..../group_A/memory.oom_control
> 

Then we should be able to do this since current is by definition 
unkillable since it was not found in select_bad_process(), right?
---
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -500,12 +500,9 @@ void mem_cgroup_out_of_memory(struct mem_cgroup *mem, gfp_t gfp_mask)
 	read_lock(&tasklist_lock);
 retry:
 	p = select_bad_process(&points, limit, mem, CONSTRAINT_NONE, NULL);
-	if (PTR_ERR(p) == -1UL)
+	if (!p || PTR_ERR(p) == -1UL)
 		goto out;
 
-	if (!p)
-		p = current;
-
 	if (oom_kill_process(p, gfp_mask, 0, points, limit, mem,
 				"Memory cgroup out of memory"))
 		goto retry;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] oom killer: break from infinite loop
  2010-03-31  6:07               ` David Rientjes
@ 2010-03-31  6:13                 ` KAMEZAWA Hiroyuki
  2010-03-31  6:30                   ` Balbir Singh
  0 siblings, 1 reply; 29+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-03-31  6:13 UTC (permalink / raw)
  To: David Rientjes
  Cc: anfei, Oleg Nesterov, Andrew Morton, KOSAKI Motohiro, nishimura,
	Mel Gorman, linux-mm, linux-kernel, balbir

On Tue, 30 Mar 2010 23:07:08 -0700 (PDT)
David Rientjes <rientjes@google.com> wrote:

> On Wed, 31 Mar 2010, KAMEZAWA Hiroyuki wrote:
> 
> > > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > > > index 0cb1ca4..9e89a29 100644
> > > > --- a/mm/oom_kill.c
> > > > +++ b/mm/oom_kill.c
> > > > @@ -510,8 +510,10 @@ retry:
> > > >  	if (PTR_ERR(p) == -1UL)
> > > >  		goto out;
> > > >  
> > > > -	if (!p)
> > > > -		p = current;
> > > > +	if (!p) {
> > > > +		read_unlock(&tasklist_lock);
> > > > +		panic("Out of memory and no killable processes...\n");
> > > > +	}
> > > >  
> > > >  	if (oom_kill_process(p, gfp_mask, 0, points, limit, mem,
> > > >  				"Memory cgroup out of memory"))
> > > > 
> > > 
> > > This actually does appear to be necessary but for a different reason: if 
> > > current is unkillable because it has OOM_DISABLE, for example, then 
> > > oom_kill_process() will repeatedly fail and mem_cgroup_out_of_memory() 
> > > will infinitely loop.
> > > 
> > > Kame-san?
> > > 
> > 
> > When a memcg goes into OOM and it only has unkillable processes (OOM_DISABLE),
> > we can do nothing. (we can't panic because container's death != system death.)
> > 
> > Because memcg itself has mutex+waitqueue for mutual execusion of OOM killer, 
> > I think infinite-loop will not be critical probelm for the whole system.
> > 
> > And, now, memcg has oom-kill-disable + oom-kill-notifier features.
> > So, If a memcg goes into OOM and there is no killable process, but oom-kill is
> > not disabled by memcg.....it means system admin's mis-configuraton.
> > 
> > He can stop inifite loop by hand, anyway.
> > # echo 1 > ..../group_A/memory.oom_control
> > 
> 
> Then we should be able to do this since current is by definition 
> unkillable since it was not found in select_bad_process(), right?

To me, this patch is acceptable and seems reasnoable.

But I didn't joined to memcg development when this check was added
and don't know why kill current..

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=c7ba5c9e8176704bfac0729875fa62798037584d

Addinc Balbir to CC. Maybe situation is changed now.
Because we can stop inifinite loop (by hand) and there is no rushing oom-kill
callers, this change is acceptable.

Thanks,
-Kame



> ---
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -500,12 +500,9 @@ void mem_cgroup_out_of_memory(struct mem_cgroup *mem, gfp_t gfp_mask)
>  	read_lock(&tasklist_lock);
>  retry:
>  	p = select_bad_process(&points, limit, mem, CONSTRAINT_NONE, NULL);
> -	if (PTR_ERR(p) == -1UL)
> +	if (!p || PTR_ERR(p) == -1UL)
>  		goto out;
>  
> -	if (!p)
> -		p = current;
> -
>  	if (oom_kill_process(p, gfp_mask, 0, points, limit, mem,
>  				"Memory cgroup out of memory"))
>  		goto retry;
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] oom killer: break from infinite loop
  2010-03-31  6:13                 ` KAMEZAWA Hiroyuki
@ 2010-03-31  6:30                   ` Balbir Singh
  2010-03-31  6:32                     ` David Rientjes
  0 siblings, 1 reply; 29+ messages in thread
From: Balbir Singh @ 2010-03-31  6:30 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: David Rientjes, anfei, Oleg Nesterov, Andrew Morton,
	KOSAKI Motohiro, nishimura, Mel Gorman, linux-mm, linux-kernel

* KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2010-03-31 15:13:56]:

> On Tue, 30 Mar 2010 23:07:08 -0700 (PDT)
> David Rientjes <rientjes@google.com> wrote:
> 
> > On Wed, 31 Mar 2010, KAMEZAWA Hiroyuki wrote:
> > 
> > > > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > > > > index 0cb1ca4..9e89a29 100644
> > > > > --- a/mm/oom_kill.c
> > > > > +++ b/mm/oom_kill.c
> > > > > @@ -510,8 +510,10 @@ retry:
> > > > >  	if (PTR_ERR(p) == -1UL)
> > > > >  		goto out;
> > > > >  
> > > > > -	if (!p)
> > > > > -		p = current;
> > > > > +	if (!p) {
> > > > > +		read_unlock(&tasklist_lock);
> > > > > +		panic("Out of memory and no killable processes...\n");
> > > > > +	}
> > > > >  
> > > > >  	if (oom_kill_process(p, gfp_mask, 0, points, limit, mem,
> > > > >  				"Memory cgroup out of memory"))
> > > > > 
> > > > 
> > > > This actually does appear to be necessary but for a different reason: if 
> > > > current is unkillable because it has OOM_DISABLE, for example, then 
> > > > oom_kill_process() will repeatedly fail and mem_cgroup_out_of_memory() 
> > > > will infinitely loop.
> > > > 
> > > > Kame-san?
> > > > 
> > > 
> > > When a memcg goes into OOM and it only has unkillable processes (OOM_DISABLE),
> > > we can do nothing. (we can't panic because container's death != system death.)
> > > 
> > > Because memcg itself has mutex+waitqueue for mutual execusion of OOM killer, 
> > > I think infinite-loop will not be critical probelm for the whole system.
> > > 
> > > And, now, memcg has oom-kill-disable + oom-kill-notifier features.
> > > So, If a memcg goes into OOM and there is no killable process, but oom-kill is
> > > not disabled by memcg.....it means system admin's mis-configuraton.
> > > 
> > > He can stop inifite loop by hand, anyway.
> > > # echo 1 > ..../group_A/memory.oom_control
> > > 
> > 
> > Then we should be able to do this since current is by definition 
> > unkillable since it was not found in select_bad_process(), right?
> 
> To me, this patch is acceptable and seems reasnoable.
> 
> But I didn't joined to memcg development when this check was added
> and don't know why kill current..
>

The reason for adding current was that we did not want to loop
forever, since it stops forward progress - no error/no forward
progress. It made sense to oom kill the current process, so that the
cgroup admin could look at what went wrong.
 
> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=c7ba5c9e8176704bfac0729875fa62798037584d
> 
> Addinc Balbir to CC. Maybe situation is changed now.
> Because we can stop inifinite loop (by hand) and there is no rushing oom-kill
> callers, this change is acceptable.
>

By hand is not always possible if we have a large number of cgroups
(I've seen a setup with 2000 cgroups on libcgroup ML). 2000 cgroups *
number of processes make the situation complex. I think using OOM
notifier is now another way of handling such a situation.
 
-- 
	Three Cheers,
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] oom killer: break from infinite loop
  2010-03-31  6:30                   ` Balbir Singh
@ 2010-03-31  6:32                     ` David Rientjes
  2010-03-31  7:08                       ` [patch -mm] memcg: make oom killer a no-op when no killable task can be found David Rientjes
  0 siblings, 1 reply; 29+ messages in thread
From: David Rientjes @ 2010-03-31  6:32 UTC (permalink / raw)
  To: Balbir Singh
  Cc: KAMEZAWA Hiroyuki, anfei, Oleg Nesterov, Andrew Morton,
	KOSAKI Motohiro, nishimura, Mel Gorman, linux-mm, linux-kernel

On Wed, 31 Mar 2010, Balbir Singh wrote:

> > To me, this patch is acceptable and seems reasnoable.
> > 
> > But I didn't joined to memcg development when this check was added
> > and don't know why kill current..
> >
> 
> The reason for adding current was that we did not want to loop
> forever, since it stops forward progress - no error/no forward
> progress. It made sense to oom kill the current process, so that the
> cgroup admin could look at what went wrong.
>  

oom_kill_process() will fail on current since it wasn't selected as an 
eligible task to kill in select_bad_process() and we know it to be a 
member of the memcg, so there's no point in trying to kill it.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [patch -mm] memcg: make oom killer a no-op when no killable task can be found
  2010-03-31  6:32                     ` David Rientjes
@ 2010-03-31  7:08                       ` David Rientjes
  2010-03-31  7:08                         ` KAMEZAWA Hiroyuki
                                           ` (2 more replies)
  0 siblings, 3 replies; 29+ messages in thread
From: David Rientjes @ 2010-03-31  7:08 UTC (permalink / raw)
  To: Andrew Morton
  Cc: KAMEZAWA Hiroyuki, anfei, KOSAKI Motohiro, nishimura,
	Balbir Singh, linux-mm

It's pointless to try to kill current if select_bad_process() did not
find an eligible task to kill in mem_cgroup_out_of_memory() since it's
guaranteed that current is a member of the memcg that is oom and it is,
by definition, unkillable.

Signed-off-by: David Rientjes <rientjes@google.com>
---
 mm/oom_kill.c |    5 +----
 1 files changed, 1 insertions(+), 4 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -500,12 +500,9 @@ void mem_cgroup_out_of_memory(struct mem_cgroup *mem, gfp_t gfp_mask)
 	read_lock(&tasklist_lock);
 retry:
 	p = select_bad_process(&points, limit, mem, CONSTRAINT_NONE, NULL);
-	if (PTR_ERR(p) == -1UL)
+	if (!p || PTR_ERR(p) == -1UL)
 		goto out;
 
-	if (!p)
-		p = current;
-
 	if (oom_kill_process(p, gfp_mask, 0, points, limit, mem,
 				"Memory cgroup out of memory"))
 		goto retry;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch -mm] memcg: make oom killer a no-op when no killable task can be found
  2010-03-31  7:08                       ` [patch -mm] memcg: make oom killer a no-op when no killable task can be found David Rientjes
@ 2010-03-31  7:08                         ` KAMEZAWA Hiroyuki
  2010-03-31  8:04                         ` Balbir Singh
  2010-04-04 23:28                         ` David Rientjes
  2 siblings, 0 replies; 29+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-03-31  7:08 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, anfei, KOSAKI Motohiro, nishimura, Balbir Singh, linux-mm

On Wed, 31 Mar 2010 00:08:38 -0700 (PDT)
David Rientjes <rientjes@google.com> wrote:

> It's pointless to try to kill current if select_bad_process() did not
> find an eligible task to kill in mem_cgroup_out_of_memory() since it's
> guaranteed that current is a member of the memcg that is oom and it is,
> by definition, unkillable.
> 
> Signed-off-by: David Rientjes <rientjes@google.com>

Ah, okay. If current is killable, current should be found by select_bad_process.

Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

> ---
>  mm/oom_kill.c |    5 +----
>  1 files changed, 1 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -500,12 +500,9 @@ void mem_cgroup_out_of_memory(struct mem_cgroup *mem, gfp_t gfp_mask)
>  	read_lock(&tasklist_lock);
>  retry:
>  	p = select_bad_process(&points, limit, mem, CONSTRAINT_NONE, NULL);
> -	if (PTR_ERR(p) == -1UL)
> +	if (!p || PTR_ERR(p) == -1UL)
>  		goto out;
>  
> -	if (!p)
> -		p = current;
> -
>  	if (oom_kill_process(p, gfp_mask, 0, points, limit, mem,
>  				"Memory cgroup out of memory"))
>  		goto retry;
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch -mm] memcg: make oom killer a no-op when no killable task can be found
  2010-03-31  7:08                       ` [patch -mm] memcg: make oom killer a no-op when no killable task can be found David Rientjes
  2010-03-31  7:08                         ` KAMEZAWA Hiroyuki
@ 2010-03-31  8:04                         ` Balbir Singh
  2010-03-31 10:38                           ` David Rientjes
  2010-04-04 23:28                         ` David Rientjes
  2 siblings, 1 reply; 29+ messages in thread
From: Balbir Singh @ 2010-03-31  8:04 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, anfei, KOSAKI Motohiro,
	nishimura, linux-mm

* David Rientjes <rientjes@google.com> [2010-03-31 00:08:38]:

> It's pointless to try to kill current if select_bad_process() did not
> find an eligible task to kill in mem_cgroup_out_of_memory() since it's
> guaranteed that current is a member of the memcg that is oom and it is,
> by definition, unkillable.
> 
> Signed-off-by: David Rientjes <rientjes@google.com>
> ---
>  mm/oom_kill.c |    5 +----
>  1 files changed, 1 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -500,12 +500,9 @@ void mem_cgroup_out_of_memory(struct mem_cgroup *mem, gfp_t gfp_mask)
>  	read_lock(&tasklist_lock);
>  retry:
>  	p = select_bad_process(&points, limit, mem, CONSTRAINT_NONE, NULL);
> -	if (PTR_ERR(p) == -1UL)
> +	if (!p || PTR_ERR(p) == -1UL)
>  		goto out;

Should we have a bit fat WAR_ON_ONCE() here?

-- 
	Three Cheers,
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch -mm] memcg: make oom killer a no-op when no killable task can be found
  2010-03-31  8:04                         ` Balbir Singh
@ 2010-03-31 10:38                           ` David Rientjes
  0 siblings, 0 replies; 29+ messages in thread
From: David Rientjes @ 2010-03-31 10:38 UTC (permalink / raw)
  To: Balbir Singh
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, anfei, KOSAKI Motohiro,
	nishimura, linux-mm

On Wed, 31 Mar 2010, Balbir Singh wrote:

> > It's pointless to try to kill current if select_bad_process() did not
> > find an eligible task to kill in mem_cgroup_out_of_memory() since it's
> > guaranteed that current is a member of the memcg that is oom and it is,
> > by definition, unkillable.
> > 
> > Signed-off-by: David Rientjes <rientjes@google.com>
> > ---
> >  mm/oom_kill.c |    5 +----
> >  1 files changed, 1 insertions(+), 4 deletions(-)
> > 
> > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -500,12 +500,9 @@ void mem_cgroup_out_of_memory(struct mem_cgroup *mem, gfp_t gfp_mask)
> >  	read_lock(&tasklist_lock);
> >  retry:
> >  	p = select_bad_process(&points, limit, mem, CONSTRAINT_NONE, NULL);
> > -	if (PTR_ERR(p) == -1UL)
> > +	if (!p || PTR_ERR(p) == -1UL)
> >  		goto out;
> 
> Should we have a bit fat WAR_ON_ONCE() here?
> 

I'm not sure a WARN_ON_ONCE() is going to be too helpful to a sysadmin who 
has misconfigured the memcg here since all it will do is emit the stack 
trace and line number, it's not going to be immediately obvious that this 
is because all tasks in the cgroup are unkillable so he or she should do
echo 1 > /dev/cgroup/blah/memory.oom_control as a remedy.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch -mm] memcg: make oom killer a no-op when no killable task can be found
  2010-03-31  7:08                       ` [patch -mm] memcg: make oom killer a no-op when no killable task can be found David Rientjes
  2010-03-31  7:08                         ` KAMEZAWA Hiroyuki
  2010-03-31  8:04                         ` Balbir Singh
@ 2010-04-04 23:28                         ` David Rientjes
  2010-04-05 21:30                           ` Andrew Morton
  2 siblings, 1 reply; 29+ messages in thread
From: David Rientjes @ 2010-04-04 23:28 UTC (permalink / raw)
  To: Andrew Morton
  Cc: KAMEZAWA Hiroyuki, anfei, KOSAKI Motohiro, nishimura,
	Balbir Singh, linux-mm

On Wed, 31 Mar 2010, David Rientjes wrote:

> It's pointless to try to kill current if select_bad_process() did not
> find an eligible task to kill in mem_cgroup_out_of_memory() since it's
> guaranteed that current is a member of the memcg that is oom and it is,
> by definition, unkillable.
> 
> Signed-off-by: David Rientjes <rientjes@google.com>
> ---
>  mm/oom_kill.c |    5 +----
>  1 files changed, 1 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -500,12 +500,9 @@ void mem_cgroup_out_of_memory(struct mem_cgroup *mem, gfp_t gfp_mask)
>  	read_lock(&tasklist_lock);
>  retry:
>  	p = select_bad_process(&points, limit, mem, CONSTRAINT_NONE, NULL);
> -	if (PTR_ERR(p) == -1UL)
> +	if (!p || PTR_ERR(p) == -1UL)
>  		goto out;
>  
> -	if (!p)
> -		p = current;
> -
>  	if (oom_kill_process(p, gfp_mask, 0, points, limit, mem,
>  				"Memory cgroup out of memory"))
>  		goto retry;
> 

Are there any objections to merging this?  It's pretty straight-forward 
given the fact that oom_kill_process() would fail if select_bad_process() 
returns NULL even if p is set to current since it was not found to be 
eligible during the tasklist scan.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch -mm] memcg: make oom killer a no-op when no killable task can be found
  2010-04-04 23:28                         ` David Rientjes
@ 2010-04-05 21:30                           ` Andrew Morton
  2010-04-05 22:40                             ` David Rientjes
  0 siblings, 1 reply; 29+ messages in thread
From: Andrew Morton @ 2010-04-05 21:30 UTC (permalink / raw)
  To: David Rientjes
  Cc: KAMEZAWA Hiroyuki, anfei, KOSAKI Motohiro, nishimura,
	Balbir Singh, linux-mm

On Sun, 4 Apr 2010 16:28:01 -0700 (PDT)
David Rientjes <rientjes@google.com> wrote:

> On Wed, 31 Mar 2010, David Rientjes wrote:
> 
> > It's pointless to try to kill current if select_bad_process() did not
> > find an eligible task to kill in mem_cgroup_out_of_memory() since it's
> > guaranteed that current is a member of the memcg that is oom and it is,
> > by definition, unkillable.
> > 
> > Signed-off-by: David Rientjes <rientjes@google.com>
> > ---
> >  mm/oom_kill.c |    5 +----
> >  1 files changed, 1 insertions(+), 4 deletions(-)
> > 
> > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -500,12 +500,9 @@ void mem_cgroup_out_of_memory(struct mem_cgroup *mem, gfp_t gfp_mask)
> >  	read_lock(&tasklist_lock);
> >  retry:
> >  	p = select_bad_process(&points, limit, mem, CONSTRAINT_NONE, NULL);
> > -	if (PTR_ERR(p) == -1UL)
> > +	if (!p || PTR_ERR(p) == -1UL)
> >  		goto out;
> >  
> > -	if (!p)
> > -		p = current;
> > -
> >  	if (oom_kill_process(p, gfp_mask, 0, points, limit, mem,
> >  				"Memory cgroup out of memory"))
> >  		goto retry;
> > 
> 
> Are there any objections to merging this?  It's pretty straight-forward 
> given the fact that oom_kill_process() would fail if select_bad_process() 
> returns NULL even if p is set to current since it was not found to be 
> eligible during the tasklist scan.

I've lost the plot on the oom-killer patches.  Half the things I'm
seeing don't even apply.

Perhaps I should drop the lot and we start again.  We still haven't
resolved the procfs back-compat issue, either.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch -mm] memcg: make oom killer a no-op when no killable task can be found
  2010-04-05 21:30                           ` Andrew Morton
@ 2010-04-05 22:40                             ` David Rientjes
  2010-04-05 22:49                               ` Andrew Morton
  0 siblings, 1 reply; 29+ messages in thread
From: David Rientjes @ 2010-04-05 22:40 UTC (permalink / raw)
  To: Andrew Morton
  Cc: KAMEZAWA Hiroyuki, anfei, KOSAKI Motohiro, nishimura,
	Balbir Singh, linux-mm

On Mon, 5 Apr 2010, Andrew Morton wrote:

> > > It's pointless to try to kill current if select_bad_process() did not
> > > find an eligible task to kill in mem_cgroup_out_of_memory() since it's
> > > guaranteed that current is a member of the memcg that is oom and it is,
> > > by definition, unkillable.
> > > 
> > > Signed-off-by: David Rientjes <rientjes@google.com>
> > > ---
> > >  mm/oom_kill.c |    5 +----
> > >  1 files changed, 1 insertions(+), 4 deletions(-)
> > > 
> > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > > --- a/mm/oom_kill.c
> > > +++ b/mm/oom_kill.c
> > > @@ -500,12 +500,9 @@ void mem_cgroup_out_of_memory(struct mem_cgroup *mem, gfp_t gfp_mask)
> > >  	read_lock(&tasklist_lock);
> > >  retry:
> > >  	p = select_bad_process(&points, limit, mem, CONSTRAINT_NONE, NULL);
> > > -	if (PTR_ERR(p) == -1UL)
> > > +	if (!p || PTR_ERR(p) == -1UL)
> > >  		goto out;
> > >  
> > > -	if (!p)
> > > -		p = current;
> > > -
> > >  	if (oom_kill_process(p, gfp_mask, 0, points, limit, mem,
> > >  				"Memory cgroup out of memory"))
> > >  		goto retry;
> > > 
> > 
> > Are there any objections to merging this?  It's pretty straight-forward 
> > given the fact that oom_kill_process() would fail if select_bad_process() 
> > returns NULL even if p is set to current since it was not found to be 
> > eligible during the tasklist scan.
> 
> I've lost the plot on the oom-killer patches.  Half the things I'm
> seeing don't even apply.
> 

This patch applies cleanly on mmotm-2010-03-24-14-48 and I don't see 
anything that has been added since then that touches 
mem_cgroup_out_of_memory().

> Perhaps I should drop the lot and we start again.  We still haven't
> resolved the procfs back-compat issue, either.

I haven't seen any outstanding compatibility issues raised.  The only 
thing that isn't backwards compatible is consolidating 
/proc/sys/vm/oom_kill_allocating_task and /proc/sys/vm/oom_dump_tasks into 
/proc/sys/vm/oom_kill_quick.  We can do that because we've enabled 
oom_dump_tasks by default so that systems that use both of these tunables 
need to now disable oom_dump_tasks to avoid the costly tasklist scan.  
Both tunables would then have the same audience, i.e. users would never 
want to enable one without the other, so it's possible to consolidate 
them.

Nobody, to my knowledge, has objected to that reasoning and removing 
dozens of patches from -mm isn't the answer for (yet to be raised) 
questions about a single change.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch -mm] memcg: make oom killer a no-op when no killable task can be found
  2010-04-05 22:40                             ` David Rientjes
@ 2010-04-05 22:49                               ` Andrew Morton
  2010-04-05 23:01                                 ` David Rientjes
  0 siblings, 1 reply; 29+ messages in thread
From: Andrew Morton @ 2010-04-05 22:49 UTC (permalink / raw)
  To: David Rientjes
  Cc: KAMEZAWA Hiroyuki, anfei, KOSAKI Motohiro, nishimura,
	Balbir Singh, linux-mm

On Mon, 5 Apr 2010 15:40:27 -0700 (PDT)
David Rientjes <rientjes@google.com> wrote:

> On Mon, 5 Apr 2010, Andrew Morton wrote:
> 
> > > > It's pointless to try to kill current if select_bad_process() did not
> > > > find an eligible task to kill in mem_cgroup_out_of_memory() since it's
> > > > guaranteed that current is a member of the memcg that is oom and it is,
> > > > by definition, unkillable.
> > > > 
> > > > Signed-off-by: David Rientjes <rientjes@google.com>
> > > > ---
> > > >  mm/oom_kill.c |    5 +----
> > > >  1 files changed, 1 insertions(+), 4 deletions(-)
> > > > 
> > > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > > > --- a/mm/oom_kill.c
> > > > +++ b/mm/oom_kill.c
> > > > @@ -500,12 +500,9 @@ void mem_cgroup_out_of_memory(struct mem_cgroup *mem, gfp_t gfp_mask)
> > > >  	read_lock(&tasklist_lock);
> > > >  retry:
> > > >  	p = select_bad_process(&points, limit, mem, CONSTRAINT_NONE, NULL);
> > > > -	if (PTR_ERR(p) == -1UL)
> > > > +	if (!p || PTR_ERR(p) == -1UL)
> > > >  		goto out;
> > > >  
> > > > -	if (!p)
> > > > -		p = current;
> > > > -
> > > >  	if (oom_kill_process(p, gfp_mask, 0, points, limit, mem,
> > > >  				"Memory cgroup out of memory"))
> > > >  		goto retry;
> > > > 
> > > 
> > > Are there any objections to merging this?  It's pretty straight-forward 
> > > given the fact that oom_kill_process() would fail if select_bad_process() 
> > > returns NULL even if p is set to current since it was not found to be 
> > > eligible during the tasklist scan.
> > 
> > I've lost the plot on the oom-killer patches.  Half the things I'm
> > seeing don't even apply.
> > 
> 
> This patch applies cleanly on mmotm-2010-03-24-14-48 and I don't see 
> anything that has been added since then that touches 
> mem_cgroup_out_of_memory().

I'm working on another mmotm at present.

> > Perhaps I should drop the lot and we start again.  We still haven't
> > resolved the procfs back-compat issue, either.
> 
> I haven't seen any outstanding compatibility issues raised.  The only 
> thing that isn't backwards compatible is consolidating 
> /proc/sys/vm/oom_kill_allocating_task and /proc/sys/vm/oom_dump_tasks into 
> /proc/sys/vm/oom_kill_quick.  We can do that because we've enabled 
> oom_dump_tasks by default so that systems that use both of these tunables 
> need to now disable oom_dump_tasks to avoid the costly tasklist scan.  

This can break stuff, as I've already described - if a startup tool is
correctly checking its syscall return values and a /procfs file
vanishes, the app may bail out and not work.

Others had other objections, iirc.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch -mm] memcg: make oom killer a no-op when no killable task can be found
  2010-04-05 22:49                               ` Andrew Morton
@ 2010-04-05 23:01                                 ` David Rientjes
  2010-04-06 12:08                                   ` KOSAKI Motohiro
  0 siblings, 1 reply; 29+ messages in thread
From: David Rientjes @ 2010-04-05 23:01 UTC (permalink / raw)
  To: Andrew Morton
  Cc: KAMEZAWA Hiroyuki, anfei, KOSAKI Motohiro, nishimura,
	Balbir Singh, linux-mm

On Mon, 5 Apr 2010, Andrew Morton wrote:

> > This patch applies cleanly on mmotm-2010-03-24-14-48 and I don't see 
> > anything that has been added since then that touches 
> > mem_cgroup_out_of_memory().
> 
> I'm working on another mmotm at present.
> 

Nothing else you've merged since mmotm-2010-03-24-14-48 has touched 
mem_cgroup_out_of_memory() that I've been cc'd on.  This patch should 
apply cleanly.

> > I haven't seen any outstanding compatibility issues raised.  The only 
> > thing that isn't backwards compatible is consolidating 
> > /proc/sys/vm/oom_kill_allocating_task and /proc/sys/vm/oom_dump_tasks into 
> > /proc/sys/vm/oom_kill_quick.  We can do that because we've enabled 
> > oom_dump_tasks by default so that systems that use both of these tunables 
> > need to now disable oom_dump_tasks to avoid the costly tasklist scan.  
> 
> This can break stuff, as I've already described - if a startup tool is
> correctly checking its syscall return values and a /procfs file
> vanishes, the app may bail out and not work.
> 

This is not the first time we have changed or obsoleted tunables in 
/proc/sys/vm.  If a startup tool really is really bailing out depending on 
whether echo 1 > /proc/sys/vm/oom_kill_allocating_task succeeds, it should 
be fixed regardless because you're not protecting anything by doing that 
since you can't predict what task is allocating memory at the time of oom.  
Those same startup tools will need to disable /proc/sys/vm/oom_dump_tasks 
if we are to remove the consolidation into oom_kill_quick and maintain two 
seperate VM sysctls that are always used together by the same users.

Nobody can even cite a single example of oom_kill_allocating_task being 
used in practice, yet we want to unnecessarily maintain these two seperate 
sysctls forever because it's possible that a buggy startup tool cares 
about the return value of enabling it?

> Others had other objections, iirc.
> 

I'm all ears.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch -mm] memcg: make oom killer a no-op when no killable task can be found
  2010-04-05 23:01                                 ` David Rientjes
@ 2010-04-06 12:08                                   ` KOSAKI Motohiro
  2010-04-06 21:47                                     ` David Rientjes
  0 siblings, 1 reply; 29+ messages in thread
From: KOSAKI Motohiro @ 2010-04-06 12:08 UTC (permalink / raw)
  To: David Rientjes, Andrew Morton
  Cc: kosaki.motohiro, KAMEZAWA Hiroyuki, anfei, nishimura,
	Balbir Singh, linux-mm

> This is not the first time we have changed or obsoleted tunables in 
> /proc/sys/vm.  If a startup tool really is really bailing out depending on 
> whether echo 1 > /proc/sys/vm/oom_kill_allocating_task succeeds, it should 
> be fixed regardless because you're not protecting anything by doing that 
>
> since you can't predict what task is allocating memory at the time of oom.  
> Those same startup tools will need to disable /proc/sys/vm/oom_dump_tasks 
> if we are to remove the consolidation into oom_kill_quick and maintain two 
> seperate VM sysctls that are always used together by the same users.
>
> Nobody can even cite a single example of oom_kill_allocating_task being 
> used in practice, yet we want to unnecessarily maintain these two seperate 
> sysctls forever because it's possible that a buggy startup tool cares 
> about the return value of enabling it?
> 
> > Others had other objections, iirc.
> > 
> 
> I'm all ears.

Complain.

Many people reviewed these patches, but following four patches got no ack.

oom-badness-heuristic-rewrite.patch
oom-default-to-killing-current-for-pagefault-ooms.patch
oom-deprecate-oom_adj-tunable.patch
oom-replace-sysctls-with-quick-mode.patch

IIRC, alan and nick and I NAKed such patch. everybody explained the reason.
We don't hope join loudly voice contest nor help to making flame. but it
doesn't mean explicit ack.

Andrew, If you really really really hope to merge these, I'm not againt
it anymore. but please put following remark explicitely into the patches.

	Nacked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujistu.com>




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch -mm] memcg: make oom killer a no-op when no killable task can be found
  2010-04-06 12:08                                   ` KOSAKI Motohiro
@ 2010-04-06 21:47                                     ` David Rientjes
  2010-04-07  0:20                                       ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 29+ messages in thread
From: David Rientjes @ 2010-04-06 21:47 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, anfei, nishimura, Balbir Singh,
	linux-mm

On Tue, 6 Apr 2010, KOSAKI Motohiro wrote:

> Many people reviewed these patches, but following four patches got no ack.
> 
> oom-badness-heuristic-rewrite.patch

Do you have any specific feedback that you could offer on why you decided 
to nack this?

> oom-default-to-killing-current-for-pagefault-ooms.patch

Same, what is the specific concern that you have with this patch?

If you don't believe we should kill current first, could you please submit 
patches for all other architectures like powerpc that already do this as 
their only course of action for VM_FAULT_OOM and then make pagefault oom 
killing consistent amongst architectures?

> oom-deprecate-oom_adj-tunable.patch

Alan had a concern about removing /proc/pid/oom_adj, or redefining it with 
different semantics as I originally did, and then I updated the patchset 
to deprecate the old tunable as Andrew suggested.

My somewhat arbitrary time of removal was approximately 18 months from 
the date of deprecation which would give us 5-6 major kernel releases in 
between.  If you think that's too early of a deadline, then I'd happily 
extend it by 6 months or a year.

Keeping /proc/pid/oom_adj around indefinitely isn't very helpful if 
there's a finer grained alternative available already unless you want 
/proc/pid/oom_adj to actually mean something in which case you'll never be 
able to seperate oom badness scores from bitshifts.  I believe everyone 
agrees that a more understood and finer grained tunable is necessary as 
compared to the current implementation that has very limited functionality 
other than polarizing tasks.

> oom-replace-sysctls-with-quick-mode.patch
> 
> IIRC, alan and nick and I NAKed such patch. everybody explained the reason.

Which patch of the four you listed are you referring to here?

> We don't hope join loudly voice contest nor help to making flame. but it
> doesn't mean explicit ack.
> 

If someone has a concern with a patch and then I reply to it and the reply 
goes unanswered, what exactly does that imply?  Do we want to stop 
development because discussion occurred on a patch yet no rebuttal was 
made that addressed specific points that I raised?

Arguing to keep /proc/pid/oom_kill_allocating_task means that we should 
also not enable /proc/pid/oom_dump_tasks by default since the same systems 
that use the former will need to now disable the latter to avoid costly 
tasklist scans.  So are you suggesting that we should not enable 
oom_dump_tasks like the rewrite does even though it provides very useful 
information to 99.9% (or perhaps 100%) of users to understand the memory 
usage of their tasks because you believe systems out there would flake out 
with the tasklist scan it requires, even though you can't cite a single 
example?

Now instead of not replying to these questions and insisting that your 
nack stand based solely on the fact that you nacked it, please get 
involved in the development process.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch -mm] memcg: make oom killer a no-op when no killable task can be found
  2010-04-06 21:47                                     ` David Rientjes
@ 2010-04-07  0:20                                       ` KAMEZAWA Hiroyuki
  2010-04-07 13:29                                         ` KOSAKI Motohiro
  2010-04-08 17:36                                         ` David Rientjes
  0 siblings, 2 replies; 29+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-04-07  0:20 UTC (permalink / raw)
  To: David Rientjes
  Cc: KOSAKI Motohiro, Andrew Morton, anfei, nishimura, Balbir Singh, linux-mm

On Tue, 6 Apr 2010 14:47:58 -0700 (PDT)
David Rientjes <rientjes@google.com> wrote:

> On Tue, 6 Apr 2010, KOSAKI Motohiro wrote:
> 
> > Many people reviewed these patches, but following four patches got no ack.
> > 
> > oom-badness-heuristic-rewrite.patch
> 
> Do you have any specific feedback that you could offer on why you decided 
> to nack this?
> 

I like this patch. But I think no one can't Ack this because there is no
"correct" answer. At least, this show good behavior on my environment.

> > oom-default-to-killing-current-for-pagefault-ooms.patch
> 
> Same, what is the specific concern that you have with this patch?
> 

I'm not sure about this. Personally, I feel pagefault-out-of-memory only
happens drivers are corrupted. So, I have no much concern on this.

> If you don't believe we should kill current first, could you please submit 
> patches for all other architectures like powerpc that already do this as 
> their only course of action for VM_FAULT_OOM and then make pagefault oom 
> killing consistent amongst architectures?
> 
> > oom-deprecate-oom_adj-tunable.patch
> 
> Alan had a concern about removing /proc/pid/oom_adj, or redefining it with 
> different semantics as I originally did, and then I updated the patchset 
> to deprecate the old tunable as Andrew suggested.
> 
> My somewhat arbitrary time of removal was approximately 18 months from 
> the date of deprecation which would give us 5-6 major kernel releases in 
> between.  If you think that's too early of a deadline, then I'd happily 
> extend it by 6 months or a year.
> 
> Keeping /proc/pid/oom_adj around indefinitely isn't very helpful if 
> there's a finer grained alternative available already unless you want 
> /proc/pid/oom_adj to actually mean something in which case you'll never be 
> able to seperate oom badness scores from bitshifts.  I believe everyone 
> agrees that a more understood and finer grained tunable is necessary as 
> compared to the current implementation that has very limited functionality 
> other than polarizing tasks.
> 

If oom-badness-heuristic-rewrite.patch will go ahead, this should go.
But my concern is administorator has to check all oom_score_adj and
tune it again if he adds more memory to the system.

Now, not-small amount of people use Virtual Machine or Contaienr. So, this
oom_score_adj's sensivity to the size of memory can put admins to hell.

 Assume a host A and B. A has 4G memory, B has 8G memory.
 Here, an applicaton which consumes 2G memory.
 Then, this application's oom_score will be 500 on A, 250 on B.
 To make oom_score 0 by oom_score_adj, admin should set -500 on A, -250 on B.

I think this kind of interface is _bad_. If admin is great and all machines
in the system has the same configuration, this oom_score_adj will work powerfully.
I admit it.
But usually, admin are not great and the system includes irregular hosts.
I hope you add one more magic knob to give admins to show importance of application
independent from system configuration, which can work cooperatively with oom_score_adj.

> > oom-replace-sysctls-with-quick-mode.patch
> > 
> > IIRC, alan and nick and I NAKed such patch. everybody explained the reason.
> 
> Which patch of the four you listed are you referring to here?
> 
replacing used sysctl is bad idea, in general.

I have no _strong_ opinion. I welcome the patch series. But aboves are my concern.
Thank you for your work.

Regards,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch -mm] memcg: make oom killer a no-op when no killable task can be found
  2010-04-07  0:20                                       ` KAMEZAWA Hiroyuki
@ 2010-04-07 13:29                                         ` KOSAKI Motohiro
  2010-04-08 18:05                                           ` David Rientjes
  2010-04-08 17:36                                         ` David Rientjes
  1 sibling, 1 reply; 29+ messages in thread
From: KOSAKI Motohiro @ 2010-04-07 13:29 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: kosaki.motohiro, David Rientjes, Andrew Morton, anfei, nishimura,
	Balbir Singh, linux-mm

> On Tue, 6 Apr 2010 14:47:58 -0700 (PDT)
> David Rientjes <rientjes@google.com> wrote:
> 
> > On Tue, 6 Apr 2010, KOSAKI Motohiro wrote:
> > 
> > > Many people reviewed these patches, but following four patches got no ack.
> > > 
> > > oom-badness-heuristic-rewrite.patch
> > 
> > Do you have any specific feedback that you could offer on why you decided 
> > to nack this?
> > 
> 
> I like this patch. But I think no one can't Ack this because there is no
> "correct" answer. At least, this show good behavior on my environment.

see diffstat. that's perfectly crap, obviously need to make separate patches
individual one. Who can review it?

 Documentation/filesystems/proc.txt |   95 ++++----
 Documentation/sysctl/vm.txt        |   21 +
 fs/proc/base.c                     |   98 ++++++++
 include/linux/memcontrol.h         |    8
 include/linux/oom.h                |   17 +
 include/linux/sched.h              |    3
 kernel/fork.c                      |    1
 kernel/sysctl.c                    |    9
 mm/memcontrol.c                    |   18 +
 mm/oom_kill.c                      |  319 ++++++++++++++-------------
 10 files changed, 404 insertions(+), 185 deletions(-)

additional commets is in below.


> > > oom-default-to-killing-current-for-pagefault-ooms.patch
> > 
> > Same, what is the specific concern that you have with this patch?
> 
> I'm not sure about this. Personally, I feel pagefault-out-of-memory only
> happens drivers are corrupted. So, I have no much concern on this.

If you suggest to revert pagefault_oom itself, it is considerable. but
even though I don't think so.

quote nick's mail

	The thing I should explain is that user interfaces are most important
	for their intended semantics. We don't generally call bugs or oversights
	part of the interface, and they are to be fixed unless some program
	relies on them.

	Nowhere in the vm documentation does it say anything about "pagefault
	ooms", and even in the kernel code, even to mm developers (who mostly
	don't care about oom killer) probably wouldn't immediately think of
	pagefault versus any other type of oom.

	Given that, do you think it is reasonable, when panic_on_oom is set,
	to allow a process to be killed due to oom condition? Or do you think
	that was an oversight of the implementation?

	Regardless of what architectures currently do. Yes there is a
	consistency issue, and it should have been fixed earlier, but the
	consistency issue goes both ways now. Some (the most widely tested
	and used, if that matters) architectures, do it the right way.

So, this patch is purely backstep. it break panic_on_oom.
If anyone post "pagefault_out_of_memory() aware pagefault for ppc" or 
something else architecture, I'm glad and ack it.


> > If you don't believe we should kill current first, could you please submit 
> > patches for all other architectures like powerpc that already do this as 
> > their only course of action for VM_FAULT_OOM and then make pagefault oom 
> > killing consistent amongst architectures?
>
> > 
> > > oom-deprecate-oom_adj-tunable.patch
> > 
> > Alan had a concern about removing /proc/pid/oom_adj, or redefining it with 
> > different semantics as I originally did, and then I updated the patchset 
> > to deprecate the old tunable as Andrew suggested.
> > 
> > My somewhat arbitrary time of removal was approximately 18 months from 
> > the date of deprecation which would give us 5-6 major kernel releases in 
> > between.  If you think that's too early of a deadline, then I'd happily 
> > extend it by 6 months or a year.
> > 
> > Keeping /proc/pid/oom_adj around indefinitely isn't very helpful if 
> > there's a finer grained alternative available already unless you want 
> > /proc/pid/oom_adj to actually mean something in which case you'll never be 
> > able to seperate oom badness scores from bitshifts.  I believe everyone 
> > agrees that a more understood and finer grained tunable is necessary as 
> > compared to the current implementation that has very limited functionality 
> > other than polarizing tasks.

The problem is, oom_adj is one of most widely used knob. it is not only used
admin, but also be used applications. in addition, oom_score_adj is bad interface
and no good to replace oom_adj. kamezawa-san, as following your mentioned.

> If oom-badness-heuristic-rewrite.patch will go ahead, this should go.
> But my concern is administorator has to check all oom_score_adj and
> tune it again if he adds more memory to the system.
> 
> Now, not-small amount of people use Virtual Machine or Contaienr. So, this
> oom_score_adj's sensivity to the size of memory can put admins to hell.
> 
>  Assume a host A and B. A has 4G memory, B has 8G memory.
>  Here, an applicaton which consumes 2G memory.
>  Then, this application's oom_score will be 500 on A, 250 on B.
>  To make oom_score 0 by oom_score_adj, admin should set -500 on A, -250 on B.
> 
> I think this kind of interface is _bad_. If admin is great and all machines
> in the system has the same configuration, this oom_score_adj will work powerfully.
> I admit it.
> But usually, admin are not great and the system includes irregular hosts.
> I hope you add one more magic knob to give admins to show importance of application
> independent from system configuration, which can work cooperatively with oom_score_adj.

agreed. oom_score_adj is completely crap. should gone.
but also following pseudo scaling adjustment is crap too. it don't consider
both page sharing and mlock pages. iow, it never works correctly.


+       points = (get_mm_rss(mm) + get_mm_counter(mm, MM_SWAPENTS)) * 1000 /
+                       totalpages;


> 
> > > oom-replace-sysctls-with-quick-mode.patch
> > > 
> > > IIRC, alan and nick and I NAKed such patch. everybody explained the reason.
> > 
> > Which patch of the four you listed are you referring to here?
> > 
> replacing used sysctl is bad idea, in general.
> 
> I have no _strong_ opinion. I welcome the patch series. But aboves are my concern.
> Thank you for your work.

I really hate "that is _inteltional_ regression" crap. now almost developers
ignore a bug report and don't join problem investigate works. I and very few
people does that. (ok, I agree you are in such few developers, thanks)

Why can't we discard it simplely? please don't make crap.


now, sadly, I can imagine why some active developers have prefered to
override ugly code immeditely rather than a code review and dialogue.
I'm feel down that I have to do it.




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch -mm] memcg: make oom killer a no-op when no killable task can be found
  2010-04-07 13:29                                         ` KOSAKI Motohiro
@ 2010-04-08 18:05                                           ` David Rientjes
  2010-04-21 19:17                                             ` Andrew Morton
  0 siblings, 1 reply; 29+ messages in thread
From: David Rientjes @ 2010-04-08 18:05 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: KAMEZAWA Hiroyuki, Andrew Morton, anfei, nishimura, Balbir Singh,
	linux-mm

On Wed, 7 Apr 2010, KOSAKI Motohiro wrote:

> > > > oom-badness-heuristic-rewrite.patch
> > > 
> > > Do you have any specific feedback that you could offer on why you decided 
> > > to nack this?
> > > 
> > 
> > I like this patch. But I think no one can't Ack this because there is no
> > "correct" answer. At least, this show good behavior on my environment.
> 
> see diffstat. that's perfectly crap, obviously need to make separate patches
> individual one. Who can review it?
> 
>  Documentation/filesystems/proc.txt |   95 ++++----
>  Documentation/sysctl/vm.txt        |   21 +
>  fs/proc/base.c                     |   98 ++++++++
>  include/linux/memcontrol.h         |    8
>  include/linux/oom.h                |   17 +
>  include/linux/sched.h              |    3
>  kernel/fork.c                      |    1
>  kernel/sysctl.c                    |    9
>  mm/memcontrol.c                    |   18 +
>  mm/oom_kill.c                      |  319 ++++++++++++++-------------
>  10 files changed, 404 insertions(+), 185 deletions(-)
> 
> additional commets is in below.
> 

This specific change cannot be broken down into individual patches as much 
as I'd like to.  It's a complete rewrite of the badness() function and 
requires two new tunables to be introduced, determination of the amount of 
memory available to current, formals being changed around, and 
documentation.

A review tip: the change itself is in the rewrite of the function now 
called oom_badness(), so I recommend applying downloading mmotm and 
reading it there as well as the documentation change.  The remainder of 
the patch fixes up the various callers of that function and isn't 
interesting.

> If you suggest to revert pagefault_oom itself, it is considerable. but
> even though I don't think so.
> 
> quote nick's mail
> 
> 	The thing I should explain is that user interfaces are most important
> 	for their intended semantics. We don't generally call bugs or oversights
> 	part of the interface, and they are to be fixed unless some program
> 	relies on them.
> 

I disagree, I believe the long-standing semantics of user interfaces such 
as panic_on_oom are more important than what the name implies or what it 
was intended for when it was introduced.

> 	Nowhere in the vm documentation does it say anything about "pagefault
> 	ooms", and even in the kernel code, even to mm developers (who mostly
> 	don't care about oom killer) probably wouldn't immediately think of
> 	pagefault versus any other type of oom.
> 
> 	Given that, do you think it is reasonable, when panic_on_oom is set,
> 	to allow a process to be killed due to oom condition? Or do you think
> 	that was an oversight of the implementation?
> 

Users have a well-defined and long-standing method of protecting their 
applications from oom kill and that is OOM_DISABLE.  With my patch, if 
current is unkillable because it is OOM_DISABLE, then we fallback to a 
tasklist scan iff panic_on_oom is unset.

> 	Regardless of what architectures currently do. Yes there is a
> 	consistency issue, and it should have been fixed earlier, but the
> 	consistency issue goes both ways now. Some (the most widely tested
> 	and used, if that matters) architectures, do it the right way.
> 
> So, this patch is purely backstep. it break panic_on_oom.
> If anyone post "pagefault_out_of_memory() aware pagefault for ppc" or 
> something else architecture, I'm glad and ack it.
> 

It's not a backstep, it's making all architectures consistent as it sits 
right now in mmotm.  If someone would like to change all VM_FAULT_OOM 
handlers to do a tasklist scan and not default to killing current, that is 
an extension of this patchset.  Likewise, if we want to ensure 
panic_on_oom is respected even for pagefault ooms, then we need to do that 
on all architectures so that we don't have multiple definitions depending 
on machine type.  The semantics of a sysctl shouldn't depend on the 
architecture and right now it does, so this patch fixes that.  In other 
words: if you want to extend the definition of panic_on_oom, then do so 
completely for all architectures first and then add it to the 
documentation.

> > > > oom-deprecate-oom_adj-tunable.patch
> > > 
> > > Alan had a concern about removing /proc/pid/oom_adj, or redefining it with 
> > > different semantics as I originally did, and then I updated the patchset 
> > > to deprecate the old tunable as Andrew suggested.
> > > 
> > > My somewhat arbitrary time of removal was approximately 18 months from 
> > > the date of deprecation which would give us 5-6 major kernel releases in 
> > > between.  If you think that's too early of a deadline, then I'd happily 
> > > extend it by 6 months or a year.
> > > 
> > > Keeping /proc/pid/oom_adj around indefinitely isn't very helpful if 
> > > there's a finer grained alternative available already unless you want 
> > > /proc/pid/oom_adj to actually mean something in which case you'll never be 
> > > able to seperate oom badness scores from bitshifts.  I believe everyone 
> > > agrees that a more understood and finer grained tunable is necessary as 
> > > compared to the current implementation that has very limited functionality 
> > > other than polarizing tasks.
> 
> The problem is, oom_adj is one of most widely used knob. it is not only used
> admin, but also be used applications. in addition, oom_score_adj is bad interface
> and no good to replace oom_adj. kamezawa-san, as following your mentioned.
> 

oom_adj is retained but deprecated, so I'm not sure what you're suggesting 
here.  Do you think we should instead keep oom_adj forever in parallel 
with oom_score_adj?  It's quite clear that a more powerful, finer-grained 
solution is necessary than what oom_adj provides.  I believe the 
deprecation for 5-6 major kernel releases is enough, but we can certainly 
talk about extending that by a year if you'd like.

Can you elaborate on why you believe oom_score_adj is a bad interface or 
have had problems with it in your personal use?

> agreed. oom_score_adj is completely crap. should gone.
> but also following pseudo scaling adjustment is crap too. it don't consider
> both page sharing and mlock pages. iow, it never works correctly.
> 
> 
> +       points = (get_mm_rss(mm) + get_mm_counter(mm, MM_SWAPENTS)) * 1000 /
> +                       totalpages;
> 

That baseline actually does work much better than total_vm as we've 
discussed multiple times on LKML leading up to the development of this 
series, but if you'd like to propose additional considerations into the 
heuristic, than please do so.

> > > > oom-replace-sysctls-with-quick-mode.patch
> > > > 
> > > > IIRC, alan and nick and I NAKed such patch. everybody explained the reason.
> > > 
> > > Which patch of the four you listed are you referring to here?
> > > 
> > replacing used sysctl is bad idea, in general.
> > 
> > I have no _strong_ opinion. I welcome the patch series. But aboves are my concern.
> > Thank you for your work.
> 
> I really hate "that is _inteltional_ regression" crap. now almost developers
> ignore a bug report and don't join problem investigate works. I and very few
> people does that. (ok, I agree you are in such few developers, thanks)
> 
> Why can't we discard it simplely? please don't make crap.
> 

Perhaps you don't understand.  The users of oom_kill_allocating_task are 
those systems that have extremely large tasklists and so iterating through 
it comes at a substantial cost.  It was originally requested by SGI 
because they preferred an alternative to the tasklist scan used for 
cpuset-constrained ooms and were satisfied with simply killing something 
quickly instead of iterating the tasklist.

This patchset, however, enables oom_dump_tasks by default because it 
provides useful information to the user to understand the memory use of 
their applications so they can hopefully determine why the oom occurred.  
This requires a tasklist scan itself, so those same users of 
oom_kill_allocating_task are no longer protected from that cost by simply 
setting this sysctl.  They must also disable oom_dump_tasks or we're at 
the same efficiency that we were before oom_kill_allocating_task was 
introduced.

Since they must modify their startup scripts, and since the users of both 
of these sysctls are the same and nobody would use one without the other, 
it should be possible to consolidate them into a single sysctl.  If 
additional changes are made to the oom killer in the future, it would then 
be possible to test for this single sysctl, oom_kill_quick, instead 
without introducing additional sysctls and polluting procfs.

Thus, it's completely unnecessary to keep oom_kill_allocating_task and we 
can redefine it for those systems.  What alternatives do you have in mind 
or what part of this logic do you not agree with?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch -mm] memcg: make oom killer a no-op when no killable task can be found
  2010-04-08 18:05                                           ` David Rientjes
@ 2010-04-21 19:17                                             ` Andrew Morton
  2010-04-21 22:04                                               ` David Rientjes
                                                                 ` (2 more replies)
  0 siblings, 3 replies; 29+ messages in thread
From: Andrew Morton @ 2010-04-21 19:17 UTC (permalink / raw)
  To: David Rientjes
  Cc: KOSAKI Motohiro, KAMEZAWA Hiroyuki, anfei, nishimura,
	Balbir Singh, linux-mm


fyi, I still consider these patches to be in the "stuck" state.  So we
need to get them unstuck.


Hiroyuki (and anyone else): could you please summarise in the briefest
way possible what your objections are to Daivd's oom-killer changes?

I'll start: we don't change the kernel ABI.  Ever.  And when we _do_
change it we don't change it without warning.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch -mm] memcg: make oom killer a no-op when no killable task can be found
  2010-04-21 19:17                                             ` Andrew Morton
@ 2010-04-21 22:04                                               ` David Rientjes
  2010-04-22  0:23                                                 ` KAMEZAWA Hiroyuki
  2010-04-22  7:23                                               ` Nick Piggin
  2010-05-04 23:55                                               ` David Rientjes
  2 siblings, 1 reply; 29+ messages in thread
From: David Rientjes @ 2010-04-21 22:04 UTC (permalink / raw)
  To: Andrew Morton
  Cc: KOSAKI Motohiro, KAMEZAWA Hiroyuki, anfei, nishimura,
	Balbir Singh, linux-mm

On Wed, 21 Apr 2010, Andrew Morton wrote:

> fyi, I still consider these patches to be in the "stuck" state.  So we
> need to get them unstuck.
> 
> 
> Hiroyuki (and anyone else): could you please summarise in the briefest
> way possible what your objections are to Daivd's oom-killer changes?
> 
> I'll start: we don't change the kernel ABI.  Ever.  And when we _do_
> change it we don't change it without warning.
> 

I'm not going to allow a simple cleanup to jeopardize the entire patchset, 
so I can write a patch that readds /proc/sys/vm/oom_kill_allocating_task 
that simply mirrors the setting of /proc/sys/vm/oom_kill_quick and then 
warn about its deprecation.  I don't believe we need to do the same thing 
for the removal of /proc/sys/vm/oom_dump_tasks since that functionality is 
now enabled by default.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch -mm] memcg: make oom killer a no-op when no killable task can be found
  2010-04-21 22:04                                               ` David Rientjes
@ 2010-04-22  0:23                                                 ` KAMEZAWA Hiroyuki
  2010-04-22  8:34                                                   ` David Rientjes
  0 siblings, 1 reply; 29+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-04-22  0:23 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, KOSAKI Motohiro, anfei, nishimura, Balbir Singh, linux-mm

On Wed, 21 Apr 2010 15:04:27 -0700 (PDT)
David Rientjes <rientjes@google.com> wrote:

> On Wed, 21 Apr 2010, Andrew Morton wrote:
> 
> > fyi, I still consider these patches to be in the "stuck" state.  So we
> > need to get them unstuck.
> > 
> > 
> > Hiroyuki (and anyone else): could you please summarise in the briefest
> > way possible what your objections are to Daivd's oom-killer changes?
> > 
> > I'll start: we don't change the kernel ABI.  Ever.  And when we _do_
> > change it we don't change it without warning.
> > 
> 
> I'm not going to allow a simple cleanup to jeopardize the entire patchset, 
> so I can write a patch that readds /proc/sys/vm/oom_kill_allocating_task 
> that simply mirrors the setting of /proc/sys/vm/oom_kill_quick and then 
> warn about its deprecation. 

Yeah, I welcome it.

> I don't believe we need to do the same thing 
> for the removal of /proc/sys/vm/oom_dump_tasks since that functionality is 
> now enabled by default.
> 

But *warning* is always apprecieated and will not make the whole patches
too dirty. So, please write one.

BTW, I don't think there is an admin who turns off oom_dump_task..
So, just keeping interface and putting this one to feature-removal-list 
is okay for me if you want to cleanup sysctl possibly.

Talking about myself, I also want to remove/cleanup some interface under memcg
which is rarely used. But I don't do because we have users. And I'll not to
clean up as far as we can maintain it. Then, we have to be careful to add
interfaces.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch -mm] memcg: make oom killer a no-op when no killable task can be found
  2010-04-22  0:23                                                 ` KAMEZAWA Hiroyuki
@ 2010-04-22  8:34                                                   ` David Rientjes
  0 siblings, 0 replies; 29+ messages in thread
From: David Rientjes @ 2010-04-22  8:34 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrew Morton, KOSAKI Motohiro, anfei, nishimura, Balbir Singh, linux-mm

On Thu, 22 Apr 2010, KAMEZAWA Hiroyuki wrote:

> > I'm not going to allow a simple cleanup to jeopardize the entire patchset, 
> > so I can write a patch that readds /proc/sys/vm/oom_kill_allocating_task 
> > that simply mirrors the setting of /proc/sys/vm/oom_kill_quick and then 
> > warn about its deprecation. 
> 
> Yeah, I welcome it.
> 

Ok, good.

> > I don't believe we need to do the same thing 
> > for the removal of /proc/sys/vm/oom_dump_tasks since that functionality is 
> > now enabled by default.
> > 
> 
> But *warning* is always apprecieated and will not make the whole patches
> too dirty. So, please write one.
> 
> BTW, I don't think there is an admin who turns off oom_dump_task..
> So, just keeping interface and putting this one to feature-removal-list 
> is okay for me if you want to cleanup sysctl possibly.
> 

Do we really need to keep oom_dump_tasks around since the result of this 
patchset is that we've enabled it by default?  It seems to me like users 
who now want to disable it (something that nobody is currently doing, it's 
the default in Linus' tree) can simply do

	echo 1 > /proc/sys/vm/oom_kill_quick

instead to both suppress the tasklist scan for the dump and for the target 
selection.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch -mm] memcg: make oom killer a no-op when no killable task can be found
  2010-04-21 19:17                                             ` Andrew Morton
  2010-04-21 22:04                                               ` David Rientjes
@ 2010-04-22  7:23                                               ` Nick Piggin
  2010-04-22  7:25                                                 ` KAMEZAWA Hiroyuki
  2010-05-04 23:55                                               ` David Rientjes
  2 siblings, 1 reply; 29+ messages in thread
From: Nick Piggin @ 2010-04-22  7:23 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Rientjes, KOSAKI Motohiro, KAMEZAWA Hiroyuki, anfei,
	nishimura, Balbir Singh, linux-mm

On Wed, Apr 21, 2010 at 12:17:58PM -0700, Andrew Morton wrote:
> 
> fyi, I still consider these patches to be in the "stuck" state.  So we
> need to get them unstuck.
> 
> 
> Hiroyuki (and anyone else): could you please summarise in the briefest
> way possible what your objections are to Daivd's oom-killer changes?
> 
> I'll start: we don't change the kernel ABI.  Ever.  And when we _do_
> change it we don't change it without warning.

How is this turning into such a big issue? It is totally ridiculous.
It is not even a "cleanup".

Just drop the ABI-changing patches, and I think the rest of them looked
OK, didn't they?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch -mm] memcg: make oom killer a no-op when no killable task can be found
  2010-04-22  7:23                                               ` Nick Piggin
@ 2010-04-22  7:25                                                 ` KAMEZAWA Hiroyuki
  2010-04-22 10:09                                                   ` Nick Piggin
  0 siblings, 1 reply; 29+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-04-22  7:25 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andrew Morton, David Rientjes, KOSAKI Motohiro, anfei, nishimura,
	Balbir Singh, linux-mm

On Thu, 22 Apr 2010 17:23:19 +1000
Nick Piggin <npiggin@suse.de> wrote:

> On Wed, Apr 21, 2010 at 12:17:58PM -0700, Andrew Morton wrote:
> > 
> > fyi, I still consider these patches to be in the "stuck" state.  So we
> > need to get them unstuck.
> > 
> > 
> > Hiroyuki (and anyone else): could you please summarise in the briefest
> > way possible what your objections are to Daivd's oom-killer changes?
> > 
> > I'll start: we don't change the kernel ABI.  Ever.  And when we _do_
> > change it we don't change it without warning.
> 
> How is this turning into such a big issue? It is totally ridiculous.
> It is not even a "cleanup".
> 
> Just drop the ABI-changing patches, and I think the rest of them looked
> OK, didn't they?
> 
I agree with you.

-Kame 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch -mm] memcg: make oom killer a no-op when no killable task can be found
  2010-04-22  7:25                                                 ` KAMEZAWA Hiroyuki
@ 2010-04-22 10:09                                                   ` Nick Piggin
  2010-04-22 10:27                                                     ` KAMEZAWA Hiroyuki
  2010-04-22 10:28                                                     ` David Rientjes
  0 siblings, 2 replies; 29+ messages in thread
From: Nick Piggin @ 2010-04-22 10:09 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrew Morton, David Rientjes, KOSAKI Motohiro, anfei, nishimura,
	Balbir Singh, linux-mm

On Thu, Apr 22, 2010 at 04:25:36PM +0900, KAMEZAWA Hiroyuki wrote:
> On Thu, 22 Apr 2010 17:23:19 +1000
> Nick Piggin <npiggin@suse.de> wrote:
> 
> > On Wed, Apr 21, 2010 at 12:17:58PM -0700, Andrew Morton wrote:
> > > 
> > > fyi, I still consider these patches to be in the "stuck" state.  So we
> > > need to get them unstuck.
> > > 
> > > 
> > > Hiroyuki (and anyone else): could you please summarise in the briefest
> > > way possible what your objections are to Daivd's oom-killer changes?
> > > 
> > > I'll start: we don't change the kernel ABI.  Ever.  And when we _do_
> > > change it we don't change it without warning.
> > 
> > How is this turning into such a big issue? It is totally ridiculous.
> > It is not even a "cleanup".
> > 
> > Just drop the ABI-changing patches, and I think the rest of them looked
> > OK, didn't they?
> > 
> I agree with you.

Oh actually what happened with the pagefault OOM / panic on oom thing?
We were talking around in circles about that too.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch -mm] memcg: make oom killer a no-op when no killable task can be found
  2010-04-22 10:09                                                   ` Nick Piggin
@ 2010-04-22 10:27                                                     ` KAMEZAWA Hiroyuki
  2010-04-22 21:11                                                       ` David Rientjes
  2010-04-22 10:28                                                     ` David Rientjes
  1 sibling, 1 reply; 29+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-04-22 10:27 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andrew Morton, David Rientjes, KOSAKI Motohiro, anfei, nishimura,
	Balbir Singh, linux-mm

On Thu, 22 Apr 2010 20:09:44 +1000
Nick Piggin <npiggin@suse.de> wrote:

> On Thu, Apr 22, 2010 at 04:25:36PM +0900, KAMEZAWA Hiroyuki wrote:
> > On Thu, 22 Apr 2010 17:23:19 +1000
> > Nick Piggin <npiggin@suse.de> wrote:
> > 
> > > On Wed, Apr 21, 2010 at 12:17:58PM -0700, Andrew Morton wrote:
> > > > 
> > > > fyi, I still consider these patches to be in the "stuck" state.  So we
> > > > need to get them unstuck.
> > > > 
> > > > 
> > > > Hiroyuki (and anyone else): could you please summarise in the briefest
> > > > way possible what your objections are to Daivd's oom-killer changes?
> > > > 
> > > > I'll start: we don't change the kernel ABI.  Ever.  And when we _do_
> > > > change it we don't change it without warning.
> > > 
> > > How is this turning into such a big issue? It is totally ridiculous.
> > > It is not even a "cleanup".
> > > 
> > > Just drop the ABI-changing patches, and I think the rest of them looked
> > > OK, didn't they?
> > > 
> > I agree with you.
> 
> Oh actually what happened with the pagefault OOM / panic on oom thing?
> We were talking around in circles about that too.
> 
Hmm...checking again.

Maybe related patches are:
 1: oom-remove-special-handling-for-pagefault-ooms.patch
 2: oom-default-to-killing-current-for-pagefault-ooms.patch

IIUC, (1) doesn't make change. But (2)...

Before(1)
 - pagefault-oom kills someone by out_of_memory().
After (1)
 - pagefault-oom calls out_of_memory() only when someone isn't being killed.

So, this patch helps to avoid double-kill and I like this change.

Before (2)
 At pagefault-out-of-memory
  - panic_on_oom==2, panic always.
  - panic_on_oom==1, panic when CONSITRAINT_NONE.
 
After (2)
  At pagefault-put-of-memory, if there is no running OOM-Kill,
  current is killed always. In this case, panic_on_oom doesn't work.

I think panic_on_oom==2 should work.. Hmm. why this behavior changes ?

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch -mm] memcg: make oom killer a no-op when no killable task can be found
  2010-04-22 10:27                                                     ` KAMEZAWA Hiroyuki
@ 2010-04-22 21:11                                                       ` David Rientjes
  0 siblings, 0 replies; 29+ messages in thread
From: David Rientjes @ 2010-04-22 21:11 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Nick Piggin, Andrew Morton, KOSAKI Motohiro, anfei, nishimura,
	Balbir Singh, linux-mm

On Thu, 22 Apr 2010, KAMEZAWA Hiroyuki wrote:

> Hmm...checking again.
> 
> Maybe related patches are:
>  1: oom-remove-special-handling-for-pagefault-ooms.patch
>  2: oom-default-to-killing-current-for-pagefault-ooms.patch
> 
> IIUC, (1) doesn't make change. But (2)...
> 
> Before(1)
>  - pagefault-oom kills someone by out_of_memory().
> After (1)
>  - pagefault-oom calls out_of_memory() only when someone isn't being killed.
> 
> So, this patch helps to avoid double-kill and I like this change.
> 
> Before (2)
>  At pagefault-out-of-memory
>   - panic_on_oom==2, panic always.
>   - panic_on_oom==1, panic when CONSITRAINT_NONE.
>  
> After (2)
>   At pagefault-put-of-memory, if there is no running OOM-Kill,
>   current is killed always. In this case, panic_on_oom doesn't work.
> 
> I think panic_on_oom==2 should work.. Hmm. why this behavior changes ?
> 

We can readd the panic_on_oom code once Nick's patchset is merged that 
unifies all architectures in using pagefault_out_of_memory() for 
VM_FAULT_OOM.  Otherwise, some architectures would panic in this case and 
others would not (while they allow tasks to be SIGKILL'd even when 
panic_on_oom == 2 is set, including OOM_DISABLE tasks!) so I think it's 
better to be entirely consistent with sysctl semantics across 
architectures.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch -mm] memcg: make oom killer a no-op when no killable task can be found
  2010-04-22 10:09                                                   ` Nick Piggin
  2010-04-22 10:27                                                     ` KAMEZAWA Hiroyuki
@ 2010-04-22 10:28                                                     ` David Rientjes
  2010-04-22 15:39                                                       ` Nick Piggin
  1 sibling, 1 reply; 29+ messages in thread
From: David Rientjes @ 2010-04-22 10:28 UTC (permalink / raw)
  To: Nick Piggin
  Cc: KAMEZAWA Hiroyuki, Andrew Morton, KOSAKI Motohiro, anfei,
	nishimura, Balbir Singh, linux-mm

On Thu, 22 Apr 2010, Nick Piggin wrote:

> Oh actually what happened with the pagefault OOM / panic on oom thing?
> We were talking around in circles about that too.
> 

The oom killer rewrite attempts to kill current first, if possible, and 
then will panic if panic_on_oom is set before falling back to selecting a 
victim.  This is consistent with all other architectures such as powerpc 
that currently do not use pagefault_out_of_memory().  If all architectures 
are eventually going to be converted to using pagefault_out_of_memory() 
with additional work on top of -mm, it would be possible to define 
consistent panic_on_oom semantics for this case.  I welcome such an 
addition since I believe it's a natural extension of panic_on_oom, but I 
believe it should be done consistently so the sysctl doesn't have 
different semantics depending on the underlying arch.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch -mm] memcg: make oom killer a no-op when no killable task can be found
  2010-04-22 10:28                                                     ` David Rientjes
@ 2010-04-22 15:39                                                       ` Nick Piggin
  2010-04-22 21:09                                                         ` David Rientjes
  0 siblings, 1 reply; 29+ messages in thread
From: Nick Piggin @ 2010-04-22 15:39 UTC (permalink / raw)
  To: David Rientjes
  Cc: KAMEZAWA Hiroyuki, Andrew Morton, KOSAKI Motohiro, anfei,
	nishimura, Balbir Singh, linux-mm

On Thu, Apr 22, 2010 at 03:28:38AM -0700, David Rientjes wrote:
> On Thu, 22 Apr 2010, Nick Piggin wrote:
> 
> > Oh actually what happened with the pagefault OOM / panic on oom thing?
> > We were talking around in circles about that too.
> > 
> 
> The oom killer rewrite attempts to kill current first, if possible, and 
> then will panic if panic_on_oom is set before falling back to selecting a 
> victim.

See, this is what we want to avoid. If the user sets panic_on_oom,
it is because they want the system to panic on oom. Not to kill
tasks and try to continue. The user does not know or care in the
slightest about "page fault oom". So I don't know why you think this
is a good idea.


>  This is consistent with all other architectures such as powerpc 
> that currently do not use pagefault_out_of_memory().  If all architectures 
> are eventually going to be converted to using pagefault_out_of_memory() 

Yes, architectures are going to be converted, it has already been
agreed, I dropped the ball and lazily hoped the arch people would do it.
But further work done should be to make it consistent in the right way,
not the wrong way.


> with additional work on top of -mm, it would be possible to define 
> consistent panic_on_oom semantics for this case.  I welcome such an 
> addition since I believe it's a natural extension of panic_on_oom, but I 
> believe it should be done consistently so the sysctl doesn't have 
> different semantics depending on the underlying arch.

It's simply a bug rather than intentional semantics. "pagefault oom"
is basically a meaningless semantic for the user.

Let's do a deal. I'll split up the below patch and send it to arch
maintainers, and you don't change the sysctl interface or "fix" the
pagefault oom path.

--
Index: linux-2.6/arch/alpha/mm/fault.c
===================================================================
--- linux-2.6.orig/arch/alpha/mm/fault.c
+++ linux-2.6/arch/alpha/mm/fault.c
@@ -188,16 +188,10 @@ do_page_fault(unsigned long address, uns
 	/* We ran out of memory, or some other thing happened to us that
 	   made us unable to handle the page fault gracefully.  */
  out_of_memory:
-	if (is_global_init(current)) {
-		yield();
-		down_read(&mm->mmap_sem);
-		goto survive;
-	}
-	printk(KERN_ALERT "VM: killing process %s(%d)\n",
-	       current->comm, task_pid_nr(current));
 	if (!user_mode(regs))
 		goto no_context;
-	do_group_exit(SIGKILL);
+	pagefault_out_of_memory();
+	return;
 
  do_sigbus:
 	/* Send a sigbus, regardless of whether we were in kernel
Index: linux-2.6/arch/avr32/mm/fault.c
===================================================================
--- linux-2.6.orig/arch/avr32/mm/fault.c
+++ linux-2.6/arch/avr32/mm/fault.c
@@ -211,15 +211,10 @@ no_context:
 	 */
 out_of_memory:
 	up_read(&mm->mmap_sem);
-	if (is_global_init(current)) {
-		yield();
-		down_read(&mm->mmap_sem);
-		goto survive;
-	}
-	printk("VM: Killing process %s\n", tsk->comm);
-	if (user_mode(regs))
-		do_group_exit(SIGKILL);
-	goto no_context;
+	pagefault_out_of_memory();
+	if (!user_mode(regs))
+		goto no_context;
+	return;
 
 do_sigbus:
 	up_read(&mm->mmap_sem);
Index: linux-2.6/arch/cris/mm/fault.c
===================================================================
--- linux-2.6.orig/arch/cris/mm/fault.c
+++ linux-2.6/arch/cris/mm/fault.c
@@ -245,10 +245,10 @@ do_page_fault(unsigned long address, str
 
  out_of_memory:
 	up_read(&mm->mmap_sem);
-	printk("VM: killing process %s\n", tsk->comm);
-	if (user_mode(regs))
-		do_exit(SIGKILL);
-	goto no_context;
+	if (!user_mode(regs))
+		goto no_context;
+	pagefault_out_of_memory();
+	return;
 
  do_sigbus:
 	up_read(&mm->mmap_sem);
Index: linux-2.6/arch/frv/mm/fault.c
===================================================================
--- linux-2.6.orig/arch/frv/mm/fault.c
+++ linux-2.6/arch/frv/mm/fault.c
@@ -257,10 +257,10 @@ asmlinkage void do_page_fault(int datamm
  */
  out_of_memory:
 	up_read(&mm->mmap_sem);
-	printk("VM: killing process %s\n", current->comm);
-	if (user_mode(__frame))
-		do_group_exit(SIGKILL);
-	goto no_context;
+	if (!user_mode(__frame))
+		goto no_context;
+	pagefault_out_of_memory();
+	return;
 
  do_sigbus:
 	up_read(&mm->mmap_sem);
Index: linux-2.6/arch/ia64/mm/fault.c
===================================================================
--- linux-2.6.orig/arch/ia64/mm/fault.c
+++ linux-2.6/arch/ia64/mm/fault.c
@@ -276,13 +276,7 @@ ia64_do_page_fault (unsigned long addres
 
   out_of_memory:
 	up_read(&mm->mmap_sem);
-	if (is_global_init(current)) {
-		yield();
-		down_read(&mm->mmap_sem);
-		goto survive;
-	}
-	printk(KERN_CRIT "VM: killing process %s\n", current->comm);
-	if (user_mode(regs))
-		do_group_exit(SIGKILL);
-	goto no_context;
+	if (!user_mode(regs))
+		goto no_context;
+	pagefault_out_of_memory();
 }
Index: linux-2.6/arch/m32r/mm/fault.c
===================================================================
--- linux-2.6.orig/arch/m32r/mm/fault.c
+++ linux-2.6/arch/m32r/mm/fault.c
@@ -271,15 +271,10 @@ no_context:
  */
 out_of_memory:
 	up_read(&mm->mmap_sem);
-	if (is_global_init(tsk)) {
-		yield();
-		down_read(&mm->mmap_sem);
-		goto survive;
-	}
-	printk("VM: killing process %s\n", tsk->comm);
 	if (error_code & ACE_USERMODE)
-		do_group_exit(SIGKILL);
-	goto no_context;
+		goto no_context;
+	pagefault_out_of_memory();
+	return;
 
 do_sigbus:
 	up_read(&mm->mmap_sem);
Index: linux-2.6/arch/m68k/mm/fault.c
===================================================================
--- linux-2.6.orig/arch/m68k/mm/fault.c
+++ linux-2.6/arch/m68k/mm/fault.c
@@ -180,15 +180,10 @@ good_area:
  */
 out_of_memory:
 	up_read(&mm->mmap_sem);
-	if (is_global_init(current)) {
-		yield();
-		down_read(&mm->mmap_sem);
-		goto survive;
-	}
-
-	printk("VM: killing process %s\n", current->comm);
-	if (user_mode(regs))
-		do_group_exit(SIGKILL);
+	if (!user_mode(regs))
+		goto no_context;
+	pagefault_out_of_memory();
+	return;
 
 no_context:
 	current->thread.signo = SIGBUS;
Index: linux-2.6/arch/microblaze/mm/fault.c
===================================================================
--- linux-2.6.orig/arch/microblaze/mm/fault.c
+++ linux-2.6/arch/microblaze/mm/fault.c
@@ -273,16 +273,11 @@ bad_area_nosemaphore:
  * us unable to handle the page fault gracefully.
  */
 out_of_memory:
-	if (current->pid == 1) {
-		yield();
-		down_read(&mm->mmap_sem);
-		goto survive;
-	}
 	up_read(&mm->mmap_sem);
-	printk(KERN_WARNING "VM: killing process %s\n", current->comm);
-	if (user_mode(regs))
-		do_exit(SIGKILL);
-	bad_page_fault(regs, address, SIGKILL);
+	if (!user_mode(regs))
+		bad_page_fault(regs, address, SIGKILL);
+	else
+		pagefault_out_of_memory();
 	return;
 
 do_sigbus:
Index: linux-2.6/arch/mn10300/mm/fault.c
===================================================================
--- linux-2.6.orig/arch/mn10300/mm/fault.c
+++ linux-2.6/arch/mn10300/mm/fault.c
@@ -338,11 +338,10 @@ no_context:
  */
 out_of_memory:
 	up_read(&mm->mmap_sem);
-	monitor_signal(regs);
-	printk(KERN_ALERT "VM: killing process %s\n", tsk->comm);
-	if ((fault_code & MMUFCR_xFC_ACCESS) == MMUFCR_xFC_ACCESS_USR)
-		do_exit(SIGKILL);
-	goto no_context;
+	if ((fault_code & MMUFCR_xFC_ACCESS) != MMUFCR_xFC_ACCESS_USR)
+		goto no_context;
+	pagefault_out_of_memory();
+	return;
 
 do_sigbus:
 	up_read(&mm->mmap_sem);
Index: linux-2.6/arch/parisc/mm/fault.c
===================================================================
--- linux-2.6.orig/arch/parisc/mm/fault.c
+++ linux-2.6/arch/parisc/mm/fault.c
@@ -264,8 +264,7 @@ no_context:
 
   out_of_memory:
 	up_read(&mm->mmap_sem);
-	printk(KERN_CRIT "VM: killing process %s\n", current->comm);
-	if (user_mode(regs))
-		do_group_exit(SIGKILL);
-	goto no_context;
+	if (!user_mode(regs))
+		goto no_context;
+	pagefault_out_of_memory();
 }
Index: linux-2.6/arch/powerpc/mm/fault.c
===================================================================
--- linux-2.6.orig/arch/powerpc/mm/fault.c
+++ linux-2.6/arch/powerpc/mm/fault.c
@@ -359,15 +359,10 @@ bad_area_nosemaphore:
  */
 out_of_memory:
 	up_read(&mm->mmap_sem);
-	if (is_global_init(current)) {
-		yield();
-		down_read(&mm->mmap_sem);
-		goto survive;
-	}
-	printk("VM: killing process %s\n", current->comm);
-	if (user_mode(regs))
-		do_group_exit(SIGKILL);
-	return SIGKILL;
+	if (!user_mode(regs))
+		return SIGKILL;
+	pagefault_out_of_memory();
+	return 0;
 
 do_sigbus:
 	up_read(&mm->mmap_sem);
Index: linux-2.6/arch/score/mm/fault.c
===================================================================
--- linux-2.6.orig/arch/score/mm/fault.c
+++ linux-2.6/arch/score/mm/fault.c
@@ -167,15 +167,10 @@ no_context:
 	*/
 out_of_memory:
 	up_read(&mm->mmap_sem);
-	if (is_global_init(tsk)) {
-		yield();
-		down_read(&mm->mmap_sem);
-		goto survive;
-	}
-	printk("VM: killing process %s\n", tsk->comm);
-	if (user_mode(regs))
-		do_group_exit(SIGKILL);
-	goto no_context;
+	if (!user_mode(regs))
+		goto no_context;
+	pagefault_out_of_memory();
+	return;
 
 do_sigbus:
 	up_read(&mm->mmap_sem);
Index: linux-2.6/arch/sh/mm/fault_32.c
===================================================================
--- linux-2.6.orig/arch/sh/mm/fault_32.c
+++ linux-2.6/arch/sh/mm/fault_32.c
@@ -290,15 +290,10 @@ no_context:
  */
 out_of_memory:
 	up_read(&mm->mmap_sem);
-	if (is_global_init(current)) {
-		yield();
-		down_read(&mm->mmap_sem);
-		goto survive;
-	}
-	printk("VM: killing process %s\n", tsk->comm);
-	if (user_mode(regs))
-		do_group_exit(SIGKILL);
-	goto no_context;
+	if (!user_mode(regs))
+		goto no_context;
+	pagefault_out_of_memory();
+	return;
 
 do_sigbus:
 	up_read(&mm->mmap_sem);
Index: linux-2.6/arch/sh/mm/tlbflush_64.c
===================================================================
--- linux-2.6.orig/arch/sh/mm/tlbflush_64.c
+++ linux-2.6/arch/sh/mm/tlbflush_64.c
@@ -294,22 +294,11 @@ no_context:
  * us unable to handle the page fault gracefully.
  */
 out_of_memory:
-	if (is_global_init(current)) {
-		panic("INIT out of memory\n");
-		yield();
-		goto survive;
-	}
-	printk("fault:Out of memory\n");
 	up_read(&mm->mmap_sem);
-	if (is_global_init(current)) {
-		yield();
-		down_read(&mm->mmap_sem);
-		goto survive;
-	}
-	printk("VM: killing process %s\n", tsk->comm);
-	if (user_mode(regs))
-		do_group_exit(SIGKILL);
-	goto no_context;
+	if (!user_mode(regs))
+		goto no_context;
+	pagefault_out_of_memory();
+	return;
 
 do_sigbus:
 	printk("fault:Do sigbus\n");
Index: linux-2.6/arch/xtensa/mm/fault.c
===================================================================
--- linux-2.6.orig/arch/xtensa/mm/fault.c
+++ linux-2.6/arch/xtensa/mm/fault.c
@@ -146,15 +146,10 @@ bad_area:
 	 */
 out_of_memory:
 	up_read(&mm->mmap_sem);
-	if (is_global_init(current)) {
-		yield();
-		down_read(&mm->mmap_sem);
-		goto survive;
-	}
-	printk("VM: killing process %s\n", current->comm);
-	if (user_mode(regs))
-		do_group_exit(SIGKILL);
-	bad_page_fault(regs, address, SIGKILL);
+	if (!user_mode(regs))
+		bad_page_fault(regs, address, SIGKILL);
+	else
+		pagefault_out_of_memory();
 	return;
 
 do_sigbus:

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch -mm] memcg: make oom killer a no-op when no killable task can be found
  2010-04-22 15:39                                                       ` Nick Piggin
@ 2010-04-22 21:09                                                         ` David Rientjes
  0 siblings, 0 replies; 29+ messages in thread
From: David Rientjes @ 2010-04-22 21:09 UTC (permalink / raw)
  To: Nick Piggin
  Cc: KAMEZAWA Hiroyuki, Andrew Morton, KOSAKI Motohiro, anfei,
	nishimura, Balbir Singh, linux-mm

On Fri, 23 Apr 2010, Nick Piggin wrote:

> > The oom killer rewrite attempts to kill current first, if possible, and 
> > then will panic if panic_on_oom is set before falling back to selecting a 
> > victim.
> 
> See, this is what we want to avoid. If the user sets panic_on_oom,
> it is because they want the system to panic on oom. Not to kill
> tasks and try to continue. The user does not know or care in the
> slightest about "page fault oom". So I don't know why you think this
> is a good idea.
> 

Unless we unify the behavior of panic_on_oom, it would be possible for the 
architectures that are not converted to using pagefault_out_of_memory() 
yet using your patch series to kill tasks even if they have OOM_DISABLE 
set.  So, as it sits this second in -mm, the system will still try to kill 
current first so that all architectures are consistent.  Once all 
architectures use pagefault_out_of_memory(), we can simply add

	if (sysctl_panic_on_oom) {
		read_lock(&tasklist_lock);
		dump_header(NULL, 0, 0, NULL);
		read_unlock(&tasklist_lock);
		panic("Out of memory: panic_on_oom is enabled\n");
	}

to pagefault_out_of_memory().  I simply opted for consistency across all 
architectures before that was done.

> >  This is consistent with all other architectures such as powerpc 
> > that currently do not use pagefault_out_of_memory().  If all architectures 
> > are eventually going to be converted to using pagefault_out_of_memory() 
> 
> Yes, architectures are going to be converted, it has already been
> agreed, I dropped the ball and lazily hoped the arch people would do it.
> But further work done should be to make it consistent in the right way,
> not the wrong way.
> 

Thanks for doing the work and proposing the patchset, there were a couple 
of patches that looked like it needed a v2, but overall it looked very 
good.  Once they're merged in upstream, I think we can add the panic to 
pagefault_out_of_memory() in -mm since they'll probably make it to Linus 
before the oom killer rewrite at the speed we're going here.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch -mm] memcg: make oom killer a no-op when no killable task can be found
  2010-04-21 19:17                                             ` Andrew Morton
  2010-04-21 22:04                                               ` David Rientjes
  2010-04-22  7:23                                               ` Nick Piggin
@ 2010-05-04 23:55                                               ` David Rientjes
  2 siblings, 0 replies; 29+ messages in thread
From: David Rientjes @ 2010-05-04 23:55 UTC (permalink / raw)
  To: Andrew Morton
  Cc: KOSAKI Motohiro, KAMEZAWA Hiroyuki, anfei, nishimura,
	Balbir Singh, linux-mm

On Wed, 21 Apr 2010, Andrew Morton wrote:

> 
> fyi, I still consider these patches to be in the "stuck" state.  So we
> need to get them unstuck.
> 
> 
> Hiroyuki (and anyone else): could you please summarise in the briefest
> way possible what your objections are to Daivd's oom-killer changes?
> 
> I'll start: we don't change the kernel ABI.  Ever.  And when we _do_
> change it we don't change it without warning.
> 

Have we resolved all of the outstanding discussion concerning the oom 
killer rewrite?  I'm not aware of any pending issues.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch -mm] memcg: make oom killer a no-op when no killable task can be found
  2010-04-07  0:20                                       ` KAMEZAWA Hiroyuki
  2010-04-07 13:29                                         ` KOSAKI Motohiro
@ 2010-04-08 17:36                                         ` David Rientjes
  1 sibling, 0 replies; 29+ messages in thread
From: David Rientjes @ 2010-04-08 17:36 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: KOSAKI Motohiro, Andrew Morton, anfei, nishimura, Balbir Singh, linux-mm

On Wed, 7 Apr 2010, KAMEZAWA Hiroyuki wrote:

> > > oom-badness-heuristic-rewrite.patch
> > 
> > Do you have any specific feedback that you could offer on why you decided 
> > to nack this?
> > 
> 
> I like this patch. But I think no one can't Ack this because there is no
> "correct" answer. At least, this show good behavior on my environment.
> 

Agreed.  I think the new oom_badness() function is much better than the 
current heuristic and should prevent X from being killed as we've 
discussed fairly often on LKML over the past six months.

> > Keeping /proc/pid/oom_adj around indefinitely isn't very helpful if 
> > there's a finer grained alternative available already unless you want 
> > /proc/pid/oom_adj to actually mean something in which case you'll never be 
> > able to seperate oom badness scores from bitshifts.  I believe everyone 
> > agrees that a more understood and finer grained tunable is necessary as 
> > compared to the current implementation that has very limited functionality 
> > other than polarizing tasks.
> > 
> 
> If oom-badness-heuristic-rewrite.patch will go ahead, this should go.
> But my concern is administorator has to check all oom_score_adj and
> tune it again if he adds more memory to the system.
> 
> Now, not-small amount of people use Virtual Machine or Contaienr. So, this
> oom_score_adj's sensivity to the size of memory can put admins to hell.
> 

Would you necessarily want to change oom_score_adj when you add or remove 
memory?  I see the currently available pool of memory available (whether 
it is system-wide, constrained to a cpuset mems, mempolicy nodes, or memcg 
limits) as a shared resource so if you want to bias a task by 25% of 
available memory by using an oom_score_adj of 250, that doesn't change if 
we add or remove memory.  It still means that the task should be biased by 
that amount in comparison to other tasks.

My perspective is that we should define oom killing priorities is terms of 
how much memory tasks are using compared to others and that the actual 
capacity itself is irrelevant if its a shared resource.  So when tasks are 
moved into a memcg, for example, that becomes a "virtualized system" with 
a more limited shared memory resource and has the same bias (or 
preference) that it did when it was in the root cgroup.

In other words, I think it would be more inconvenient to update 
oom_score_adj anytime a task changes memcg, is attached to a different 
cpuset, or is bound to nodes by way of a mempolicy.  In these scenarios, I 
see them as simply having a restricted set of allowed memory yet the bias 
can remain the same.

Users who do actually want to bias a task by a memory quantity can easily 
do so, but I think they would be in the minority and we hope to avoid 
adding unnecessary tunables when a conversion to the appropriate 
oom_score_adj value is possible with a simple divide.

> > > oom-replace-sysctls-with-quick-mode.patch
> > > 
> > > IIRC, alan and nick and I NAKed such patch. everybody explained the reason.
> > 
> > Which patch of the four you listed are you referring to here?
> > 
> replacing used sysctl is bad idea, in general.
> 

I agree, but since the audience for both of these sysctls will need to do 
echo 0 > /proc/sys/vm/oom_dump_tasks as the result of this patchset since 
it is now enabled by default, do you think we can take this as an 
opportunity to consolidate them down into one?  Otherwise, we're obliged 
to continue to support them indefinitely even though their only users are 
the exact same systems.

> I have no _strong_ opinion. I welcome the patch series. But aboves are my concern.
> Thank you for your work.
> 

Thanks, Kame, I appreciate that.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2010-05-04 23:56 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-05-04 23:51 [patch -mm] memcg: make oom killer a no-op when no killable task can be found David Rientjes
  -- strict thread matches above, loose matches on Subject: below --
2010-03-28 14:55 [PATCH] oom killer: break from infinite loop anfei
2010-03-28 16:28 ` Oleg Nesterov
2010-03-28 21:21   ` David Rientjes
2010-03-29 14:06     ` anfei
2010-03-29 20:01       ` David Rientjes
2010-03-30 14:29         ` anfei
2010-03-30 20:29           ` David Rientjes
2010-03-31  0:57             ` KAMEZAWA Hiroyuki
2010-03-31  6:07               ` David Rientjes
2010-03-31  6:13                 ` KAMEZAWA Hiroyuki
2010-03-31  6:30                   ` Balbir Singh
2010-03-31  6:32                     ` David Rientjes
2010-03-31  7:08                       ` [patch -mm] memcg: make oom killer a no-op when no killable task can be found David Rientjes
2010-03-31  7:08                         ` KAMEZAWA Hiroyuki
2010-03-31  8:04                         ` Balbir Singh
2010-03-31 10:38                           ` David Rientjes
2010-04-04 23:28                         ` David Rientjes
2010-04-05 21:30                           ` Andrew Morton
2010-04-05 22:40                             ` David Rientjes
2010-04-05 22:49                               ` Andrew Morton
2010-04-05 23:01                                 ` David Rientjes
2010-04-06 12:08                                   ` KOSAKI Motohiro
2010-04-06 21:47                                     ` David Rientjes
2010-04-07  0:20                                       ` KAMEZAWA Hiroyuki
2010-04-07 13:29                                         ` KOSAKI Motohiro
2010-04-08 18:05                                           ` David Rientjes
2010-04-21 19:17                                             ` Andrew Morton
2010-04-21 22:04                                               ` David Rientjes
2010-04-22  0:23                                                 ` KAMEZAWA Hiroyuki
2010-04-22  8:34                                                   ` David Rientjes
2010-04-22  7:23                                               ` Nick Piggin
2010-04-22  7:25                                                 ` KAMEZAWA Hiroyuki
2010-04-22 10:09                                                   ` Nick Piggin
2010-04-22 10:27                                                     ` KAMEZAWA Hiroyuki
2010-04-22 21:11                                                       ` David Rientjes
2010-04-22 10:28                                                     ` David Rientjes
2010-04-22 15:39                                                       ` Nick Piggin
2010-04-22 21:09                                                         ` David Rientjes
2010-05-04 23:55                                               ` David Rientjes
2010-04-08 17:36                                         ` David Rientjes

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox