linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] mm,oom: remove shortcuts for SIGKILL and PF_EXITING cases
@ 2016-02-21  7:14 Tetsuo Handa
  2016-02-23  0:47 ` David Rientjes
  0 siblings, 1 reply; 4+ messages in thread
From: Tetsuo Handa @ 2016-02-21  7:14 UTC (permalink / raw)
  To: mhocko, akpm
  Cc: rientjes, hannes, mgorman, oleg, torvalds, hughd, andrea, riel,
	linux-mm, linux-kernel, Tetsuo Handa

Currently, sending SIGKILL to all user processes sharing the same memory
is omitted by three locations. But they should be removed due to possible
OOM livelock sequence shown below.

About setting TIF_MEMIE on current->mm && fatal_signal_pending(current)
at out_of_memory():

  There are two thread groups named P1 and P2 that are created using
  clone(!CLONE_SIGHAND && CLONE_VM) and one independent thread group
  named P3. A sequence shown below is possible.

  ----------
  P1             P2             P3
  Do something that invokes a __GFP_FS memory allocation (e.g. page fault).
                 Calls mmap().
                                Calls kill(P1, SIGKILL).
                 Arrives at vm_mmap_pgoff().
                 Calls down_write(&mm->mmap_sem).
                                Sends SIGKILL to P1.
  fatal_signal_pending(P1) becomes true.
                 Calls do_mmap_pgoff().
  Calls out_of_memory().
  Gets TIF_MEMDIE.
                 Calls out_of_memory().
                 oom_scan_process_thread() returns OOM_SCAN_ABORT.
  Arrives at do_exit().
  Calls down_read(&mm->mmap_sem) at exit_mm().
                 oom_scan_process_thread() still returns OOM_SCAN_ABORT.
  ----------

  P1 is waiting for P2 to call up_write(&mm->mmap_sem) but P2 won't
  give up memory allocation because fatal_signal_pending(P2) is false.
  This race condition can be avoided if P1 sent SIGKILL to P2 when
  P1 called out_of_memory().

About setting TIF_MEMIE on current->mm && task_will_free_mem(current)
at out_of_memory():

  There are two thread groups named P1 and P2 that are created using
  clone(!CLONE_SIGHAND && CLONE_VM). A sequence shown below is possible.

  ----------
  P1             P2
  Calls _exit(0).
                 Calls mmap().
                 Arrives at vm_mmap_pgoff().
                 Calls down_write(&mm->mmap_sem).
  Arrives at do_exit().
  Gets PF_EXITING via exit_signals().
  Calls tty_audit_exit().
  Do a GFP_KERNEL allocation from tty_audit_log().
  Calls out_of_memory().
  Gets TIF_MEMDIE.
                 Calls out_of_memory().
                 oom_scan_process_thread() returns OOM_SCAN_ABORT.
  Calls down_read(&mm->mmap_sem) at exit_mm().
                 oom_scan_process_thread() still returns OOM_SCAN_ABORT.
  ----------

  P1 is waiting for P2 to call up_write(&mm->mmap_sem) but P2 won't
  give up memory allocation because fatal_signal_pending(P2) is false.
  This race condition can be avoided if P1 sent SIGKILL to P2 when
  P1 called out_of_memory().

About setting TIF_MEMIE on p->mm && task_will_free_mem(p)
at oom_kill_process():

  There are two thread groups named P1 and P2 that are created using
  clone(!CLONE_SIGHAND && CLONE_VM) and one independent thread group
  named P3. A sequence shown below is possible.

  ----------
  P1             P2
  Calls _exit(0).
                 Calls mmap().
                 Arrives at vm_mmap_pgoff().
                 Calls down_write(&mm->mmap_sem).
  Arrives at do_exit().
  Gets PF_EXITING via exit_signals().
  Calls down_read(&mm->mmap_sem) at exit_mm().
                 Calls do_mmap_pgoff().
                 Calls out_of_memory().
                 select_bad_process() returns P1.
                 oom_kill_process() sets TIF_MEMDIE on P1.
                 oom_scan_process_thread() returns OOM_SCAN_ABORT.
  ----------

  P1 is waiting for P2 to call up_write(&mm->mmap_sem) but P2 won't
  give up memory allocation because fatal_signal_pending(P2) is false.
  This race condition can be avoided if P2 sent SIGKILL to P2 when
  P2 called out_of_memory().

About setting TIF_MEMIE on fatal_signal_pending(current)
at mem_cgroup_out_of_memory():

  mem_cgroup_out_of_memory() is called from a lockless context via
  mem_cgroup_oom_synchronize() called from pagefault_out_of_memory()
  is talking about only current thread. If global OOM condition follows
  before memcg OOM condition is solved, the same problem will occur.

About setting TIF_MEMIE on task_will_free_mem(current)
at mem_cgroup_out_of_memory():

  I don't know whether it is possible to call mem_cgroup_out_of_memory()
  between getting PF_EXITING and doing current->mm = NULL. But if it is
  possible to call, then the same problem will occur.

And since removing these shortcuts breaks a wrong and optimistic
assumption in oom_kill_process()

  /*
   * Kill all user processes sharing victim->mm in other thread groups, if
   * any.  They don't get access to memory reserves, though, to avoid
   * depletion of all memory.  This prevents mm->mmap_sem livelock when an
   * oom killed thread cannot exit because it requires the semaphore and
   * its contended by another thread trying to allocate memory itself.
   * That thread will now get access to memory reserves since it has a
   * pending fatal signal.
   */

set TIF_MEMDIE on all threads in thread groups which got SIGKILL by
the OOM killer. This will help getting threads doing !__GFP_FS allocations
which are not allowed to call out_of_memory() (which will set TIF_MEMDIE
via current->mm && fatal_signal_pending(current) case).

Ideally p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN thread groups
need to get SIGKILL and TIF_MEMDIE as well because it is possible that
one of threads in such thread groups are holding mm->mmap_sem lock for
write.

Of course, situations described in this patch are corner cases. But
since we are not going to add memory allocation watchdog mechanism nor
timeout based next OOM victim selection logic, we need to eliminate
all possible corner cases. Silent OOM livelock is really annoying.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
---
 mm/memcontrol.c | 10 ----------
 mm/oom_kill.c   | 46 ++++++++++------------------------------------
 2 files changed, 10 insertions(+), 46 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index ae8b81c..390ec2c 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1253,16 +1253,6 @@ static void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
 
 	mutex_lock(&oom_lock);
 
-	/*
-	 * If current has a pending SIGKILL or is exiting, then automatically
-	 * select it.  The goal is to allow it to allocate so that it may
-	 * quickly exit and free its memory.
-	 */
-	if (fatal_signal_pending(current) || task_will_free_mem(current)) {
-		mark_oom_victim(current);
-		goto unlock;
-	}
-
 	check_panic_on_oom(&oc, CONSTRAINT_MEMCG, memcg);
 	totalpages = mem_cgroup_get_limit(memcg) ? : 1;
 	for_each_mem_cgroup_tree(iter, memcg) {
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index d7bb9c1..5e8563a 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -684,19 +684,6 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 					      DEFAULT_RATELIMIT_BURST);
 	bool can_oom_reap = true;
 
-	/*
-	 * If the task is already exiting, don't alarm the sysadmin or kill
-	 * its children or threads, just set TIF_MEMDIE so it can die quickly
-	 */
-	task_lock(p);
-	if (p->mm && task_will_free_mem(p)) {
-		mark_oom_victim(p);
-		task_unlock(p);
-		put_task_struct(p);
-		return;
-	}
-	task_unlock(p);
-
 	if (__ratelimit(&oom_rs))
 		dump_header(oc, p, memcg);
 
@@ -759,20 +746,15 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 	task_unlock(victim);
 
 	/*
-	 * Kill all user processes sharing victim->mm in other thread groups, if
-	 * any.  They don't get access to memory reserves, though, to avoid
-	 * depletion of all memory.  This prevents mm->mmap_sem livelock when an
-	 * oom killed thread cannot exit because it requires the semaphore and
-	 * its contended by another thread trying to allocate memory itself.
-	 * That thread will now get access to memory reserves since it has a
-	 * pending fatal signal.
+	 * Kill all user processes sharing victim->mm. This reduces possibility
+	 * of hitting mm->mmap_sem livelock when an oom killed thread cannot
+	 * exit because it requires the semaphore and its contended by another
+	 * thread trying to allocate memory itself.
 	 */
 	rcu_read_lock();
 	for_each_process(p) {
 		if (!process_shares_mm(p, mm))
 			continue;
-		if (same_thread_group(p, victim))
-			continue;
 		if (unlikely(p->flags & PF_KTHREAD) || is_global_init(p) ||
 		    p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) {
 			/*
@@ -784,6 +766,12 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 			continue;
 		}
 		do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, true);
+		for_each_thread(p, t) {
+			task_lock(t);
+			if (t->mm)
+				mark_oom_victim(t);
+			task_unlock(t);
+		}
 	}
 	rcu_read_unlock();
 
@@ -860,20 +848,6 @@ bool out_of_memory(struct oom_control *oc)
 		return true;
 
 	/*
-	 * If current has a pending SIGKILL or is exiting, then automatically
-	 * select it.  The goal is to allow it to allocate so that it may
-	 * quickly exit and free its memory.
-	 *
-	 * But don't select if current has already released its mm and cleared
-	 * TIF_MEMDIE flag at exit_mm(), otherwise an OOM livelock may occur.
-	 */
-	if (current->mm &&
-	    (fatal_signal_pending(current) || task_will_free_mem(current))) {
-		mark_oom_victim(current);
-		return true;
-	}
-
-	/*
 	 * Check if there were limitations on the allocation (only relevant for
 	 * NUMA) that may require different handling.
 	 */
-- 
1.8.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH] mm,oom: remove shortcuts for SIGKILL and PF_EXITING cases
  2016-02-21  7:14 [PATCH] mm,oom: remove shortcuts for SIGKILL and PF_EXITING cases Tetsuo Handa
@ 2016-02-23  0:47 ` David Rientjes
  2016-02-23 10:38   ` Tetsuo Handa
  0 siblings, 1 reply; 4+ messages in thread
From: David Rientjes @ 2016-02-23  0:47 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: mhocko, akpm, hannes, mgorman, oleg, torvalds, hughd, andrea,
	riel, linux-mm, linux-kernel

On Sun, 21 Feb 2016, Tetsuo Handa wrote:

> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index ae8b81c..390ec2c 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1253,16 +1253,6 @@ static void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  
>  	mutex_lock(&oom_lock);
>  
> -	/*
> -	 * If current has a pending SIGKILL or is exiting, then automatically
> -	 * select it.  The goal is to allow it to allocate so that it may
> -	 * quickly exit and free its memory.
> -	 */
> -	if (fatal_signal_pending(current) || task_will_free_mem(current)) {
> -		mark_oom_victim(current);
> -		goto unlock;
> -	}
> -
>  	check_panic_on_oom(&oc, CONSTRAINT_MEMCG, memcg);
>  	totalpages = mem_cgroup_get_limit(memcg) ? : 1;
>  	for_each_mem_cgroup_tree(iter, memcg) {
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index d7bb9c1..5e8563a 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -684,19 +684,6 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
>  					      DEFAULT_RATELIMIT_BURST);
>  	bool can_oom_reap = true;
>  
> -	/*
> -	 * If the task is already exiting, don't alarm the sysadmin or kill
> -	 * its children or threads, just set TIF_MEMDIE so it can die quickly
> -	 */
> -	task_lock(p);
> -	if (p->mm && task_will_free_mem(p)) {
> -		mark_oom_victim(p);
> -		task_unlock(p);
> -		put_task_struct(p);
> -		return;
> -	}
> -	task_unlock(p);
> -
>  	if (__ratelimit(&oom_rs))
>  		dump_header(oc, p, memcg);
>  
> @@ -759,20 +746,15 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
>  	task_unlock(victim);
>  
>  	/*
> -	 * Kill all user processes sharing victim->mm in other thread groups, if
> -	 * any.  They don't get access to memory reserves, though, to avoid
> -	 * depletion of all memory.  This prevents mm->mmap_sem livelock when an
> -	 * oom killed thread cannot exit because it requires the semaphore and
> -	 * its contended by another thread trying to allocate memory itself.
> -	 * That thread will now get access to memory reserves since it has a
> -	 * pending fatal signal.
> +	 * Kill all user processes sharing victim->mm. This reduces possibility
> +	 * of hitting mm->mmap_sem livelock when an oom killed thread cannot
> +	 * exit because it requires the semaphore and its contended by another
> +	 * thread trying to allocate memory itself.
>  	 */
>  	rcu_read_lock();
>  	for_each_process(p) {
>  		if (!process_shares_mm(p, mm))
>  			continue;
> -		if (same_thread_group(p, victim))
> -			continue;
>  		if (unlikely(p->flags & PF_KTHREAD) || is_global_init(p) ||
>  		    p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) {
>  			/*
> @@ -784,6 +766,12 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
>  			continue;
>  		}
>  		do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, true);
> +		for_each_thread(p, t) {
> +			task_lock(t);
> +			if (t->mm)
> +				mark_oom_victim(t);
> +			task_unlock(t);
> +		}
>  	}
>  	rcu_read_unlock();
>  
> @@ -860,20 +848,6 @@ bool out_of_memory(struct oom_control *oc)
>  		return true;
>  
>  	/*
> -	 * If current has a pending SIGKILL or is exiting, then automatically
> -	 * select it.  The goal is to allow it to allocate so that it may
> -	 * quickly exit and free its memory.
> -	 *
> -	 * But don't select if current has already released its mm and cleared
> -	 * TIF_MEMDIE flag at exit_mm(), otherwise an OOM livelock may occur.
> -	 */
> -	if (current->mm &&
> -	    (fatal_signal_pending(current) || task_will_free_mem(current))) {
> -		mark_oom_victim(current);
> -		return true;
> -	}
> -
> -	/*
>  	 * Check if there were limitations on the allocation (only relevant for
>  	 * NUMA) that may require different handling.
>  	 */

No, NACK.  You cannot prohibit an exiting process from gaining access to 
memory reserves and randomly killing another process without additional 
chances of a livelock.  The goal is for an exiting or killed process to 
be able to exit so it can free its memory, not kill additional processes.

Please start trimming your cc list, I seriously doubt all these people are 
interested in this thread.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH] mm,oom: remove shortcuts for SIGKILL and PF_EXITING cases
  2016-02-23  0:47 ` David Rientjes
@ 2016-02-23 10:38   ` Tetsuo Handa
  2016-02-23 13:24     ` Tetsuo Handa
  0 siblings, 1 reply; 4+ messages in thread
From: Tetsuo Handa @ 2016-02-23 10:38 UTC (permalink / raw)
  To: rientjes; +Cc: mhocko, hannes, linux-mm

David Rientjes wrote:
> On Sun, 21 Feb 2016, Tetsuo Handa wrote:
> 
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index ae8b81c..390ec2c 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -1253,16 +1253,6 @@ static void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
> >  
> >  	mutex_lock(&oom_lock);
> >  
> > -	/*
> > -	 * If current has a pending SIGKILL or is exiting, then automatically
> > -	 * select it.  The goal is to allow it to allocate so that it may
> > -	 * quickly exit and free its memory.
> > -	 */
> > -	if (fatal_signal_pending(current) || task_will_free_mem(current)) {
> > -		mark_oom_victim(current);
> > -		goto unlock;
> > -	}
> > -
> >  	check_panic_on_oom(&oc, CONSTRAINT_MEMCG, memcg);
> >  	totalpages = mem_cgroup_get_limit(memcg) ? : 1;
> >  	for_each_mem_cgroup_tree(iter, memcg) {
> > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > index d7bb9c1..5e8563a 100644
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -684,19 +684,6 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
> >  					      DEFAULT_RATELIMIT_BURST);
> >  	bool can_oom_reap = true;
> >  
> > -	/*
> > -	 * If the task is already exiting, don't alarm the sysadmin or kill
> > -	 * its children or threads, just set TIF_MEMDIE so it can die quickly
> > -	 */
> > -	task_lock(p);
> > -	if (p->mm && task_will_free_mem(p)) {
> > -		mark_oom_victim(p);
> > -		task_unlock(p);
> > -		put_task_struct(p);
> > -		return;
> > -	}
> > -	task_unlock(p);
> > -
> >  	if (__ratelimit(&oom_rs))
> >  		dump_header(oc, p, memcg);
> >  
> > @@ -759,20 +746,15 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
> >  	task_unlock(victim);
> >  
> >  	/*
> > -	 * Kill all user processes sharing victim->mm in other thread groups, if
> > -	 * any.  They don't get access to memory reserves, though, to avoid
> > -	 * depletion of all memory.  This prevents mm->mmap_sem livelock when an
> > -	 * oom killed thread cannot exit because it requires the semaphore and
> > -	 * its contended by another thread trying to allocate memory itself.
> > -	 * That thread will now get access to memory reserves since it has a
> > -	 * pending fatal signal.
> > +	 * Kill all user processes sharing victim->mm. This reduces possibility
> > +	 * of hitting mm->mmap_sem livelock when an oom killed thread cannot
> > +	 * exit because it requires the semaphore and its contended by another
> > +	 * thread trying to allocate memory itself.
> >  	 */
> >  	rcu_read_lock();
> >  	for_each_process(p) {
> >  		if (!process_shares_mm(p, mm))
> >  			continue;
> > -		if (same_thread_group(p, victim))
> > -			continue;
> >  		if (unlikely(p->flags & PF_KTHREAD) || is_global_init(p) ||
> >  		    p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) {
> >  			/*
> > @@ -784,6 +766,12 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
> >  			continue;
> >  		}
> >  		do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, true);
> > +		for_each_thread(p, t) {
> > +			task_lock(t);
> > +			if (t->mm)
> > +				mark_oom_victim(t);
> > +			task_unlock(t);
> > +		}
> >  	}
> >  	rcu_read_unlock();
> >  
> > @@ -860,20 +848,6 @@ bool out_of_memory(struct oom_control *oc)
> >  		return true;
> >  
> >  	/*
> > -	 * If current has a pending SIGKILL or is exiting, then automatically
> > -	 * select it.  The goal is to allow it to allocate so that it may
> > -	 * quickly exit and free its memory.
> > -	 *
> > -	 * But don't select if current has already released its mm and cleared
> > -	 * TIF_MEMDIE flag at exit_mm(), otherwise an OOM livelock may occur.
> > -	 */
> > -	if (current->mm &&
> > -	    (fatal_signal_pending(current) || task_will_free_mem(current))) {
> > -		mark_oom_victim(current);
> > -		return true;
> > -	}
> > -
> > -	/*
> >  	 * Check if there were limitations on the allocation (only relevant for
> >  	 * NUMA) that may require different handling.
> >  	 */
> 
> No, NACK.  You cannot prohibit an exiting process from gaining access to 
> memory reserves and randomly killing another process without additional 
> chances of a livelock.  The goal is for an exiting or killed process to 
> be able to exit so it can free its memory, not kill additional processes.

I know what these shortcuts are trying to do. I'm pointing out that these
shortcuts have a chance of silent OOM livelock. If we preserve these shortcuts,
we had better not to wait forever. We need to kill additional processes if
exiting or killed process seems to got stuck.

Same with http://lkml.kernel.org/r/20160217143917.GP29196@dhcp22.suse.cz .

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH] mm,oom: remove shortcuts for SIGKILL and PF_EXITING cases
  2016-02-23 10:38   ` Tetsuo Handa
@ 2016-02-23 13:24     ` Tetsuo Handa
  0 siblings, 0 replies; 4+ messages in thread
From: Tetsuo Handa @ 2016-02-23 13:24 UTC (permalink / raw)
  To: rientjes; +Cc: mhocko, hannes, linux-mm

Tetsuo Handa wrote:
> > No, NACK.  You cannot prohibit an exiting process from gaining access to 
> > memory reserves and randomly killing another process without additional 
> > chances of a livelock.  The goal is for an exiting or killed process to 
> > be able to exit so it can free its memory, not kill additional processes.
> 
> I know what these shortcuts are trying to do. I'm pointing out that these
> shortcuts have a chance of silent OOM livelock. If we preserve these shortcuts,
> we had better not to wait forever. We need to kill additional processes if
> exiting or killed process seems to got stuck.
> 
> Same with http://lkml.kernel.org/r/20160217143917.GP29196@dhcp22.suse.cz .
> 
Or, can we accept something like below?
--------------------------------------------------------------------------------
>From 3a9231486624ad34bbf84f9798523c05f5d401d5 Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Date: Tue, 23 Feb 2016 22:04:49 +0900
Subject: [PATCH] mm,oom: check mmap_sem lockability when using shortcuts for
 SIGKILL and PF_EXITING cases

Currently, oom_scan_process_thread() returns OOM_SCAN_ABORT when there is
a thread which is exiting. But it is possible that that thread is blocked
at down_read(&mm->mmap_sem) in exit_mm() called from do_exit() whereas
one of threads sharing that memory is doing a GFP_KERNEL allocation
between down_write(&mm->mmap_sem) and up_write(&mm->mmap_sem)
(e.g. mmap()).

----------
T1                  T2
                    Calls mmap()
Calls _exit(0)
                    Arrives at vm_mmap_pgoff()
Arrives at do_exit()
Gets PF_EXITING via exit_signals()
                    Calls down_write(&mm->mmap_sem)
                    Calls do_mmap_pgoff()
Calls down_read(&mm->mmap_sem) from exit_mm()
                    Calls out of memory via a GFP_KERNEL allocation but
                    oom_scan_process_thread(T1) returns OOM_SCAN_ABORT
----------

down_read(&mm->mmap_sem) by T1 is waiting for up_write(&mm->mmap_sem) by
T2 while oom_scan_process_thread() by T2 is waiting for T1 to set
T1->mm = NULL. Under such situation, the OOM killer does not choose
a victim, which results in silent OOM livelock problem.

Also, sending SIGKILL to all user processes sharing the same memory is
omitted by three locations.

About setting TIF_MEMIE on current->mm && fatal_signal_pending(current)
at out_of_memory():

  There are two thread groups named P1 and P2 that are created using
  clone(!CLONE_SIGHAND && CLONE_VM) and one independent thread group
  named P3. A sequence shown below is possible.

  ----------
  P1             P2             P3
  Do something that invokes a __GFP_FS memory allocation (e.g. page fault).
                 Calls mmap().
                                Calls kill(P1, SIGKILL).
                 Arrives at vm_mmap_pgoff().
                 Calls down_write(&mm->mmap_sem).
                                Sends SIGKILL to P1.
  fatal_signal_pending(P1) becomes true.
                 Calls do_mmap_pgoff().
  Calls out_of_memory().
  Gets TIF_MEMDIE.
                 Calls out_of_memory().
                 oom_scan_process_thread() returns OOM_SCAN_ABORT.
  Arrives at do_exit().
  Calls down_read(&mm->mmap_sem) at exit_mm().
                 oom_scan_process_thread() still returns OOM_SCAN_ABORT.
  ----------

  P1 is waiting for P2 to call up_write(&mm->mmap_sem) but P2 won't
  give up memory allocation because fatal_signal_pending(P2) is false.

About setting TIF_MEMIE on current->mm && task_will_free_mem(current)
at out_of_memory():

  There are two thread groups named P1 and P2 that are created using
  clone(CLONE_VM). A sequence shown below is possible.

  ----------
  P1             P2
  Calls _exit(0).
                 Calls mmap().
                 Arrives at vm_mmap_pgoff().
                 Calls down_write(&mm->mmap_sem).
  Arrives at do_exit().
  Gets PF_EXITING via exit_signals().
  Calls tty_audit_exit().
  Do a GFP_KERNEL allocation from tty_audit_log().
  Calls out_of_memory().
  Gets TIF_MEMDIE.
                 Calls out_of_memory().
                 oom_scan_process_thread() returns OOM_SCAN_ABORT.
  Calls down_read(&mm->mmap_sem) at exit_mm().
                 oom_scan_process_thread() still returns OOM_SCAN_ABORT.
  ----------

  P1 is waiting for P2 to call up_write(&mm->mmap_sem) but P2 won't
  give up memory allocation because fatal_signal_pending(P2) is false.

About setting TIF_MEMIE on p->mm && task_will_free_mem(p)
at oom_kill_process():

  There are two thread groups named P1 and P2 that are created using
  clone(CLONE_VM). A sequence shown below is possible.

  ----------
  P1             P2
  Calls _exit(0).
                 Calls mmap().
                 Arrives at vm_mmap_pgoff().
                 Calls down_write(&mm->mmap_sem).
  Arrives at do_exit().
  Gets PF_EXITING via exit_signals().
  Calls down_read(&mm->mmap_sem) at exit_mm().
                 Calls do_mmap_pgoff().
                 Calls out_of_memory().
                 select_bad_process() returns P1.
                 oom_kill_process() sets TIF_MEMDIE on P1.
                 oom_scan_process_thread() returns OOM_SCAN_ABORT.
  ----------

  P1 is waiting for P2 to call up_write(&mm->mmap_sem) but P2 won't
  give up memory allocation because fatal_signal_pending(P2) is false.

About setting TIF_MEMIE on fatal_signal_pending(current)
at mem_cgroup_out_of_memory():

  mem_cgroup_out_of_memory() is called from a lockless context via
  mem_cgroup_oom_synchronize() called from pagefault_out_of_memory()
  is talking about only current thread. If global OOM condition follows
  before memcg OOM condition is solved, the same problem will occur.

About setting TIF_MEMIE on task_will_free_mem(current)
at mem_cgroup_out_of_memory():

  I don't know whether it is possible to call mem_cgroup_out_of_memory()
  between getting PF_EXITING and doing current->mm = NULL. But if it is
  possible to call, then the same problem will occur.

This patch checks whether the mm of a thread which the caller of
out_of_memory() is trying to wait for termination can be locked for read,
in order to avoid silent OOM livelock.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
---
 include/linux/oom.h |  1 +
 mm/memcontrol.c     |  3 ++-
 mm/oom_kill.c       | 39 ++++++++++++++++++++++++++++++---------
 3 files changed, 33 insertions(+), 10 deletions(-)

diff --git a/include/linux/oom.h b/include/linux/oom.h
index 45993b8..efd7aa5 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -86,6 +86,7 @@ extern void check_panic_on_oom(struct oom_control *oc,
 			       enum oom_constraint constraint,
 			       struct mem_cgroup *memcg);
 
+extern bool task_can_read_lock_mm(struct task_struct *tsk);
 extern enum oom_scan_t oom_scan_process_thread(struct oom_control *oc,
 		struct task_struct *task, unsigned long totalpages);
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index ae8b81c..c0dca1b 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1258,7 +1258,8 @@ static void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	 * select it.  The goal is to allow it to allocate so that it may
 	 * quickly exit and free its memory.
 	 */
-	if (fatal_signal_pending(current) || task_will_free_mem(current)) {
+	if ((fatal_signal_pending(current) || task_will_free_mem(current)) &&
+	    task_can_read_lock_mm(current)) {
 		mark_oom_victim(current);
 		goto unlock;
 	}
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index d7bb9c1..8a27967 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -268,6 +268,22 @@ static enum oom_constraint constrained_alloc(struct oom_control *oc,
 }
 #endif
 
+bool task_can_read_lock_mm(struct task_struct *tsk)
+{
+	struct mm_struct *mm;
+	bool ret = false;
+
+	task_lock(tsk);
+	mm = tsk->mm;
+	if (mm && down_read_trylock(&mm->mmap_sem)) {
+		up_read(&mm->mmap_sem);
+		ret = true;
+	}
+	task_unlock(tsk);
+	return ret;
+}
+
+
 enum oom_scan_t oom_scan_process_thread(struct oom_control *oc,
 			struct task_struct *task, unsigned long totalpages)
 {
@@ -278,7 +294,8 @@ enum oom_scan_t oom_scan_process_thread(struct oom_control *oc,
 	 * This task already has access to memory reserves and is being killed.
 	 * Don't allow any other task to have access to the reserves.
 	 */
-	if (test_tsk_thread_flag(task, TIF_MEMDIE)) {
+	if (test_tsk_thread_flag(task, TIF_MEMDIE) &&
+	    task_can_read_lock_mm(task)) {
 		if (!is_sysrq_oom(oc))
 			return OOM_SCAN_ABORT;
 	}
@@ -292,7 +309,8 @@ enum oom_scan_t oom_scan_process_thread(struct oom_control *oc,
 	if (oom_task_origin(task))
 		return OOM_SCAN_SELECT;
 
-	if (task_will_free_mem(task) && !is_sysrq_oom(oc))
+	if (task_will_free_mem(task) && !is_sysrq_oom(oc) &&
+	    task_can_read_lock_mm(task))
 		return OOM_SCAN_ABORT;
 
 	return OOM_SCAN_OK;
@@ -688,14 +706,16 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 	 * If the task is already exiting, don't alarm the sysadmin or kill
 	 * its children or threads, just set TIF_MEMDIE so it can die quickly
 	 */
-	task_lock(p);
-	if (p->mm && task_will_free_mem(p)) {
-		mark_oom_victim(p);
+	if (task_can_read_lock_mm(p)) {
+		task_lock(p);
+		if (p->mm && task_will_free_mem(p)) {
+			mark_oom_victim(p);
+			task_unlock(p);
+			put_task_struct(p);
+			return;
+		}
 		task_unlock(p);
-		put_task_struct(p);
-		return;
 	}
-	task_unlock(p);
 
 	if (__ratelimit(&oom_rs))
 		dump_header(oc, p, memcg);
@@ -868,7 +888,8 @@ bool out_of_memory(struct oom_control *oc)
 	 * TIF_MEMDIE flag at exit_mm(), otherwise an OOM livelock may occur.
 	 */
 	if (current->mm &&
-	    (fatal_signal_pending(current) || task_will_free_mem(current))) {
+	    (fatal_signal_pending(current) || task_will_free_mem(current)) &&
+	    task_can_read_lock_mm(current)) {
 		mark_oom_victim(current);
 		return true;
 	}
-- 
1.8.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2016-02-23 13:25 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-02-21  7:14 [PATCH] mm,oom: remove shortcuts for SIGKILL and PF_EXITING cases Tetsuo Handa
2016-02-23  0:47 ` David Rientjes
2016-02-23 10:38   ` Tetsuo Handa
2016-02-23 13:24     ` Tetsuo Handa

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox