[RFC PATCH] mm, oom_reaper: do not attempt to reap a task more than twice

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH] mm, oom_reaper: do not attempt to reap a task more than twice
@ 2016-05-26 15:27 Michal Hocko
  2016-05-27 10:31 ` Tetsuo Handa
  0 siblings, 1 reply; 9+ messages in thread
From: Michal Hocko @ 2016-05-26 15:27 UTC (permalink / raw)
  To: linux-mm; +Cc: Tetsuo Handa, David Rientjes, Andrew Morton, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

oom_reaper relies on the mmap_sem for read to do its job. Many places
which might block readers have been converted to use down_write_killable
and that has reduced chances of the contention a lot. Some paths where
the mmap_sem is held for write can take other locks and they might
either be not prepared to fail due to fatal signal pending or too
impractical to be changed.

This patch introduces MMF_OOM_NOT_REAPABLE flag which gets set after the
first attempt to reap a task's mm fails. If the flag is present already
after the failure then we set MMF_OOM_REAPED to hide this mm from the
oom killer completely so it can go and chose another victim.

As a result a risk of OOM deadlock when the oom victim would be blocked
indefinetly and so the oom killer cannot make any progress should be
mitigated considerably while we still try really hard to do perform all
reclaim attempts and stay predictable in the behavior.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
Hi,
I believe that after [1] and this patch we can reasonably expect that
the risk of the oom lockups is so low that we do not need to employ
timeout based solutions. I am sending this as an RFC because there still
might be better ways to accomplish the similar effect. I just like this
one because it is nicely grafted into the oom reaper which will now be
invoked for basically all oom victims.

[1] http://lkml.kernel.org/r/1464266415-15558-1-git-send-email-mhocko@kernel.org

 include/linux/sched.h |  1 +
 mm/oom_kill.c         | 19 +++++++++++++++++++
 2 files changed, 20 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index ec636400669f..12a4c8e04e6a 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -512,6 +512,7 @@ static inline int get_dumpable(struct mm_struct *mm)
 #define MMF_HAS_UPROBES		19	/* has uprobes */
 #define MMF_RECALC_UPROBES	20	/* MMF_HAS_UPROBES can be wrong */
 #define MMF_OOM_REAPED		21	/* mm has been already reaped */
+#define MMF_OOM_NOT_REAPABLE	22	/* mm couldn't be reaped */
 
 #define MMF_INIT_MASK		(MMF_DUMPABLE_MASK | MMF_DUMP_FILTER_MASK)
 
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 5bb2f7698ad7..2f82f2a558e4 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -531,8 +531,27 @@ static void oom_reap_task(struct task_struct *tsk)
 		schedule_timeout_idle(HZ/10);
 
 	if (attempts > MAX_OOM_REAP_RETRIES) {
+		struct task_struct *p;
+
 		pr_info("oom_reaper: unable to reap pid:%d (%s)\n",
 				task_pid_nr(tsk), tsk->comm);
+
+		/*
+		 * If we've already tried to reap this task in the past and
+		 * failed it probably doesn't make much sense to try yet again
+		 * so hide the mm from the oom killer so that it can move on
+		 * to another task with a different mm struct.
+		 */
+		p = find_lock_task_mm(tsk);
+		if (p) {
+			if (test_and_set_bit(MMF_OOM_NOT_REAPABLE, &p->mm->flags)) {
+				pr_info("oom_reaper: giving up pid:%d (%s)\n",
+						task_pid_nr(tsk), tsk->comm);
+				set_bit(MMF_OOM_REAPED, &p->mm->flags);
+			}
+			task_unlock(p);
+		}
+
 		debug_show_all_locks();
 	}
 
-- 
2.8.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH] mm, oom_reaper: do not attempt to reap a task more than twice
  2016-05-26 15:27 [RFC PATCH] mm, oom_reaper: do not attempt to reap a task more than twice Michal Hocko
@ 2016-05-27 10:31 ` Tetsuo Handa
  2016-05-27 12:23   ` Michal Hocko
  0 siblings, 1 reply; 9+ messages in thread
From: Tetsuo Handa @ 2016-05-27 10:31 UTC (permalink / raw)
  To: mhocko, linux-mm; +Cc: rientjes, akpm, oleg, vdavydov, mhocko

Michal Hocko wrote:
> Hi,
> I believe that after [1] and this patch we can reasonably expect that
> the risk of the oom lockups is so low that we do not need to employ
> timeout based solutions. I am sending this as an RFC because there still
> might be better ways to accomplish the similar effect. I just like this
> one because it is nicely grafted into the oom reaper which will now be
> invoked for basically all oom victims.
> 
> [1] http://lkml.kernel.org/r/1464266415-15558-1-git-send-email-mhocko@kernel.org

I still cannot agree with "we do not need to employ timeout based solutions".

While it is true that OOM-reap is per "struct mm_struct" action, we don't
need to change user visible oom_score_adj interface by [1] in order to
enforce OOM-kill being per "struct mm_struct" action.

It is possible that a victim thread releases a lot of memory by closing
pipe's file descriptors at exit_files() (before exit_task_work() which is
between exit_mm() and exit_notify()). The problem is that there is no
trigger for giving up (e.g. timeout) when that optimistic expectation
failed. As long as we wake up the OOM reaper, we can use the OOM reaper
as a trigger for giving up, and we can perfectly avoid OOM lockups
as long as the OOM killer is invoked.

We were too much focused on making sure that TIF_MEMDIE thread does not get stuck
at down_read() in exit_mm(). We forgot about "tsk->mm = NULL -> exit_aio() etc.
by mmput() -> exit_oom_victim()" sequence where exit_aio() can be blocked on I/O
which involves memory allocation and oom_scan_process_thread() returns
OOM_SCAN_ABORT due to task->signal->oom_victims > 0 and __oom_reap_task() cannot
reap due to find_lock_task_mm() returning NULL and/or mm->mm_users is already 0.

Yes, commit 449d777d7ad6d7f9 ("mm, oom_reaper: clear TIF_MEMDIE for all tasks
queued for oom_reaper") which went to Linux 4.7-rc1 will clear TIF_MEMDIE and
decrement task->signal->oom_victims even if __oom_reap_task() cannot reap
so that oom_scan_process_thread() will not return OOM_SCAN_ABORT forever.
But still, such unlocking depends on an assumption that wake_oom_reaper() is
always called.

What we need to have is "always call wake_oom_reaper() in order to let the
OOM reaper clear TIF_MEMDIE and mark as no longer OOM-killable" or "ignore
TIF_MEMDIE after some timeout". As you hate timeout, I propose below patch
instead of [1] and your "[RFC PATCH] mm, oom_reaper: do not attempt to reap
a task more than twice".
----------
>From a6b9f155e99971ef6144583a9ca1f427b0a85df8 Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Date: Fri, 27 May 2016 15:59:29 +0900
Subject: [PATCH] mm,oom: Make sure OOM killer always make forward progress

This patch is for handling problems listed below.

  (1) Incomplete TIF_MEMDIE shortcuts which may cause OOM livelock
      due to failing to wake up the OOM reaper.

  (2) Infinite retry loop which may cause OOM livelock because
      the OOM killer continues selecting the same victim forever.

  (3) MM shared by unkillable tasks which may cause OOM livelock
      because the OOM killer cannot wake up the OOM reaper.

The core of this patch is mm_is_reapable() which examines whether all
threads using a given mm is dying. While mm_is_reapable() is costly and
slow operation, it is true only until that mm gets MMF_OOM_REAPABLE.
It is likely that once oom_kill_process() selects an OOM victim,
subsequent mm_is_reapable() calls (from __oom_reap_task() by the OOM
reaper and/or out_of_memory()/mem_cgroup_out_of_memory() shortcuts
by any threads sharing that OOM victim's mm) find MMF_OOM_REAPABLE.

There are two exceptions which this patch does not address.
One is that it is theoretically possible that sending SIGKILL to
oom_task_origin() task can get stuck, as explained as
http://lkml.kernel.org/r/1463796090-7948-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp .
The other is that it is theoretically possible that current thread which
got TIF_MEMDIE using out_of_memory()/mem_cgroup_out_of_memory() shortcuts
gets stuck due to unable to satisfy __GFP_NOFAIL allocation request (and
therefore needs to select next OOM victim). We will be able to avoid such
infinite retry loop using per "struct task_struct" atomic flags.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
---
 include/linux/oom.h   |  32 +----------
 include/linux/sched.h |   1 +
 mm/memcontrol.c       |   7 +--
 mm/oom_kill.c         | 153 ++++++++++++++++++++++++++------------------------
 4 files changed, 84 insertions(+), 109 deletions(-)

diff --git a/include/linux/oom.h b/include/linux/oom.h
index 8346952..3f4453f 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -67,15 +67,7 @@ static inline bool oom_task_origin(const struct task_struct *p)
 	return p->signal->oom_flag_origin;
 }
 
-extern void mark_oom_victim(struct task_struct *tsk);
-
-#ifdef CONFIG_MMU
-extern void try_oom_reaper(struct task_struct *tsk);
-#else
-static inline void try_oom_reaper(struct task_struct *tsk)
-{
-}
-#endif
+extern bool task_is_reapable(struct task_struct *tsk);
 
 extern unsigned long oom_badness(struct task_struct *p,
 		struct mem_cgroup *memcg, const nodemask_t *nodemask,
@@ -105,28 +97,6 @@ extern void oom_killer_enable(void);
 
 extern struct task_struct *find_lock_task_mm(struct task_struct *p);
 
-static inline bool task_will_free_mem(struct task_struct *task)
-{
-	struct signal_struct *sig = task->signal;
-
-	/*
-	 * A coredumping process may sleep for an extended period in exit_mm(),
-	 * so the oom killer cannot assume that the process will promptly exit
-	 * and release memory.
-	 */
-	if (sig->flags & SIGNAL_GROUP_COREDUMP)
-		return false;
-
-	if (!(task->flags & PF_EXITING))
-		return false;
-
-	/* Make sure that the whole thread group is going down */
-	if (!thread_group_empty(task) && !(sig->flags & SIGNAL_GROUP_EXIT))
-		return false;
-
-	return true;
-}
-
 /* sysctls */
 extern int sysctl_oom_dump_tasks;
 extern int sysctl_oom_kill_allocating_task;
diff --git a/include/linux/sched.h b/include/linux/sched.h
index b3e84ac..d748f16 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -523,6 +523,7 @@ static inline int get_dumpable(struct mm_struct *mm)
 #define MMF_HAS_UPROBES		19	/* has uprobes */
 #define MMF_RECALC_UPROBES	20	/* MMF_HAS_UPROBES can be wrong */
 #define MMF_OOM_REAPED		21	/* mm has been already reaped */
+#define MMF_OOM_REAPABLE	22      /* mm is ready to be reaped */
 
 #define MMF_INIT_MASK		(MMF_DUMPABLE_MASK | MMF_DUMP_FILTER_MASK)
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 37ba604..aed588e 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1271,15 +1271,12 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	mutex_lock(&oom_lock);
 
 	/*
-	 * If current has a pending SIGKILL or is exiting, then automatically
+	 * If current's memory is ready to be OOM-reaped, then automatically
 	 * select it.  The goal is to allow it to allocate so that it may
 	 * quickly exit and free its memory.
 	 */
-	if (fatal_signal_pending(current) || task_will_free_mem(current)) {
-		mark_oom_victim(current);
-		try_oom_reaper(current);
+	if (task_is_reapable(current))
 		goto unlock;
-	}
 
 	check_panic_on_oom(&oc, CONSTRAINT_MEMCG, memcg);
 	totalpages = mem_cgroup_get_limit(memcg) ? : 1;
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 1685890..95fce47 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -428,6 +428,50 @@ static bool process_shares_mm(struct task_struct *p, struct mm_struct *mm)
 	return false;
 }
 
+bool mm_is_reapable(struct mm_struct *mm)
+{
+	struct task_struct *p;
+
+	if (!mm)
+		return false;
+	if (test_bit(MMF_OOM_REAPABLE, &mm->flags))
+		return true;
+	if (!down_read_trylock(&mm->mmap_sem))
+		return false;
+	up_read(&mm->mmap_sem);
+	/*
+	 * There might be other threads/processes which are either not
+	 * dying or even not killable.
+	 */
+	if (atomic_read(&mm->mm_users) > 1) {
+		rcu_read_lock();
+		for_each_process(p) {
+			bool exiting;
+
+			if (!process_shares_mm(p, mm))
+				continue;
+			if (fatal_signal_pending(p))
+				continue;
+
+			/*
+			 * If the task is exiting make sure the whole thread
+			 * group is exiting and cannot access mm anymore.
+			 */
+			spin_lock_irq(&p->sighand->siglock);
+			exiting = signal_group_exit(p->signal);
+			spin_unlock_irq(&p->sighand->siglock);
+			if (exiting)
+				continue;
+
+			/* Give up */
+			rcu_read_unlock();
+			return false;
+		}
+		rcu_read_unlock();
+	}
+	set_bit(MMF_OOM_REAPABLE, &mm->flags);
+	return true;
+}
 
 #ifdef CONFIG_MMU
 /*
@@ -483,7 +527,7 @@ static bool __oom_reap_task(struct task_struct *tsk)
 
 	task_unlock(p);
 
-	if (!down_read_trylock(&mm->mmap_sem)) {
+	if (!mm_is_reapable(mm) || !down_read_trylock(&mm->mmap_sem)) {
 		ret = false;
 		goto unlock_oom;
 	}
@@ -553,6 +597,9 @@ static void oom_reap_task(struct task_struct *tsk)
 		debug_show_all_locks();
 	}
 
+	/* Do not allow the OOM killer to select this thread group again. */
+	tsk->signal->oom_score_adj = OOM_SCORE_ADJ_MIN;
+
 	/*
 	 * Clear TIF_MEMDIE because the task shouldn't be sitting on a
 	 * reasonably reclaimable memory anymore or it is not a good candidate
@@ -606,51 +653,6 @@ static void wake_oom_reaper(struct task_struct *tsk)
 	wake_up(&oom_reaper_wait);
 }
 
-/* Check if we can reap the given task. This has to be called with stable
- * tsk->mm
- */
-void try_oom_reaper(struct task_struct *tsk)
-{
-	struct mm_struct *mm = tsk->mm;
-	struct task_struct *p;
-
-	if (!mm)
-		return;
-
-	/*
-	 * There might be other threads/processes which are either not
-	 * dying or even not killable.
-	 */
-	if (atomic_read(&mm->mm_users) > 1) {
-		rcu_read_lock();
-		for_each_process(p) {
-			bool exiting;
-
-			if (!process_shares_mm(p, mm))
-				continue;
-			if (fatal_signal_pending(p))
-				continue;
-
-			/*
-			 * If the task is exiting make sure the whole thread group
-			 * is exiting and cannot acces mm anymore.
-			 */
-			spin_lock_irq(&p->sighand->siglock);
-			exiting = signal_group_exit(p->signal);
-			spin_unlock_irq(&p->sighand->siglock);
-			if (exiting)
-				continue;
-
-			/* Give up */
-			rcu_read_unlock();
-			return;
-		}
-		rcu_read_unlock();
-	}
-
-	wake_oom_reaper(tsk);
-}
-
 static int __init oom_init(void)
 {
 	oom_reaper_th = kthread_run(oom_reaper, NULL, "oom_reaper");
@@ -675,7 +677,7 @@ static void wake_oom_reaper(struct task_struct *tsk)
  * Has to be called with oom_lock held and never after
  * oom has been disabled already.
  */
-void mark_oom_victim(struct task_struct *tsk)
+static void mark_oom_victim(struct task_struct *tsk)
 {
 	WARN_ON(oom_killer_disabled);
 	/* OOM killer might race with memcg OOM */
@@ -743,6 +745,24 @@ void oom_killer_enable(void)
 }
 
 /*
+ * Try to mark the given task as an OOM victim.
+ *
+ * @tsk: Task to check.
+ *
+ * Needs task_lock(@tsk)/task_unlock(@tsk) unless @tsk == current.
+ */
+bool task_is_reapable(struct task_struct *tsk)
+{
+	if ((fatal_signal_pending(tsk) || (tsk->flags & PF_EXITING)) &&
+	    mm_is_reapable(tsk->mm)) {
+		mark_oom_victim(tsk);
+		wake_oom_reaper(tsk);
+		return true;
+	}
+	return false;
+}
+
+/*
  * Must be called while holding a reference to p, which will be released upon
  * returning.
  */
@@ -757,16 +777,14 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 	unsigned int victim_points = 0;
 	static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
 					      DEFAULT_RATELIMIT_BURST);
-	bool can_oom_reap = true;
 
 	/*
-	 * If the task is already exiting, don't alarm the sysadmin or kill
-	 * its children or threads, just set TIF_MEMDIE so it can die quickly
+	 * If the task's memory is ready to be OOM-reaped, then don't alarm
+	 * the sysadmin or kill its children or threads, just set TIF_MEMDIE
+	 * so it can die quickly.
 	 */
 	task_lock(p);
-	if (p->mm && task_will_free_mem(p)) {
-		mark_oom_victim(p);
-		try_oom_reaper(p);
+	if (task_is_reapable(p)) {
 		task_unlock(p);
 		put_task_struct(p);
 		return;
@@ -849,22 +867,18 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 			continue;
 		if (same_thread_group(p, victim))
 			continue;
-		if (unlikely(p->flags & PF_KTHREAD) || is_global_init(p) ||
-		    p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) {
-			/*
-			 * We cannot use oom_reaper for the mm shared by this
-			 * process because it wouldn't get killed and so the
-			 * memory might be still used.
-			 */
-			can_oom_reap = false;
+		if (unlikely(p->flags & PF_KTHREAD))
 			continue;
-		}
+		if (is_global_init(p))
+			continue;
+		if (p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
+			continue;
+
 		do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, true);
 	}
 	rcu_read_unlock();
 
-	if (can_oom_reap)
-		wake_oom_reaper(victim);
+	wake_oom_reaper(victim);
 
 	mmdrop(mm);
 	put_task_struct(victim);
@@ -936,19 +950,12 @@ bool out_of_memory(struct oom_control *oc)
 		return true;
 
 	/*
-	 * If current has a pending SIGKILL or is exiting, then automatically
+	 * If current's memory is ready to be OOM-reaped, then automatically
 	 * select it.  The goal is to allow it to allocate so that it may
 	 * quickly exit and free its memory.
-	 *
-	 * But don't select if current has already released its mm and cleared
-	 * TIF_MEMDIE flag at exit_mm(), otherwise an OOM livelock may occur.
 	 */
-	if (current->mm &&
-	    (fatal_signal_pending(current) || task_will_free_mem(current))) {
-		mark_oom_victim(current);
-		try_oom_reaper(current);
+	if (task_is_reapable(current))
 		return true;
-	}
 
 	/*
 	 * The OOM killer does not compensate for IO-less reclaim.
-- 
1.8.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH] mm, oom_reaper: do not attempt to reap a task more than twice
  2016-05-27 10:31 ` Tetsuo Handa
@ 2016-05-27 12:23   ` Michal Hocko
  2016-05-27 13:18     ` [RFC PATCH] mm, oom_reaper: do not attempt to reap a task morethan twice Tetsuo Handa
  0 siblings, 1 reply; 9+ messages in thread
From: Michal Hocko @ 2016-05-27 12:23 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-mm, rientjes, akpm, oleg, vdavydov

On Fri 27-05-16 19:31:19, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > Hi,
> > I believe that after [1] and this patch we can reasonably expect that
> > the risk of the oom lockups is so low that we do not need to employ
> > timeout based solutions. I am sending this as an RFC because there still
> > might be better ways to accomplish the similar effect. I just like this
> > one because it is nicely grafted into the oom reaper which will now be
> > invoked for basically all oom victims.
> > 
> > [1] http://lkml.kernel.org/r/1464266415-15558-1-git-send-email-mhocko@kernel.org
> 
> I still cannot agree with "we do not need to employ timeout based solutions".
> 
> While it is true that OOM-reap is per "struct mm_struct" action, we don't
> need to change user visible oom_score_adj interface by [1] in order to
> enforce OOM-kill being per "struct mm_struct" action.

We want to change the oom_score_adj behavior for the pure consistency I
believe.

[...]

> Yes, commit 449d777d7ad6d7f9 ("mm, oom_reaper: clear TIF_MEMDIE for all tasks
> queued for oom_reaper") which went to Linux 4.7-rc1 will clear TIF_MEMDIE and
> decrement task->signal->oom_victims even if __oom_reap_task() cannot reap
> so that oom_scan_process_thread() will not return OOM_SCAN_ABORT forever.
> But still, such unlocking depends on an assumption that wake_oom_reaper() is
> always called.

which is practically the case. The only real exception are use_mm()
users. I want to look at those but I guess they need a special handling.

> What we need to have is "always call wake_oom_reaper() in order to let the
> OOM reaper clear TIF_MEMDIE and mark as no longer OOM-killable" or "ignore
> TIF_MEMDIE after some timeout". As you hate timeout, I propose below patch
> instead of [1] and your "[RFC PATCH] mm, oom_reaper: do not attempt to reap
> a task more than twice".
[...]
> @@ -849,22 +867,18 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
>  			continue;
>  		if (same_thread_group(p, victim))
>  			continue;
> -		if (unlikely(p->flags & PF_KTHREAD) || is_global_init(p) ||
> -		    p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) {
> -			/*
> -			 * We cannot use oom_reaper for the mm shared by this
> -			 * process because it wouldn't get killed and so the
> -			 * memory might be still used.
> -			 */
> -			can_oom_reap = false;
> +		if (unlikely(p->flags & PF_KTHREAD))
>  			continue;
> -		}
> +		if (is_global_init(p))
> +			continue;
> +		if (p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
> +			continue;
> +
>  		do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, true);
>  	}
>  	rcu_read_unlock();
>  
> -	if (can_oom_reap)
> -		wake_oom_reaper(victim);
> +	wake_oom_reaper(victim);
>  
>  	mmdrop(mm);
>  	put_task_struct(victim);

So this is the biggest change to my approach. And I think it is
incorrect because you cannot simply reap the memory when you have active
users of that memory potentially. Shared with global init is just non
existant problem. Such a system would be crippled enough to not bother.
But use_mm is potentially real and I believe we should find some way
around it and even not consider such tasks. Fortunately we do not have
many users of use_mm in the kernel and most users will not use them.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH] mm, oom_reaper: do not attempt to reap a task morethan twice
  2016-05-27 12:23   ` Michal Hocko
@ 2016-05-27 13:18     ` Tetsuo Handa
  2016-05-27 13:35       ` Michal Hocko
  0 siblings, 1 reply; 9+ messages in thread
From: Tetsuo Handa @ 2016-05-27 13:18 UTC (permalink / raw)
  To: mhocko; +Cc: linux-mm, rientjes, akpm, oleg, vdavydov

Michal Hocko wrote:
> On Fri 27-05-16 19:31:19, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > Hi,
> > > I believe that after [1] and this patch we can reasonably expect that
> > > the risk of the oom lockups is so low that we do not need to employ
> > > timeout based solutions. I am sending this as an RFC because there still
> > > might be better ways to accomplish the similar effect. I just like this
> > > one because it is nicely grafted into the oom reaper which will now be
> > > invoked for basically all oom victims.
> > > 
> > > [1] http://lkml.kernel.org/r/1464266415-15558-1-git-send-email-mhocko@kernel.org
> > 
> > I still cannot agree with "we do not need to employ timeout based solutions".
> > 
> > While it is true that OOM-reap is per "struct mm_struct" action, we don't
> > need to change user visible oom_score_adj interface by [1] in order to
> > enforce OOM-kill being per "struct mm_struct" action.
> 
> We want to change the oom_score_adj behavior for the pure consistency I
> believe.

Is it an agreed conclusion rather than your will? Did userspace developers ack?

> 
> [...]
> 
> > Yes, commit 449d777d7ad6d7f9 ("mm, oom_reaper: clear TIF_MEMDIE for all tasks
> > queued for oom_reaper") which went to Linux 4.7-rc1 will clear TIF_MEMDIE and
> > decrement task->signal->oom_victims even if __oom_reap_task() cannot reap
> > so that oom_scan_process_thread() will not return OOM_SCAN_ABORT forever.
> > But still, such unlocking depends on an assumption that wake_oom_reaper() is
> > always called.
> 
> which is practically the case. The only real exception are use_mm()
> users. I want to look at those but I guess they need a special handling.
> 
> > What we need to have is "always call wake_oom_reaper() in order to let the
> > OOM reaper clear TIF_MEMDIE and mark as no longer OOM-killable" or "ignore
> > TIF_MEMDIE after some timeout". As you hate timeout, I propose below patch
> > instead of [1] and your "[RFC PATCH] mm, oom_reaper: do not attempt to reap
> > a task more than twice".
> [...]
> > @@ -849,22 +867,18 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
> >  			continue;
> >  		if (same_thread_group(p, victim))
> >  			continue;
> > -		if (unlikely(p->flags & PF_KTHREAD) || is_global_init(p) ||
> > -		    p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) {
> > -			/*
> > -			 * We cannot use oom_reaper for the mm shared by this
> > -			 * process because it wouldn't get killed and so the
> > -			 * memory might be still used.
> > -			 */
> > -			can_oom_reap = false;
> > +		if (unlikely(p->flags & PF_KTHREAD))
> >  			continue;
> > -		}
> > +		if (is_global_init(p))
> > +			continue;
> > +		if (p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
> > +			continue;
> > +
> >  		do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, true);
> >  	}
> >  	rcu_read_unlock();
> >  
> > -	if (can_oom_reap)
> > -		wake_oom_reaper(victim);
> > +	wake_oom_reaper(victim);
> >  
> >  	mmdrop(mm);
> >  	put_task_struct(victim);
> 
> So this is the biggest change to my approach. And I think it is
> incorrect because you cannot simply reap the memory when you have active
> users of that memory potentially.

I don't reap the memory when I have active users of that memory potentially.
I do below check. I'm calling wake_oom_reaper() in order to guarantee that
oom_reap_task() shall clear TIF_MEMDIE and drop oom_victims.

@@ -483,7 +527,7 @@ static bool __oom_reap_task(struct task_struct *tsk)
 
 	task_unlock(p);
 
-	if (!down_read_trylock(&mm->mmap_sem)) {
+	if (!mm_is_reapable(mm) || !down_read_trylock(&mm->mmap_sem)) {
 		ret = false;
 		goto unlock_oom;
 	}

>                                   Shared with global init is just non
> existant problem. Such a system would be crippled enough to not bother.

See commit a2b829d95958da20 ("mm/oom_kill.c: avoid attempting to kill init
sharing same memory"). My patch simply rolled back to that commit, and
hands over the duty of clearing TIF_MEMDIE and dropping oom_victims to
the OOM reaper's code provided by commit 449d777d7ad6d7f9 ("mm, oom_reaper:
clear TIF_MEMDIE for all tasks queued for oom_reaper").

> But use_mm is potentially real and I believe we should find some way
> around it and even not consider such tasks. Fortunately we do not have
> many users of use_mm in the kernel and most users will not use them.

I don't know why use_mm() becomes a problem because I do above check
just before trying to reap that memory.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH] mm, oom_reaper: do not attempt to reap a task morethan twice
  2016-05-27 13:18     ` [RFC PATCH] mm, oom_reaper: do not attempt to reap a task morethan twice Tetsuo Handa
@ 2016-05-27 13:35       ` Michal Hocko
  2016-05-27 16:24         ` [RFC PATCH] mm, oom_reaper: do not attempt to reap a task more than twice Tetsuo Handa
  0 siblings, 1 reply; 9+ messages in thread
From: Michal Hocko @ 2016-05-27 13:35 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-mm, rientjes, akpm, oleg, vdavydov

On Fri 27-05-16 22:18:42, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Fri 27-05-16 19:31:19, Tetsuo Handa wrote:
> > > Michal Hocko wrote:
> > > > Hi,
> > > > I believe that after [1] and this patch we can reasonably expect that
> > > > the risk of the oom lockups is so low that we do not need to employ
> > > > timeout based solutions. I am sending this as an RFC because there still
> > > > might be better ways to accomplish the similar effect. I just like this
> > > > one because it is nicely grafted into the oom reaper which will now be
> > > > invoked for basically all oom victims.
> > > > 
> > > > [1] http://lkml.kernel.org/r/1464266415-15558-1-git-send-email-mhocko@kernel.org
> > > 
> > > I still cannot agree with "we do not need to employ timeout based solutions".
> > > 
> > > While it is true that OOM-reap is per "struct mm_struct" action, we don't
> > > need to change user visible oom_score_adj interface by [1] in order to
> > > enforce OOM-kill being per "struct mm_struct" action.
> > 
> > We want to change the oom_score_adj behavior for the pure consistency I
> > believe.
> 
> Is it an agreed conclusion rather than your will? Did userspace developers ack?

If you think this is not the right approach then please comment as a
reply to the patch.

[...]

> > So this is the biggest change to my approach. And I think it is
> > incorrect because you cannot simply reap the memory when you have active
> > users of that memory potentially.
> 
> I don't reap the memory when I have active users of that memory potentially.
> I do below check. I'm calling wake_oom_reaper() in order to guarantee that
> oom_reap_task() shall clear TIF_MEMDIE and drop oom_victims.
> 
> @@ -483,7 +527,7 @@ static bool __oom_reap_task(struct task_struct *tsk)
>  
>  	task_unlock(p);
>  
> -	if (!down_read_trylock(&mm->mmap_sem)) {
> +	if (!mm_is_reapable(mm) || !down_read_trylock(&mm->mmap_sem)) {
>  		ret = false;
>  		goto unlock_oom;
>  	}

OK, I've missed this part. So you just defer the decision to the oom
reaper while I am trying to solve that at oom_kill_process level.
We could very well do 
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index bcb6d3b26c94..d9017b8c7300 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -813,6 +813,7 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 			 * memory might be still used.
 			 */
 			can_oom_reap = false;
+			set_bit(MMF_OOM_REAPED, mm->flags);
 			continue;
 		}
 		if (p->signal->oom_score_adj == OOM_ADJUST_MIN)

with the same result. If you _really_ think that this would make a
difference I could live with that. But I am highly skeptical this
matters all that much.

> 
> >                                   Shared with global init is just non
> > existant problem. Such a system would be crippled enough to not bother.
> 
> See commit a2b829d95958da20 ("mm/oom_kill.c: avoid attempting to kill init
> sharing same memory").

Don't you think that a system where the largest memory consumer is the
global init is crippled terribly?

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH] mm, oom_reaper: do not attempt to reap a task more than twice
  2016-05-27 13:35       ` Michal Hocko
@ 2016-05-27 16:24         ` Tetsuo Handa
  2016-05-28 12:22           ` Tetsuo Handa
  2016-05-30 11:55           ` Michal Hocko
  0 siblings, 2 replies; 9+ messages in thread
From: Tetsuo Handa @ 2016-05-27 16:24 UTC (permalink / raw)
  To: mhocko; +Cc: linux-mm, rientjes, akpm, oleg, vdavydov

Michal Hocko wrote:
> On Fri 27-05-16 22:18:42, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > So this is the biggest change to my approach. And I think it is
> > > incorrect because you cannot simply reap the memory when you have active
> > > users of that memory potentially.
> > 
> > I don't reap the memory when I have active users of that memory potentially.
> > I do below check. I'm calling wake_oom_reaper() in order to guarantee that
> > oom_reap_task() shall clear TIF_MEMDIE and drop oom_victims.
> > 
> > @@ -483,7 +527,7 @@ static bool __oom_reap_task(struct task_struct *tsk)
> >  
> >  	task_unlock(p);
> >  
> > -	if (!down_read_trylock(&mm->mmap_sem)) {
> > +	if (!mm_is_reapable(mm) || !down_read_trylock(&mm->mmap_sem)) {
> >  		ret = false;
> >  		goto unlock_oom;
> >  	}
> 
> OK, I've missed this part. So you just defer the decision to the oom
> reaper while I am trying to solve that at oom_kill_process level.

Right.

> We could very well do 
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index bcb6d3b26c94..d9017b8c7300 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -813,6 +813,7 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
>  			 * memory might be still used.
>  			 */
>  			can_oom_reap = false;
> +			set_bit(MMF_OOM_REAPED, mm->flags);
>  			continue;
>  		}
>  		if (p->signal->oom_score_adj == OOM_ADJUST_MIN)
> 
> with the same result. If you _really_ think that this would make a
> difference I could live with that. But I am highly skeptical this
> matters all that much.

It matters a lot. There is a "guarantee" difference.

Maybe this is something like top half and bottom half relationship?
The OOM killer context is the top half and the OOM reaper context is
the bottom half. "The top half always hands over to the bottom half"
can allow the bottom half to recover when something went wrong.
In your approach, the top half might not hand over to the bottom half.

I think the lines needed for the guarantee are something like

	rcu_read_lock();
	for_each_process(p) {
		if (!process_shares_mm(p, mm))
			continue;
		if (same_thread_group(p, victim))
			continue;
		/*
		 * It is not safe to reap memory used by global init or
		 * kernel threads.
		 */
		if (unlikely(p->flags & PF_KTHREAD) || is_global_init(p)) {
			set_bit(MMF_OOM_REAPED, mm->flags);
			continue;
		}
		/*
		 * Memory used by OOM_SCORE_ADJ_MIN is still OOM reapable
		 * if they are already killed or exiting. Just don't
		 * send SIGKILL.
		 */
		if (p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
			continue;

		do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, true);
	}
	rcu_read_unlock();

	wake_oom_reaper(victim);

but doing set_bit(MMF_OOM_REAPED, mm->flags) here makes sense?

> > >                                   Shared with global init is just non
> > > existant problem. Such a system would be crippled enough to not bother.
> > 
> > See commit a2b829d95958da20 ("mm/oom_kill.c: avoid attempting to kill init
> > sharing same memory").
> 
> Don't you think that a system where the largest memory consumer is the
> global init is crippled terribly?

Why not?

PID=1 name=init RSS=100MB
  \
  +-- PID=102 name=child RSS=10MB (completed execve("child") 10 minutes ago)
  \
  +-- PID=103 name=child RSS=10MB (completed execve("child") 7 minutes ago)
  \
  +-- PID=104 name=child RSS=10MB (completed execve("child") 3 minutes ago)
  \
  +-- PID=105 name=init RSS=100MB (about to start execve("child") from vfork())

should be allowed, doesn't it?
There is no reason to exclude vfork()'ed child from OOM victim candidates.
We can't control how people run their userspace programs.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH] mm, oom_reaper: do not attempt to reap a task more than twice
  2016-05-27 16:24         ` [RFC PATCH] mm, oom_reaper: do not attempt to reap a task more than twice Tetsuo Handa
@ 2016-05-28 12:22           ` Tetsuo Handa
  2016-05-30 11:57             ` Michal Hocko
  2016-05-30 11:55           ` Michal Hocko
  1 sibling, 1 reply; 9+ messages in thread
From: Tetsuo Handa @ 2016-05-28 12:22 UTC (permalink / raw)
  To: mhocko; +Cc: linux-mm, rientjes, akpm, oleg, vdavydov

Tetsuo Handa wrote:
> Michal Hocko wrote:
> > We could very well do 
> > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > index bcb6d3b26c94..d9017b8c7300 100644
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -813,6 +813,7 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
> >  			 * memory might be still used.
> >  			 */
> >  			can_oom_reap = false;
> > +			set_bit(MMF_OOM_REAPED, mm->flags);
> >  			continue;
> >  		}
> >  		if (p->signal->oom_score_adj == OOM_ADJUST_MIN)
> > 
> > with the same result. If you _really_ think that this would make a
> > difference I could live with that. But I am highly skeptical this
> > matters all that much.

Usage of set_bit() above and below are both wrong. The mm used by
kernel thread via use_mm() will become OOM reapable after unuse_mm().
Thus, setting MMF_OOM_REAPED is a mistake as with MMF_OOM_KILLED
( http://lkml.kernel.org/r/201603152015.JAE86937.VFOLtQFOFJOSHM@I-love.SAKURA.ne.jp ).

> I think the lines needed for the guarantee are something like
> 
> 	rcu_read_lock();
> 	for_each_process(p) {
> 		if (!process_shares_mm(p, mm))
> 			continue;
> 		if (same_thread_group(p, victim))
> 			continue;
> 		/*
> 		 * It is not safe to reap memory used by global init or
> 		 * kernel threads.
> 		 */
> 		if (unlikely(p->flags & PF_KTHREAD) || is_global_init(p)) {
> 			set_bit(MMF_OOM_REAPED, mm->flags);
> 			continue;
> 		}
> 		/*
> 		 * Memory used by OOM_SCORE_ADJ_MIN is still OOM reapable
> 		 * if they are already killed or exiting. Just don't
> 		 * send SIGKILL.
> 		 */
> 		if (p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
> 			continue;
> 
> 		do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, true);
> 	}
> 	rcu_read_unlock();
> 
> 	wake_oom_reaper(victim);
> 
> but doing set_bit(MMF_OOM_REAPED, mm->flags) here makes sense?



I also realized that my

	if (task_is_reapable(current))
		return true;

is wrong. task_is_reapable() depends on all threads using current->mm are
dying or exiting, but select_bad_process() (which is needed for calling
mark_oom_victim() from oom_kill_process() after oom_badness() > 0 by
oom_scan_process_thread() returning OOM_SCAN_OK) depends on there is no
TIF_MEMDIE thread.

If there is a TIF_MEMDIE thread, current thread which will (as of Linux 4.6)
be able to get TIF_MEMDIE by

  fatal_signal_pending(current) || ((current->flags & PF_EXITING) && !(current->signal->flags & SIGNAL_GROUP_COREDUMP))

condition will fail to get TIF_MEMDIE because oom_scan_process_thread() will
return OOM_SCAN_ABORT. The logic of setting TIF_MEMDIE to only one thread

	/*
	 * Kill all user processes sharing victim->mm in other thread groups, if
	 * any.  They don't get access to memory reserves, though, to avoid
	 * depletion of all memory.  This prevents mm->mmap_sem livelock when an
	 * oom killed thread cannot exit because it requires the semaphore and
	 * its contended by another thread trying to allocate memory itself.
	 * That thread will now get access to memory reserves since it has a
	 * pending fatal signal.
	 */

does not allow the shortcuts to require that current->mm is reapable.

It seems to me that your "[PATCH 6/6] mm, oom: fortify task_will_free_mem"
expects that current->mm is reapable as well as my patch.
If so, [PATCH 6/6] will not work.

+static inline bool task_will_free_mem(struct task_struct *task)
+{
(...snipped...)
+		rcu_read_lock();
+		for_each_process(p) {
+			bool vfork;
+
+			/*
+			 * skip over vforked tasks because they are mostly
+			 * independent and will drop the mm soon
+			 */
+			task_lock(p);
+			vfork = p->vfork_done;
+			task_unlock(p);
+			if (vfork)
+				continue;
+
+			ret = __task_will_free_mem(p);
+			if (!ret)
+				break;
+		}
+		rcu_read_unlock();
(...snipped...)
+}

@@ -945,14 +894,10 @@ bool out_of_memory(struct oom_control *oc)
 	 * If current has a pending SIGKILL or is exiting, then automatically
 	 * select it.  The goal is to allow it to allocate so that it may
 	 * quickly exit and free its memory.
-	 *
-	 * But don't select if current has already released its mm and cleared
-	 * TIF_MEMDIE flag at exit_mm(), otherwise an OOM livelock may occur.
 	 */
-	if (current->mm &&
-	    (fatal_signal_pending(current) || task_will_free_mem(current))) {
+	if (task_will_free_mem(current)) {
 		mark_oom_victim(current);
-		try_oom_reaper(current);
+		wake_oom_reaper(current);
 		return true;
 	}
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH] mm, oom_reaper: do not attempt to reap a task more than twice
  2016-05-28 12:22           ` Tetsuo Handa
@ 2016-05-30 11:57             ` Michal Hocko
  0 siblings, 0 replies; 9+ messages in thread
From: Michal Hocko @ 2016-05-30 11:57 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-mm, rientjes, akpm, oleg, vdavydov

On Sat 28-05-16 21:22:08, Tetsuo Handa wrote:
> Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > We could very well do 
> > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > > index bcb6d3b26c94..d9017b8c7300 100644
> > > --- a/mm/oom_kill.c
> > > +++ b/mm/oom_kill.c
> > > @@ -813,6 +813,7 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
> > >  			 * memory might be still used.
> > >  			 */
> > >  			can_oom_reap = false;
> > > +			set_bit(MMF_OOM_REAPED, mm->flags);
> > >  			continue;
> > >  		}
> > >  		if (p->signal->oom_score_adj == OOM_ADJUST_MIN)
> > > 
> > > with the same result. If you _really_ think that this would make a
> > > difference I could live with that. But I am highly skeptical this
> > > matters all that much.
> 
> Usage of set_bit() above and below are both wrong. The mm used by
> kernel thread via use_mm() will become OOM reapable after unuse_mm().

Please note that all other holders of that mm are gone by that time. So
unuse_mm will simply drop the last reference of the mm and do the
remaining clean up. There is no real reason this mm should be around and
visible by the oom killer.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH] mm, oom_reaper: do not attempt to reap a task more than twice
  2016-05-27 16:24         ` [RFC PATCH] mm, oom_reaper: do not attempt to reap a task more than twice Tetsuo Handa
  2016-05-28 12:22           ` Tetsuo Handa
@ 2016-05-30 11:55           ` Michal Hocko
  1 sibling, 0 replies; 9+ messages in thread
From: Michal Hocko @ 2016-05-30 11:55 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-mm, rientjes, akpm, oleg, vdavydov

On Sat 28-05-16 01:24:51, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Fri 27-05-16 22:18:42, Tetsuo Handa wrote:
> > > Michal Hocko wrote:
> > > > So this is the biggest change to my approach. And I think it is
> > > > incorrect because you cannot simply reap the memory when you have active
> > > > users of that memory potentially.
> > > 
> > > I don't reap the memory when I have active users of that memory potentially.
> > > I do below check. I'm calling wake_oom_reaper() in order to guarantee that
> > > oom_reap_task() shall clear TIF_MEMDIE and drop oom_victims.
> > > 
> > > @@ -483,7 +527,7 @@ static bool __oom_reap_task(struct task_struct *tsk)
> > >  
> > >  	task_unlock(p);
> > >  
> > > -	if (!down_read_trylock(&mm->mmap_sem)) {
> > > +	if (!mm_is_reapable(mm) || !down_read_trylock(&mm->mmap_sem)) {
> > >  		ret = false;
> > >  		goto unlock_oom;
> > >  	}
> > 
> > OK, I've missed this part. So you just defer the decision to the oom
> > reaper while I am trying to solve that at oom_kill_process level.
> 
> Right.
> 
> > We could very well do 
> > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > index bcb6d3b26c94..d9017b8c7300 100644
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -813,6 +813,7 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
> >  			 * memory might be still used.
> >  			 */
> >  			can_oom_reap = false;
> > +			set_bit(MMF_OOM_REAPED, mm->flags);
> >  			continue;
> >  		}
> >  		if (p->signal->oom_score_adj == OOM_ADJUST_MIN)
> > 
> > with the same result. If you _really_ think that this would make a
> > difference I could live with that. But I am highly skeptical this
> > matters all that much.
> 
> It matters a lot. There is a "guarantee" difference.
> 
> Maybe this is something like top half and bottom half relationship?
> The OOM killer context is the top half and the OOM reaper context is
> the bottom half. "The top half always hands over to the bottom half"
> can allow the bottom half to recover when something went wrong.
> In your approach, the top half might not hand over to the bottom half.

But your bottom half would just decide to back off the same way I do
here. And as for the bonus your bottom half would have to do the rather
costly process iteration to find that out.

[...]
> > > >                                   Shared with global init is just non
> > > > existant problem. Such a system would be crippled enough to not bother.
> > > 
> > > See commit a2b829d95958da20 ("mm/oom_kill.c: avoid attempting to kill init
> > > sharing same memory").
> > 
> > Don't you think that a system where the largest memory consumer is the
> > global init is crippled terribly?
> 
> Why not?
> 
> PID=1 name=init RSS=100MB
>   \
>   +-- PID=102 name=child RSS=10MB (completed execve("child") 10 minutes ago)
>   \
>   +-- PID=103 name=child RSS=10MB (completed execve("child") 7 minutes ago)
>   \
>   +-- PID=104 name=child RSS=10MB (completed execve("child") 3 minutes ago)
>   \
>   +-- PID=105 name=init RSS=100MB (about to start execve("child") from vfork())
> 
> should be allowed, doesn't it?

Killing the vforked task doesn't make any sense because it will be quite
unlikely to free any memory. We should select from the any other tasks
which have completed their vfork.

> There is no reason to exclude vfork()'ed child from OOM victim candidates.
> We can't control how people run their userspace programs.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2016-05-30 11:57 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-05-26 15:27 [RFC PATCH] mm, oom_reaper: do not attempt to reap a task more than twice Michal Hocko
2016-05-27 10:31 ` Tetsuo Handa
2016-05-27 12:23   ` Michal Hocko
2016-05-27 13:18     ` [RFC PATCH] mm, oom_reaper: do not attempt to reap a task morethan twice Tetsuo Handa
2016-05-27 13:35       ` Michal Hocko
2016-05-27 16:24         ` [RFC PATCH] mm, oom_reaper: do not attempt to reap a task more than twice Tetsuo Handa
2016-05-28 12:22           ` Tetsuo Handa
2016-05-30 11:57             ` Michal Hocko
2016-05-30 11:55           ` Michal Hocko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox