* [RFC PATCH] mm, oom_reaper: do not attempt to reap a task more than twice @ 2016-05-26 15:27 Michal Hocko 2016-05-27 10:31 ` Tetsuo Handa 0 siblings, 1 reply; 9+ messages in thread From: Michal Hocko @ 2016-05-26 15:27 UTC (permalink / raw) To: linux-mm; +Cc: Tetsuo Handa, David Rientjes, Andrew Morton, LKML, Michal Hocko From: Michal Hocko <mhocko@suse.com> oom_reaper relies on the mmap_sem for read to do its job. Many places which might block readers have been converted to use down_write_killable and that has reduced chances of the contention a lot. Some paths where the mmap_sem is held for write can take other locks and they might either be not prepared to fail due to fatal signal pending or too impractical to be changed. This patch introduces MMF_OOM_NOT_REAPABLE flag which gets set after the first attempt to reap a task's mm fails. If the flag is present already after the failure then we set MMF_OOM_REAPED to hide this mm from the oom killer completely so it can go and chose another victim. As a result a risk of OOM deadlock when the oom victim would be blocked indefinetly and so the oom killer cannot make any progress should be mitigated considerably while we still try really hard to do perform all reclaim attempts and stay predictable in the behavior. Signed-off-by: Michal Hocko <mhocko@suse.com> --- Hi, I believe that after [1] and this patch we can reasonably expect that the risk of the oom lockups is so low that we do not need to employ timeout based solutions. I am sending this as an RFC because there still might be better ways to accomplish the similar effect. I just like this one because it is nicely grafted into the oom reaper which will now be invoked for basically all oom victims. [1] http://lkml.kernel.org/r/1464266415-15558-1-git-send-email-mhocko@kernel.org include/linux/sched.h | 1 + mm/oom_kill.c | 19 +++++++++++++++++++ 2 files changed, 20 insertions(+) diff --git a/include/linux/sched.h b/include/linux/sched.h index ec636400669f..12a4c8e04e6a 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -512,6 +512,7 @@ static inline int get_dumpable(struct mm_struct *mm) #define MMF_HAS_UPROBES 19 /* has uprobes */ #define MMF_RECALC_UPROBES 20 /* MMF_HAS_UPROBES can be wrong */ #define MMF_OOM_REAPED 21 /* mm has been already reaped */ +#define MMF_OOM_NOT_REAPABLE 22 /* mm couldn't be reaped */ #define MMF_INIT_MASK (MMF_DUMPABLE_MASK | MMF_DUMP_FILTER_MASK) diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 5bb2f7698ad7..2f82f2a558e4 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -531,8 +531,27 @@ static void oom_reap_task(struct task_struct *tsk) schedule_timeout_idle(HZ/10); if (attempts > MAX_OOM_REAP_RETRIES) { + struct task_struct *p; + pr_info("oom_reaper: unable to reap pid:%d (%s)\n", task_pid_nr(tsk), tsk->comm); + + /* + * If we've already tried to reap this task in the past and + * failed it probably doesn't make much sense to try yet again + * so hide the mm from the oom killer so that it can move on + * to another task with a different mm struct. + */ + p = find_lock_task_mm(tsk); + if (p) { + if (test_and_set_bit(MMF_OOM_NOT_REAPABLE, &p->mm->flags)) { + pr_info("oom_reaper: giving up pid:%d (%s)\n", + task_pid_nr(tsk), tsk->comm); + set_bit(MMF_OOM_REAPED, &p->mm->flags); + } + task_unlock(p); + } + debug_show_all_locks(); } -- 2.8.1 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [RFC PATCH] mm, oom_reaper: do not attempt to reap a task more than twice 2016-05-26 15:27 [RFC PATCH] mm, oom_reaper: do not attempt to reap a task more than twice Michal Hocko @ 2016-05-27 10:31 ` Tetsuo Handa 2016-05-27 12:23 ` Michal Hocko 0 siblings, 1 reply; 9+ messages in thread From: Tetsuo Handa @ 2016-05-27 10:31 UTC (permalink / raw) To: mhocko, linux-mm; +Cc: rientjes, akpm, oleg, vdavydov, mhocko Michal Hocko wrote: > Hi, > I believe that after [1] and this patch we can reasonably expect that > the risk of the oom lockups is so low that we do not need to employ > timeout based solutions. I am sending this as an RFC because there still > might be better ways to accomplish the similar effect. I just like this > one because it is nicely grafted into the oom reaper which will now be > invoked for basically all oom victims. > > [1] http://lkml.kernel.org/r/1464266415-15558-1-git-send-email-mhocko@kernel.org I still cannot agree with "we do not need to employ timeout based solutions". While it is true that OOM-reap is per "struct mm_struct" action, we don't need to change user visible oom_score_adj interface by [1] in order to enforce OOM-kill being per "struct mm_struct" action. It is possible that a victim thread releases a lot of memory by closing pipe's file descriptors at exit_files() (before exit_task_work() which is between exit_mm() and exit_notify()). The problem is that there is no trigger for giving up (e.g. timeout) when that optimistic expectation failed. As long as we wake up the OOM reaper, we can use the OOM reaper as a trigger for giving up, and we can perfectly avoid OOM lockups as long as the OOM killer is invoked. We were too much focused on making sure that TIF_MEMDIE thread does not get stuck at down_read() in exit_mm(). We forgot about "tsk->mm = NULL -> exit_aio() etc. by mmput() -> exit_oom_victim()" sequence where exit_aio() can be blocked on I/O which involves memory allocation and oom_scan_process_thread() returns OOM_SCAN_ABORT due to task->signal->oom_victims > 0 and __oom_reap_task() cannot reap due to find_lock_task_mm() returning NULL and/or mm->mm_users is already 0. Yes, commit 449d777d7ad6d7f9 ("mm, oom_reaper: clear TIF_MEMDIE for all tasks queued for oom_reaper") which went to Linux 4.7-rc1 will clear TIF_MEMDIE and decrement task->signal->oom_victims even if __oom_reap_task() cannot reap so that oom_scan_process_thread() will not return OOM_SCAN_ABORT forever. But still, such unlocking depends on an assumption that wake_oom_reaper() is always called. What we need to have is "always call wake_oom_reaper() in order to let the OOM reaper clear TIF_MEMDIE and mark as no longer OOM-killable" or "ignore TIF_MEMDIE after some timeout". As you hate timeout, I propose below patch instead of [1] and your "[RFC PATCH] mm, oom_reaper: do not attempt to reap a task more than twice". ---------- >From a6b9f155e99971ef6144583a9ca1f427b0a85df8 Mon Sep 17 00:00:00 2001 From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Date: Fri, 27 May 2016 15:59:29 +0900 Subject: [PATCH] mm,oom: Make sure OOM killer always make forward progress This patch is for handling problems listed below. (1) Incomplete TIF_MEMDIE shortcuts which may cause OOM livelock due to failing to wake up the OOM reaper. (2) Infinite retry loop which may cause OOM livelock because the OOM killer continues selecting the same victim forever. (3) MM shared by unkillable tasks which may cause OOM livelock because the OOM killer cannot wake up the OOM reaper. The core of this patch is mm_is_reapable() which examines whether all threads using a given mm is dying. While mm_is_reapable() is costly and slow operation, it is true only until that mm gets MMF_OOM_REAPABLE. It is likely that once oom_kill_process() selects an OOM victim, subsequent mm_is_reapable() calls (from __oom_reap_task() by the OOM reaper and/or out_of_memory()/mem_cgroup_out_of_memory() shortcuts by any threads sharing that OOM victim's mm) find MMF_OOM_REAPABLE. There are two exceptions which this patch does not address. One is that it is theoretically possible that sending SIGKILL to oom_task_origin() task can get stuck, as explained as http://lkml.kernel.org/r/1463796090-7948-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp . The other is that it is theoretically possible that current thread which got TIF_MEMDIE using out_of_memory()/mem_cgroup_out_of_memory() shortcuts gets stuck due to unable to satisfy __GFP_NOFAIL allocation request (and therefore needs to select next OOM victim). We will be able to avoid such infinite retry loop using per "struct task_struct" atomic flags. Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> --- include/linux/oom.h | 32 +---------- include/linux/sched.h | 1 + mm/memcontrol.c | 7 +-- mm/oom_kill.c | 153 ++++++++++++++++++++++++++------------------------ 4 files changed, 84 insertions(+), 109 deletions(-) diff --git a/include/linux/oom.h b/include/linux/oom.h index 8346952..3f4453f 100644 --- a/include/linux/oom.h +++ b/include/linux/oom.h @@ -67,15 +67,7 @@ static inline bool oom_task_origin(const struct task_struct *p) return p->signal->oom_flag_origin; } -extern void mark_oom_victim(struct task_struct *tsk); - -#ifdef CONFIG_MMU -extern void try_oom_reaper(struct task_struct *tsk); -#else -static inline void try_oom_reaper(struct task_struct *tsk) -{ -} -#endif +extern bool task_is_reapable(struct task_struct *tsk); extern unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *memcg, const nodemask_t *nodemask, @@ -105,28 +97,6 @@ extern void oom_killer_enable(void); extern struct task_struct *find_lock_task_mm(struct task_struct *p); -static inline bool task_will_free_mem(struct task_struct *task) -{ - struct signal_struct *sig = task->signal; - - /* - * A coredumping process may sleep for an extended period in exit_mm(), - * so the oom killer cannot assume that the process will promptly exit - * and release memory. - */ - if (sig->flags & SIGNAL_GROUP_COREDUMP) - return false; - - if (!(task->flags & PF_EXITING)) - return false; - - /* Make sure that the whole thread group is going down */ - if (!thread_group_empty(task) && !(sig->flags & SIGNAL_GROUP_EXIT)) - return false; - - return true; -} - /* sysctls */ extern int sysctl_oom_dump_tasks; extern int sysctl_oom_kill_allocating_task; diff --git a/include/linux/sched.h b/include/linux/sched.h index b3e84ac..d748f16 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -523,6 +523,7 @@ static inline int get_dumpable(struct mm_struct *mm) #define MMF_HAS_UPROBES 19 /* has uprobes */ #define MMF_RECALC_UPROBES 20 /* MMF_HAS_UPROBES can be wrong */ #define MMF_OOM_REAPED 21 /* mm has been already reaped */ +#define MMF_OOM_REAPABLE 22 /* mm is ready to be reaped */ #define MMF_INIT_MASK (MMF_DUMPABLE_MASK | MMF_DUMP_FILTER_MASK) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 37ba604..aed588e 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1271,15 +1271,12 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask, mutex_lock(&oom_lock); /* - * If current has a pending SIGKILL or is exiting, then automatically + * If current's memory is ready to be OOM-reaped, then automatically * select it. The goal is to allow it to allocate so that it may * quickly exit and free its memory. */ - if (fatal_signal_pending(current) || task_will_free_mem(current)) { - mark_oom_victim(current); - try_oom_reaper(current); + if (task_is_reapable(current)) goto unlock; - } check_panic_on_oom(&oc, CONSTRAINT_MEMCG, memcg); totalpages = mem_cgroup_get_limit(memcg) ? : 1; diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 1685890..95fce47 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -428,6 +428,50 @@ static bool process_shares_mm(struct task_struct *p, struct mm_struct *mm) return false; } +bool mm_is_reapable(struct mm_struct *mm) +{ + struct task_struct *p; + + if (!mm) + return false; + if (test_bit(MMF_OOM_REAPABLE, &mm->flags)) + return true; + if (!down_read_trylock(&mm->mmap_sem)) + return false; + up_read(&mm->mmap_sem); + /* + * There might be other threads/processes which are either not + * dying or even not killable. + */ + if (atomic_read(&mm->mm_users) > 1) { + rcu_read_lock(); + for_each_process(p) { + bool exiting; + + if (!process_shares_mm(p, mm)) + continue; + if (fatal_signal_pending(p)) + continue; + + /* + * If the task is exiting make sure the whole thread + * group is exiting and cannot access mm anymore. + */ + spin_lock_irq(&p->sighand->siglock); + exiting = signal_group_exit(p->signal); + spin_unlock_irq(&p->sighand->siglock); + if (exiting) + continue; + + /* Give up */ + rcu_read_unlock(); + return false; + } + rcu_read_unlock(); + } + set_bit(MMF_OOM_REAPABLE, &mm->flags); + return true; +} #ifdef CONFIG_MMU /* @@ -483,7 +527,7 @@ static bool __oom_reap_task(struct task_struct *tsk) task_unlock(p); - if (!down_read_trylock(&mm->mmap_sem)) { + if (!mm_is_reapable(mm) || !down_read_trylock(&mm->mmap_sem)) { ret = false; goto unlock_oom; } @@ -553,6 +597,9 @@ static void oom_reap_task(struct task_struct *tsk) debug_show_all_locks(); } + /* Do not allow the OOM killer to select this thread group again. */ + tsk->signal->oom_score_adj = OOM_SCORE_ADJ_MIN; + /* * Clear TIF_MEMDIE because the task shouldn't be sitting on a * reasonably reclaimable memory anymore or it is not a good candidate @@ -606,51 +653,6 @@ static void wake_oom_reaper(struct task_struct *tsk) wake_up(&oom_reaper_wait); } -/* Check if we can reap the given task. This has to be called with stable - * tsk->mm - */ -void try_oom_reaper(struct task_struct *tsk) -{ - struct mm_struct *mm = tsk->mm; - struct task_struct *p; - - if (!mm) - return; - - /* - * There might be other threads/processes which are either not - * dying or even not killable. - */ - if (atomic_read(&mm->mm_users) > 1) { - rcu_read_lock(); - for_each_process(p) { - bool exiting; - - if (!process_shares_mm(p, mm)) - continue; - if (fatal_signal_pending(p)) - continue; - - /* - * If the task is exiting make sure the whole thread group - * is exiting and cannot acces mm anymore. - */ - spin_lock_irq(&p->sighand->siglock); - exiting = signal_group_exit(p->signal); - spin_unlock_irq(&p->sighand->siglock); - if (exiting) - continue; - - /* Give up */ - rcu_read_unlock(); - return; - } - rcu_read_unlock(); - } - - wake_oom_reaper(tsk); -} - static int __init oom_init(void) { oom_reaper_th = kthread_run(oom_reaper, NULL, "oom_reaper"); @@ -675,7 +677,7 @@ static void wake_oom_reaper(struct task_struct *tsk) * Has to be called with oom_lock held and never after * oom has been disabled already. */ -void mark_oom_victim(struct task_struct *tsk) +static void mark_oom_victim(struct task_struct *tsk) { WARN_ON(oom_killer_disabled); /* OOM killer might race with memcg OOM */ @@ -743,6 +745,24 @@ void oom_killer_enable(void) } /* + * Try to mark the given task as an OOM victim. + * + * @tsk: Task to check. + * + * Needs task_lock(@tsk)/task_unlock(@tsk) unless @tsk == current. + */ +bool task_is_reapable(struct task_struct *tsk) +{ + if ((fatal_signal_pending(tsk) || (tsk->flags & PF_EXITING)) && + mm_is_reapable(tsk->mm)) { + mark_oom_victim(tsk); + wake_oom_reaper(tsk); + return true; + } + return false; +} + +/* * Must be called while holding a reference to p, which will be released upon * returning. */ @@ -757,16 +777,14 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p, unsigned int victim_points = 0; static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL, DEFAULT_RATELIMIT_BURST); - bool can_oom_reap = true; /* - * If the task is already exiting, don't alarm the sysadmin or kill - * its children or threads, just set TIF_MEMDIE so it can die quickly + * If the task's memory is ready to be OOM-reaped, then don't alarm + * the sysadmin or kill its children or threads, just set TIF_MEMDIE + * so it can die quickly. */ task_lock(p); - if (p->mm && task_will_free_mem(p)) { - mark_oom_victim(p); - try_oom_reaper(p); + if (task_is_reapable(p)) { task_unlock(p); put_task_struct(p); return; @@ -849,22 +867,18 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p, continue; if (same_thread_group(p, victim)) continue; - if (unlikely(p->flags & PF_KTHREAD) || is_global_init(p) || - p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) { - /* - * We cannot use oom_reaper for the mm shared by this - * process because it wouldn't get killed and so the - * memory might be still used. - */ - can_oom_reap = false; + if (unlikely(p->flags & PF_KTHREAD)) continue; - } + if (is_global_init(p)) + continue; + if (p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) + continue; + do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, true); } rcu_read_unlock(); - if (can_oom_reap) - wake_oom_reaper(victim); + wake_oom_reaper(victim); mmdrop(mm); put_task_struct(victim); @@ -936,19 +950,12 @@ bool out_of_memory(struct oom_control *oc) return true; /* - * If current has a pending SIGKILL or is exiting, then automatically + * If current's memory is ready to be OOM-reaped, then automatically * select it. The goal is to allow it to allocate so that it may * quickly exit and free its memory. - * - * But don't select if current has already released its mm and cleared - * TIF_MEMDIE flag at exit_mm(), otherwise an OOM livelock may occur. */ - if (current->mm && - (fatal_signal_pending(current) || task_will_free_mem(current))) { - mark_oom_victim(current); - try_oom_reaper(current); + if (task_is_reapable(current)) return true; - } /* * The OOM killer does not compensate for IO-less reclaim. -- 1.8.3.1 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [RFC PATCH] mm, oom_reaper: do not attempt to reap a task more than twice 2016-05-27 10:31 ` Tetsuo Handa @ 2016-05-27 12:23 ` Michal Hocko 2016-05-27 13:18 ` [RFC PATCH] mm, oom_reaper: do not attempt to reap a task morethan twice Tetsuo Handa 0 siblings, 1 reply; 9+ messages in thread From: Michal Hocko @ 2016-05-27 12:23 UTC (permalink / raw) To: Tetsuo Handa; +Cc: linux-mm, rientjes, akpm, oleg, vdavydov On Fri 27-05-16 19:31:19, Tetsuo Handa wrote: > Michal Hocko wrote: > > Hi, > > I believe that after [1] and this patch we can reasonably expect that > > the risk of the oom lockups is so low that we do not need to employ > > timeout based solutions. I am sending this as an RFC because there still > > might be better ways to accomplish the similar effect. I just like this > > one because it is nicely grafted into the oom reaper which will now be > > invoked for basically all oom victims. > > > > [1] http://lkml.kernel.org/r/1464266415-15558-1-git-send-email-mhocko@kernel.org > > I still cannot agree with "we do not need to employ timeout based solutions". > > While it is true that OOM-reap is per "struct mm_struct" action, we don't > need to change user visible oom_score_adj interface by [1] in order to > enforce OOM-kill being per "struct mm_struct" action. We want to change the oom_score_adj behavior for the pure consistency I believe. [...] > Yes, commit 449d777d7ad6d7f9 ("mm, oom_reaper: clear TIF_MEMDIE for all tasks > queued for oom_reaper") which went to Linux 4.7-rc1 will clear TIF_MEMDIE and > decrement task->signal->oom_victims even if __oom_reap_task() cannot reap > so that oom_scan_process_thread() will not return OOM_SCAN_ABORT forever. > But still, such unlocking depends on an assumption that wake_oom_reaper() is > always called. which is practically the case. The only real exception are use_mm() users. I want to look at those but I guess they need a special handling. > What we need to have is "always call wake_oom_reaper() in order to let the > OOM reaper clear TIF_MEMDIE and mark as no longer OOM-killable" or "ignore > TIF_MEMDIE after some timeout". As you hate timeout, I propose below patch > instead of [1] and your "[RFC PATCH] mm, oom_reaper: do not attempt to reap > a task more than twice". [...] > @@ -849,22 +867,18 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p, > continue; > if (same_thread_group(p, victim)) > continue; > - if (unlikely(p->flags & PF_KTHREAD) || is_global_init(p) || > - p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) { > - /* > - * We cannot use oom_reaper for the mm shared by this > - * process because it wouldn't get killed and so the > - * memory might be still used. > - */ > - can_oom_reap = false; > + if (unlikely(p->flags & PF_KTHREAD)) > continue; > - } > + if (is_global_init(p)) > + continue; > + if (p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) > + continue; > + > do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, true); > } > rcu_read_unlock(); > > - if (can_oom_reap) > - wake_oom_reaper(victim); > + wake_oom_reaper(victim); > > mmdrop(mm); > put_task_struct(victim); So this is the biggest change to my approach. And I think it is incorrect because you cannot simply reap the memory when you have active users of that memory potentially. Shared with global init is just non existant problem. Such a system would be crippled enough to not bother. But use_mm is potentially real and I believe we should find some way around it and even not consider such tasks. Fortunately we do not have many users of use_mm in the kernel and most users will not use them. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [RFC PATCH] mm, oom_reaper: do not attempt to reap a task morethan twice 2016-05-27 12:23 ` Michal Hocko @ 2016-05-27 13:18 ` Tetsuo Handa 2016-05-27 13:35 ` Michal Hocko 0 siblings, 1 reply; 9+ messages in thread From: Tetsuo Handa @ 2016-05-27 13:18 UTC (permalink / raw) To: mhocko; +Cc: linux-mm, rientjes, akpm, oleg, vdavydov Michal Hocko wrote: > On Fri 27-05-16 19:31:19, Tetsuo Handa wrote: > > Michal Hocko wrote: > > > Hi, > > > I believe that after [1] and this patch we can reasonably expect that > > > the risk of the oom lockups is so low that we do not need to employ > > > timeout based solutions. I am sending this as an RFC because there still > > > might be better ways to accomplish the similar effect. I just like this > > > one because it is nicely grafted into the oom reaper which will now be > > > invoked for basically all oom victims. > > > > > > [1] http://lkml.kernel.org/r/1464266415-15558-1-git-send-email-mhocko@kernel.org > > > > I still cannot agree with "we do not need to employ timeout based solutions". > > > > While it is true that OOM-reap is per "struct mm_struct" action, we don't > > need to change user visible oom_score_adj interface by [1] in order to > > enforce OOM-kill being per "struct mm_struct" action. > > We want to change the oom_score_adj behavior for the pure consistency I > believe. Is it an agreed conclusion rather than your will? Did userspace developers ack? > > [...] > > > Yes, commit 449d777d7ad6d7f9 ("mm, oom_reaper: clear TIF_MEMDIE for all tasks > > queued for oom_reaper") which went to Linux 4.7-rc1 will clear TIF_MEMDIE and > > decrement task->signal->oom_victims even if __oom_reap_task() cannot reap > > so that oom_scan_process_thread() will not return OOM_SCAN_ABORT forever. > > But still, such unlocking depends on an assumption that wake_oom_reaper() is > > always called. > > which is practically the case. The only real exception are use_mm() > users. I want to look at those but I guess they need a special handling. > > > What we need to have is "always call wake_oom_reaper() in order to let the > > OOM reaper clear TIF_MEMDIE and mark as no longer OOM-killable" or "ignore > > TIF_MEMDIE after some timeout". As you hate timeout, I propose below patch > > instead of [1] and your "[RFC PATCH] mm, oom_reaper: do not attempt to reap > > a task more than twice". > [...] > > @@ -849,22 +867,18 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p, > > continue; > > if (same_thread_group(p, victim)) > > continue; > > - if (unlikely(p->flags & PF_KTHREAD) || is_global_init(p) || > > - p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) { > > - /* > > - * We cannot use oom_reaper for the mm shared by this > > - * process because it wouldn't get killed and so the > > - * memory might be still used. > > - */ > > - can_oom_reap = false; > > + if (unlikely(p->flags & PF_KTHREAD)) > > continue; > > - } > > + if (is_global_init(p)) > > + continue; > > + if (p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) > > + continue; > > + > > do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, true); > > } > > rcu_read_unlock(); > > > > - if (can_oom_reap) > > - wake_oom_reaper(victim); > > + wake_oom_reaper(victim); > > > > mmdrop(mm); > > put_task_struct(victim); > > So this is the biggest change to my approach. And I think it is > incorrect because you cannot simply reap the memory when you have active > users of that memory potentially. I don't reap the memory when I have active users of that memory potentially. I do below check. I'm calling wake_oom_reaper() in order to guarantee that oom_reap_task() shall clear TIF_MEMDIE and drop oom_victims. @@ -483,7 +527,7 @@ static bool __oom_reap_task(struct task_struct *tsk) task_unlock(p); - if (!down_read_trylock(&mm->mmap_sem)) { + if (!mm_is_reapable(mm) || !down_read_trylock(&mm->mmap_sem)) { ret = false; goto unlock_oom; } > Shared with global init is just non > existant problem. Such a system would be crippled enough to not bother. See commit a2b829d95958da20 ("mm/oom_kill.c: avoid attempting to kill init sharing same memory"). My patch simply rolled back to that commit, and hands over the duty of clearing TIF_MEMDIE and dropping oom_victims to the OOM reaper's code provided by commit 449d777d7ad6d7f9 ("mm, oom_reaper: clear TIF_MEMDIE for all tasks queued for oom_reaper"). > But use_mm is potentially real and I believe we should find some way > around it and even not consider such tasks. Fortunately we do not have > many users of use_mm in the kernel and most users will not use them. I don't know why use_mm() becomes a problem because I do above check just before trying to reap that memory. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [RFC PATCH] mm, oom_reaper: do not attempt to reap a task morethan twice 2016-05-27 13:18 ` [RFC PATCH] mm, oom_reaper: do not attempt to reap a task morethan twice Tetsuo Handa @ 2016-05-27 13:35 ` Michal Hocko 2016-05-27 16:24 ` [RFC PATCH] mm, oom_reaper: do not attempt to reap a task more than twice Tetsuo Handa 0 siblings, 1 reply; 9+ messages in thread From: Michal Hocko @ 2016-05-27 13:35 UTC (permalink / raw) To: Tetsuo Handa; +Cc: linux-mm, rientjes, akpm, oleg, vdavydov On Fri 27-05-16 22:18:42, Tetsuo Handa wrote: > Michal Hocko wrote: > > On Fri 27-05-16 19:31:19, Tetsuo Handa wrote: > > > Michal Hocko wrote: > > > > Hi, > > > > I believe that after [1] and this patch we can reasonably expect that > > > > the risk of the oom lockups is so low that we do not need to employ > > > > timeout based solutions. I am sending this as an RFC because there still > > > > might be better ways to accomplish the similar effect. I just like this > > > > one because it is nicely grafted into the oom reaper which will now be > > > > invoked for basically all oom victims. > > > > > > > > [1] http://lkml.kernel.org/r/1464266415-15558-1-git-send-email-mhocko@kernel.org > > > > > > I still cannot agree with "we do not need to employ timeout based solutions". > > > > > > While it is true that OOM-reap is per "struct mm_struct" action, we don't > > > need to change user visible oom_score_adj interface by [1] in order to > > > enforce OOM-kill being per "struct mm_struct" action. > > > > We want to change the oom_score_adj behavior for the pure consistency I > > believe. > > Is it an agreed conclusion rather than your will? Did userspace developers ack? If you think this is not the right approach then please comment as a reply to the patch. [...] > > So this is the biggest change to my approach. And I think it is > > incorrect because you cannot simply reap the memory when you have active > > users of that memory potentially. > > I don't reap the memory when I have active users of that memory potentially. > I do below check. I'm calling wake_oom_reaper() in order to guarantee that > oom_reap_task() shall clear TIF_MEMDIE and drop oom_victims. > > @@ -483,7 +527,7 @@ static bool __oom_reap_task(struct task_struct *tsk) > > task_unlock(p); > > - if (!down_read_trylock(&mm->mmap_sem)) { > + if (!mm_is_reapable(mm) || !down_read_trylock(&mm->mmap_sem)) { > ret = false; > goto unlock_oom; > } OK, I've missed this part. So you just defer the decision to the oom reaper while I am trying to solve that at oom_kill_process level. We could very well do diff --git a/mm/oom_kill.c b/mm/oom_kill.c index bcb6d3b26c94..d9017b8c7300 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -813,6 +813,7 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p, * memory might be still used. */ can_oom_reap = false; + set_bit(MMF_OOM_REAPED, mm->flags); continue; } if (p->signal->oom_score_adj == OOM_ADJUST_MIN) with the same result. If you _really_ think that this would make a difference I could live with that. But I am highly skeptical this matters all that much. > > > Shared with global init is just non > > existant problem. Such a system would be crippled enough to not bother. > > See commit a2b829d95958da20 ("mm/oom_kill.c: avoid attempting to kill init > sharing same memory"). Don't you think that a system where the largest memory consumer is the global init is crippled terribly? -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [RFC PATCH] mm, oom_reaper: do not attempt to reap a task more than twice 2016-05-27 13:35 ` Michal Hocko @ 2016-05-27 16:24 ` Tetsuo Handa 2016-05-28 12:22 ` Tetsuo Handa 2016-05-30 11:55 ` Michal Hocko 0 siblings, 2 replies; 9+ messages in thread From: Tetsuo Handa @ 2016-05-27 16:24 UTC (permalink / raw) To: mhocko; +Cc: linux-mm, rientjes, akpm, oleg, vdavydov Michal Hocko wrote: > On Fri 27-05-16 22:18:42, Tetsuo Handa wrote: > > Michal Hocko wrote: > > > So this is the biggest change to my approach. And I think it is > > > incorrect because you cannot simply reap the memory when you have active > > > users of that memory potentially. > > > > I don't reap the memory when I have active users of that memory potentially. > > I do below check. I'm calling wake_oom_reaper() in order to guarantee that > > oom_reap_task() shall clear TIF_MEMDIE and drop oom_victims. > > > > @@ -483,7 +527,7 @@ static bool __oom_reap_task(struct task_struct *tsk) > > > > task_unlock(p); > > > > - if (!down_read_trylock(&mm->mmap_sem)) { > > + if (!mm_is_reapable(mm) || !down_read_trylock(&mm->mmap_sem)) { > > ret = false; > > goto unlock_oom; > > } > > OK, I've missed this part. So you just defer the decision to the oom > reaper while I am trying to solve that at oom_kill_process level. Right. > We could very well do > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > index bcb6d3b26c94..d9017b8c7300 100644 > --- a/mm/oom_kill.c > +++ b/mm/oom_kill.c > @@ -813,6 +813,7 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p, > * memory might be still used. > */ > can_oom_reap = false; > + set_bit(MMF_OOM_REAPED, mm->flags); > continue; > } > if (p->signal->oom_score_adj == OOM_ADJUST_MIN) > > with the same result. If you _really_ think that this would make a > difference I could live with that. But I am highly skeptical this > matters all that much. It matters a lot. There is a "guarantee" difference. Maybe this is something like top half and bottom half relationship? The OOM killer context is the top half and the OOM reaper context is the bottom half. "The top half always hands over to the bottom half" can allow the bottom half to recover when something went wrong. In your approach, the top half might not hand over to the bottom half. I think the lines needed for the guarantee are something like rcu_read_lock(); for_each_process(p) { if (!process_shares_mm(p, mm)) continue; if (same_thread_group(p, victim)) continue; /* * It is not safe to reap memory used by global init or * kernel threads. */ if (unlikely(p->flags & PF_KTHREAD) || is_global_init(p)) { set_bit(MMF_OOM_REAPED, mm->flags); continue; } /* * Memory used by OOM_SCORE_ADJ_MIN is still OOM reapable * if they are already killed or exiting. Just don't * send SIGKILL. */ if (p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) continue; do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, true); } rcu_read_unlock(); wake_oom_reaper(victim); but doing set_bit(MMF_OOM_REAPED, mm->flags) here makes sense? > > > Shared with global init is just non > > > existant problem. Such a system would be crippled enough to not bother. > > > > See commit a2b829d95958da20 ("mm/oom_kill.c: avoid attempting to kill init > > sharing same memory"). > > Don't you think that a system where the largest memory consumer is the > global init is crippled terribly? Why not? PID=1 name=init RSS=100MB \ +-- PID=102 name=child RSS=10MB (completed execve("child") 10 minutes ago) \ +-- PID=103 name=child RSS=10MB (completed execve("child") 7 minutes ago) \ +-- PID=104 name=child RSS=10MB (completed execve("child") 3 minutes ago) \ +-- PID=105 name=init RSS=100MB (about to start execve("child") from vfork()) should be allowed, doesn't it? There is no reason to exclude vfork()'ed child from OOM victim candidates. We can't control how people run their userspace programs. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [RFC PATCH] mm, oom_reaper: do not attempt to reap a task more than twice 2016-05-27 16:24 ` [RFC PATCH] mm, oom_reaper: do not attempt to reap a task more than twice Tetsuo Handa @ 2016-05-28 12:22 ` Tetsuo Handa 2016-05-30 11:57 ` Michal Hocko 2016-05-30 11:55 ` Michal Hocko 1 sibling, 1 reply; 9+ messages in thread From: Tetsuo Handa @ 2016-05-28 12:22 UTC (permalink / raw) To: mhocko; +Cc: linux-mm, rientjes, akpm, oleg, vdavydov Tetsuo Handa wrote: > Michal Hocko wrote: > > We could very well do > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > > index bcb6d3b26c94..d9017b8c7300 100644 > > --- a/mm/oom_kill.c > > +++ b/mm/oom_kill.c > > @@ -813,6 +813,7 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p, > > * memory might be still used. > > */ > > can_oom_reap = false; > > + set_bit(MMF_OOM_REAPED, mm->flags); > > continue; > > } > > if (p->signal->oom_score_adj == OOM_ADJUST_MIN) > > > > with the same result. If you _really_ think that this would make a > > difference I could live with that. But I am highly skeptical this > > matters all that much. Usage of set_bit() above and below are both wrong. The mm used by kernel thread via use_mm() will become OOM reapable after unuse_mm(). Thus, setting MMF_OOM_REAPED is a mistake as with MMF_OOM_KILLED ( http://lkml.kernel.org/r/201603152015.JAE86937.VFOLtQFOFJOSHM@I-love.SAKURA.ne.jp ). > I think the lines needed for the guarantee are something like > > rcu_read_lock(); > for_each_process(p) { > if (!process_shares_mm(p, mm)) > continue; > if (same_thread_group(p, victim)) > continue; > /* > * It is not safe to reap memory used by global init or > * kernel threads. > */ > if (unlikely(p->flags & PF_KTHREAD) || is_global_init(p)) { > set_bit(MMF_OOM_REAPED, mm->flags); > continue; > } > /* > * Memory used by OOM_SCORE_ADJ_MIN is still OOM reapable > * if they are already killed or exiting. Just don't > * send SIGKILL. > */ > if (p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) > continue; > > do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, true); > } > rcu_read_unlock(); > > wake_oom_reaper(victim); > > but doing set_bit(MMF_OOM_REAPED, mm->flags) here makes sense? I also realized that my if (task_is_reapable(current)) return true; is wrong. task_is_reapable() depends on all threads using current->mm are dying or exiting, but select_bad_process() (which is needed for calling mark_oom_victim() from oom_kill_process() after oom_badness() > 0 by oom_scan_process_thread() returning OOM_SCAN_OK) depends on there is no TIF_MEMDIE thread. If there is a TIF_MEMDIE thread, current thread which will (as of Linux 4.6) be able to get TIF_MEMDIE by fatal_signal_pending(current) || ((current->flags & PF_EXITING) && !(current->signal->flags & SIGNAL_GROUP_COREDUMP)) condition will fail to get TIF_MEMDIE because oom_scan_process_thread() will return OOM_SCAN_ABORT. The logic of setting TIF_MEMDIE to only one thread /* * Kill all user processes sharing victim->mm in other thread groups, if * any. They don't get access to memory reserves, though, to avoid * depletion of all memory. This prevents mm->mmap_sem livelock when an * oom killed thread cannot exit because it requires the semaphore and * its contended by another thread trying to allocate memory itself. * That thread will now get access to memory reserves since it has a * pending fatal signal. */ does not allow the shortcuts to require that current->mm is reapable. It seems to me that your "[PATCH 6/6] mm, oom: fortify task_will_free_mem" expects that current->mm is reapable as well as my patch. If so, [PATCH 6/6] will not work. +static inline bool task_will_free_mem(struct task_struct *task) +{ (...snipped...) + rcu_read_lock(); + for_each_process(p) { + bool vfork; + + /* + * skip over vforked tasks because they are mostly + * independent and will drop the mm soon + */ + task_lock(p); + vfork = p->vfork_done; + task_unlock(p); + if (vfork) + continue; + + ret = __task_will_free_mem(p); + if (!ret) + break; + } + rcu_read_unlock(); (...snipped...) +} @@ -945,14 +894,10 @@ bool out_of_memory(struct oom_control *oc) * If current has a pending SIGKILL or is exiting, then automatically * select it. The goal is to allow it to allocate so that it may * quickly exit and free its memory. - * - * But don't select if current has already released its mm and cleared - * TIF_MEMDIE flag at exit_mm(), otherwise an OOM livelock may occur. */ - if (current->mm && - (fatal_signal_pending(current) || task_will_free_mem(current))) { + if (task_will_free_mem(current)) { mark_oom_victim(current); - try_oom_reaper(current); + wake_oom_reaper(current); return true; } -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [RFC PATCH] mm, oom_reaper: do not attempt to reap a task more than twice 2016-05-28 12:22 ` Tetsuo Handa @ 2016-05-30 11:57 ` Michal Hocko 0 siblings, 0 replies; 9+ messages in thread From: Michal Hocko @ 2016-05-30 11:57 UTC (permalink / raw) To: Tetsuo Handa; +Cc: linux-mm, rientjes, akpm, oleg, vdavydov On Sat 28-05-16 21:22:08, Tetsuo Handa wrote: > Tetsuo Handa wrote: > > Michal Hocko wrote: > > > We could very well do > > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > > > index bcb6d3b26c94..d9017b8c7300 100644 > > > --- a/mm/oom_kill.c > > > +++ b/mm/oom_kill.c > > > @@ -813,6 +813,7 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p, > > > * memory might be still used. > > > */ > > > can_oom_reap = false; > > > + set_bit(MMF_OOM_REAPED, mm->flags); > > > continue; > > > } > > > if (p->signal->oom_score_adj == OOM_ADJUST_MIN) > > > > > > with the same result. If you _really_ think that this would make a > > > difference I could live with that. But I am highly skeptical this > > > matters all that much. > > Usage of set_bit() above and below are both wrong. The mm used by > kernel thread via use_mm() will become OOM reapable after unuse_mm(). Please note that all other holders of that mm are gone by that time. So unuse_mm will simply drop the last reference of the mm and do the remaining clean up. There is no real reason this mm should be around and visible by the oom killer. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [RFC PATCH] mm, oom_reaper: do not attempt to reap a task more than twice 2016-05-27 16:24 ` [RFC PATCH] mm, oom_reaper: do not attempt to reap a task more than twice Tetsuo Handa 2016-05-28 12:22 ` Tetsuo Handa @ 2016-05-30 11:55 ` Michal Hocko 1 sibling, 0 replies; 9+ messages in thread From: Michal Hocko @ 2016-05-30 11:55 UTC (permalink / raw) To: Tetsuo Handa; +Cc: linux-mm, rientjes, akpm, oleg, vdavydov On Sat 28-05-16 01:24:51, Tetsuo Handa wrote: > Michal Hocko wrote: > > On Fri 27-05-16 22:18:42, Tetsuo Handa wrote: > > > Michal Hocko wrote: > > > > So this is the biggest change to my approach. And I think it is > > > > incorrect because you cannot simply reap the memory when you have active > > > > users of that memory potentially. > > > > > > I don't reap the memory when I have active users of that memory potentially. > > > I do below check. I'm calling wake_oom_reaper() in order to guarantee that > > > oom_reap_task() shall clear TIF_MEMDIE and drop oom_victims. > > > > > > @@ -483,7 +527,7 @@ static bool __oom_reap_task(struct task_struct *tsk) > > > > > > task_unlock(p); > > > > > > - if (!down_read_trylock(&mm->mmap_sem)) { > > > + if (!mm_is_reapable(mm) || !down_read_trylock(&mm->mmap_sem)) { > > > ret = false; > > > goto unlock_oom; > > > } > > > > OK, I've missed this part. So you just defer the decision to the oom > > reaper while I am trying to solve that at oom_kill_process level. > > Right. > > > We could very well do > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > > index bcb6d3b26c94..d9017b8c7300 100644 > > --- a/mm/oom_kill.c > > +++ b/mm/oom_kill.c > > @@ -813,6 +813,7 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p, > > * memory might be still used. > > */ > > can_oom_reap = false; > > + set_bit(MMF_OOM_REAPED, mm->flags); > > continue; > > } > > if (p->signal->oom_score_adj == OOM_ADJUST_MIN) > > > > with the same result. If you _really_ think that this would make a > > difference I could live with that. But I am highly skeptical this > > matters all that much. > > It matters a lot. There is a "guarantee" difference. > > Maybe this is something like top half and bottom half relationship? > The OOM killer context is the top half and the OOM reaper context is > the bottom half. "The top half always hands over to the bottom half" > can allow the bottom half to recover when something went wrong. > In your approach, the top half might not hand over to the bottom half. But your bottom half would just decide to back off the same way I do here. And as for the bonus your bottom half would have to do the rather costly process iteration to find that out. [...] > > > > Shared with global init is just non > > > > existant problem. Such a system would be crippled enough to not bother. > > > > > > See commit a2b829d95958da20 ("mm/oom_kill.c: avoid attempting to kill init > > > sharing same memory"). > > > > Don't you think that a system where the largest memory consumer is the > > global init is crippled terribly? > > Why not? > > PID=1 name=init RSS=100MB > \ > +-- PID=102 name=child RSS=10MB (completed execve("child") 10 minutes ago) > \ > +-- PID=103 name=child RSS=10MB (completed execve("child") 7 minutes ago) > \ > +-- PID=104 name=child RSS=10MB (completed execve("child") 3 minutes ago) > \ > +-- PID=105 name=init RSS=100MB (about to start execve("child") from vfork()) > > should be allowed, doesn't it? Killing the vforked task doesn't make any sense because it will be quite unlikely to free any memory. We should select from the any other tasks which have completed their vfork. > There is no reason to exclude vfork()'ed child from OOM victim candidates. > We can't control how people run their userspace programs. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2016-05-30 11:57 UTC | newest] Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2016-05-26 15:27 [RFC PATCH] mm, oom_reaper: do not attempt to reap a task more than twice Michal Hocko 2016-05-27 10:31 ` Tetsuo Handa 2016-05-27 12:23 ` Michal Hocko 2016-05-27 13:18 ` [RFC PATCH] mm, oom_reaper: do not attempt to reap a task morethan twice Tetsuo Handa 2016-05-27 13:35 ` Michal Hocko 2016-05-27 16:24 ` [RFC PATCH] mm, oom_reaper: do not attempt to reap a task more than twice Tetsuo Handa 2016-05-28 12:22 ` Tetsuo Handa 2016-05-30 11:57 ` Michal Hocko 2016-05-30 11:55 ` Michal Hocko
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox