From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
To: mhocko@kernel.org
Cc: linux-mm@kvack.org, rientjes@google.com, akpm@linux-foundation.org
Subject: Re: [PATCH 3/3] mm, oom_reaper: clear TIF_MEMDIE for all tasks queued for oom_reaper
Date: Mon, 18 Apr 2016 20:59:51 +0900 [thread overview]
Message-ID: <201604182059.JFB76917.OFJMHFLSOtQVFO@I-love.SAKURA.ne.jp> (raw)
In-Reply-To: <20160417115422.GA21757@dhcp22.suse.cz>
Michal Hocko wrote:
> On Sat 16-04-16 11:51:11, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > On Thu 07-04-16 20:55:34, Tetsuo Handa wrote:
> > > > Michal Hocko wrote:
> > > > > The first obvious one is when the oom victim clears its mm and gets
> > > > > stuck later on. oom_reaper would back of on find_lock_task_mm returning
> > > > > NULL. We can safely try to clear TIF_MEMDIE in this case because such a
> > > > > task would be ignored by the oom killer anyway. The flag would be
> > > > > cleared by that time already most of the time anyway.
> > > >
> > > > I didn't understand what this wants to tell. The OOM victim will clear
> > > > TIF_MEMDIE as soon as it sets current->mm = NULL.
> > >
> > > No it clears the flag _after_ it returns from mmput. There is no
> > > guarantee it won't get stuck somewhere on the way there - e.g. exit_aio
> > > waits for completion and who knows what else might get stuck.
> >
> > OK. Then, I think an OOM livelock scenario shown below is possible.
> >
> > (1) First OOM victim (where mm->mm_users == 1) is selected by the first
> > round of out_of_memory() call.
> >
> > (2) The OOM reaper calls atomic_inc_not_zero(&mm->mm_users).
> >
> > (3) The OOM victim calls mmput() from exit_mm() from do_exit().
> > mmput() returns immediately because atomic_dec_and_test(&mm->mm_users)
> > returns false because of (2).
> >
> > (4) The OOM reaper reaps memory and then calls mmput().
> > mmput() calls exit_aio() etc. and waits for completion because
> > atomic_dec_and_test(&mm->mm_users) is now true.
> >
> > (5) Second OOM victim (which is the parent of the first OOM victim)
> > is selected by the next round of out_of_memory() call.
> >
> > (6) The OOM reaper is waiting for completion of the first OOM victim's
> > memory while the second OOM victim is waiting for the OOM reaper to
> > reap memory.
> >
> > Where is the guarantee that exit_aio() etc. called from mmput() by the
> > OOM reaper does not depend on memory allocation (i.e. the OOM reaper is
> > not blocked forever inside __oom_reap_task())?
>
> You should realize that the mmput is called _after_ we have reclaimed
> victim's address space. So there should be some memory freed by that
> time which reduce the likelyhood of a lockup due to memory allocation
> request if it is really needed for exit_aio.
Not always true. mmput() is called when down_read_trylock(&mm->mmap_sem) failed.
It is possible that the OOM victim was about to call up_write(&mm->mmap_sem) when
down_read_trylock(&mm->mmap_sem) failed, and it is possible that the OOM victim
runs until returning from mmput() from exit_mm() from do_exit() when the OOM
reaper was preempted between down_read_trylock(&mm->mmap_sem) and mmput().
Under such race, the OOM reaper will call mmput() without reclaiming the OOM
victim's address space.
>
> But you have a good point here. We want to strive for robustness of
> oom_reaper as much as possible. We have dropped the munlock patch because
> of the robustness so I guess we want this to be fixed as well. The
> reason for blocking might be different from memory pressure I guess.
The reality of race/dependency is more complicated than we can imagine.
>
> Here is what should work - I have only compile tested it. I will prepare
> the proper patch later this week with other oom reaper patches or after
> I come back from LSF/MM.
Excuse me, but is system_wq suitable for queuing operations which may take
unpredictable duration to flush?
system_wq is the one used by schedule[_delayed]_work[_on]().
Multi-CPU multi-threaded. There are users which expect relatively
short queue flush time. Don't queue works which can run for too
long.
Many users including SysRq-f depend on system_wq being flushed shortly. We
haven't guaranteed that SysRq-f can always fire and select a different OOM
victim, but you proposed always clearing TIF_MEMDIE without thinking the
possibility of the OOM victim with mmap_sem held for write being stuck at
unkillable wait.
I wonder about your definition of "robustness". You are almost always missing
the worst scenario. You are trying to manage OOM without defining default:
label in a switch statement. I don't think your approach is robust.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2016-04-18 12:00 UTC|newest]
Thread overview: 23+ messages / expand[flat|nested] mbox.gz Atom feed top
2016-04-06 14:13 [PATCH 0/3] oom reaper follow ups v1 Michal Hocko
2016-04-06 14:13 ` [PATCH 1/3] mm, oom: move GFP_NOFS check to out_of_memory Michal Hocko
2016-04-06 14:13 ` [PATCH 2/3] oom, oom_reaper: Try to reap tasks which skip regular OOM killer path Michal Hocko
2016-04-07 11:38 ` Tetsuo Handa
2016-04-08 11:19 ` Tetsuo Handa
2016-04-08 11:50 ` Michal Hocko
2016-04-09 4:39 ` [PATCH 2/3] oom, oom_reaper: Try to reap tasks which skipregular " Tetsuo Handa
2016-04-11 12:02 ` Michal Hocko
2016-04-11 13:26 ` [PATCH 2/3] oom, oom_reaper: Try to reap tasks which skip regular " Tetsuo Handa
2016-04-11 13:43 ` Michal Hocko
2016-04-13 11:08 ` Tetsuo Handa
2016-04-08 11:34 ` Michal Hocko
2016-04-08 13:14 ` Michal Hocko
2016-04-06 14:13 ` [PATCH 3/3] mm, oom_reaper: clear TIF_MEMDIE for all tasks queued for oom_reaper Michal Hocko
2016-04-07 11:55 ` Tetsuo Handa
2016-04-08 11:34 ` Michal Hocko
2016-04-16 2:51 ` Tetsuo Handa
2016-04-17 11:54 ` Michal Hocko
2016-04-18 11:59 ` Tetsuo Handa [this message]
2016-04-19 14:17 ` Michal Hocko
2016-04-19 15:07 ` Tetsuo Handa
2016-04-19 19:32 ` Michal Hocko
2016-04-08 13:07 ` Michal Hocko
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=201604182059.JFB76917.OFJMHFLSOtQVFO@I-love.SAKURA.ne.jp \
--to=penguin-kernel@i-love.sakura.ne.jp \
--cc=akpm@linux-foundation.org \
--cc=linux-mm@kvack.org \
--cc=mhocko@kernel.org \
--cc=rientjes@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox