linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Michal Hocko <mhocko@kernel.org>
To: David Rientjes <rientjes@google.com>
Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>,
	Andrew Morton <akpm@linux-foundation.org>,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org
Subject: Re: [rfc patch] mm, oom: fix unnecessary killing of additional processes
Date: Thu, 31 May 2018 08:32:12 +0200	[thread overview]
Message-ID: <20180531063212.GF15278@dhcp22.suse.cz> (raw)
In-Reply-To: <alpine.DEB.2.21.1805301357100.150424@chino.kir.corp.google.com>

On Wed 30-05-18 14:06:51, David Rientjes wrote:
> On Mon, 28 May 2018, Michal Hocko wrote:
> 
> > > That's not sufficient since the oom reaper is also not able to oom reap if 
> > > the mm has blockable mmu notifiers or all memory is shared filebacked 
> > > memory, so it immediately sets MMF_OOM_SKIP and additional processes are 
> > > oom killed.
> > 
> > Could you be more specific with a real world example where that is the
> > case? I mean the full address space of non-reclaimable file backed
> > memory where waiting some more would help? Blockable mmu notifiers are
> > a PITA for sure. I wish we could have a better way to deal with them.
> > Maybe we can tell them we are in the non-blockable context and have them
> > release as much as possible. Still something that a random timeout
> > wouldn't help I am afraid.
> > 
> 
> It's not a random timeout, it's sufficiently long such that we don't oom 
> kill several processes needlessly in the very rare case where oom livelock 
> would actually prevent the original victim from exiting.  The oom reaper 
> processing an mm, finding everything to be mlocked, and immediately 
> MMF_OOM_SKIP is inappropriate.  This is rather trivial to reproduce for a 
> large memory hogging process that mlocks all of its memory; we 
> consistently see spurious and unnecessary oom kills simply because the oom 
> reaper has set MMF_OOM_SKIP very early.

It takes quite some additional steps for admin to allow a large amount
of mlocked memory and such an application should be really careful to
not consume too much memory. So how come this is something you see that
consistently? Is this some sort of bug or an unfortunate workload side
effect? I am asking this because I really want to see how relevant this
really is.

> This patch introduces a "give up" period such that the oom reaper is still 
> allowed to do its good work but only gives up in the hope the victim can 
> make forward progress at some substantial period of time in the future.  I 
> would understand the objection if oom livelock where the victim cannot 
> make forward progress were commonplace, but in the interest of not killing 
> several processes needlessly every time a large mlocked process is 
> targeted, I think it compels a waiting period.

But the waiting periods just turn out to be a really poor design. There
will be no good timeout to fit for everybody. We can do better and as
long as this is the case the timeout based solution should be really
rejected. It is a shortcut that doesn't really solve the underlying
problem.

> > Trying to reap a different oom victim when the current one is not making
> > progress during the lock contention is certainly something that make
> > sense. It has been proposed in the past and we just gave it up because
> > it was more complex. Do you have any specific example when this would
> > help to justify the additional complexity?
> > 
> 
> I'm not sure how you're defining complexity, the patch adds ~30 lines of 
> code and prevents processes from needlessly being oom killed when oom 
> reaping is largely unsuccessful and before the victim finishes 
> free_pgtables() and then also allows the oom reaper to operate on multiple 
> mm's instead of processing one at a time.  Obviously if there is a delay 
> before MMF_OOM_SKIP is set it requires that the oom reaper be able to 
> process other mm's, otherwise we stall needlessly for 10s.  Operating on 
> multiple mm's in a linked list while waiting for victims to exit during a 
> timeout period is thus very much needed, it wouldn't make sense without 
> it.

It needs to keep track of the current retry state of the reaped victim
and that is an additional complexity, isn't it? And I am asking how
often do we have to handle that. Please note that the primary objective
here is to unclutter a locked up situation. The oom reaper doesn't block
the victim to go away on its own while we keep retrying. So a slow
progress on the reaper side is not an issue IMIHO.

> > > But also note that even if oom reaping is possible, in the presence of an 
> > > antagonist that continues to allocate memory, that it is possible to oom 
> > > kill additional victims unnecessarily if we aren't able to complete 
> > > free_pgtables() in exit_mmap() of the original victim.
> > 
> > If there is unbound source of allocations then we are screwed no matter
> > what. We just hope that the allocator will get noticed by the oom killer
> > and it will be stopped.
> > 
> 
> It's not unbounded, it's just an allocator that acts as an antagonist.  At 
> the risk of being overly verbose, for system or memcg oom conditions: a 
> large mlocked process is oom killed, other processes continue to 
> allocate/charge, the oom reaper almost immediately grants MMF_OOM_SKIP 
> without being able to free any memory, and the other important processes 
> are needlessly oom killed before the original victim can reach 
> exit_mmap().  This happens a _lot_.
>
> I'm open to hearing any other suggestions that you have other than waiting 
> some time period before MMF_OOM_SKIP gets set to solve this problem.

I've already offered one. Make mlocked pages reapable. This is something
that has been on the todo list for quite some time. I just didn't have
time to work on that. The priority was not at the top because most sane
workloads simply do not mlock large portion of the memory. But if you
can see that happening regularly then this should be the first thing to
try. The main obstable to do so back then was the page_lock currently
taken in the munlock path. I've discussed that with Hugh and he said
that they are mainly for accounting purposes and mostly a relict from
the past IIRC and this should be fixable and a general improvement as
well.

-- 
Michal Hocko
SUSE Labs

  reply	other threads:[~2018-05-31  6:32 UTC|newest]

Thread overview: 35+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-05-24 21:22 David Rientjes
2018-05-25  0:19 ` Tetsuo Handa
2018-05-25 19:44   ` David Rientjes
2018-05-25  7:26 ` Michal Hocko
2018-05-25 19:36   ` David Rientjes
2018-05-28  8:13     ` Michal Hocko
2018-05-30 21:06       ` David Rientjes
2018-05-31  6:32         ` Michal Hocko [this message]
2018-05-31 21:16           ` David Rientjes
2018-06-01  7:46             ` Michal Hocko
2018-06-05  4:25               ` David Rientjes
2018-06-05  8:57                 ` Michal Hocko
2018-06-13 13:20                   ` Tetsuo Handa
2018-06-13 13:29                     ` Michal Hocko
2018-06-04  5:48 ` [lkp-robot] [mm, oom] 2d251ff6e6: BUG:unable_to_handle_kernel kernel test robot
2018-06-14 20:42 ` [patch] mm, oom: fix unnecessary killing of additional processes David Rientjes
2018-06-15  6:55   ` Michal Hocko
2018-06-15 23:15     ` David Rientjes
2018-06-19  8:33       ` Michal Hocko
2018-06-20 13:03         ` Michal Hocko
2018-06-20 20:34           ` David Rientjes
2018-06-21  7:45             ` Michal Hocko
2018-06-21  7:54               ` Michal Hocko
2018-06-21 20:50               ` David Rientjes
2018-06-22  7:42                 ` Michal Hocko
2018-06-22 14:29                   ` Michal Hocko
2018-06-22 18:49                     ` David Rientjes
2018-06-25  9:04                       ` Michal Hocko
2018-06-19  0:27   ` Andrew Morton
2018-06-19  8:47     ` Michal Hocko
2018-06-19 20:34     ` David Rientjes
2018-06-20 21:59       ` [patch v2] " David Rientjes
2018-06-21 10:58         ` [RFC PATCH] mm, oom: oom_free_timeout_ms can be static kbuild test robot
2018-06-21 10:58         ` [patch v2] mm, oom: fix unnecessary killing of additional processes kbuild test robot
2018-06-24  2:36   ` [patch] " Tetsuo Handa

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20180531063212.GF15278@dhcp22.suse.cz \
    --to=mhocko@kernel.org \
    --cc=akpm@linux-foundation.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=penguin-kernel@i-love.sakura.ne.jp \
    --cc=rientjes@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox