linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
To: David Rientjes <rientjes@google.com>
Cc: Michal Hocko <mhocko@suse.cz>,
	Andrew Morton <akpm@linux-foundation.org>,
	Johannes Weiner <hannes@cmpxchg.org>,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	cgroups@vger.kernel.org
Subject: Re: [patch] mm, memcg: add oom killer delay
Date: Tue, 25 Jun 2013 10:39:05 +0900	[thread overview]
Message-ID: <51C8F4B9.9060604@jp.fujitsu.com> (raw)
In-Reply-To: <alpine.DEB.2.02.1306140254590.8780@chino.kir.corp.google.com>

(2013/06/14 19:12), David Rientjes wrote:
> On Fri, 14 Jun 2013, Kamezawa Hiroyuki wrote:
>
>> Reading your discussion, I think I understand your requirements.
>> The problem is that I can't think you took into all options into
>> accounts and found the best way is this new oom_delay. IOW, I can't
>> convice oom-delay is the best way to handle your issue.
>>
>
> Ok, let's talk about it.
>

I'm sorry that my RTT is long in these days.

>> Your requeirement is
>>   - Allowing userland oom-handler within local memcg.
>>
>
> Another requirement:
>
>   - Allow userland oom handler for global oom conditions.
>
> Hopefully that's hooked into memcg because the functionality is already
> there, we can simply duplicate all of the oom functionality that we'll be
> adding for the root memcg.
>

At mm-summit, it was discussed ant people seems to think user-land-oom-handler
is impossible. Hm, and in-kernel scripting was discussed, as far as I remember.



>> Considering straightforward, the answer should be
>>   - Allowing oom-handler daemon out of memcg's control by its limit.
>>     (For example, a flag/capability for a task can archive this.)
>>     Or attaching some *fixed* resource to the task rather than cgroup.
>>
>>     Allow to set task->secret_saving=20M.
>>
>
> Exactly!
>
> First of all, thanks very much for taking an interest in our usecase and
> discussing it with us.
>
> I didn't propose what I referred to earlier in the thread as "memcg
> reserves" because I thought it was going to be a more difficult battle.
> The fact that you brought it up first actually makes me think it's less
> insane :)
>
> We do indeed want memcg reserves and I have patches to add it if you'd
> like to see that first.  It ensures that this userspace oom handler can
> actually do some work in determining which process to kill.  The reserve
> is a fraction of true memory reserves (the space below the per-zone min
> watermarks) which is dependent on min_free_kbytes.  This does indeed
> become more difficult with true and complete kmem charging.  That "work"
> could be opening the tasks file (which allocates the pidlist within the
> kernel), checking /proc/pid/status for rss, checking for how long a
> process has been running, checking for tid, sending a signal to drop
> caches, etc.
>


Considering only memcg, bypassing all charge-limit-check will work.
But as you say, that will not work against global-oom.
Then, in-kernel scripting was discussed.


> We'd also like to do this for global oom conditions, which makes it even
> more interesting.  I was thinking of using a fraction of memory reserves
> as the oom killer currently does (that memory below the min watermark) for
> these purposes.
>
> Memory charging is simply bypassed for these oom handlers (we only grant
> access to those waiting on the memory.oom_control eventfd) up to
> memory.limit_in_bytes + (min_free_kbytes / 4), for example.  I don't think
> this is entirely insane because these oom handlers should lead to future
> memory freeing, just like TIF_MEMDIE processes.
>

I think that kinds of bypassing is acceptable.


>> Going back to your patch, what's confusing is your approach.
>> Why the problem caused by the amount of memory should be solved by
>> some dealy, i.e. the amount of time ?
>>
>> This exchanging sounds confusing to me.
>>
>
> Even with all of the above (which is not actually that invasive of a
> patch), I still think we need memory.oom_delay_millisecs.  I probably made
> a mistake in describing what that is addressing if it seems like it's
> trying to address any of the above.
>
> If a userspace oom handler fails to respond even with access to those
> "memcg reserves",

How this happens ?

>  the kernel needs to kill within that memcg.  Do we do
> that above a set time period (this patch) or when the reserves are
> completely exhausted?  That's debatable, but if we are to allow it for
> global oom conditions as well then my opinion was to make it as safe as
> possible; today, we can't disable the global oom killer from userspace and
> I don't think we should ever allow it to be disabled.  I think we should
> allow userspace a reasonable amount of time to respond and then kill if it
> is exceeded.
>
> For the global oom case, we want to have a priority-based memcg selection.
> Select the lowest priority top-level memcg and kill within it.  If it has
> an oom notifier, send it a signal to kill something.  If it fails to
> react, kill something after memory.oom_delay_millisecs has elapsed.  If
> there isn't a userspace oom notifier, kill something within that lowest
> priority memcg.
>

Someone may be against that kind of control and say "Hey, I have better idea".
That was another reason that oom-scirpiting was discussed. No one can implement
general-purpose-victim-selection-logic.

> The bottomline with my approach is that I don't believe there is ever a
> reason for an oom memcg to remain oom indefinitely.  That's why I hate
> memory.oom_control == 1 and I think for the global notification it would
> be deemed a nonstarter since you couldn't even login to the machine.
>
>> I'm not against what you finally want to do, but I don't like the fix.
>>
>
> I'm thrilled to hear that, and I hope we can work to make userspace oom
> handling more effective.
>
> What do you think about that above?

IMHO, it will be difficult but allowing to write script/filter for oom-killing
will be worth to try. like..

==
for_each_process :
   if comm == mem_manage_daemon :
      continue
   if user == root              :
      continue
   score = default_calc_score()
   if score > high_score :
      selected = current
==

BTW, if you love the logic in the userland oom daemon, why you can't implement
it in the kernel ? Does that do some pretty things other than sending SIGKILL ?

Thanks,
-Kame





--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  parent reply	other threads:[~2013-06-25  1:39 UTC|newest]

Thread overview: 47+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-05-30  1:18 David Rientjes
2013-05-30 15:07 ` Michal Hocko
2013-05-30 20:47   ` David Rientjes
2013-05-31  8:10     ` Michal Hocko
2013-05-31 10:22       ` David Rientjes
2013-05-31 11:02         ` Michal Hocko
2013-05-31 11:21         ` Michal Hocko
2013-05-31 19:29           ` David Rientjes
2013-06-01  6:11             ` Johannes Weiner
2013-06-01 10:29               ` Michal Hocko
2013-06-01 15:15                 ` Johannes Weiner
2013-06-03 15:34               ` Michal Hocko
2013-06-03 16:48                 ` Johannes Weiner
2013-06-03 18:03                   ` Michal Hocko
2013-06-03 18:30                   ` Johannes Weiner
2013-06-03 21:33                     ` KOSAKI Motohiro
2013-06-04  9:17                     ` Michal Hocko
2013-06-04 18:48                       ` Johannes Weiner
2013-06-04 19:27                         ` Michal Hocko
2013-06-05 13:49                         ` Michal Hocko
2013-06-03 16:31               ` Michal Hocko
2013-06-03 16:51                 ` Johannes Weiner
2013-06-01 10:20             ` Michal Hocko
2013-06-03 18:18               ` David Rientjes
2013-06-03 18:54                 ` Johannes Weiner
2013-06-03 19:09                   ` David Rientjes
2013-06-03 21:43                     ` Johannes Weiner
2013-06-03 19:31                 ` Michal Hocko
2013-06-03 21:17                   ` David Rientjes
2013-06-04  9:55                     ` Michal Hocko
2013-06-05  6:40                       ` David Rientjes
2013-06-05  9:39                         ` Michal Hocko
2013-06-06  0:09                           ` David Rientjes
2013-06-10 14:23                             ` Michal Hocko
2013-06-11 20:33                               ` David Rientjes
2013-06-12 20:23                                 ` Michal Hocko
2013-06-12 21:27                                   ` David Rientjes
2013-06-13 15:16                                     ` Michal Hocko
2013-06-13 22:25                                       ` David Rientjes
2013-06-14  0:56                                         ` Kamezawa Hiroyuki
2013-06-14 10:12                                           ` David Rientjes
2013-06-19 21:30                                             ` David Rientjes
2013-06-25  1:39                                             ` Kamezawa Hiroyuki [this message]
2013-06-26 23:18                                               ` David Rientjes
2013-07-10 11:23                                             ` Michal Hocko
2013-05-31 21:46 ` Andrew Morton
2013-06-03 18:00   ` David Rientjes

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=51C8F4B9.9060604@jp.fujitsu.com \
    --to=kamezawa.hiroyu@jp.fujitsu.com \
    --cc=akpm@linux-foundation.org \
    --cc=cgroups@vger.kernel.org \
    --cc=hannes@cmpxchg.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@suse.cz \
    --cc=rientjes@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox