From: David Rientjes <rientjes@google.com>
To: Michal Hocko <mhocko@suse.cz>
Cc: Andrew Morton <akpm@linux-foundation.org>,
Johannes Weiner <hannes@cmpxchg.org>,
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>,
linux-kernel@vger.kernel.org, linux-mm@kvack.org,
cgroups@vger.kernel.org
Subject: Re: [patch] mm, memcg: add oom killer delay
Date: Mon, 3 Jun 2013 11:18:09 -0700 (PDT) [thread overview]
Message-ID: <alpine.DEB.2.02.1306031102480.7956@chino.kir.corp.google.com> (raw)
In-Reply-To: <20130601102058.GA19474@dhcp22.suse.cz>
On Sat, 1 Jun 2013, Michal Hocko wrote:
> > Users obviously don't have the ability to attach processes to the root
> > memcg. They are constrained to their own subtree of memcgs.
>
> OK, I assume those groups are generally untrusted, right? So you cannot
> let them register their oom handler even via an admin interface. This
> makes it a bit complicated because it makes much harder demands on the
> handler itself as it has to run under restricted environment.
>
That's the point of the patch. We want to allow users to register their
own oom handler in a subtree (they may attach it to their own subtree root
and wait on memory.oom_control of a child memcg with a limit less than
that root) but not insist on an absolutely perfect implementation that can
never fail when you run on many, many servers. Userspace implementations
do fail sometimes, we just accept that.
> I still do not see why you cannot simply read tasks file into a
> preallocated buffer. This would be few pages even for thousands of pids.
> You do not have to track processes as they come and go.
>
What do you suggest when you read the "tasks" file and it returns -ENOMEM
because kmalloc() fails because the userspace oom handler's memcg is also
oom? Obviously it's not a situation we want to get into, but unless you
know that handler's exact memory usage across multiple versions, nothing
else is sharing that memcg, and it's a perfect implementation, you can't
guarantee it. We need to address real world problems that occur in
practice.
> As I said before. oom_delay_millisecs is actually really easy to be done
> from userspace. If you really need a safety break then you can register
> such a handler as a fallback. I am not familiar with eventfd internals
> much but I guess that multiple handlers are possible. The fallback might
> be enforeced by the admin (when a new group is created) or by the
> container itself. Would something like this work for your use case?
>
You're suggesting another userspace process that solely waits for a set
duration and then reenables the oom killer? It faces all the same
problems as the true userspace oom handler: it's own perfect
implementation and it's own memcg constraints.
> > If that user is constrained to his or her own subtree, as previously
> > stated, there's also no way to login and rectify the situation at that
> > point and requires admin intervention or a reboot.
>
> Yes, insisting on the same subtree makes the life much harder for oom
> handlers. I totally agree with you on that. I just feel that introducing
> a new knob to workaround user "inability" to write a proper handler
> (what ever that means) is not justified.
>
It's not necessarily harder if you assign the userspace oom handlers to
the root of your subtree with access to more memory than the children.
There is no "inability" to write a proper handler, but when you have
dozens of individual users implementing their own userspace handlers with
changing memcg limits over time, then you might find it hard to have
perfection every time. If we had perfection, we wouldn't have to worry
about oom in the first place. We can't just let these gazillion memcgs
sit spinning forever because they get stuck, either. That's why we've
used this solution for years as a failsafe. Disabling the oom killer
entirely, even for a memcg, is ridiculous, and if you don't have a grace
period then oom handlers themselves just don't work.
> > Then why does "cat tasks" stall when my memcg is totally depleted of all
> > memory?
>
> if you run it like this then cat obviously needs some charged
> allocations. If you had a proper handler which mlocks its buffer for the
> read syscall then you shouldn't require any allocation at the oom time.
> This shouldn't be that hard to do without too much memory overhead. As I
> said we are talking about few (dozens) of pages per handler.
>
I'm talking about the memory the kernel allocates when reading the "tasks"
file, not userspace. This can, and will, return -ENOMEM.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2013-06-03 18:18 UTC|newest]
Thread overview: 47+ messages / expand[flat|nested] mbox.gz Atom feed top
2013-05-30 1:18 David Rientjes
2013-05-30 15:07 ` Michal Hocko
2013-05-30 20:47 ` David Rientjes
2013-05-31 8:10 ` Michal Hocko
2013-05-31 10:22 ` David Rientjes
2013-05-31 11:02 ` Michal Hocko
2013-05-31 11:21 ` Michal Hocko
2013-05-31 19:29 ` David Rientjes
2013-06-01 6:11 ` Johannes Weiner
2013-06-01 10:29 ` Michal Hocko
2013-06-01 15:15 ` Johannes Weiner
2013-06-03 15:34 ` Michal Hocko
2013-06-03 16:48 ` Johannes Weiner
2013-06-03 18:03 ` Michal Hocko
2013-06-03 18:30 ` Johannes Weiner
2013-06-03 21:33 ` KOSAKI Motohiro
2013-06-04 9:17 ` Michal Hocko
2013-06-04 18:48 ` Johannes Weiner
2013-06-04 19:27 ` Michal Hocko
2013-06-05 13:49 ` Michal Hocko
2013-06-03 16:31 ` Michal Hocko
2013-06-03 16:51 ` Johannes Weiner
2013-06-01 10:20 ` Michal Hocko
2013-06-03 18:18 ` David Rientjes [this message]
2013-06-03 18:54 ` Johannes Weiner
2013-06-03 19:09 ` David Rientjes
2013-06-03 21:43 ` Johannes Weiner
2013-06-03 19:31 ` Michal Hocko
2013-06-03 21:17 ` David Rientjes
2013-06-04 9:55 ` Michal Hocko
2013-06-05 6:40 ` David Rientjes
2013-06-05 9:39 ` Michal Hocko
2013-06-06 0:09 ` David Rientjes
2013-06-10 14:23 ` Michal Hocko
2013-06-11 20:33 ` David Rientjes
2013-06-12 20:23 ` Michal Hocko
2013-06-12 21:27 ` David Rientjes
2013-06-13 15:16 ` Michal Hocko
2013-06-13 22:25 ` David Rientjes
2013-06-14 0:56 ` Kamezawa Hiroyuki
2013-06-14 10:12 ` David Rientjes
2013-06-19 21:30 ` David Rientjes
2013-06-25 1:39 ` Kamezawa Hiroyuki
2013-06-26 23:18 ` David Rientjes
2013-07-10 11:23 ` Michal Hocko
2013-05-31 21:46 ` Andrew Morton
2013-06-03 18:00 ` David Rientjes
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=alpine.DEB.2.02.1306031102480.7956@chino.kir.corp.google.com \
--to=rientjes@google.com \
--cc=akpm@linux-foundation.org \
--cc=cgroups@vger.kernel.org \
--cc=hannes@cmpxchg.org \
--cc=kamezawa.hiroyu@jp.fujitsu.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mhocko@suse.cz \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox