From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-yh0-f53.google.com (mail-yh0-f53.google.com [209.85.213.53]) by kanga.kvack.org (Postfix) with ESMTP id 396106B0035 for ; Sat, 7 Dec 2013 16:04:39 -0500 (EST) Received: by mail-yh0-f53.google.com with SMTP id b20so1532806yha.40 for ; Sat, 07 Dec 2013 13:04:38 -0800 (PST) Received: from mail-pb0-x230.google.com (mail-pb0-x230.google.com [2607:f8b0:400e:c01::230]) by mx.google.com with ESMTPS id z48si3217367yha.81.2013.12.07.13.04.37 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Sat, 07 Dec 2013 13:04:38 -0800 (PST) Received: by mail-pb0-f48.google.com with SMTP id md12so3026641pbc.21 for ; Sat, 07 Dec 2013 13:04:36 -0800 (PST) MIME-Version: 1.0 In-Reply-To: <20131207190653.GI21724@cmpxchg.org> References: <20131204054533.GZ3556@cmpxchg.org> <20131205025026.GA26777@htj.dyndns.org> <20131206173438.GE21724@cmpxchg.org> <20131207174039.GH21724@cmpxchg.org> <20131207190653.GI21724@cmpxchg.org> Date: Sat, 7 Dec 2013 13:04:36 -0800 Message-ID: Subject: Re: [patch 7/8] mm, memcg: allow processes handling oom notifications to access reserves From: Tim Hockin Content-Type: multipart/alternative; boundary=047d7b6d8d963c7c9004ecf81c6a Sender: owner-linux-mm@kvack.org List-ID: To: Johannes Weiner Cc: Michal Hocko , Li Zefan , KAMEZAWA Hiroyuki , Tejun Heo , Christoph Lameter , David Rientjes , linux-mm@kvack.org, Rik van Riel , Pekka Enberg , cgroups@vger.kernel.org, Mel Gorman , Andrew Morton , linux-kernel@vger.kernel.org --047d7b6d8d963c7c9004ecf81c6a Content-Type: text/plain; charset=UTF-8 We have hierarchical "containers". Jobs exist in these containers. The containers can hold sub-containers. In case of system OOM we want to kill in strict priority order. From the root of the hierarchy, choose the lowest priority. This could be a task or a memcg. If a memcg, recurse. We CAN do it in kernel (in fact we do, and I argued for that, and David acquiesced). But doing it in kernel means changes are slow and risky. What we really have is a bunch of features that we offer to our users that need certain OOM-time behaviors and guarantees to be implemented. I don't expect that most of our changes are useful for anyone outside of Google, really. They come with a lot of environmental assumptions. This is why David finally convinced me it was easier to release changes, to fix bugs, and to update kernels if we do this in userspace. I apologize if I am not giving you what you want. I am typing on a phone at the moment. If this still doesn't help I can try from a computer later. Tim On Dec 7, 2013 11:07 AM, "Johannes Weiner" wrote: > On Sat, Dec 07, 2013 at 10:12:19AM -0800, Tim Hockin wrote: > > You more or less described the fundamental change - a score per memcg, > with > > a recursive OOM killer which evaluates scores between siblings at the > same > > level. > > > > It gets a bit complicated because we have need if wider scoring ranges > than > > are provided by default > > If so, I'm sure you can make a convincing case to widen the internal > per-task score ranges. The per-memcg score ranges have not even be > defined, so this is even easier. > > > and because we score PIDs against mcgs at a given scope. > > You are describing bits of a solution, not a problem. And I can't > possibly infer a problem from this. > > > We also have some tiebreaker heuristic (age). > > Either periodically update the per-memcg score from userspace or > implement this in the kernel. We have considered CPU usage > history/runtime etc. in the past when picking an OOM victim task. > > But I'm again just speculating what your problem is, so this may or > may not be a feasible solution. > > > We also have a handful of features that depend on OOM handling like the > > aforementioned automatically growing and changing the actual OOM score > > depending on usage in relation to various thresholds ( e.g. we sold you > X, > > and we allow you to go over X but if you do, your likelihood of death in > > case of system OOM goes up. > > You can trivially monitor threshold events from userspace with the > existing infrastructure and accordingly update the per-memcg score. > > > Do you really want us to teach the kernel policies like this? It would > be > > way easier to do and test in userspace. > > Maybe. Providing fragments of your solution is not an efficient way > to communicate the problem. And you have to sell the problem before > anybody can be expected to even consider your proposal as one of the > possible solutions. > --047d7b6d8d963c7c9004ecf81c6a Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable

We have hierarchical "containers".=C2=A0 Jobs exis= t in these containers.=C2=A0 The containers can hold sub-containers.

In case of system OOM we want to kill in strict priority ord= er.=C2=A0 From the root of the hierarchy, choose the lowest priority.=C2=A0= This could be a task or a memcg.=C2=A0 If a memcg, recurse.=C2=A0

We CAN do it in kernel (in fact we do, and I argued for that= , and David acquiesced).=C2=A0 But doing it in kernel means changes are slo= w and risky.

What we really have is a bunch of features that we offer to = our users that need certain OOM-time behaviors and guarantees to be impleme= nted.=C2=A0 I don't expect that most of our changes are useful for anyo= ne outside of Google, really. They come with a lot of environmental assumpt= ions.=C2=A0 This is why David finally convinced me it was easier to release= changes, to fix bugs, and to update kernels if we do this in userspace.

I apologize if I am not giving you what you want.=C2=A0 I am= typing on a phone at the moment.=C2=A0 If this still doesn't help I ca= n try from a computer later.

Tim

On Dec 7, 2013 11:07 AM, "Johannes Weiner&q= uot; <hannes@cmpxchg.org> w= rote:
On Sat, Dec 07, 2013 at 10:12:19AM -0800, Tim Hockin wrote:
> You more or less described the fundamental change - a score per memcg,= with
> a recursive OOM killer which evaluates scores between siblings at the = same
> level.
>
> It gets a bit complicated because we have need if wider scoring ranges= than
> are provided by default

If so, I'm sure you can make a convincing case to widen the internal per-task score ranges. =C2=A0The per-memcg score ranges have not even be defined, so this is even easier.

> and because we score PIDs against mcgs at a given scope.

You are describing bits of a solution, not a problem. =C2=A0And I can't=
possibly infer a problem from this.

> We also have some tiebreaker heuristic (age).

Either periodically update the per-memcg score from userspace or
implement this in the kernel. =C2=A0We have considered CPU usage
history/runtime etc. in the past when picking an OOM victim task.

But I'm again just speculating what your problem is, so this may or
may not be a feasible solution.

> We also have a handful of features that depend on OOM handling like th= e
> aforementioned automatically growing and changing the actual OOM score=
> depending on usage in relation to various thresholds ( e.g. we sold yo= u X,
> and we allow you to go over X but if you do, your likelihood of death = in
> case of system OOM goes up.

You can trivially monitor threshold events from userspace with the
existing infrastructure and accordingly update the per-memcg score.

> Do you really want us to teach the kernel policies like this? =C2=A0It= would be
> way easier to do and test in userspace.

Maybe. =C2=A0Providing fragments of your solution is not an efficient way to communicate the problem. =C2=A0And you have to sell the problem before anybody can be expected to even consider your proposal as one of the
possible solutions.
--047d7b6d8d963c7c9004ecf81c6a-- -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org