We have hierarchical "containers".  Jobs exist in these containers.  The
containers can hold sub-containers.

In case of system OOM we want to kill in strict priority order.  From the
root of the hierarchy, choose the lowest priority.  This could be a task or
a memcg.  If a memcg, recurse.

We CAN do it in kernel (in fact we do, and I argued for that, and David
acquiesced).  But doing it in kernel means changes are slow and risky.

What we really have is a bunch of features that we offer to our users that
need certain OOM-time behaviors and guarantees to be implemented.  I don't
expect that most of our changes are useful for anyone outside of Google,
really. They come with a lot of environmental assumptions.  This is why
David finally convinced me it was easier to release changes, to fix bugs,
and to update kernels if we do this in userspace.

I apologize if I am not giving you what you want.  I am typing on a phone
at the moment.  If this still doesn't help I can try from a computer later.

Tim
On Dec 7, 2013 11:07 AM, "Johannes Weiner" <hannes@cmpxchg.org> wrote:

> On Sat, Dec 07, 2013 at 10:12:19AM -0800, Tim Hockin wrote:
> > You more or less described the fundamental change - a score per memcg,
> with
> > a recursive OOM killer which evaluates scores between siblings at the
> same
> > level.
> >
> > It gets a bit complicated because we have need if wider scoring ranges
> than
> > are provided by default
>
> If so, I'm sure you can make a convincing case to widen the internal
> per-task score ranges.  The per-memcg score ranges have not even be
> defined, so this is even easier.
>
> > and because we score PIDs against mcgs at a given scope.
>
> You are describing bits of a solution, not a problem.  And I can't
> possibly infer a problem from this.
>
> > We also have some tiebreaker heuristic (age).
>
> Either periodically update the per-memcg score from userspace or
> implement this in the kernel.  We have considered CPU usage
> history/runtime etc. in the past when picking an OOM victim task.
>
> But I'm again just speculating what your problem is, so this may or
> may not be a feasible solution.
>
> > We also have a handful of features that depend on OOM handling like the
> > aforementioned automatically growing and changing the actual OOM score
> > depending on usage in relation to various thresholds ( e.g. we sold you
> X,
> > and we allow you to go over X but if you do, your likelihood of death in
> > case of system OOM goes up.
>
> You can trivially monitor threshold events from userspace with the
> existing infrastructure and accordingly update the per-memcg score.
>
> > Do you really want us to teach the kernel policies like this?  It would
> be
> > way easier to do and test in userspace.
>
> Maybe.  Providing fragments of your solution is not an efficient way
> to communicate the problem.  And you have to sell the problem before
> anybody can be expected to even consider your proposal as one of the
> possible solutions.
>