We have hierarchical "containers". Jobs exist in these containers. The containers can hold sub-containers. In case of system OOM we want to kill in strict priority order. From the root of the hierarchy, choose the lowest priority. This could be a task or a memcg. If a memcg, recurse. We CAN do it in kernel (in fact we do, and I argued for that, and David acquiesced). But doing it in kernel means changes are slow and risky. What we really have is a bunch of features that we offer to our users that need certain OOM-time behaviors and guarantees to be implemented. I don't expect that most of our changes are useful for anyone outside of Google, really. They come with a lot of environmental assumptions. This is why David finally convinced me it was easier to release changes, to fix bugs, and to update kernels if we do this in userspace. I apologize if I am not giving you what you want. I am typing on a phone at the moment. If this still doesn't help I can try from a computer later. Tim On Dec 7, 2013 11:07 AM, "Johannes Weiner" wrote: > On Sat, Dec 07, 2013 at 10:12:19AM -0800, Tim Hockin wrote: > > You more or less described the fundamental change - a score per memcg, > with > > a recursive OOM killer which evaluates scores between siblings at the > same > > level. > > > > It gets a bit complicated because we have need if wider scoring ranges > than > > are provided by default > > If so, I'm sure you can make a convincing case to widen the internal > per-task score ranges. The per-memcg score ranges have not even be > defined, so this is even easier. > > > and because we score PIDs against mcgs at a given scope. > > You are describing bits of a solution, not a problem. And I can't > possibly infer a problem from this. > > > We also have some tiebreaker heuristic (age). > > Either periodically update the per-memcg score from userspace or > implement this in the kernel. We have considered CPU usage > history/runtime etc. in the past when picking an OOM victim task. > > But I'm again just speculating what your problem is, so this may or > may not be a feasible solution. > > > We also have a handful of features that depend on OOM handling like the > > aforementioned automatically growing and changing the actual OOM score > > depending on usage in relation to various thresholds ( e.g. we sold you > X, > > and we allow you to go over X but if you do, your likelihood of death in > > case of system OOM goes up. > > You can trivially monitor threshold events from userspace with the > existing infrastructure and accordingly update the per-memcg score. > > > Do you really want us to teach the kernel policies like this? It would > be > > way easier to do and test in userspace. > > Maybe. Providing fragments of your solution is not an efficient way > to communicate the problem. And you have to sell the problem before > anybody can be expected to even consider your proposal as one of the > possible solutions. >