From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pd0-f171.google.com (mail-pd0-f171.google.com [209.85.192.171]) by kanga.kvack.org (Postfix) with ESMTP id 943AA6B0035 for ; Wed, 5 Mar 2014 21:52:26 -0500 (EST) Received: by mail-pd0-f171.google.com with SMTP id r10so1902808pdi.2 for ; Wed, 05 Mar 2014 18:52:26 -0800 (PST) Received: from mail-pd0-x22d.google.com (mail-pd0-x22d.google.com [2607:f8b0:400e:c02::22d]) by mx.google.com with ESMTPS id ki1si3903975pbc.115.2014.03.05.18.52.25 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Wed, 05 Mar 2014 18:52:25 -0800 (PST) Received: by mail-pd0-f173.google.com with SMTP id z10so1903642pdj.4 for ; Wed, 05 Mar 2014 18:52:25 -0800 (PST) Date: Wed, 5 Mar 2014 18:52:22 -0800 (PST) From: David Rientjes Subject: Re: [patch 00/11] userspace out of memory handling In-Reply-To: <20140305131743.b9a916fbc4e40fd895bc4e76@linux-foundation.org> Message-ID: References: <20140305131743.b9a916fbc4e40fd895bc4e76@linux-foundation.org> MIME-Version: 1.0 Content-Type: MULTIPART/MIXED; BOUNDARY="531381512-1909896120-1394074344=:30075" Sender: owner-linux-mm@kvack.org List-ID: To: Andrew Morton Cc: Johannes Weiner , Michal Hocko , KAMEZAWA Hiroyuki , Christoph Lameter , Pekka Enberg , Tejun Heo , Mel Gorman , Oleg Nesterov , Rik van Riel , Jianguo Wu , Tim Hockin , linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups@vger.kernel.org, linux-doc@vger.kernel.org This message is in MIME format. The first part should be readable text, while the remaining parts are likely unreadable without MIME-aware tools. --531381512-1909896120-1394074344=:30075 Content-Type: TEXT/PLAIN; charset=US-ASCII On Wed, 5 Mar 2014, Andrew Morton wrote: > > This patchset introduces a standard interface through memcg that allows > > both of these conditions to be handled in the same clean way: users > > define memory.oom_reserve_in_bytes to define the reserve and this > > amount is allowed to be overcharged to the process handling the oom > > condition's memcg. If used with the root memcg, this amount is allowed > > to be allocated below the per-zone watermarks for root processes that > > are handling such conditions (only root may write to > > cgroup.event_control for the root memcg). > > If process A is trying to allocate memory, cannot do so and the > userspace oom-killer is invoked, there must be means via which process > A waits for the userspace oom-killer's action. It does so by relooping in the page allocator waiting for memory to be freed just like it would if the kernel oom killer were called and process A was waiting for the oom kill victim process B to exit, we don't have the ability to put it on a waitqueue because we don't touch the freeing hotpath. The userspace oom handler may not even necessarily kill anything, it may be able to free its own memory and start throttling other processes, for example. > And there must be > fallbacks which occur if the userspace oom killer fails to clear the > oom condition, or times out. > I agree completely and proposed this before as memory.oom_delay_millisecs at http://lwn.net/Articles/432226 which we use internally when memory can't be freed or a memcg's limit cannot be expanded. I guess it makes more sense alongside the rest of this patchset now, I can add it as an additional patch next time around. > Would be interested to see a description of how all this works. > There's an article for LWN also being developed on this topic. As mentioned in that article, I think it would be best to generalize a lot of the common functions and the eventfd handling entirely into a library. I've attached an example implementation that just invokes a function to handle the situation. For Google's usecase specifically, at the root memcg level (system oom) we want to do priority based memcg killing. We want to kill from within a memcg hierarchy that has the lowest priority relative to other memcgs. This cannot be implemented with /proc/pid/oom_score_adj today. Those priorities may also change depending on whether a memcg hierarchy is "overlimit", i.e. its limit has been increased temporarily because it has hit a memcg oom and additional memory is readily available on the system. So why not just introduce a memcg tunable that specifies a priority? Well, it's not that simple. Other users will want to implement different policies on system oom (think about things like existing panic_on_oom or oom_kill_allocating_task sysctls). I introduced oom_kill_allocating_task originally for SGI because they wanted a fast oom kill rather than expensive tasklist scan: the allocating task itself is rather irrelevant, it was just the unlucky task that was allocating at the moment that oom was triggered. What's guaranteed is that current in that case will always free memory from under oom (it's not a member of some other mempolicy or cpuset that would be needlessly killed). Both sysctls could trivially be reimplemented in userspace with this feature. I have other customers who don't run in a memcg environment at all, they simply reattach all processes to root and delete all other memcgs. These customers are only concerned about system oom conditions and want to do something "interesting" before a process is killed. Some want to log the VM statistics as an artifact to examine later, some want to examine heap profiles, others can start throttling and freeing memory rather than kill anything. All of this is impossible today because the kernel oom killer will simply kill something immediately and any stats we collect afterwards don't represent the oom condition. The heap profiles are lost, throttling is useless, etc. Jianguo (cc'd) may also have usecases not described here. > It is unfortunate that this feature is memcg-only. Surely it could > also be used by non-memcg setups. Would like to see at least a > detailed description of how this will all be presented and implemented. > We should aim to make the memcg and non-memcg userspace interfaces and > user-visible behaviour as similar as possible. > It's memcg only because it can handle both system and memcg oom conditions with the same clean interface, it would be possible to implement only system oom condition handling through procfs (a little sloppy since it needs to register the eventfd) but then a userspace oom handler would need to determine which interface to use based on whether it was running in a memcg or non-memcg environment. I implemented this feature with userspace in mind: I didn't want it to need two different implementations to do the same thing depending on memcg. The way it is written, a userspace oom handler does not know (nor need not care) whether it is constrained by the amount of system RAM or a memcg limit. It can simply write the reserve to its memcg's memory.oom_reserve_in_bytes, attach to memory.oom_control and be done. This does mean that memcg needs to be enabled for the support, though. This is already done on most distributions, the cgroup just needs to be mounted. Would it be better to duplicate the interface in two different spots depending on CONFIG_MEMCG? I didn't think so, and I think the idea of a userspace library that takes care of this registration (and mounting, perhaps) proposed on LWN would be the best of both worlds. > Patches 1, 2, 3 and 5 appear to be independent and useful so I think > I'll cherrypick those, OK? > Ok! I'm hoping that the PF_MEMPOLICY bit that is removed in those patches is at least temporarily reserved for PF_OOM_HANDLER introduced here, I removed it purposefully :) --531381512-1909896120-1394074344=:30075 Content-Type: TEXT/x-csrc; name=liboom.c Content-Transfer-Encoding: BASE64 Content-ID: Content-Description: Content-Disposition: attachment; filename=liboom.c LyoNCiAqDQogKi8NCiNpbmNsdWRlIDxlcnJuby5oPg0KI2luY2x1ZGUgPGZj bnRsLmg+DQojaW5jbHVkZSA8bGltaXRzLmg+DQojaW5jbHVkZSA8c3RkaW8u aD4NCiNpbmNsdWRlIDxzdHJpbmcuaD4NCg0KI2luY2x1ZGUgPHN5cy9ldmVu dGZkLmg+DQojaW5jbHVkZSA8c3lzL21tYW4uaD4NCiNpbmNsdWRlIDxzeXMv dHlwZXMuaD4NCg0KI2RlZmluZSBTVFJJTkdfTUFYCSg1MTIpDQoNCnZvaWQg aGFuZGxlX29vbSh2b2lkKQ0Kew0KCXByaW50Zigibm90aWZpY2F0aW9uIHJl Y2VpdmVkXG4iKTsNCn0NCg0KaW50IHdhaXRfb29tX25vdGlmaWVyKGludCBl dmVudGZkX2ZkLCB2b2lkICgqaGFuZGxlcikodm9pZCkpDQp7DQoJdWludDY0 X3QgcmV0Ow0KCWludCBlcnI7DQoNCglmb3IgKDs7KSB7DQoJCWVyciA9IHJl YWQoZXZlbnRmZF9mZCwgJnJldCwgc2l6ZW9mKHJldCkpOw0KCQlpZiAoZXJy ICE9IHNpemVvZihyZXQpKSB7DQoJCQlmcHJpbnRmKHN0ZGVyciwgInJlYWQo KVxuIik7DQoJCQlyZXR1cm4gZXJyOw0KCQl9DQoJCWhhbmRsZXIoKTsNCgl9 DQp9DQoNCmludCByZWdpc3Rlcl9vb21fbm90aWZpZXIoY29uc3QgY2hhciAq bWVtY2cpDQp7DQoJY2hhciBwYXRoW1BBVEhfTUFYXTsNCgljaGFyIGNvbnRy b2xfc3RyaW5nW1NUUklOR19NQVhdOw0KCWludCBldmVudF9jb250cm9sX2Zk Ow0KCWludCBjb250cm9sX2ZkOw0KCWludCBldmVudGZkX2ZkOw0KCWludCBl cnIgPSAwOw0KDQoJZXJyID0gc25wcmludGYocGF0aCwgUEFUSF9NQVgsICIl cy9tZW1vcnkub29tX2NvbnRyb2wiLCBtZW1jZyk7DQoJaWYgKGVyciA8IDAp IHsNCgkJZnByaW50ZihzdGRlcnIsICJzbnByaW50ZigpXG4iKTsNCgkJZ290 byBvdXQ7DQoJfQ0KDQoJY29udHJvbF9mZCA9IG9wZW4ocGF0aCwgT19SRE9O TFkpOw0KCWlmIChjb250cm9sX2ZkID09IC0xKSB7DQoJCWZwcmludGYoc3Rk ZXJyLCAib3BlbigpOiAlZFxuIiwgZXJybm8pOw0KCQllcnIgPSBlcnJubzsN CgkJZ290byBvdXQ7DQoJfQ0KDQoJZXZlbnRmZF9mZCA9IGV2ZW50ZmQoMCwg MCk7DQoJaWYgKGV2ZW50ZmRfZmQgPT0gLTEpIHsNCgkJZnByaW50ZihzdGRl cnIsICJldmVudGZkKCk6ICVkXG4iLCBlcnJubyk7DQoJCWVyciA9IGVycm5v Ow0KCQlnb3RvIG91dF9jbG9zZV9jb250cm9sOw0KCX0NCg0KCWVyciA9IHNu cHJpbnRmKGNvbnRyb2xfc3RyaW5nLCBTVFJJTkdfTUFYLCAiJWQgJWQiLCBl dmVudGZkX2ZkLA0KCQkgICAgICAgY29udHJvbF9mZCk7DQoJaWYgKGVyciA8 IDApIHsNCgkJZnByaW50ZihzdGRlcnIsICJzbnByaW50ZigpXG4iKTsNCgkJ Z290byBvdXRfY2xvc2VfZXZlbnRmZDsNCgl9DQoNCgllcnIgPSBzbnByaW50 ZihwYXRoLCBQQVRIX01BWCwgIiVzL2Nncm91cC5ldmVudF9jb250cm9sIiwg bWVtY2cpOw0KCWlmIChlcnIgPCAwKSB7DQoJCWZwcmludGYoc3RkZXJyLCAi c25wcmludGYoKVxuIik7DQoJCWdvdG8gb3V0X2Nsb3NlX2V2ZW50ZmQ7DQoJ fQ0KDQoJZXZlbnRfY29udHJvbF9mZCA9IG9wZW4ocGF0aCwgT19XUk9OTFkp Ow0KCWlmIChldmVudF9jb250cm9sX2ZkID09IDEpIHsNCgkJZnByaW50Zihz dGRlcnIsICJvcGVuKCk6ICVkXG4iLCBlcnJubyk7DQoJCWVyciA9IGVycm5v Ow0KCQlnb3RvIG91dF9jbG9zZV9ldmVudGZkOw0KCX0NCg0KCXdyaXRlKGV2 ZW50X2NvbnRyb2xfZmQsIGNvbnRyb2xfc3RyaW5nLCBzdHJsZW4oY29udHJv bF9zdHJpbmcpKTsNCgljbG9zZShldmVudF9jb250cm9sX2ZkKTsNCg0KCXJl dHVybiBldmVudGZkX2ZkOw0KDQpvdXRfY2xvc2VfZXZlbnRmZDoNCgljbG9z ZShldmVudGZkX2ZkKTsNCm91dF9jbG9zZV9jb250cm9sOg0KCWNsb3NlKGNv bnRyb2xfZmQpOw0Kb3V0Og0KCXJldHVybiBlcnI7DQp9DQoNCmludCBtYWlu KGludCBhcmdjLCBjaGFyICoqYXJndikNCnsNCglpbnQgZXZlbnRmZF9mZDsN CglpbnQgZXJyID0gMDsNCg0KCWlmIChhcmdjICE9IDIpIHsNCgkJZnByaW50 ZihzdGRlcnIsICJ1c2FnZTogJXMgPHBhdGg+XG4iLCBhcmd2WzBdKTsNCgkJ cmV0dXJuIC0xOw0KCX0NCg0KCWVyciA9IG1sb2NrYWxsKE1DTF9GVVRVUkUp Ow0KCWlmIChlcnIpIHsNCgkJZnByaW50ZihzdGRlcnIsICIlZFxuIiwgZXJy bm8pOw0KCQlyZXR1cm4gLTE7DQoJfQ0KDQoJZXZlbnRmZF9mZCA9IHJlZ2lz dGVyX29vbV9ub3RpZmllcihhcmd2WzFdKTsNCglpZiAoZXZlbnRmZF9mZCA8 IDApIHsNCgkJZnByaW50ZihzdGRlcnIsICIlZFxuIiwgZXJyKTsNCgkJZ290 byBvdXQ7DQoJfQ0KDQoJZXJyID0gd2FpdF9vb21fbm90aWZpZXIoZXZlbnRm ZF9mZCwgaGFuZGxlX29vbSk7DQoJaWYgKGVycikgew0KCQlmcHJpbnRmKHN0 ZGVyciwgIndhaXRfb29tX25vdGlmaWVyKClcbiIpOw0KCQlnb3RvIG91dDsN Cgl9DQoNCm91dDoNCgltdW5sb2NrYWxsKCk7DQoJcmV0dXJuIGVycjsNCn0N Cg== --531381512-1909896120-1394074344=:30075-- -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org