From mboxrd@z Thu Jan 1 00:00:00 1970 Date: Tue, 11 Nov 2008 00:14:54 -0800 (PST) From: David Rientjes Subject: Re: [RFC][PATCH]Per-cgroup OOM handler In-Reply-To: <20081111162812.492218fc.kamezawa.hiroyu@jp.fujitsu.com> Message-ID: References: <604427e00811031340k56634773g6e260d79e6cb51e7@mail.gmail.com> <604427e00811031419k2e990061kdb03f4b715b51fb9@mail.gmail.com> <20081106143438.5557b87c.kamezawa.hiroyu@jp.fujitsu.com> <604427e00811102042x202906ecq2a10eb5e404e2ec9@mail.gmail.com> <20081111162812.492218fc.kamezawa.hiroyu@jp.fujitsu.com> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org Return-Path: To: KAMEZAWA Hiroyuki Cc: Ying Han , linux-mm@kvack.org, Rohit Seth , Paul Menage List-ID: Sorry, there's been some confusion in this proposal. On Tue, 11 Nov 2008, KAMEZAWA Hiroyuki wrote: > > Here is how we do the one-tick-wait in cgroup_should_oom() in oom_kill.c > > >-------if (!ret) { > > >------->-------/* If we're not going to OOM, we should sleep for a > > >------->------- * bit to give userspace a chance to respond before we > > >------->------- * go back and try to reclaim again */ > > >------->-------schedule_timeout_uninterruptible(1); > > >-------} > > and it works well in-house so far as i mentioned earlier. what's > > important here is not "sleeping for one tick", the idea here is to > > reschedule the ooming thread so the oom handler can make action ( like > > adding memory node to the cpuset) and the subsequent page allocator in > > get_page_from_freelist() can use it. > > > Can't we avoid this kind of magical one-tick wait ? > cgroup_should_oom() determines whether the oom killer should be invoked or whether userspace should be given the opportunity to act first; it returns zero only when the kernel has deferred to userspace. In these situations, the kernel will return to the page allocator to attempt the allocation again. If current were not rescheduled like this (the schedule_timeout is simply more powerful than a cond_resched), there is a very high liklihood that this subsequent allocation attempt would fail just as it did before the oom killer was triggered and then we'd enter reclaim unnecessarily when userspace could have reclaimed on its own, killed a task by overriding the kernels' heuristics, added a node to the cpuset, increased its memcg allocation, etc. So this reschedule simply prevents needlessly entering reclaim, just as its comment indicates. > > > (Before OOM, the system tend to wait in congestion_wait() or some.) > > Yes, we wait on block congestion as part of direct reclaim but at this point we've yet to notify the userspace oom handler so that it may act to avoid invoking the oom kiler. > Hmm, from discussion of mem_notify handler in Feb/March of this year, > oom-hanlder cannot works well if memory is near to OOM, in general. > Then, mlockall was recomemded to handler. > (and it must not do file access.) > This would be a legitimate point if we were talking about a system-wide oom notifier like /dev/mem_notify was and we were addressing unconstrained ooms. This patch was specific to cgroups and the only likely usecases are for either cpusets or memcg. > I wonder creating small cpuset (and isolated node) for oom-handler may be > another help. > This is obviously a pure userspace issue; any sane oom handler that is itself subjected to the same memory constraints would be written to avoid memory allocations when woken up. > > > I'm wondering > > > - freeeze-all-threads-in-group-at-oom > > > - free emergency memory to page allocator which was pooled at cgroup > > > creation > > > rather than 1-tick wait > > > > > > BTW, it seems this patch allows task detach/attach always. it's safe(and > > > sane) ? > > > > yes, we allows task detach/attach. So far we don't see any race condition > > except the livelock > > i mentioned above. Any particular scenario can think of now? thanks > > > I don't find it ;) > BTW, shouldn't we disable preempt(or irq) before taking spinlocks ? > I don't know which spinlock you're specifically referring to here, but the oom killer (and thus the handling of the oom handler) is invoked in process context with irqs enabled. David -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org