From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail191.messagelabs.com (mail191.messagelabs.com [216.82.242.19]) by kanga.kvack.org (Postfix) with ESMTP id DB2AF6B01DF for ; Tue, 8 Jun 2010 20:14:48 -0400 (EDT) Received: from wpaz5.hot.corp.google.com (wpaz5.hot.corp.google.com [172.24.198.69]) by smtp-out.google.com with ESMTP id o590Ekmu024643 for ; Tue, 8 Jun 2010 17:14:47 -0700 Received: from pzk7 (pzk7.prod.google.com [10.243.19.135]) by wpaz5.hot.corp.google.com with ESMTP id o590EiHp012507 for ; Tue, 8 Jun 2010 17:14:45 -0700 Received: by pzk7 with SMTP id 7so2938428pzk.30 for ; Tue, 08 Jun 2010 17:14:44 -0700 (PDT) Date: Tue, 8 Jun 2010 17:14:42 -0700 (PDT) From: David Rientjes Subject: Re: [patch 05/18] oom: give current access to memory reserves if it has been killed In-Reply-To: <20100608130804.8794d029.akpm@linux-foundation.org> Message-ID: References: <20100608130804.8794d029.akpm@linux-foundation.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org To: Andrew Morton Cc: Rik van Riel , Nick Piggin , Oleg Nesterov , Balbir Singh , KAMEZAWA Hiroyuki , KOSAKI Motohiro , linux-mm@kvack.org List-ID: On Tue, 8 Jun 2010, Andrew Morton wrote: > > It's possible to livelock the page allocator if a thread has mm->mmap_sem > > What is the state of this thread? Trying to allocate memory, I assume. > Right, which I agree is a bad scenario to be in but indeed does happen (and we have a workaround at Google that identifies these particular cases and kills the holder of the writelock on mm->mmap_sem). We have one thread holding a readlock on mm->mmap_sem while trying to allocate memory so the oom killer becomes a no-op to prevent needless task killing while waiting for the killed task to exit, but that killed task can't exit because it requires a writelock on the same semaphore. > > and fails to make forward progress because the oom killer selects another > > thread sharing the same ->mm to kill that cannot exit until the semaphore > > is dropped. > > > > The oom killer will not kill multiple tasks at the same time; each oom > > killed task must exit before another task may be killed. > > This sounds like a quite risky design. The possibility that we'll > cause other dead/livelocks similar to this one seems pretty high. It > applies to all sleeping locks in the entire kernel, doesn't it? > It applies to any writelock that is taken during the exitpath of an oom killed task if a thread holding a readlock is trying to allocate memory itself. This is how it's always been done at least within the past few years and we haven't had a problem other than with mm->mmap_sem. At one point we used an oom killer timeout to kill other tasks after a period of time had elapsed, but that hasn't been required since we've been killing the thread holding the writelock on mm->mmap_sem. > > Thus, if one > > thread is holding mm->mmap_sem and cannot allocate memory, all threads > > sharing the same ->mm are blocked from exiting as well. In the oom kill > > case, that means the thread holding mm->mmap_sem will never free > > additional memory since it cannot get access to memory reserves and the > > thread that depends on it with access to memory reserves cannot exit > > because it cannot acquire the semaphore. Thus, the page allocators > > livelocks. > > > > When the oom killer is called and current happens to have a pending > > SIGKILL, this patch automatically gives it access to memory reserves and > > returns. Upon returning to the page allocator, its allocation will > > hopefully succeed so it can quickly exit and free its memory. If not, the > > page allocator will fail the allocation if it is not __GFP_NOFAIL. > > You said "hopefully". > "hopefully" in this case means that the allocation better succeed or we've depleted all memory reserves and we're deadlocked, it doesn't mean that this is a speculative change that may or may not work. > Does it actually work? Any real-world testing results? If so, they'd > be a useful addition to the changelog. > It certain does, and prevents needlessly killing another task when we know current is exiting. The nice thing about that is that we don't need to do anything like checking if a child should be sacrified or if current is OOM_DISABLE: we already know it's dying so it should simply get access to memory reserves either to return and handle its pending SIGKILL or continue down the exitpath. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org