From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from post.mail.demon.net (post-20.mail.demon.net [194.217.242.27]) by kvack.org (8.8.7/8.8.7) with SMTP id RAA07734 for ; Thu, 4 Dec 1997 17:27:12 -0500 Date: Thu, 4 Dec 1997 10:02:44 +0000 (GMT) From: Mark Hemment Reply-To: Mark Hemment Subject: Re: 2.0.30: Lockups with huge proceses mallocing all VM In-Reply-To: Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org To: Karl Kleinpaste Cc: linux-mm List-ID: On 3 Dec 1997, Karl Kleinpaste wrote: > We can reproduce this fairly reliably. Before anyone had looked at > the code at all closely, some folks were surmising that perhaps Linux > was not guaranteeing the availability of backing store for freshly- > allocated pages, and that perhaps eventually Linux was getting stuck > looking for a free page when none were to be found. Yep, Linux uses lazy swap-page allocation (some OSes use eager allocation). This means it can run out of pages. If a page allocation fails during a page fault, then the faulting task is killed. If an allocation fails for a "management structure" (eg. vm_area_struct), then the system-call should fail with EAGAIN. Unfortunately, not all allocation failures are handled cleanly - some work may have been done before the failure, which is not completely undone after the allocation failure. An example of this is munmap(), which may partially unmap the given address range before an allocation failure (yep, munmap() can cause allocations). Another example is mlock() and mprotect() - either may partially succeed. This can confuse an allocation, which may end up seg-faulting. > I'm wondering whether this sort of lockup is analogous to the > fragmentation lockups recently mentioned by Bill Hawes and others. If > so, could someone direct me toward Mark Hemment or others doing work > of this sort? I was (am) working on reducing the free-page pool fragmentation when my page-colouring is being used. It places a lower bound on the fragmentation. BTW, are you using NFS? This requries largish orders of contigious pages from the page allocator, and I believe this can cause NFS to stall the machine until it gets the requried allocations - but I might be wrong here, and/or this 'feature' may only be in 2.1.xx.... > I'm perfectly willing to wade into the kernel mem.mgmt code to figure > out what I can about this, though it sounds like others may be way out > in front on the issue. In the meantime, we're working around the > problem as best we can by imposing datasize limits (via ulimit) since > the problem only presents itself when the machine is out of aggregate > VM anyway -- it doesn't matter if we make this lone process die as > long as the machine as a whole survives. I think the bug is not in the VM sub-system, but the lack of available pages is causing some other sub-system to lock your box. If you have a free test target, try using the latest 2.1.x to see if the problem is still there. Regards, markhe