From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wi0-f170.google.com (mail-wi0-f170.google.com [209.85.212.170]) by kanga.kvack.org (Postfix) with ESMTP id 8AF676B0088 for ; Mon, 22 Dec 2014 11:57:39 -0500 (EST) Received: by mail-wi0-f170.google.com with SMTP id bs8so11045682wib.1 for ; Mon, 22 Dec 2014 08:57:39 -0800 (PST) Received: from mx2.suse.de (cantor2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id bk1si33667079wjb.171.2014.12.22.08.57.38 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Mon, 22 Dec 2014 08:57:38 -0800 (PST) Date: Mon, 22 Dec 2014 17:57:36 +0100 From: Michal Hocko Subject: Re: How to handle TIF_MEMDIE stalls? Message-ID: <20141222165736.GB2900@dhcp22.suse.cz> References: <20141218153341.GB832@dhcp22.suse.cz> <201412192122.DJI13055.OOVSQLOtFHFFMJ@I-love.SAKURA.ne.jp> <20141220020331.GM1942@devil.localdomain> <201412202141.ADF87596.tOSLJHFFOOFMVQ@I-love.SAKURA.ne.jp> <20141220223504.GI15665@dastard> <201412211745.ECD69212.LQOFHtFOJMSOFV@I-love.SAKURA.ne.jp> <20141221204249.GL15665@dastard> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20141221204249.GL15665@dastard> Sender: owner-linux-mm@kvack.org List-ID: To: Dave Chinner Cc: Tetsuo Handa , dchinner@redhat.com, linux-mm@kvack.org, rientjes@google.com, oleg@redhat.com On Mon 22-12-14 07:42:49, Dave Chinner wrote: [...] > "memory reclaim gave up"? So why the hell isn't it returning a > failure to the caller? > > i.e. We have a perfectly good page cache allocation failure error > path here all the way back to userspace, but we're invoking the > OOM-killer to kill random processes rather than returning ENOMEM to > the processes that are generating the memory demand? > > Further: when did the oom-killer become the primary method > of handling situations when memory allocation needs to fail? > __GFP_WAIT does *not* mean memory allocation can't fail - that's what > __GFP_NOFAIL means. And none of the page cache allocations use > __GFP_NOFAIL, so why aren't we getting an allocation failure before > the oom-killer is kicked? Well, it has been an unwritten rule that GFP_KERNEL allocations for low-order (<=PAGE_ALLOC_COSTLY_ORDER) never fail. This is a long ago decision which would be tricky to fix now without silently breaking a lot of code. Sad... Nevertheless the caller can prevent from an endless loop by using __GFP_NORETRY so this could be used as a workaround. The default should be opposite IMO and only those who really require some guarantee should use a special flag for that purpose. > > I guess __alloc_pages_direct_reclaim() returns NULL with did_some_progress > 0 > > so that __alloc_pages_may_oom() will not be called easily. As long as > > try_to_free_pages() returns non-zero, __alloc_pages_direct_reclaim() might > > return NULL with did_some_progress > 0. So, do_try_to_free_pages() is called > > for many times and is likely to return non-zero. And when > > __alloc_pages_may_oom() is called, TIF_MEMDIE is set on the thread waiting > > for mutex_lock(&"struct inode"->i_mutex) at xfs_file_buffered_aio_write() > > and I see no further progress. > > Of course - TIF_MEMDIE doesn't do anything to the task that is > blocked, and the SIGKILL signal can't be delivered until the syscall > completes or the kernel code checks for pending signals and handles > EINTR directly. Mutexes are uninterruptible by design so there's no > EINTR processing, hence the oom killer cannot make progress when > everything is blocked on mutexes waiting for memory allocation to > succeed or fail. > > i.e. until the lock holder exists from direct memory reclaim and > releases the locks it holds, the oom killer will not be able to save > the system. IOWs, the problem is that memory allocation is not > failing when it should.... > > Focussing on the OOM killer here is the wrong way to solve this > problem - the problem that needs to be solved is sane handling of > OOM conditions to avoid needing to invoke the OOM-killer... Completely agreed! [...] -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org