From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx113.postini.com [74.125.245.113]) by kanga.kvack.org (Postfix) with SMTP id BD9436B0062 for ; Sun, 25 Nov 2012 07:05:28 -0500 (EST) Received: by mail-ea0-f169.google.com with SMTP id a12so4083865eaa.14 for ; Sun, 25 Nov 2012 04:05:27 -0800 (PST) Date: Sun, 25 Nov 2012 13:05:24 +0100 From: Michal Hocko Subject: Re: memory-cgroup bug Message-ID: <20121125120524.GB10623@dhcp22.suse.cz> References: <20121121200207.01068046@pobox.sk> <20121122152441.GA9609@dhcp22.suse.cz> <20121122190526.390C7A28@pobox.sk> <20121122214249.GA20319@dhcp22.suse.cz> <20121122233434.3D5E35E6@pobox.sk> <20121123074023.GA24698@dhcp22.suse.cz> <20121123102137.10D6D653@pobox.sk> <20121123100438.GF24698@dhcp22.suse.cz> <20121125011047.7477BB5E@pobox.sk> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20121125011047.7477BB5E@pobox.sk> Sender: owner-linux-mm@kvack.org List-ID: To: azurIt Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups mailinglist , KAMEZAWA Hiroyuki [Adding Kamezawa into CC] On Sun 25-11-12 01:10:47, azurIt wrote: > >Could you take few snapshots over time? > > > Here it is, now from different server, snapshot was taken every second > for 10 minutes (hope it's enough): > www.watchdog.sk/lkml/memcg-bug-2.tar.gz Hmm, interesting: $ grep . */memory.failcnt | cut -d: -f2 | awk 'BEGIN{min=666666}{if (prev>0) {diff=$1-prev; if (diff>max) max=diff; if (diff] mem_cgroup_handle_oom+0x241/0x3b0 546 [] do_truncate+0x58/0xa0 533 [] 0xffffffffffffffff Tells us that the stacks are pretty much stable. $ grep do_truncate -r [0-9]* | cut -d/ -f2 | sort | uniq -c 546 24495 So 24495 is stuck in do_truncate [] do_truncate+0x58/0xa0 [] do_last+0x250/0xa30 [] path_openat+0xd7/0x440 [] do_filp_open+0x49/0xa0 [] do_sys_open+0x106/0x240 [] sys_open+0x20/0x30 [] system_call_fastpath+0x18/0x1d [] 0xffffffffffffffff I suspect it is waiting for i_mutex. Who is holding that lock? Other tasks are blocked on the mem_cgroup_handle_oom either coming from the page fault path so i_mutex can be exluded or vfs_write (24796) and that one is interesting: [] mem_cgroup_handle_oom+0x241/0x3b0 [] T.1146+0x5ab/0x5c0 [] mem_cgroup_cache_charge+0xbe/0xe0 [] add_to_page_cache_locked+0x4c/0x140 [] add_to_page_cache_lru+0x22/0x50 [] grab_cache_page_write_begin+0x8b/0xe0 [] ext3_write_begin+0x88/0x270 [] generic_file_buffered_write+0x116/0x290 [] __generic_file_aio_write+0x27c/0x480 [] generic_file_aio_write+0x76/0xf0 # takes &inode->i_mutex [] do_sync_write+0xea/0x130 [] vfs_write+0xf3/0x1f0 [] sys_write+0x51/0x90 [] system_call_fastpath+0x18/0x1d [] 0xffffffffffffffff This smells like a deadlock. But kind strange one. The rapidly increasing failcnt suggests that somebody still tries to allocate but who when all of them hung in the mem_cgroup_handle_oom. This can be explained though. Memcg OOM killer let's only one process (which is able to lock the hierarchy by mem_cgroup_oom_lock) call mem_cgroup_out_of_memory and kill a process, while others are waiting on the wait queue. Once the killer is done it calls memcg_wakeup_oom which wakes up other tasks waiting on the queue. Those retry the charge, in a hope there is some memory freed in the meantime which hasn't happened so they get into OOM again (and again and again). This all usually works out except in this particular case I would bet my hat that the OOM selected task is pid 24495 which is blocked on the mutex which is held by one of the oom killer task so it cannot finish - thus free a memory. It seems that the current Linus' tree is affected as well. I will have to think about a solution but it sounds really tricky. It is not just ext3 that is affected. I guess we need to tell mem_cgroup_cache_charge that it should never reach OOM from add_to_page_cache_locked. This sounds quite intrusive to me. On the other hand it is really weird that an excessive writer might trigger a memcg OOM killer. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org