From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pa0-f48.google.com (mail-pa0-f48.google.com [209.85.220.48]) by kanga.kvack.org (Postfix) with ESMTP id F2F936B0035 for ; Mon, 14 Jul 2014 07:01:43 -0400 (EDT) Received: by mail-pa0-f48.google.com with SMTP id bj1so1762469pad.21 for ; Mon, 14 Jul 2014 04:01:43 -0700 (PDT) Received: from fgwmail6.fujitsu.co.jp (fgwmail6.fujitsu.co.jp. [192.51.44.36]) by mx.google.com with ESMTPS id ko1si8902890pbd.115.2014.07.14.04.01.42 for (version=TLSv1 cipher=RC4-SHA bits=128/128); Mon, 14 Jul 2014 04:01:42 -0700 (PDT) Received: from kw-mxauth.gw.nic.fujitsu.com (unknown [10.0.237.134]) by fgwmail6.fujitsu.co.jp (Postfix) with ESMTP id 6DE8F3EE0BD for ; Mon, 14 Jul 2014 20:01:41 +0900 (JST) Received: from s4.gw.fujitsu.co.jp (s4.gw.nic.fujitsu.com [10.0.50.94]) by kw-mxauth.gw.nic.fujitsu.com (Postfix) with ESMTP id 811BDAC042A for ; Mon, 14 Jul 2014 20:01:40 +0900 (JST) Received: from g01jpfmpwyt01.exch.g01.fujitsu.local (g01jpfmpwyt01.exch.g01.fujitsu.local [10.128.193.38]) by s4.gw.fujitsu.co.jp (Postfix) with ESMTP id 29AE41DB8032 for ; Mon, 14 Jul 2014 20:01:40 +0900 (JST) Message-ID: <53C3B867.2020005@jp.fujitsu.com> Date: Mon, 14 Jul 2014 20:00:55 +0900 From: Masayoshi Mizuma MIME-Version: 1.0 Subject: Re: xfs: two deadlock problems occur when kswapd writebacks XFS pages. References: <53A0013A.1010100@jp.fujitsu.com> <20140617132609.GI9508@dastard> <53A15DC7.50001@jp.fujitsu.com> <53A7D6CC.1040605@jp.fujitsu.com> <20140624220530.GD1976@devil.localdomain> In-Reply-To: <20140624220530.GD1976@devil.localdomain> Content-Type: text/plain; charset="ISO-8859-1"; format=flowed Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: david@fromorbit.com Cc: linux-mm@kvack.org, xfs@oss.sgi.com Hi Dave, Thank you for your comment! and I apologize for my delayed response. As your comment, I have investigated again the RHEL7 crash dump why the processes which doing direct memory reclaim are stuck at shrink_inactive_list(). Then, I found the reason that the processes and kswapd are trying to free page caches from a zone despite the number of inactive file pages is very very small (40 pages). kswapd moved inactive file pages to isolate file pages to free the pages at shrink_inactive_list(). As the result, NR_INACTIVE_FILE was 0 and NR_ISOLATED_FILE was 40. Therefore, no one can increase NR_INACTIVE_FILE or decrease NR_ISOLATED_FILE, so the system hangs up. In such situation, we should not try to free inactive file pages because kswapd and direct memory reclaimer can move inactive file pages to isolate file pages up to 32 pages. And, I found why the problems did not happen on the upstream kernel. The problems did not happen because of the following commit. --- commit 623762517e2370be3b3f95f4fe08d6c063a49b06 Author: Johannes Weiner Date: Tue May 6 12:50:07 2014 -0700 revert "mm: vmscan: do not swap anon pages just because free+file is low" --- Thank you so much! Masayoshi Mizuma On Wed, 25 Jun 2014 08:05:30 +1000 Dave Chinner wrote: > On Mon, Jun 23, 2014 at 04:27:08PM +0900, Masayoshi Mizuma wrote: >> Hi Dave, >> >> (I removed CCing xfs and linux-mm. And I changed your email address >> to @redhat.com because this email includes RHEL7 kernel stack traces.) > > Please don't do that. There's nothing wrong with posting RHEL7 stack > traces to public lists (though I'd prefer you to reproduce this > problem on a 3.15 or 3.16-rc kernel), and breaking the thread of > discussion makes it impossible to involve the people necessary to > solve this problem. > > I've re-added xfs and linux-mm to the cc list, and taken my redhat > address off it... > > > > [looks at sysrq-w output] > > kswapd0 is blocked in shrink_inactive_list/congestion_wait(). > > kswapd1 is blocked waiting for log space from > shrink_inactive_list(). > > kthreadd is blocked in shrink_inactive_list/congestion_wait trying > to fork another process. > > xfsaild is in uninterruptible sleep, indicating that there is still > metadata to be written to push the log tail to it's required target, > and it will retry again in less than 20ms. > > xfslogd is not blocked, indicating the log has not deadlocked > due to lack of space. > > there are lots of timestamp updates waiting for log space. > > There is one kworker stuck in data IO completion on an inode lock. > > There are several threads blocked on an AGF lock trying to free > extents. > > The bdi writeback thread is blocked waiting for allocation. > > A single xfs_alloc_wq kworker is blocked in > shrink_inactive_list/congestion_wait while trying to read in btree > blocks for transactional modification. Indicative of memory pressure > trashing the working set of cached metadata. waiting for memory > reclaim > - holds agf lock, blocks unlinks > > There are 113 (!) blocked sadc processes - why are there so many > stats gathering processes running? If you stop gathering stats, does > the problem go away? > > There are 54 mktemp processes blocked - what is generating them? > what filesystem are they actually running on? i.e. which XFS > filesystem in the system is having log space shortages? And what is > the xfs_info output of that filesystem i.e. have you simply > oversubscribed a tiny log and so it crawls along at a very slow > pace? > > All of the blocked processes are on CPUs 0-3 i.e. on node 0, which > is handled by kswapd0, which is not blocked waiting for log > space. Hmmm - what is the value of /proc/sys/vm/zone_reclaim_mode? > If it is not zero, does setting it to zero make the problem go away? > > Interestingly enough, for a system under extreme memory pressure, > don't see any processes blocked waiting for swap space or swap IO. > Do you have any swap space configured on this machine? If you > don't, does the problem go away when you add a swap device? > > Overall, I can't see anything that indicates that the filesystem has > actually hung. I can see it having trouble allocating the memory it > needs to make forwards progress, but the system itself is not > deadlocked. Is there any IO being issued when the system is in this > state? If there is Io being issued, then progress is being made and > the system is merely slow because of the extreme memory pressure > generated by the stress test. > > If there is not IO being issued, does the system start making > progress again if you kill one of the memory hogs? i.e. does the > equivalent of triggering an OOM-kill make the system responsive > again? If it does, then the filesystem is not hung and the problem > is that there isn't enough free memory to allow the filesystem to do > IO and hence allow memory reclaim to make progress. In which case, > does increasing /proc/sys/vm/min_free_kbytes make the problem go > away? > > Cheers, > > Dave. > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org