From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail203.messagelabs.com (mail203.messagelabs.com [216.82.254.243]) by kanga.kvack.org (Postfix) with SMTP id E62536B0087 for ; Wed, 20 Oct 2010 19:31:32 -0400 (EDT) Received: by gxk25 with SMTP id 25so76386gxk.14 for ; Wed, 20 Oct 2010 16:31:30 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: References: <20100915122334.3fa7b35f@notabene> <20100915082843.GA17252@localhost> <20100915184434.18e2d933@notabene> <20101018151459.2b443221@notabene> <20101019101151.57c6dd56@notabene> <20101020055717.GA12752@localhost> <20101020142326.GA5243@barrios-desktop> Date: Thu, 21 Oct 2010 08:31:29 +0900 Message-ID: Subject: Re: Deadlock possibly caused by too_many_isolated. From: Minchan Kim Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Sender: owner-linux-mm@kvack.org To: Torsten Kaiser Cc: Wu Fengguang , Neil Brown , Rik van Riel , Andrew Morton , KOSAKI Motohiro , KAMEZAWA Hiroyuki , "linux-kernel@vger.kernel.org" , "linux-mm@kvack.org" , "Li, Shaohua" , Jens Axboe List-ID: On Thu, Oct 21, 2010 at 12:35 AM, Torsten Kaiser wrote: > On Wed, Oct 20, 2010 at 4:23 PM, Minchan Kim wrot= e: < SNIP> >> What is your problem?(Sorry if you explained it several time). > > The original problem was that using too many gcc's caused a swapstorm > that completely hung my system. > I first blame it one the workqueue changes in 2.6.36-rc1 and/or its > interaction with XFS (because in -rc5 a workqueue related problem in > XFS got fixed), but Tejun Heo found out that a) it were exhausted > mempools and not the workqueues and that the problems itself existed > at least in 2.6.35 already. In > http://marc.info/?l=3Dlinux-raid&m=3D128699402805191&w=3D2 I have describ= e a > simpler testcase that I found after looking more closely into the > mempools. > > Short story: swaping over RAID1 (drivers/md/raid1.c) can cause a > system hang, because it is using too much of the fs_bio_set mempool > from fs/bio.c. > >> I read the thread. >> It seems Wu's patch solved deadlock problem by FS lock holding and too_m= any_isolated. >> What is the problem remained in your case? unusable system by swapstorm? >> If it is, I think it's expected behavior. Please see the below comment. >> (If I don't catch your point, Please explain your problem.) > > I do not have a problem, if the system becomes unusable *during* a > swapstorm, but it should recover. That is not the case in my system. > With Wu's too_many_isolated-patch and Neil's patch agains raid1.c the > system does no longer seem to be completely stuck (a swapoutrate of > ~80kb every 20 seconds still happens), but I would still expect a > better recovery time. (At that rate the recovery would probably take a > few days...) I got understand your problem. BTW, Wu's too_many_isolated patch should merge regardless of this problem. It's another story. > >>> [ =A0437.481365] SysRq : Show Memory >>> [ =A0437.490003] Mem-Info: >>> [ =A0437.491357] Node 0 DMA per-cpu: >>> [ =A0437.500032] CPU =A0 =A00: hi: =A0 =A00, btch: =A0 1 usd: =A0 0 >>> [ =A0437.500032] CPU =A0 =A01: hi: =A0 =A00, btch: =A0 1 usd: =A0 0 >>> [ =A0437.500032] CPU =A0 =A02: hi: =A0 =A00, btch: =A0 1 usd: =A0 0 >>> [ =A0437.500032] CPU =A0 =A03: hi: =A0 =A00, btch: =A0 1 usd: =A0 0 >>> [ =A0437.500032] Node 0 DMA32 per-cpu: >>> [ =A0437.500032] CPU =A0 =A00: hi: =A0186, btch: =A031 usd: 138 >>> [ =A0437.500032] CPU =A0 =A01: hi: =A0186, btch: =A031 usd: =A030 >>> [ =A0437.500032] CPU =A0 =A02: hi: =A0186, btch: =A031 usd: =A0 0 >>> [ =A0437.500032] CPU =A0 =A03: hi: =A0186, btch: =A031 usd: =A0 0 >>> [ =A0437.500032] Node 1 DMA32 per-cpu: >>> [ =A0437.500032] CPU =A0 =A00: hi: =A0186, btch: =A031 usd: =A0 0 >>> [ =A0437.500032] CPU =A0 =A01: hi: =A0186, btch: =A031 usd: =A0 0 >>> [ =A0437.500032] CPU =A0 =A02: hi: =A0186, btch: =A031 usd: =A0 0 >>> [ =A0437.500032] CPU =A0 =A03: hi: =A0186, btch: =A031 usd: =A0 0 >>> [ =A0437.500032] Node 1 Normal per-cpu: >>> [ =A0437.500032] CPU =A0 =A00: hi: =A0186, btch: =A031 usd: =A0 0 >>> [ =A0437.500032] CPU =A0 =A01: hi: =A0186, btch: =A031 usd: =A0 0 >>> [ =A0437.500032] CPU =A0 =A02: hi: =A0186, btch: =A031 usd: =A025 >>> [ =A0437.500032] CPU =A0 =A03: hi: =A0186, btch: =A031 usd: =A030 >>> [ =A0437.500032] active_anon:2039 inactive_anon:985233 isolated_anon:68= 2 >>> [ =A0437.500032] =A0active_file:1667 inactive_file:1723 isolated_file:0 >>> [ =A0437.500032] =A0unevictable:0 dirty:0 writeback:25387 unstable:0 >>> [ =A0437.500032] =A0free:3471 slab_reclaimable:2840 slab_unreclaimable:= 6337 >>> [ =A0437.500032] =A0mapped:1284 shmem:960501 pagetables:523 bounce:0 >>> [ =A0437.500032] Node 0 DMA free:8008kB min:28kB low:32kB high:40kB >>> active_anon:0kB inact >>> ive_anon:7596kB active_file:12kB inactive_file:0kB unevictable:0kB >>> isolated(anon):0kB i >>> solated(file):0kB present:15768kB mlocked:0kB dirty:0kB >>> writeback:404kB mapped:0kB shme >>> m:7192kB slab_reclaimable:32kB slab_unreclaimable:304kB >>> kernel_stack:0kB pagetables:0kB >>> =A0unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:118 >>> all_unreclaimable? no >>> [ =A0437.500032] lowmem_reserve[]: 0 2004 2004 2004 >> >> Node 0 DMA : free 8008K but lowmem_reserve 8012K(2004 pages) >> So page allocator can't allocate the page unless preferred zone is DMA >> >>> [ =A0437.500032] Node 0 DMA32 free:2980kB min:4036kB low:5044kB >>> high:6052kB active_anon:2 >>> 844kB inactive_anon:1918424kB active_file:3428kB inactive_file:3780kB >>> unevictable:0kB isolated(anon):1232kB isolated(file):0kB >>> present:2052320kB mlocked:0kB dirty:0kB writeback:72016kB >>> mapped:2232kB shmem:1847640kB slab_reclaimable:5444kB >>> slab_unreclaimable:13508kB kernel_stack:744kB pagetables:864kB >>> unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 >>> all_unreclaimable? no >>> [ =A0437.500032] lowmem_reserve[]: 0 0 0 0 >> >> Node 0 DMA32 : free 2980K but min 4036K. >> Few file LRU compare to anon LRU > > In the testcase I fill a tmpfs as fast as I can with data from > /dev/zero. So nearly everything gets swapped out and only the last > written data from the tmpfs fills all RAM. (I have 4GB RAM, the tmpfs > is limited to 6GB, 16 dd's are writing into it) > >> Normally, it could fail to allocate the page. >> 'Normal' means caller doesn't request alloc_pages with __GFP_HIGH or !__= GFP_WAIT >> Generally many call sites don't pass gfp_flag with __GFP_HIGH|!__GFP_WAI= T. >> >>> [ =A0437.500032] Node 1 DMA32 free:2188kB min:3036kB low:3792kB >>> high:4552kB active_anon:0kB inactive_anon:1555368kB active_file:0kB >>> inactive_file:28kB unevictable:0kB isolated(anon):768kB >>> isolated(file):0kB present:1544000kB mlocked:0kB dirty:0kB >>> writeback:21160kB mapped:0kB shmem:1534960kB slab_reclaimable:3728kB >>> slab_unreclaimable:7076kB kernel_stack:8kB pagetables:0kB unstable:0kB >>> bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no >>> [ =A0437.500032] lowmem_reserve[]: 0 0 505 505 >> >> Node 1 DMA32 free : 2188K min 3036K >> It's a same situation with Node 0 DMA32. >> Normally, it could fail to allocate the page. >> Few file LRU compare to anon LRU >> >> >>> [ =A0437.500032] Node 1 Normal free:708kB min:1016kB low:1268kB >>> high:1524kB active_anon:5312kB inactive_anon:459544kB >>> active_file:3228kB inactive_file:3084kB unevictable:0kB >>> isolated(anon):728kB isolated(file):0kB present:517120kB mlocked:0kB >>> dirty:0kB writeback:7968kB mapped:2904kB shmem:452212kB >>> slab_reclaimable:2156kB slab_unreclaimable:4460kB kernel_stack:200kB >>> pagetables:1228kB unstable:0kB bounce:0kB writeback_tmp:0kB >>> pages_scanned:9678 all_unreclaimable? no >>> [ =A0437.500032] lowmem_reserve[]: 0 0 0 0 >> >> Node 1 Normal : free 708K min 1016K >> Normally, it could fail to allocate the page. >> Few file LRU compare to anon LRU >> >>> [ =A0437.500032] Node 0 DMA: 2*4kB 2*8kB 1*16kB 3*32kB 3*64kB 4*128kB >>> 4*256kB 2*512kB 1*1024kB 2*2048kB 0*4096kB =3D 8008kB >>> [ =A0437.500032] Node 0 DMA32: 27*4kB 15*8kB 8*16kB 8*32kB 7*64kB >>> 1*128kB 1*256kB 1*512kB 1*1024kB 0*2048kB 0*4096kB =3D 2980kB >>> [ =A0437.500032] Node 1 DMA32: 1*4kB 6*8kB 3*16kB 1*32kB 0*64kB 1*128kB >>> 0*256kB 0*512kB 0*1024kB 1*2048kB 0*4096kB =3D 2308kB >>> [ =A0437.500032] Node 1 Normal: 39*4kB 13*8kB 10*16kB 3*32kB 1*64kB >>> 1*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB =3D 708kB >>> [ =A0437.500032] 989289 total pagecache pages >>> [ =A0437.500032] 25398 pages in swap cache >>> [ =A0437.500032] Swap cache stats: add 859204, delete 833806, find 28/3= 9 >>> [ =A0437.500032] Free swap =A0=3D 9865628kB >>> [ =A0437.500032] Total swap =3D 10000316kB >>> [ =A0437.500032] 1048575 pages RAM >>> [ =A0437.500032] 33809 pages reserved >>> [ =A0437.500032] 7996 pages shared >>> [ =A0437.500032] 1008521 pages non-shared >>> >> All zones don't have enough pages and don't have enough file lru pages. >> So swapout is expected behavior, I think. >> It means your workload exceeds your system available DRAM size. > > Yes, as intended. I wanted to create many writes to a RAID1 device > under memory pressure to show/verify that the current use of mempools > in raid1.c is buggered. > > That is not really any sane workload, that literally is just there to > create a swapstorm and then see if the system survives it. > > The problem is, that the system is not surviving it: bio allocations > fail in raid1.c and it falls back to the fs_bio_set mempool. But that > mempool is only 2 entries big, because you should ever only use one of > its entries at a time. But the current mainline code from raid1.c > allocates one bio per drive before submitting it -> That bug is fixed > my Neil's patch and I would have expected that to fix my hang. But it > seems that there is an additional problem so that mempool still get > emptied. And that means that no writeback happens any longer and so > the kernel can't swapout and gets stuck. That's what I missed that why there are lots of writeback pages in log. Thanks for kind explanation. > > I think the last mail from Jens Axboe is the correct answer, not > increasing the fs_bio_set mempool size via BIO_POOL_SIZE. > > But should that go even further: Forbid any use of bio_alloc() and > bio_clone() in any device drivers? Or at the very least in all device > drivers that could be used for swapspace? But like Jens pointed out, "So md and friends should really have a pool per device, so that stacking will always work properly." Shouldn't raid1.c have a pool for bio_set and use own bio_set like setup_clone in dm.c? Maybe I am wrong since I don't have a knowledge about RAID. > > Torsten > --=20 Kind regards, Minchan Kim -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org