From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-oi0-f71.google.com (mail-oi0-f71.google.com [209.85.218.71]) by kanga.kvack.org (Postfix) with ESMTP id 1EE7E800D8 for ; Wed, 24 Jan 2018 21:04:44 -0500 (EST) Received: by mail-oi0-f71.google.com with SMTP id x9so830810oie.2 for ; Wed, 24 Jan 2018 18:04:44 -0800 (PST) Received: from www262.sakura.ne.jp (www262.sakura.ne.jp. [2001:e42:101:1:202:181:97:72]) by mx.google.com with ESMTPS id 42si1393040otk.546.2018.01.24.18.04.42 for (version=TLS1 cipher=AES128-SHA bits=128/128); Wed, 24 Jan 2018 18:04:42 -0800 (PST) Message-Id: <201801250204.w0P24NKZ033992@www262.sakura.ne.jp> Subject: Re: [PATCH 1/2] mm,vmscan: Kill global shrinker lock. From: Tetsuo Handa MIME-Version: 1.0 Date: Thu, 25 Jan 2018 11:04:23 +0900 References: <20171115140020.GA6771@cmpxchg.org> <20171115141113.2nw4c4nejermhckb@dhcp22.suse.cz> In-Reply-To: <20171115141113.2nw4c4nejermhckb@dhcp22.suse.cz> Content-Type: text/plain; charset="ISO-2022-JP" Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Michal Hocko , Johannes Weiner , linux-mm@lists.ewheeler.net Cc: Tetsuo Handa , Minchan Kim , Huang Ying , Mel Gorman , Vladimir Davydov , Andrew Morton , Shakeel Butt , Greg Thelen , linux-mm@kvack.org, linux-kernel@vger.kernel.org Michal Hocko wrote: > On Wed 15-11-17 09:00:20, Johannes Weiner wrote: > > In any case, Minchan's lock breaking seems way preferable over that > > level of headscratching complexity for an unusual case like Shakeel's. > > agreed! I would go the more complex way only if it turns out that early > break out causes some real problems. > Eric Wheeler wrote (at http://lkml.kernel.org/r/alpine.LRH.2.11.1801242349220.30642@mail.ewheeler.net ): > Hello all, > > We are getting processes stuck with /proc/pid/stack listing the following: Yes, I think that this is a silent OOM lockup. > > [] io_schedule+0x12/0x40 > [] __lock_page+0x105/0x150 > [] pagecache_get_page+0x161/0x210 > [] shmem_unused_huge_shrink+0x334/0x3f0 > [] super_cache_scan+0x176/0x180 > [] shrink_slab+0x275/0x460 > [] shrink_node+0x10e/0x320 > [] node_reclaim+0x19d/0x250 > [] get_page_from_freelist+0x16a/0xac0 > [] __alloc_pages_nodemask+0x107/0x290 > [] pte_alloc_one+0x13/0x40 > [] __pte_alloc+0x19/0x100 > [] alloc_set_pte+0x468/0x4c0 > [] finish_fault+0x3a/0x70 > [] __handle_mm_fault+0x94a/0x1190 > [] handle_mm_fault+0xc4/0x1d0 > [] __do_page_fault+0x253/0x4d0 > [] do_page_fault+0x33/0x120 > [] page_fault+0x4c/0x60 > > > For some reason io_schedule is not coming back, so shrinker_rwsem never > gets an up_read. When this happens, other processes like libvirt get stuck > trying to start VMs with the /proc/pid/stack of libvirtd looking like so, > while register_shrinker waits for shrinker_rwsem to be released: > > [] call_rwsem_down_write_failed+0x13/0x20 > [] register_shrinker+0x45/0xa0 > [] sget_userns+0x468/0x4a0 > [] mount_nodev+0x2a/0xa0 > [] mount_fs+0x34/0x150 > [] vfs_kern_mount+0x62/0x120 > [] do_mount+0x1ee/0xc50 > [] SyS_mount+0x7e/0xd0 > [] do_syscall_64+0x61/0x1a0 > [] entry_SYSCALL64_slow_path+0x25/0x25 > [] 0xffffffffffffffff > If io_schedule() depends on somebody else's memory allocation request, that somebody else will call shrink_slab() and down_read_trylock(&shrinker_rwsem) will fail without making progress. This means that that somebody else will forever retry as long as should_continue_reclaim() returns true. I don't know what is causing should_continue_reclaim() to return true, but nobody will be able to reclaim memory because down_read_trylock(&shrinker_rwsem) continues failing without making progress. > > I seem to be able to reproduce this somewhat reliably, it will likely be > stuck by tomorrow morning. Since it does seem to take a day to hang, I was > hoping to avoid a bisect and see if anyone has seen this behavior or knows > it to be fixed in 4.15-rc. I think that this problem is not yet fixed in linux-next.git . > > Note that we are using zram as our only swap device, but at the time that > it shrink_slab() failed to return, there was plenty of memory available > and no swap was in use. > > The machine is generally responsive, but `sync` will hang forever and our > only way out is `echo b > /proc/sysrq-trigger`. > > Please suggest any additional information you might need for testing, and > I am happy to try patches. > > Thank you for your help! Pretending we will be able to make progress if (!down_read_trylock(&shrinker_rwsem)) { /* * If we would return 0, our callers would understand that we * have nothing else to shrink and give up trying. By returning * 1 we keep it going and assume we'll be able to shrink next * time. */ freed = 1; goto out; } can work only if do_shrink_slab() does not depend on somebody else's memory allocation. I think we should kill shrinker_rwsem assumption. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org