From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ed1-f72.google.com (mail-ed1-f72.google.com [209.85.208.72]) by kanga.kvack.org (Postfix) with ESMTP id 98D738E00E5 for ; Wed, 12 Dec 2018 04:48:35 -0500 (EST) Received: by mail-ed1-f72.google.com with SMTP id m19so8277599edc.6 for ; Wed, 12 Dec 2018 01:48:35 -0800 (PST) Received: from mx1.suse.de (mx2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id i17si5665250edb.85.2018.12.12.01.48.34 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Wed, 12 Dec 2018 01:48:34 -0800 (PST) Date: Wed, 12 Dec 2018 10:48:32 +0100 From: Michal Hocko Subject: Re: [PATCH] mm, memcg: fix reclaim deadlock with writeback Message-ID: <20181212094832.GN1286@dhcp22.suse.cz> References: <20181211132645.31053-1-mhocko@kernel.org> <20181212094249.cw4xjrdchqsp2tkt@kshutemo-mobl1> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20181212094249.cw4xjrdchqsp2tkt@kshutemo-mobl1> Sender: owner-linux-mm@kvack.org List-ID: To: "Kirill A. Shutemov" Cc: Andrew Morton , Liu Bo , Jan Kara , Dave Chinner , Theodore Ts'o , Johannes Weiner , Vladimir Davydov , linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, LKML , Hugh Dickins On Wed 12-12-18 12:42:49, Kirill A. Shutemov wrote: > On Tue, Dec 11, 2018 at 02:26:45PM +0100, Michal Hocko wrote: > > From: Michal Hocko > > > > Liu Bo has experienced a deadlock between memcg (legacy) reclaim and the > > ext4 writeback > > task1: > > [] wait_on_page_bit+0x82/0xa0 > > [] shrink_page_list+0x907/0x960 > > [] shrink_inactive_list+0x2c7/0x680 > > [] shrink_node_memcg+0x404/0x830 > > [] shrink_node+0xd8/0x300 > > [] do_try_to_free_pages+0x10d/0x330 > > [] try_to_free_mem_cgroup_pages+0xd5/0x1b0 > > [] try_charge+0x14d/0x720 > > [] memcg_kmem_charge_memcg+0x3c/0xa0 > > [] memcg_kmem_charge+0x7e/0xd0 > > [] __alloc_pages_nodemask+0x178/0x260 > > [] alloc_pages_current+0x95/0x140 > > [] pte_alloc_one+0x17/0x40 > > [] __pte_alloc+0x1e/0x110 > > [] alloc_set_pte+0x5fe/0xc20 > > [] do_fault+0x103/0x970 > > [] handle_mm_fault+0x61e/0xd10 > > [] __do_page_fault+0x252/0x4d0 > > [] do_page_fault+0x30/0x80 > > [] page_fault+0x28/0x30 > > [] 0xffffffffffffffff > > > > task2: > > [] __lock_page+0x86/0xa0 > > [] mpage_prepare_extent_to_map+0x2e7/0x310 [ext4] > > [] ext4_writepages+0x479/0xd60 > > [] do_writepages+0x1e/0x30 > > [] __writeback_single_inode+0x45/0x320 > > [] writeback_sb_inodes+0x272/0x600 > > [] __writeback_inodes_wb+0x92/0xc0 > > [] wb_writeback+0x268/0x300 > > [] wb_workfn+0xb4/0x390 > > [] process_one_work+0x189/0x420 > > [] worker_thread+0x4e/0x4b0 > > [] kthread+0xe6/0x100 > > [] ret_from_fork+0x41/0x50 > > [] 0xffffffffffffffff > > > > He adds > > : task1 is waiting for the PageWriteback bit of the page that task2 has > > : collected in mpd->io_submit->io_bio, and tasks2 is waiting for the LOCKED > > : bit the page which tasks1 has locked. > > > > More precisely task1 is handling a page fault and it has a page locked > > while it charges a new page table to a memcg. That in turn hits a memory > > limit reclaim and the memcg reclaim for legacy controller is waiting on > > the writeback but that is never going to finish because the writeback > > itself is waiting for the page locked in the #PF path. So this is > > essentially ABBA deadlock. > > Side node: > > Do we have PG_writeback vs. PG_locked ordering documentated somewhere? I am not aware of any > IIUC, the trace from task2 suggests that we must not wait for writeback > on the locked page. > > But that not what I see for many wait_on_page_writeback() users: it usally > called with the page locked. I see it for truncate, shmem, swapfile, > splice... > > Maybe the problem is within task2 codepath after all? Jack and David have explained that this is due to an optimization multiple filesystems do. They lock and set wribeback on multiple pages and then send a largeer IO at once. So in this case we have the following pattern lock_page(B) SetPageWriteback(B) unlock_page(B) lock_page(A) lock_page(A) pte_alloc_pne shrink_page_list wait_on_page_writeback(B) SetPageWriteback(A) unlock_page(A) # flush A, B to clear the writeback -- Michal Hocko SUSE Labs