From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-9.8 required=3.0 tests=DKIM_INVALID,DKIM_SIGNED, INCLUDES_PATCH,MAILING_LIST_MULTI,SIGNED_OFF_BY,SPF_HELO_NONE,SPF_PASS, URIBL_BLOCKED,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id A4391C43141 for ; Sat, 16 Nov 2019 15:44:16 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 4CE5C20815 for ; Sat, 16 Nov 2019 15:44:16 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (1024-bit key) header.d=kernel.org header.i=@kernel.org header.b="q9rhmkTj" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 4CE5C20815 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=kernel.org Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id CF2F26B0005; Sat, 16 Nov 2019 10:44:15 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id CA4F16B0008; Sat, 16 Nov 2019 10:44:15 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id BB90A6B000A; Sat, 16 Nov 2019 10:44:15 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0083.hostedemail.com [216.40.44.83]) by kanga.kvack.org (Postfix) with ESMTP id A67956B0005 for ; Sat, 16 Nov 2019 10:44:15 -0500 (EST) Received: from smtpin06.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay02.hostedemail.com (Postfix) with SMTP id 596994DCA for ; Sat, 16 Nov 2019 15:44:15 +0000 (UTC) X-FDA: 76162562070.06.low52_7fd0120bd439 X-HE-Tag: low52_7fd0120bd439 X-Filterd-Recvd-Size: 10288 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by imf07.hostedemail.com (Postfix) with ESMTP for ; Sat, 16 Nov 2019 15:44:14 +0000 (UTC) Received: from sasha-vm.mshome.net (unknown [50.234.116.4]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPSA id 9E1332073B; Sat, 16 Nov 2019 15:44:13 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=default; t=1573919054; bh=0tPbmlMoHUmKUKg4UP3KeQMU9iKDfEP/BR6bOzr6vMA=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=q9rhmkTjaTKSnRykUWYF07tdz/dju/iLYzHM7HQWrCufnztGFEHFVCiW7PO2kU1ki KNwY5/1qlf3eLJfgYu5ZuKVixOdJKWrchhPlY4W6/Hp5bwzOroq66qvTzjJZkrQ4h0 7q1Vow3qxbH576OxqxpeN+lv3liJ5zHessR1O5/c= From: Sasha Levin To: linux-kernel@vger.kernel.org, stable@vger.kernel.org Cc: Dave Chinner , Jan Kara , Nicholas Piggin , Andrew Morton , Linus Torvalds , Sasha Levin , linux-mm@kvack.org Subject: [PATCH AUTOSEL 4.19 130/237] mm/page-writeback.c: fix range_cyclic writeback vs writepages deadlock Date: Sat, 16 Nov 2019 10:39:25 -0500 Message-Id: <20191116154113.7417-130-sashal@kernel.org> X-Mailer: git-send-email 2.20.1 In-Reply-To: <20191116154113.7417-1-sashal@kernel.org> References: <20191116154113.7417-1-sashal@kernel.org> MIME-Version: 1.0 X-stable: review X-Patchwork-Hint: Ignore Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: From: Dave Chinner [ Upstream commit 64081362e8ff4587b4554087f3cfc73d3e0a4cd7 ] We've recently seen a workload on XFS filesystems with a repeatable deadlock between background writeback and a multi-process application doing concurrent writes and fsyncs to a small range of a file. range_cyclic writeback Process 1 Process 2 xfs_vm_writepages write_cache_pages writeback_index =3D 2 cycled =3D 0 .... find page 2 dirty lock Page 2 ->writepage page 2 writeback page 2 clean page 2 added to bio no more pages write() locks page 1 dirties page 1 locks page 2 dirties page 1 fsync() .... xfs_vm_writepages write_cache_pages start index 0 find page 1 towrite lock Page 1 ->writepage page 1 writeback page 1 clean page 1 added to bio find page 2 towrite lock Page 2 page 2 is writeback write() locks page 1 dirties page 1 fsync() .... xfs_vm_writepages write_cache_pages start index 0 !done && !cycled sets index to 0, restarts lookup find page 1 dirty find page 1 towrite lock Page 1 page 1 is writeback lock Page 1 DEADLOCK because: - process 1 needs page 2 writeback to complete to make enough progress to issue IO pending for page 1 - writeback needs page 1 writeback to complete so process 2 can progress and unlock the page it is blocked on, then it can issue the IO pending for page 2 - process 2 can't make progress until process 1 issues IO for page 1 The underlying cause of the problem here is that range_cyclic writeback i= s processing pages in descending index order as we hold higher index pages in a structure controlled from above write_cache_pages(). The write_cache_pages() caller needs to be able to submit these pages for IO before write_cache_pages restarts writeback at mapping index 0 to avoid wcp inverting the page lock/writeback wait order. generic_writepages() is not susceptible to this bug as it has no private context held across write_cache_pages() - filesystems using this infrastructure always submit pages in ->writepage immediately and so ther= e is no problem with range_cyclic going back to mapping index 0. However: mpage_writepages() has a private bio context, exofs_writepages() has page_collect fuse_writepages() has fuse_fill_wb_data nfs_writepages() has nfs_pageio_descriptor xfs_vm_writepages() has xfs_writepage_ctx All of these ->writepages implementations can hold pages under writeback in their private structures until write_cache_pages() returns, and hence they are all susceptible to this deadlock. Also worth noting is that ext4 has it's own bastardised version of write_cache_pages() and so it /may/ have an equivalent deadlock. I looke= d at the code long enough to understand that it has a similar retry loop fo= r range_cyclic writeback reaching the end of the file and then promptly ran away before my eyes bled too much. I'll leave it for the ext4 developers to determine if their code is actually has this deadlock and how to fix i= t if it has. There's a few ways I can see avoid this deadlock. There's probably more, but these are the first I've though of: 1. get rid of range_cyclic altogether 2. range_cyclic always stops at EOF, and we start again from writeback index 0 on the next call into write_cache_pages() 2a. wcp also returns EAGAIN to ->writepages implementations to indicate range cyclic has hit EOF. writepages implementations can then flush the current context and call wpc again to continue. i.e. lift the retry into the ->writepages implementation 3. range_cyclic uses trylock_page() rather than lock_page(), and it skips pages it can't lock without blocking. It will already do this for pages under writeback, so this seems like a no-brainer 3a. all non-WB_SYNC_ALL writeback uses trylock_page() to avoid blocking as per pages under writeback. I don't think #1 is an option - range_cyclic prevents frequently dirtied lower file offset from starving background writeback of rarely touched higher file offsets. #2 is simple, and I don't think it will have any impact on performance as going back to the start of the file implies an immediate seek. We'll have exactly the same number of seeks if we switch writeback to another inode, and then come back to this one later and restart from index 0. #2a is pretty much "status quo without the deadlock". Moving the retry loop up into the wcp caller means we can issue IO on the pending pages before calling wcp again, and so avoid locking or waiting on pages in the wrong order. I'm not convinced we need to do this given that we get the same thing from #2 on the next writeback call from the writeback infrastructure. #3 is really just a band-aid - it doesn't fix the access/wait inversion problem, just prevents it from becoming a deadlock situation. I'd prefer we fix the inversion, not sweep it under the carpet like this. #3a is really an optimisation that just so happens to include the band-aid fix of #3. So it seems that the simplest way to fix this issue is to implement solution #2 Link: http://lkml.kernel.org/r/20181005054526.21507-1-david@fromorbit.com Signed-off-by: Dave Chinner Reviewed-by: Jan Kara Cc: Nicholas Piggin Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Sasha Levin --- mm/page-writeback.c | 33 +++++++++++++++------------------ 1 file changed, 15 insertions(+), 18 deletions(-) diff --git a/mm/page-writeback.c b/mm/page-writeback.c index ea4fd3af3b4bd..43df0c52e1ccb 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -2149,6 +2149,13 @@ EXPORT_SYMBOL(tag_pages_for_writeback); * not miss some pages (e.g., because some other process has cleared TOW= RITE * tag we set). The rule we follow is that TOWRITE tag can be cleared on= ly * by the process clearing the DIRTY tag (and submitting the page for IO= ). + * + * To avoid deadlocks between range_cyclic writeback and callers that ho= ld + * pages in PageWriteback to aggregate IO until write_cache_pages() retu= rns, + * we do not loop back to the start of the file. Doing so causes a page + * lock/page writeback access order inversion - we should only ever lock + * multiple pages in ascending page->index order, and looping back to th= e start + * of the file violates that rule and causes deadlocks. */ int write_cache_pages(struct address_space *mapping, struct writeback_control *wbc, writepage_t writepage, @@ -2163,7 +2170,6 @@ int write_cache_pages(struct address_space *mapping= , pgoff_t index; pgoff_t end; /* Inclusive */ pgoff_t done_index; - int cycled; int range_whole =3D 0; int tag; =20 @@ -2171,23 +2177,17 @@ int write_cache_pages(struct address_space *mappi= ng, if (wbc->range_cyclic) { writeback_index =3D mapping->writeback_index; /* prev offset */ index =3D writeback_index; - if (index =3D=3D 0) - cycled =3D 1; - else - cycled =3D 0; end =3D -1; } else { index =3D wbc->range_start >> PAGE_SHIFT; end =3D wbc->range_end >> PAGE_SHIFT; if (wbc->range_start =3D=3D 0 && wbc->range_end =3D=3D LLONG_MAX) range_whole =3D 1; - cycled =3D 1; /* ignore range_cyclic tests */ } if (wbc->sync_mode =3D=3D WB_SYNC_ALL || wbc->tagged_writepages) tag =3D PAGECACHE_TAG_TOWRITE; else tag =3D PAGECACHE_TAG_DIRTY; -retry: if (wbc->sync_mode =3D=3D WB_SYNC_ALL || wbc->tagged_writepages) tag_pages_for_writeback(mapping, index, end); done_index =3D index; @@ -2279,17 +2279,14 @@ int write_cache_pages(struct address_space *mappi= ng, pagevec_release(&pvec); cond_resched(); } - if (!cycled && !done) { - /* - * range_cyclic: - * We hit the last page and there is more work to be done: wrap - * back to the start of the file - */ - cycled =3D 1; - index =3D 0; - end =3D writeback_index - 1; - goto retry; - } + + /* + * If we hit the last page and there is more work to be done: wrap + * back the index back to the start of the file for the next + * time we are called. + */ + if (wbc->range_cyclic && !done) + done_index =3D 0; if (wbc->range_cyclic || (range_whole && wbc->nr_to_write > 0)) mapping->writeback_index =3D done_index; =20 --=20 2.20.1