From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id D5123FA375E for ; Fri, 13 Sep 2024 15:51:51 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 66C6C6B00CE; Fri, 13 Sep 2024 11:51:51 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 61CF16B00D0; Fri, 13 Sep 2024 11:51:51 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 4BCED6B00D1; Fri, 13 Sep 2024 11:51:51 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 2DB6D6B00CE for ; Fri, 13 Sep 2024 11:51:51 -0400 (EDT) Received: from smtpin03.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id E1849120552 for ; Fri, 13 Sep 2024 15:51:50 +0000 (UTC) X-FDA: 82560155580.03.4F0DE72 Received: from casper.infradead.org (casper.infradead.org [90.155.50.34]) by imf16.hostedemail.com (Postfix) with ESMTP id 83914180002 for ; Fri, 13 Sep 2024 15:51:48 +0000 (UTC) Authentication-Results: imf16.hostedemail.com; dkim=pass header.d=infradead.org header.s=casper.20170209 header.b=onxuXcui; spf=none (imf16.hostedemail.com: domain of willy@infradead.org has no SPF policy when checking 90.155.50.34) smtp.mailfrom=willy@infradead.org; dmarc=none ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1726242602; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=GeAnsuP8aeMkI3YdnhdkdYt/fm3JMpOCpvH3j/+kM8o=; b=h/iPG0yopv8j8207bMjf8n99gNd1JKpBkJ7KRG+jBifq1eZ8XA9GH7kIjDZ4IqrmoMt/Qc 4EcVRc27zlKLw2myPFXKe2pTamI0WVDezCO7nEaSx1YpPatjF9j0LHBQAdnRIyh55MaPHm j/R6oCXJg1/Gwn0zd+yPglMz8kbnOcY= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1726242602; a=rsa-sha256; cv=none; b=zlXHLkk2d7bzKZ3P/7i+7nGOD8LqNGJWWGC7bwIgshgcuae3vrlzZioopCRrAWOrEx/zx1 c64KSn/6LHA2TkTTSvUjU720HLJb3wmX2t6DoejkyFh0UeQ+AAgUVNJfit2WPTXzii0AnI LR31CJSh8ddfMojEX5f3hJeovdJOP3Y= ARC-Authentication-Results: i=1; imf16.hostedemail.com; dkim=pass header.d=infradead.org header.s=casper.20170209 header.b=onxuXcui; spf=none (imf16.hostedemail.com: domain of willy@infradead.org has no SPF policy when checking 90.155.50.34) smtp.mailfrom=willy@infradead.org; dmarc=none DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=casper.20170209; h=In-Reply-To:Content-Type:MIME-Version: References:Message-ID:Subject:Cc:To:From:Date:Sender:Reply-To: Content-Transfer-Encoding:Content-ID:Content-Description; bh=GeAnsuP8aeMkI3YdnhdkdYt/fm3JMpOCpvH3j/+kM8o=; b=onxuXcui/ywqsVP8qdl3qOka6P +/qucTGwlgBgjjB4pksKLZDtxPZPYkS7GutVyb6OLmyESTUiJzbj6i7SsCXqYvmQLl3fQSfLsaOht FOszvAJkqNp1rEJXnouqSUQ0F1EpI0pK4MzWb7b5CBJ3aKLY7L+LldZnjWP1CUwpAQ2kn/lP52UEz qI3kw3VisV28jPeHYZDFAW5DUdEy4u4erPytpG+ic1WQAw9QQLbsodh4fUj8U4edFwPNF1WcWmVp+ gg/uPT4uvqSZ/PXcE+L4yBNGuksMGWJERGN4Yv6NNJoiay6VUSU1KwywdW0NVmowue1iB/1vLN6sa IxsiuHsg==; Received: from willy by casper.infradead.org with local (Exim 4.98 #2 (Red Hat Linux)) id 1sp8aC-0000000Gk78-2M7t; Fri, 13 Sep 2024 15:51:40 +0000 Date: Fri, 13 Sep 2024 16:51:40 +0100 From: Matthew Wilcox To: Chris Mason Cc: Linus Torvalds , Jens Axboe , Christian Theune , linux-mm@kvack.org, "linux-xfs@vger.kernel.org" , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, Daniel Dao , Dave Chinner , regressions@lists.linux.dev, regressions@leemhuis.info Subject: Re: Known and unfixed active data loss bug in MM + XFS with large folios since Dec 2021 (any kernel from 6.1 upwards) Message-ID: References: <0fc8c3e7-e5d2-40db-8661-8c7199f84e43@kernel.dk> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Stat-Signature: kaqzjyq3hsiihcecittobneesn4by4xf X-Rspamd-Queue-Id: 83914180002 X-Rspam-User: X-Rspamd-Server: rspam08 X-HE-Tag: 1726242708-492944 X-HE-Meta: U2FsdGVkX19ymyH9NOhGxuDG50ZAvVlSw1wypyk/sZW1PlpQUWiUzdt0LHQeUnoHh063z47M/+EF5K9i0cFL/cclnMMDiqS1XnRZDKhKV2LzvjheuOlgDGghkzwPJLsgvLTfCcK5jKK8kf349FfBNg810qanwEz3y53JqNuCoxRKu3zLnkvjWzP+HXWS+AowQG1ChCHQstqsSM5mUV724ing9MirZYUKEDjTrfiCgjM3PKmYgqABR0kxQ5zPF2Zk6gBNvmhcn3+V5TYymdkZ40NvtJpUY6qX/dkY5TiZM5qBLSJ5jlmkKNTqSBYobQFYufu8yf6f9KDNDADDeChzfDWrGNSY2GbdFrRBdaQer/J9a0RiCE8WEVWq3hRVPfBPY/VplW02wgQpbVxb315VFb5/tNibd5Bs0Td4N2zcpix4rkcFuL5lylAFtoqFvmrlOtVuCmI8PfY0eOj9EH/OOvffGMZ2Aa1ZuhbpvvCALEnNfY3vWaTyU62OQyYfQniour4xJxNwpsviZsdVOa5hB+lCSCpBnLNIS+hEfKu2/QUHj9EBq843vnFTQQaU0wp03lxKxqaxnwO2ODeWD3FBlPEt9nIkx3PKeESpw3OOGrj5SqVzLpA2UVh4bj6fqyxQRRwGL7X5n9I/V1v06lACMB1Z+kqspySJqE7E/avMo+09Vthg968B356JbqFIwPhV7xsaZG5Ye7lcIBron23DHqPHb1pZM4GoY4j7B0QNJP2O+W2iZ2G4aU4OZKo8KfJnQ3E1dGkxZOa6BrgIj1fs4NtH4ndeCokgtclR3QQJl+PIZ0iDQJRvM0L/z0ywPSpGNsJV9oCur9GG4bWSdqpU31aqLB4GLTLLt6wm8QbVzuUM3lsWM/IerwFcJ6iMuW6IA8YirjcnndTjCMacbYOUhV/geJ7Uh3ZeSe+lzTEfldG8UEbM4OPJg4eEA4anLtZdphkgJhUed8yPinOSTSv 9CvBnuIG gMakpE5m8lqCxxKWEzf+d5Y1KU2HQVoBSJYzrBjnzH8XHdQsYj40p6ZMOKVq5L7IgCUVc0InbarbreGazYtDugU4pNCABP82TnqXyKvjY58X0DRzfc469WcZA3/kSJaH6qshIOYXuFpR3mOVEhamCrHvQe1nrCHqOx1U45I2wxPfB9hxjWo+dzBEHFJzCiUL/nYum5HgCIPd6tvoOwrIOeylDN5nRcBnC8uGlLm1XmUN8jyMO/QpOx59LKujCz0z4+DJfxAxQss0Ez7bzSq/lCLoZsw== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Fri, Sep 13, 2024 at 11:30:41AM -0400, Chris Mason wrote: > I've mentioned this in the past to both Willy and Dave Chinner, but so > far all of my attempts to reproduce it on purpose have failed. It's > awkward because I don't like to send bug reports that I haven't > reproduced on a non-facebook kernel, but I'm pretty confident this bug > isn't specific to us. I don't think the bug is specific to you either. It's been hit by several people ... but it's really hard to hit ;-( > I'll double down on repros again during plumbers and hopefully come up > with a recipe for explosions. On other important datapoint is that we I appreciate the effort! > The issue looked similar to Christian Theune's rcu stalls, but since it > was just one CPU spinning away, I was able to perf probe and drgn my way > to some details. The xarray for the file had a series of large folios: > > [ index 0 large folio from the correct file ] > [ index 1: large folio from the correct file ] > ... > [ index N: large folio from a completely different file ] > [ index N+1: large folio from the correct file ] > > I'm being sloppy with index numbers, but the important part is that > we've got a large folio from the wrong file in the middle of the bunch. If you could get the precise index numbers, that would be an important clue. It would be interesting to know the index number in the xarray where the folio was found rather than folio->index (as I suspect that folio->index is completely bogus because folio->mapping is wrong). But gathering that info is going to be hard. Maybe something like this? +++ b/mm/filemap.c @@ -2317,6 +2317,12 @@ static void filemap_get_read_batch(struct address_space *mapping, if (unlikely(folio != xas_reload(&xas))) goto put_folio; +{ + struct address_space *fmapping = READ_ONCE(folio->mapping); + if (fmapping != NULL && fmapping != mapping) + printk("bad folio at %lx\n", xas.xa_index); +} + if (!folio_batch_add(fbatch, folio)) break; if (!folio_test_uptodate(folio)) (could use VM_BUG_ON_FOLIO() too, but i'm not sure that the identity of the bad folio we've found is as interesting as where we found it)