From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 4F440EEE26C for ; Thu, 12 Sep 2024 21:55:07 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id BA1F66B0083; Thu, 12 Sep 2024 17:55:06 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id B52B36B0088; Thu, 12 Sep 2024 17:55:06 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id A194E6B0089; Thu, 12 Sep 2024 17:55:06 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 83D976B0083 for ; Thu, 12 Sep 2024 17:55:06 -0400 (EDT) Received: from smtpin29.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id E675D8014A for ; Thu, 12 Sep 2024 21:55:05 +0000 (UTC) X-FDA: 82557442170.29.CFB30F7 Received: from casper.infradead.org (casper.infradead.org [90.155.50.34]) by imf12.hostedemail.com (Postfix) with ESMTP id 3639240005 for ; Thu, 12 Sep 2024 21:55:04 +0000 (UTC) Authentication-Results: imf12.hostedemail.com; dkim=pass header.d=infradead.org header.s=casper.20170209 header.b=vszIJZK+; spf=none (imf12.hostedemail.com: domain of willy@infradead.org has no SPF policy when checking 90.155.50.34) smtp.mailfrom=willy@infradead.org; dmarc=none ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1726177964; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=GnfPyNgu0BILt09ywPU879XMla3butlw0lEC9CUtWzk=; b=EEy1x3/kkrUp+quX6OtxchcsPAdsjf+OUYZkA/XK6uRcqzwrpCRGAtgiRhHMZotLCM+6YG 8eH1+mgOmQbdRZGKUy/1iwwY9bopP4N15g1ILIez6+Y8YK4/zEK0+LROnPoi6RycW8QdSz mnvpLS5WjJgqRCNj6JfQ3dkcdD5k8/c= ARC-Authentication-Results: i=1; imf12.hostedemail.com; dkim=pass header.d=infradead.org header.s=casper.20170209 header.b=vszIJZK+; spf=none (imf12.hostedemail.com: domain of willy@infradead.org has no SPF policy when checking 90.155.50.34) smtp.mailfrom=willy@infradead.org; dmarc=none ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1726177964; a=rsa-sha256; cv=none; b=HImpLlcu8KoF8M6oU6ZsatQfHrFhaWYhgF140sdEEnatpuNUoKZD2bOJnsVpgYBrErWTfX +OVmXP9f8MkocCa8eRIBpbZVJmPPR/dZokbnMCtqoMn1JESSKBD95DwxV7ygDypxG8mWB7 bzMZAmEJE88KhiNG8THapRvMqm3B0EQ= DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=casper.20170209; h=In-Reply-To:Content-Transfer-Encoding: Content-Type:MIME-Version:References:Message-ID:Subject:Cc:To:From:Date: Sender:Reply-To:Content-ID:Content-Description; bh=GnfPyNgu0BILt09ywPU879XMla3butlw0lEC9CUtWzk=; b=vszIJZK+Jvdrtrizior1yTTJ2u rlSnrzJnL9q2dvdMvWwzLpMBqr7nZrPLok3EEzuY052thHL6X1OfTo2AJyd+qoXDIZkj7p6kq+Dy+ 4kAczS5IUk/PozZij9/cX92sMmkn9DGnYy/Pkm/Y35meuZVwooU/6I75JM4I9JP/Ukeiq5MqBivos XAgxlZKGblo85rJWpR9gEAM+JAr/6tfUAhAIRpNsuX0tMJVeOWZk05dyjsznMpmKwA8g/SvS69l0C qJqfii+DAS7LeR7mt6YbnLd8CLuAiO5eXcc4VqpElgF4KxmWju7IrZD6yi6FtAW6tpZzwVx98J23F 6HCczzJQ==; Received: from willy by casper.infradead.org with local (Exim 4.98 #2 (Red Hat Linux)) id 1sormG-0000000BONF-2X2q; Thu, 12 Sep 2024 21:55:00 +0000 Date: Thu, 12 Sep 2024 22:55:00 +0100 From: Matthew Wilcox To: Christian Theune Cc: linux-mm@kvack.org, "linux-xfs@vger.kernel.org" , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, torvalds@linux-foundation.org, axboe@kernel.dk, Daniel Dao , Dave Chinner , clm@meta.com, regressions@lists.linux.dev, regressions@leemhuis.info Subject: Re: Known and unfixed active data loss bug in MM + XFS with large folios since Dec 2021 (any kernel from 6.1 upwards) Message-ID: References: MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: X-Rspamd-Server: rspam06 X-Rspamd-Queue-Id: 3639240005 X-Stat-Signature: wseyp1x9fzajn9g5eg5yr7digkbmb9au X-Rspam-User: X-HE-Tag: 1726178104-773629 X-HE-Meta: U2FsdGVkX19574DuW22tA+7CSPUenR3r7DhpRn5eYMtskxF7LXypyi8NtAnuFSNworDHRQRY7Ytrfb/gQm5jDdR/gSGhybG+diQlA0BnYepG7bQws9Xm34lCPbyilyVEb/4mWeMZOBCXCPBnrBVIMgldQSyGEsvIvr3pQQXtw1YwH4JaUJD0OBAyegj8izQy78ZNbz4sgxc+BbHqcUVoIhCnCqTwot07axxxrqVbQYaSv0VgMjD6fWbhNVDXVJpuOj+xMaHKf6/929CIpQNP6dsCgIs0343hjorbES0WXcFFOXfjY6LN/6lkwUVFkmS8AOzx661b5DJajO1PxHiBzrCxi3hbPPvOseH8/Sb7wQdh+fJJ+mY+dNJc+lnJS7fVhJSaRKf3q1dvNAsGQYQXku4J/dkVdkqSMmZGZzPTHVTw2lpWhDLjlRf/nx7zD3JFnYVtkyoyawj1qrdyczT8eremBjVvwfdv+fVLt9Mu5WB219pSE6dUvyq3tZaiTMKa78X+uVXk7/bKcB5usyhcWBcTBykW0hu7HIzU4yTsjtxTSCrV4bzh7y+cslXNxCDXmCM64X6yq3Imv7lMueX6HAt0kMs+J59xwozLendhUioFB8g+NcwQtAOI64GRxC5swYc78/wyWEPyLDKpTxJk/oZy9zd76cnY6TMJAEVPSj5srBKe7pvusKQ8OM5iX2ii8760t06P0Mkf4Sgvm1DirXogu1PwgCX9h0BDA9bzIPcHDPeZmPfvsD/egKQj+uQahNXrXCM5bx9ow0hLJpMKFLu4f94EUhbiPIAvd9aTPt9sqdbkl9NJgSkwVrcDL4XB38QPf4snF07xUrOnrDKveLKH7aXHwB778FBUxpoqvD+rBr233b9tQN1nnRyf76igcupzwTKTwIcz9i32ECa3Am/gH8BkEsw9zDXIsxndG2PQ9rXceRPZLtiX/BoLwqSc9f8eAnfHrVekgk88PTL iBjBqZVw B91N4NdWS/sVyL2Onu4e5RlIBeiq1Gu543jLWdpuh6SYT2WCjsqcU0Av3pd+iPbo2FQ46cmRBzZgnHrMLTujyydCTHVqkcbCOSIS5/rTDXh8hNgi1kexFadbj7dmzUn1X0xBfipGOhC2+5s8WjsrnaGskLohppuoVlPcNS+ka33CUEzslzs4VTvUNP/qCKQXf38doz7FiwX2oCe5PdR6xxGJAP38lOn3lE81UymJsK9S1C+sY2wGi41YHXiTRtaNi/6HcGCVmWASOxOGJ3zFrGGeYL7w/iwkT8LWam+C4QL4Wcx0= X-Bogosity: Ham, tests=bogofilter, spamicity=0.001806, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Thu, Sep 12, 2024 at 11:18:34PM +0200, Christian Theune wrote: > This bug is very hard to reproduce but has been known to exist as a > “fluke” for a while already. I have invested a number of days trying > to come up with workloads to trigger it quicker than that stochastic > “once every few weeks in a fleet of 1.5k machines", but it eludes > me so far. I know that this also affects Facebook/Meta as well as > Cloudflare who are both running newer kernels (at least 6.1, 6.6, > and 6.9) with the above mentioned patch reverted. I’m from a much > smaller company and seeing that those guys are running with this patch > reverted (that now makes their kernel basically an untested/unsupported > deviation from the mainline) smells like desparation. I’m with a > much smaller team and company and I’m wondering why this isn’t > tackled more urgently from more hands to make it shallow (hopefully). This passive-aggressive nonsense is deeply aggravating. I've known about this bug for much longer, but like you I am utterly unable to reproduce it. I've spent months looking for the bug, and I cannot.