From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 4576FCE8D72 for ; Thu, 19 Sep 2024 10:19:49 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 8182B6B0092; Thu, 19 Sep 2024 06:19:48 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 7C8386B0095; Thu, 19 Sep 2024 06:19:48 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 6B6DE6B0098; Thu, 19 Sep 2024 06:19:48 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 4D1616B0092 for ; Thu, 19 Sep 2024 06:19:48 -0400 (EDT) Received: from smtpin18.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id B2326160BDA for ; Thu, 19 Sep 2024 10:19:47 +0000 (UTC) X-FDA: 82581091614.18.7C61913 Received: from mail.flyingcircus.io (mail.flyingcircus.io [212.122.41.197]) by imf03.hostedemail.com (Postfix) with ESMTP id AD51E2000E for ; Thu, 19 Sep 2024 10:19:45 +0000 (UTC) Authentication-Results: imf03.hostedemail.com; dkim=pass header.d=flyingcircus.io header.s=mail header.b="MdW/JwzL"; spf=pass (imf03.hostedemail.com: domain of ct@flyingcircus.io designates 212.122.41.197 as permitted sender) smtp.mailfrom=ct@flyingcircus.io; dmarc=pass (policy=reject) header.from=flyingcircus.io ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1726741036; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=C8urjnOua0GCJAjrYNro+A9+djxFaRIrbs95gdqoxeU=; b=GASYzohjrLfYp/ly+8wQesLCXCnfHEeir9pfSu0P01iCcW7D5pVaMBmyGYSynngOt4gQ0F mfvoBOABctc4P8MKXdkylLCFGybgy/WbSKipEpRsJsaOS/6TFG7E9JJdOyyreLuckRNtT6 vLngoFR2pfvAYNsWEu43xtsUSIl8kKs= ARC-Authentication-Results: i=1; imf03.hostedemail.com; dkim=pass header.d=flyingcircus.io header.s=mail header.b="MdW/JwzL"; spf=pass (imf03.hostedemail.com: domain of ct@flyingcircus.io designates 212.122.41.197 as permitted sender) smtp.mailfrom=ct@flyingcircus.io; dmarc=pass (policy=reject) header.from=flyingcircus.io ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1726741036; a=rsa-sha256; cv=none; b=ukL7tbUZ61QSMYCDfvwp3ubeyCF1CNFs1MqfKNix9jcZuXAZWgriWRJM/6bntBRFvtiatD tM6Kv6/dmE+iaqC72Vl0GL3fb1Jo1nieuI/KF5UYNYByDQGyvE0g+taiyGmDgCEVvTvqQC o7c4snIbOHe2lK5u0bk9GELY0+Ve8BU= Content-Type: text/plain; charset=utf-8 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=flyingcircus.io; s=mail; t=1726741181; bh=C8urjnOua0GCJAjrYNro+A9+djxFaRIrbs95gdqoxeU=; h=Subject:From:In-Reply-To:Date:Cc:References:To; b=MdW/JwzLtG7jvM2lKNp19UNEGkA++Q85uQ+qZjR6Gwdt1C7rGaR7xP/GOO7e/Ff73 yCRvERCX/lQ8dqkMhIY+NBCdHUAnPm+OuxhsFXaaIWt08iCDrjc450JYCSLAU7m66F Tj4x6f9YQRiT1AWs3G2qXIx+1o4QNalFGYcwxG2g= Mime-Version: 1.0 (Mac OS X Mail 16.0 \(3818.100.11.1.3\)) Subject: Re: Known and unfixed active data loss bug in MM + XFS with large folios since Dec 2021 (any kernel from 6.1 upwards) From: Christian Theune In-Reply-To: Date: Thu, 19 Sep 2024 12:19:19 +0200 Cc: Dave Chinner , Matthew Wilcox , Chris Mason , Jens Axboe , linux-mm@kvack.org, "linux-xfs@vger.kernel.org" , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, Daniel Dao , regressions@lists.linux.dev, regressions@leemhuis.info Content-Transfer-Encoding: quoted-printable Message-Id: References: <74cceb67-2e71-455f-a4d4-6c5185ef775b@meta.com> <52d45d22-e108-400e-a63f-f50ef1a0ae1a@meta.com> <5bee194c-9cd3-47e7-919b-9f352441f855@kernel.dk> <459beb1c-defd-4836-952c-589203b7005c@meta.com> To: Linus Torvalds X-Rspamd-Server: rspam06 X-Rspamd-Queue-Id: AD51E2000E X-Stat-Signature: uwkybrxggafoefyncyci3zt3co3mgijz X-Rspam-User: X-HE-Tag: 1726741185-680255 X-HE-Meta: U2FsdGVkX1+kWR0FY5r9MtxB2QxY8fi78QOnTw5h+tDsC1w/6NJi4UD/zi3mV3KIQKegqB2wpdna5EGJwNCEcCH5103vMl29sIiFd9re0+wtkhzf5xqjz4rUlwPa9tgZf4+VJ75pAE7AQpFHr7GYd8TgxWxKM+dLybsopcGC8dvIG2BtRV+MHWwwD8ijP/OYArzfBPxOrou6j+h5Fk6XWEzlifzGudY9DE7ki6QYW9lqkoIQiuCIFK0/qYJmaV6UujUswMR4jBiY9FV1sh5W1rOlWEZGYwNRbz68NdvgUsfDRBVDb/hZTbw9LG6WN772RWeMF+3xWEFqaA5apTskIu2yLPJhZmZHy8KC3QuQ7seWk1Ug3NBmh4h12aqv1MtAl+xE3FFI6mueXQ6Ev0f3zyaIGVsicu3DEirLs2LIOStKmq7bUDEKkst/d+BFO3sUCpGERWpVNYx3yuowFAZqz32CO+OI3GpU6jYPVsZzPYyFKd6wzvzEXdaeXA7g1tQb9npKYOq+yG83/nBbiCjcfIaJyAc3Jy+QEATzoxEuk6qNpusgZwm9VzawimU68CjWcpKuOtpjcc0lOIAEWIg4KwClkv5VabmWgUwGGvcnJ9B3SwOIjl0aHkCzlYwjXS6R2v4Y+pGCFF6om6xZr7zOlwUqphoSjy1bc5+cLz20ZxORZJ6HnJZ328nQjPEA97rk2gngDoxbQUq2Oq8rrTP2+VptuS1mubgb6zRCpclpfHfmihzPrQ6JqomwCZmLz57RWHqsNLxrbUsalpol0cj0+NGTWA0kE1mr7HLbWejuee73Pd9z9n9NovMf4KdtQ6ebnHsSnWeyEVvc9kdaiymnj7LtmIXs1D6kLbYqL1sCnDaPeV1ppYLjQKZq3TtZDOP2ZiykCwFmQ1DScv1zdK0Znr7SEROdUTiBYqhcseQ0S3en8N0EcPHsnaunwBohWkShJu9I1tS6+dsq8lmgJ0r sW2JRinP iD6ZaIx69X9hDAV55rtIGzcx1ANxCYQ7bOIR/EFeyQWsdeSoeG5EWdck+QOohFxaB/00Jel7p3oRbqIAkXjwwqAEeNgEHP2F585eqGWavZDss9BKyHAuRHe7x80eWYuxvacKh+tVwiP26BeZrGeNUARksFb5lRyimYM3vovbIB+QHtdOkW9kphzeSwO4bRfXCZFN5Q10VNClUvgFQiRis6NL07JvxZEedJ13DzVJdo1OCTiFt7vsKtt98EaxyLcTS/xu+z5UiSI7X3QqjkdfwK/LU/RYjyJRi6x7TQesRA6T2MPClMmmoKFGepw== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: > On 19. Sep 2024, at 08:57, Linus Torvalds = wrote: >=20 > Yeah, right now Jens is still going to run some more testing, but I > think the plan is to just backport >=20 > a4864671ca0b ("lib/xarray: introduce a new helper xas_get_order") > 6758c1128ceb ("mm/filemap: optimize filemap folio adding") >=20 > and I think we're at the point where you might as well start testing > that if you have the cycles for it. Jens is mostly trying to confirm > the root cause, but even without that, I think you running your load > with those two changes back-ported is worth it. >=20 > (Or even just try running it on plain 6.10 or 6.11, both of which > already has those commits) I=E2=80=99ve discussed this with my team and we=E2=80=99re preparing to = switch all our=20 non-prod machines as well as those production machines that have shown the error before. This will require a bit of user communication and reboot scheduling. Our release prep will be able to roll this out starting early next week and the production machines in question around Sept 30. We would run with 6.11 as our understanding so far is that running the most current kernel would generate the most insight and is easier to work with for you all? (Generally we run the mostly vanilla LTS that has surpassed x.y.50+ so we might later downgrade to 6.6 when this is fixed.) > So considering how well the reproducer works for Jens and Chris, my > main worry is whether your load might have some _additional_ issue. >=20 > Unlikely, but still .. The two commits fix the repproducer, so I think > the important thing to make sure is that it really fixes the original > issue too. >=20 > And yeah, I'd be surprised if it doesn't, but at the same time I would > _not_ suggest you try to make your load look more like the case we > already know gets fixed. >=20 > So yes, it will be "weeks of not seeing crashes" until we'd be > _really_ confident it's all the same thing, but I'd rather still have > you test that, than test something else than what caused issues > originally, if you see what I mean. Agreed, I=E2=80=99m all onboard with that. Liebe Gr=C3=BC=C3=9Fe, Christian Theune --=20 Christian Theune =C2=B7 ct@flyingcircus.io =C2=B7 +49 345 219401 0 Flying Circus Internet Operations GmbH =C2=B7 https://flyingcircus.io Leipziger Str. 70/71 =C2=B7 06108 Halle (Saale) =C2=B7 Deutschland HR Stendal HRB 21169 =C2=B7 Gesch=C3=A4ftsf=C3=BChrer: Christian Theune, = Christian Zagrodnick