From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 9A3BFCF07A9 for ; Thu, 10 Oct 2024 06:29:48 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 059EB6B0089; Thu, 10 Oct 2024 02:29:48 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 00A3C6B008A; Thu, 10 Oct 2024 02:29:47 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id DEC7E6B008C; Thu, 10 Oct 2024 02:29:47 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id BEED56B0089 for ; Thu, 10 Oct 2024 02:29:47 -0400 (EDT) Received: from smtpin16.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id 14FC5C01AB for ; Thu, 10 Oct 2024 06:29:44 +0000 (UTC) X-FDA: 82656716814.16.B9231BF Received: from mail.flyingcircus.io (mail.flyingcircus.io [212.122.41.197]) by imf20.hostedemail.com (Postfix) with ESMTP id A43A41C0010 for ; Thu, 10 Oct 2024 06:29:44 +0000 (UTC) Authentication-Results: imf20.hostedemail.com; dkim=pass header.d=flyingcircus.io header.s=mail header.b=GKMXlb1b; spf=pass (imf20.hostedemail.com: domain of ct@flyingcircus.io designates 212.122.41.197 as permitted sender) smtp.mailfrom=ct@flyingcircus.io; dmarc=pass (policy=reject) header.from=flyingcircus.io ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1728541741; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=vOuCUK01oxy5k7JbXkIGuKuHjfVwEEjE3/abMmPSBdE=; b=8O8PtasUiHt3zHdks3MGsCRA8IqLcN9+hVhBeI6dyBZOmVSUJ7SEfdQPHr8QoKaIOH+acE d2x8m1uoVY6tQuRxX+FV+0Q+KrytO4tBJScYKeYNWzcHmzSxVGkhdv9D22/5C2PDsbivju lY2qOoOQtrnEBgbqcOcdwJtQ5kfOpMw= ARC-Authentication-Results: i=1; imf20.hostedemail.com; dkim=pass header.d=flyingcircus.io header.s=mail header.b=GKMXlb1b; spf=pass (imf20.hostedemail.com: domain of ct@flyingcircus.io designates 212.122.41.197 as permitted sender) smtp.mailfrom=ct@flyingcircus.io; dmarc=pass (policy=reject) header.from=flyingcircus.io ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1728541741; a=rsa-sha256; cv=none; b=mKiVC0ImLh2JvQRuwfcnznOLZ2ekYFzvrjZf9QL3jymr1PjbIzZ9PUwpmbtyjVKrUT9uyW VnHCjbzHGew03/CDZQXZjKeeyuD+/RvNQ0tgUsR/yAw+hDQig3nq4eDWQ35ML9oJ75Rd1d B0a43GiupbWOPiBB8r87bkGm8Xk2RtI= Content-Type: text/plain; charset=utf-8 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=flyingcircus.io; s=mail; t=1728541781; bh=vOuCUK01oxy5k7JbXkIGuKuHjfVwEEjE3/abMmPSBdE=; h=Subject:From:In-Reply-To:Date:Cc:References:To; b=GKMXlb1b3NZjYTYbzmtz//XxClHYQHHhFh2EOhA8sM7KjLDSs/PVVrRoPZlK8PkVP g4WI2ufab656vcuQZ+2RXKLZKLs6T9PhaIlZ8SUVz5f27XBfim2UtKqvfgTaM4fApr mAIY80OovTFfuWA+9SYacshCY8KUljhCIdESO5Fc= Mime-Version: 1.0 (Mac OS X Mail 16.0 \(3818.100.11.1.3\)) Subject: Re: Known and unfixed active data loss bug in MM + XFS with large folios since Dec 2021 (any kernel from 6.1 upwards) From: Christian Theune In-Reply-To: Date: Thu, 10 Oct 2024 08:29:14 +0200 Cc: Linus Torvalds , Dave Chinner , Matthew Wilcox , Jens Axboe , linux-mm@kvack.org, "linux-xfs@vger.kernel.org" , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, Daniel Dao , regressions@lists.linux.dev, regressions@leemhuis.info Content-Transfer-Encoding: quoted-printable Message-Id: References: <74cceb67-2e71-455f-a4d4-6c5185ef775b@meta.com> <52d45d22-e108-400e-a63f-f50ef1a0ae1a@meta.com> <5bee194c-9cd3-47e7-919b-9f352441f855@kernel.dk> <459beb1c-defd-4836-952c-589203b7005c@meta.com> <02121707-E630-4E7E-837B-8F53B4C28721@flyingcircus.io> To: Chris Mason X-Rspam-User: X-Stat-Signature: 6x711zhf4kkd4j8fqsg31uofjim3wqwm X-Rspamd-Queue-Id: A43A41C0010 X-Rspamd-Server: rspam11 X-HE-Tag: 1728541784-472824 X-HE-Meta: U2FsdGVkX193ZiBVPWY0ZBjzO+qyZyjAChHdqPt57/f7ltDf+K80agekvYmmcTbCxMwcvUtOCrKRPESqEudvl/NZKWco6zYLbM5ZjMsE0/b3FNFRYYTsamVR2ObpoCVcUv6mtOfvr+r8Q8igkCP88SkdeMO9v78t0IekTGO0Yxn45eZEjiq2ilcuT4PuphWCVf2ev/GHQ8TD3OjGPVgHRMmfIZnDP3iOlBN62QfJxObYllTWY7M/jsZDxBj2nL3qDOcAbabcgm7JQwpYKebsOD8w6VbCXxBnbozvRG7AM9EIQbIR9RxevSeRU9i/E13rRsxM4ayJUZGicM2VLGg2WTaonMHqF16JElPGThZAGd+uylc5FVAE816qleFliRJTA6VQD7n8xPQ/O6IYVwW5zmoHmRaDjBuuX4C/EKdUqSgItbZqpUl2+wye1ed4EnSQ7BC4/zkiiyZEB1jY43Bxo4LwuvvU0upJxFMKMad7hRf3Kj56Hx7o71xNUF6RdnM/MNmVxSxkHXsnc5zesr5sMsRns1wb7PzIFLhGrrnmyaTXO46gHMtTQH7FFkmos7gr4Drg+kMqCLMYujM/k1oRpRFKE0lxqbYEzXJWa+6ZTtwtWj+MzqtuqnIsv/UaDudebNcbjXy675tanUWS+lcss53QY3m1KbilpwInAkO7dmbXhpUeJMi9lnrsSW12RqY6dph/+BREnSojtt4zP9TntAmutpKJzIt2xIjNi+bYq8kpuoHkzZ2AGspB3HBIpltYcafJeMlOVbI6HfbfR19zJtarkFxG/cSwXzTyuL8XoI85Z+N6LNlRcxuRpTncxnw6XURQ4Co7Sm0Xph5KyyTGsWV+s/l1BXhIClFrYE3ttf4zjZg+ZxhsAUR/ucrrYcCG+a6FxDS+hJAZ7PRBDsg7Y6eho1VstGsho+UL0Kw+YpNVuAktQ0QgIyeg8VDyb35D5mAdOXooK0HGxhnpxWr Lll7V1LN iFVz0SnQYFKxGy8Icsl+DNNnHKnS9KGfRLWumWWQAzVEvbjndHChlp0KCk5UVb6KAEwsdD3suEp1Qjcq6oq+mOYhmQUobciLiwCTBX2JpbkzQYThEyjDSiBTuRcznfEgWg+onWHLQSZ21439pRLm5x5i8ZsTag2PaBCgQ3sLYLSAfgyCMh5GNpL303VRmXl6F4zKksLT1ZcQYHOwYgiVSnyOxIMj7zqe46/+5ZVsG1EbjmfMmmD1maK4uKjvIs5mDfBbhmHUVJLi/0n2kSPSQhIy5O3CEbY2ZTQCJ793uAOyUVF7dF1jwJIIzYrd9xSPa3a7zXN7Rvd5QiLg= X-Bogosity: Ham, tests=bogofilter, spamicity=0.019430, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: > On 1. Oct 2024, at 02:56, Chris Mason wrote: >=20 > Not disagreeing with Linus at all, but given that you've got IO > throttling too, we might really just be waiting. It's hard to tell > because the hung task timeouts only give you information about one = process. >=20 > I've attached a minimal version of a script we use here to show all = the > D state processes, it might help explain things. The only problem is > you have to actually ssh to the box and run it when you're stuck. >=20 > The idea is to print the stack trace of every D state process, and = then > also print out how often each unique stack trace shows up. When we're > deadlocked on something, there are normally a bunch of the same stack > (say waiting on writeback) and then one jerk sitting around in a > different stack who is causing all the trouble. I think I should be able to trigger this. I=E2=80=99ve seen around a 100 = of those issues over the last week and the chance of it happening = correlates with a certain workload that should be easy to trigger. Also, = the condition remains for at around 5 minutes, so I should be able to = trace it when I see the alert in an interactive session. I=E2=80=99ve verified I can run your script and I=E2=80=99ll get back to = you in the next days. Christian --=20 Christian Theune =C2=B7 ct@flyingcircus.io =C2=B7 +49 345 219401 0 Flying Circus Internet Operations GmbH =C2=B7 https://flyingcircus.io Leipziger Str. 70/71 =C2=B7 06108 Halle (Saale) =C2=B7 Deutschland HR Stendal HRB 21169 =C2=B7 Gesch=C3=A4ftsf=C3=BChrer: Christian Theune, = Christian Zagrodnick