From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 2430BCE835D for ; Mon, 30 Sep 2024 17:35:12 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id A32AE28000F; Mon, 30 Sep 2024 13:35:11 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 9E31C280003; Mon, 30 Sep 2024 13:35:11 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 85CA428000F; Mon, 30 Sep 2024 13:35:11 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id 610DC280003 for ; Mon, 30 Sep 2024 13:35:11 -0400 (EDT) Received: from smtpin24.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id CDB33ABF24 for ; Mon, 30 Sep 2024 17:35:10 +0000 (UTC) X-FDA: 82622105580.24.D34F775 Received: from mail.flyingcircus.io (mail.flyingcircus.io [212.122.41.197]) by imf19.hostedemail.com (Postfix) with ESMTP id C87F51A0013 for ; Mon, 30 Sep 2024 17:35:07 +0000 (UTC) Authentication-Results: imf19.hostedemail.com; dkim=pass header.d=flyingcircus.io header.s=mail header.b=MwQhucXQ; spf=pass (imf19.hostedemail.com: domain of ct@flyingcircus.io designates 212.122.41.197 as permitted sender) smtp.mailfrom=ct@flyingcircus.io; dmarc=pass (policy=reject) header.from=flyingcircus.io ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1727717582; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=y9fgcEQUNiPEpJ2USU0yAmJArJilNuIATGvm2Om33JQ=; b=bU8q9k/kKuP90bEcjRYoyusulD+NdiSqo7OznyRjEi662G8IPUAUlAIqfq1IW6ZlPXFOwO drnCSA+4xhHTCqRz0hSrx+nl0B3Bh9Hfp63Vw34BFFwAJ4pTvw1MM2qdmOZRNSDgYDapbY IFphODRqA14RnkxnUYl8xNdaTBtZx7M= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1727717582; a=rsa-sha256; cv=none; b=LEsKqhhUcgYP6R2uXtsbmfS/L2scAfy9i4cnjWuLL4oVcUFRVJnhNeDw1hkAtwbWIKMio/ xLdmlOedxsKG6FtFsFRH2EJ8mBAqqUmSff0c769osqgXEM/03cfbA1ON7BcfICQyobbxpZ tYnVt+uPswdAq4C6l9feEHMfyQqCCxE= ARC-Authentication-Results: i=1; imf19.hostedemail.com; dkim=pass header.d=flyingcircus.io header.s=mail header.b=MwQhucXQ; spf=pass (imf19.hostedemail.com: domain of ct@flyingcircus.io designates 212.122.41.197 as permitted sender) smtp.mailfrom=ct@flyingcircus.io; dmarc=pass (policy=reject) header.from=flyingcircus.io Content-Type: text/plain; charset=utf-8 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=flyingcircus.io; s=mail; t=1727717703; bh=y9fgcEQUNiPEpJ2USU0yAmJArJilNuIATGvm2Om33JQ=; h=Subject:From:In-Reply-To:Date:Cc:References:To; b=MwQhucXQAAqABLZuGhW72ooGGwKgNFhJwlgx3MfMmNHxz/098ZA71bbR6XG/tufxR rG+DkgrDd7adUqXjW+L5hPfp2LEUrFDuG3JG6vZWMTiTqIGrI2Cw1aGyl93wwne9E5 m+dGKzcjPdqzYCzyPOnp5cMNFY4add8L8K5b8rWc= Mime-Version: 1.0 (Mac OS X Mail 16.0 \(3818.100.11.1.3\)) Subject: Re: Known and unfixed active data loss bug in MM + XFS with large folios since Dec 2021 (any kernel from 6.1 upwards) From: Christian Theune In-Reply-To: Date: Mon, 30 Sep 2024 19:34:39 +0200 Cc: Dave Chinner , Matthew Wilcox , Chris Mason , Jens Axboe , linux-mm@kvack.org, "linux-xfs@vger.kernel.org" , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, Daniel Dao , regressions@lists.linux.dev, regressions@leemhuis.info Content-Transfer-Encoding: quoted-printable Message-Id: <02121707-E630-4E7E-837B-8F53B4C28721@flyingcircus.io> References: <74cceb67-2e71-455f-a4d4-6c5185ef775b@meta.com> <52d45d22-e108-400e-a63f-f50ef1a0ae1a@meta.com> <5bee194c-9cd3-47e7-919b-9f352441f855@kernel.dk> <459beb1c-defd-4836-952c-589203b7005c@meta.com> To: Linus Torvalds X-Stat-Signature: igis9hz58zek9ctojp7io48a3noeco5d X-Rspamd-Queue-Id: C87F51A0013 X-Rspam-User: X-Rspamd-Server: rspam08 X-HE-Tag: 1727717707-278939 X-HE-Meta: U2FsdGVkX19BIu3XLX9eXcMDE3M70xGdLT8rG2aZHt4swP1bDtvth0US/fWPC91udoI7G1ilslS3bD4yNkSEec1UlG5c8K816O3yGVSDnl05g9af9EO10eDfWRx40042yYqboCN7sasVuOK8Ti7mxWihBJAJ6hFyyWpqOibdYPl5Hbehn9mJh39IpSVAtJRMHpIe6+UqWT6m8EyOHqrrXL5qV48KP61eV9hsQbYBd5Ng9WLHUEF9C8JcoRt/IIVuS1isSrD3M+HgBxAIjQoK6Oz7aIslANP9jGSuK4B6Z9wZZt57e4BKiN/OPAb8hFqvLX78Eb70r+NGTuQfIUNnmQtjl5zGiHktsPXilh4C8MVktE0QsyH5Kke8ehohJwkgu7K0a6SjMHN7bn1OXcB+x8fkdldHPrE7PvyaHSBTIxIiRevm51rhmjQW+FuTzcINY8tFtrtGra9uERwDciqMV0POiLEE0BJVQRlQqF85ClIMQqrBFYB4RSMokPukPEVtqqqDAsmSkVjLQ1pCwa1fob2CC5zDSumQ17SH+kTKXO0pWY/NcX5cne2M/IJHAkdlBwNRH5G+tTvDLVGszzPogm8fbscFsoQZRmui48ilnIKAnHEuKSmo2FI+GePHV1wIdZDhBR2IxqW1DwZ+HyPvSEds+ikmmXthPCvgSFRMPIMaaCffQCMQ6SkpPWjyfbPBrIlX+LuL020r+2ThiEPkZT4Od0vE/sEF+TN52+yGiDRpMMEvq+pQPl3c95Sn4L9vqKzugz49K3d6bm4KTH+sjo49H3VjtPQbWUPToYHp5O21TN+tc6XTlmjkBw9vpiMWENNKgjj1ekWxMWOJ2Nz94zRABjSJQoYy4LNQPsGR/aqNHK/g7krWzsIdAsYMbw1+xZ60u24oARM5v4AX7QHod7o9fPgkFaPkESyKmaGJfQ0b81uUwam9f8+3W3CisisUC4wz3NiY7flYFGCXjZz v7HbSNrQ d9WFltlYe/NaxPzqU1ww5B2FCSuICV1ZJP1P1o2bYmCRFVULp3EL0x+BGd/ZwslCZFk/K6F4OZma9nPRfMehHyQGT0m2CfPX/60h/D+s4eMQVXf1z1VLQJGeOO75dqK3cD2laQoksMTob2YVIk4ynX5fXJ3EL1HMM4/JkXZWOR0XRWpr3WZv7YVAnBw76n53brNOafFst6bl5JvJ1vGdvzcylQt9BuGvYuIzQwnKSQ7g6VPezTinIkwku5ck8x0SwzTIWJjx32VcwCbYCJj6ZeEXaXZHuLtW6rOdzrvauKDSkRQCX0AtQXovEGltaTWpQX/ZRTBhlgQcvmMQ= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Hi, we=E2=80=99ve been running a number of VMs since last week on 6.11. = We=E2=80=99ve encountered one hung task situation multiple times now = that seems to be resolving itself after a bit of time, though. I do not = see spinning CPU during this time. The situation seems to be related to cgroups-based IO throttling / = weighting so far: Here are three examples of similar tracebacks where jobs that do perform = a certain amount of IO (either given a weight or given an explicit limit = like this: IOWeight=3D10 IOReadIOPSMax=3D/dev/vda 188 IOWriteIOPSMax=3D/dev/vda 188 =09 Telemetry for the affected VM does not show that it actually reaches 188 = IOPS (the load is mostly writing) but creates a kind of gaussian curve = =E2=80=A6=20 The underlying storage and network was completely inconspicuous during = the whole time. Sep 27 00:51:20 13 kernel: INFO: task nix-build:5300 = blocked for more than 122 seconds. Sep 27 00:51:20 13 kernel: Not tainted 6.11.0 = #1-NixOS Sep 27 00:51:20 13 kernel: "echo 0 > = /proc/sys/kernel/hung_task_timeout_secs" disables this message. Sep 27 00:51:20 13 kernel: task:nix-build = state:D stack:0 pid:5300 tgid:5298 ppid:5297 flags:0x00000002 Sep 27 00:51:20 13 kernel: Call Trace: Sep 27 00:51:20 13 kernel: Sep 27 00:51:20 13 kernel: __schedule+0x3a3/0x1300 Sep 27 00:51:20 13 kernel: ? = xfs_vm_writepages+0x67/0x90 [xfs] Sep 27 00:51:20 13 kernel: schedule+0x27/0xf0 Sep 27 00:51:20 13 kernel: io_schedule+0x46/0x70 Sep 27 00:51:20 13 kernel: = folio_wait_bit_common+0x13f/0x340 Sep 27 00:51:20 13 kernel: ? = __pfx_wake_page_function+0x10/0x10 Sep 27 00:51:20 13 kernel: = folio_wait_writeback+0x2b/0x80 Sep 27 00:51:20 13 kernel: = __filemap_fdatawait_range+0x80/0xe0 Sep 27 00:51:20 13 kernel: = filemap_write_and_wait_range+0x85/0xb0 Sep 27 00:51:20 13 kernel: = xfs_setattr_size+0xd9/0x3c0 [xfs] Sep 27 00:51:20 13 kernel: xfs_vn_setattr+0x81/0x150 = [xfs] Sep 27 00:51:20 13 kernel: notify_change+0x2ed/0x4f0 Sep 27 00:51:20 13 kernel: ? do_truncate+0x98/0xf0 Sep 27 00:51:20 13 kernel: do_truncate+0x98/0xf0 Sep 27 00:51:20 13 kernel: do_ftruncate+0xfe/0x160 Sep 27 00:51:20 13 kernel: = __x64_sys_ftruncate+0x3e/0x70 Sep 27 00:51:20 13 kernel: do_syscall_64+0xb7/0x200 Sep 27 00:51:20 13 kernel: = entry_SYSCALL_64_after_hwframe+0x77/0x7f Sep 27 00:51:20 13 kernel: RIP: 0033:0x7f1ed1912c2b Sep 27 00:51:20 13 kernel: RSP: 002b:00007f1eb73fd3f8 = EFLAGS: 00000246 ORIG_RAX: 000000000000004d Sep 27 00:51:20 13 kernel: RAX: ffffffffffffffda RBX: = 0000000000000000 RCX: 00007f1ed1912c2b Sep 27 00:51:20 13 kernel: RDX: 0000000000000003 RSI: = 0000000000000000 RDI: 0000000000000012 Sep 27 00:51:20 13 kernel: RBP: 0000000000000012 R08: = 0000000000000000 R09: 00007f1eb73fd3a0 Sep 27 00:51:20 13 kernel: R10: 0000000000132000 R11: = 0000000000000246 R12: 00005601d0150290 Sep 27 00:51:20 13 kernel: R13: 00005601d58ae0b8 R14: = 0000000000000001 R15: 00005601d58bec58 Sep 27 00:51:20 13 kernel: Sep 28 10:13:04 release2405dev00 kernel: INFO: task nix-channel:507080 = blocked for more than 122 seconds. Sep 28 10:13:04 release2405dev00 kernel: Not tainted 6.11.0 = #1-NixOS Sep 28 10:13:04 release2405dev00 kernel: "echo 0 > = /proc/sys/kernel/hung_task_timeout_secs" disables this message. Sep 28 10:13:04 release2405dev00 kernel: task:nix-channel state:D = stack:0 pid:507080 tgid:507080 ppid:507061 flags:0x00000002 Sep 28 10:13:04 release2405dev00 kernel: Call Trace: Sep 28 10:13:04 release2405dev00 kernel: Sep 28 10:13:04 release2405dev00 kernel: __schedule+0x3a3/0x1300 Sep 28 10:13:04 release2405dev00 kernel: ? xfs_vm_writepages+0x67/0x90 = [xfs] Sep 28 10:13:04 release2405dev00 kernel: schedule+0x27/0xf0 Sep 28 10:13:04 release2405dev00 kernel: io_schedule+0x46/0x70 Sep 28 10:13:04 release2405dev00 kernel: = folio_wait_bit_common+0x13f/0x340 Sep 28 10:13:04 release2405dev00 kernel: ? = __pfx_wake_page_function+0x10/0x10 Sep 28 10:13:04 release2405dev00 kernel: folio_wait_writeback+0x2b/0x80 Sep 28 10:13:04 release2405dev00 kernel: = __filemap_fdatawait_range+0x80/0xe0 Sep 28 10:13:04 release2405dev00 kernel: = file_write_and_wait_range+0x88/0xb0 Sep 28 10:13:04 release2405dev00 kernel: xfs_file_fsync+0x5e/0x2a0 = [xfs] Sep 28 10:13:04 release2405dev00 kernel: __x64_sys_fdatasync+0x52/0x90 Sep 28 10:13:04 release2405dev00 kernel: do_syscall_64+0xb7/0x200 Sep 28 10:13:04 release2405dev00 kernel: = entry_SYSCALL_64_after_hwframe+0x77/0x7f Sep 28 10:13:04 release2405dev00 kernel: RIP: 0033:0x7f5b9371270a Sep 28 10:13:04 release2405dev00 kernel: RSP: 002b:00007ffd678149f0 = EFLAGS: 00000293 ORIG_RAX: 000000000000004b Sep 28 10:13:04 release2405dev00 kernel: RAX: ffffffffffffffda RBX: = 0000559a4d023a18 RCX: 00007f5b9371270a Sep 28 10:13:04 release2405dev00 kernel: RDX: 0000000000000000 RSI: = 0000000000000000 RDI: 0000000000000007 Sep 28 10:13:04 release2405dev00 kernel: RBP: 0000000000000000 R08: = 0000000000000001 R09: 0000559a4d027878 Sep 28 10:13:04 release2405dev00 kernel: R10: 0000000000000016 R11: = 0000000000000293 R12: 0000000000000001 Sep 28 10:13:04 release2405dev00 kernel: R13: 000000000000002e R14: = 0000559a4d0278fc R15: 00007ffd67814bf0 Sep 28 10:13:04 release2405dev00 kernel: Sep 28 03:39:19 10 kernel: INFO: task nix-build:94696 = blocked for more than 122 seconds. Sep 28 03:39:19 10 kernel: Not tainted 6.11.0 = #1-NixOS Sep 28 03:39:19 10 kernel: "echo 0 > = /proc/sys/kernel/hung_task_timeout_secs" disables this message. Sep 28 03:39:19 10 kernel: task:nix-build = state:D stack:0 pid:94696 tgid:94696 ppid:94695 flags:0x00000002 Sep 28 03:39:19 10 kernel: Call Trace: Sep 28 03:39:19 10 kernel: Sep 28 03:39:19 10 kernel: __schedule+0x3a3/0x1300 Sep 28 03:39:19 10 kernel: schedule+0x27/0xf0 Sep 28 03:39:19 10 kernel: io_schedule+0x46/0x70 Sep 28 03:39:19 10 kernel: = folio_wait_bit_common+0x13f/0x340 Sep 28 03:39:19 10 kernel: ? = __pfx_wake_page_function+0x10/0x10 Sep 28 03:39:19 10 kernel: = folio_wait_writeback+0x2b/0x80 Sep 28 03:39:19 10 kernel: = truncate_inode_partial_folio+0x5e/0x1b0 Sep 28 03:39:19 10 kernel: = truncate_inode_pages_range+0x1de/0x400 Sep 28 03:39:19 10 kernel: evict+0x29f/0x2c0 Sep 28 03:39:19 10 kernel: ? iput+0x6e/0x230 Sep 28 03:39:19 10 kernel: ? = _atomic_dec_and_lock+0x39/0x50 Sep 28 03:39:19 10 kernel: do_unlinkat+0x2de/0x330 Sep 28 03:39:19 10 kernel: __x64_sys_unlink+0x3f/0x70 Sep 28 03:39:19 10 kernel: do_syscall_64+0xb7/0x200 Sep 28 03:39:19 10 kernel: = entry_SYSCALL_64_after_hwframe+0x77/0x7f Sep 28 03:39:19 10 kernel: RIP: 0033:0x7f37c062d56b Sep 28 03:39:19 10 kernel: RSP: 002b:00007fff71638018 = EFLAGS: 00000206 ORIG_RAX: 0000000000000057 Sep 28 03:39:19 10 kernel: RAX: ffffffffffffffda RBX: = 0000562038c30500 RCX: 00007f37c062d56b Sep 28 03:39:19 10 kernel: RDX: 0000000000000000 RSI: = 0000000000000000 RDI: 0000562038c31c80 Sep 28 03:39:19 10 kernel: RBP: 0000562038c30690 R08: = 0000000000016020 R09: 0000000000000000 Sep 28 03:39:19 10 kernel: R10: 0000000000000050 R11: = 0000000000000206 R12: 00007fff71638058 Sep 28 03:39:19 10 kernel: R13: 00007fff7163803c R14: = 00007fff71638960 R15: 0000562040b8a500 Sep 28 03:39:19 10 kernel: Hope this helps, Christian > On 19. Sep 2024, at 12:19, Christian Theune = wrote: >=20 >=20 >=20 >> On 19. Sep 2024, at 08:57, Linus Torvalds = wrote: >>=20 >> Yeah, right now Jens is still going to run some more testing, but I >> think the plan is to just backport >>=20 >> a4864671ca0b ("lib/xarray: introduce a new helper xas_get_order") >> 6758c1128ceb ("mm/filemap: optimize filemap folio adding") >>=20 >> and I think we're at the point where you might as well start testing >> that if you have the cycles for it. Jens is mostly trying to confirm >> the root cause, but even without that, I think you running your load >> with those two changes back-ported is worth it. >>=20 >> (Or even just try running it on plain 6.10 or 6.11, both of which >> already has those commits) >=20 > I=E2=80=99ve discussed this with my team and we=E2=80=99re preparing = to switch all our=20 > non-prod machines as well as those production machines that have shown > the error before. >=20 > This will require a bit of user communication and reboot scheduling. > Our release prep will be able to roll this out starting early next = week > and the production machines in question around Sept 30. >=20 > We would run with 6.11 as our understanding so far is that running the > most current kernel would generate the most insight and is easier to > work with for you all? >=20 > (Generally we run the mostly vanilla LTS that has surpassed x.y.50+ so > we might later downgrade to 6.6 when this is fixed.) >=20 >> So considering how well the reproducer works for Jens and Chris, my >> main worry is whether your load might have some _additional_ issue. >>=20 >> Unlikely, but still .. The two commits fix the repproducer, so I = think >> the important thing to make sure is that it really fixes the original >> issue too. >>=20 >> And yeah, I'd be surprised if it doesn't, but at the same time I = would >> _not_ suggest you try to make your load look more like the case we >> already know gets fixed. >>=20 >> So yes, it will be "weeks of not seeing crashes" until we'd be >> _really_ confident it's all the same thing, but I'd rather still have >> you test that, than test something else than what caused issues >> originally, if you see what I mean. >=20 > Agreed, I=E2=80=99m all onboard with that. >=20 > Liebe Gr=C3=BC=C3=9Fe, > Christian Theune >=20 > --=20 > Christian Theune =C2=B7 ct@flyingcircus.io =C2=B7 +49 345 219401 0 > Flying Circus Internet Operations GmbH =C2=B7 https://flyingcircus.io > Leipziger Str. 70/71 =C2=B7 06108 Halle (Saale) =C2=B7 Deutschland > HR Stendal HRB 21169 =C2=B7 Gesch=C3=A4ftsf=C3=BChrer: Christian = Theune, Christian Zagrodnick >=20 Liebe Gr=C3=BC=C3=9Fe, Christian Theune --=20 Christian Theune =C2=B7 ct@flyingcircus.io =C2=B7 +49 345 219401 0 Flying Circus Internet Operations GmbH =C2=B7 https://flyingcircus.io Leipziger Str. 70/71 =C2=B7 06108 Halle (Saale) =C2=B7 Deutschland HR Stendal HRB 21169 =C2=B7 Gesch=C3=A4ftsf=C3=BChrer: Christian Theune, = Christian Zagrodnick