From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 562B3EEE26E for ; Thu, 12 Sep 2024 22:12:43 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id EBC196B0089; Thu, 12 Sep 2024 18:12:42 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id E6CA16B008A; Thu, 12 Sep 2024 18:12:42 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id D5B286B008C; Thu, 12 Sep 2024 18:12:42 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id B8F866B0089 for ; Thu, 12 Sep 2024 18:12:42 -0400 (EDT) Received: from smtpin25.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id 745451408F7 for ; Thu, 12 Sep 2024 22:12:42 +0000 (UTC) X-FDA: 82557486564.25.6A25DCB Received: from mail-pj1-f44.google.com (mail-pj1-f44.google.com [209.85.216.44]) by imf22.hostedemail.com (Postfix) with ESMTP id 5CD01C000F for ; Thu, 12 Sep 2024 22:12:40 +0000 (UTC) Authentication-Results: imf22.hostedemail.com; dkim=pass header.d=kernel-dk.20230601.gappssmtp.com header.s=20230601 header.b="TN6Ub/M4"; spf=pass (imf22.hostedemail.com: domain of axboe@kernel.dk designates 209.85.216.44 as permitted sender) smtp.mailfrom=axboe@kernel.dk; dmarc=none ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1726179107; a=rsa-sha256; cv=none; b=s+6YZKt87Jl0kBVOTWpfRo5zih3dcjwsQbJ8plxMqoCG5clkOrAZcmB3P248vYvPnnZKtO 40qiULVy3Ow84h2EezLSosigUW1LyRSvtri2zESN86H879abo5Bnd6VBPHVsS/lyAMMmuj UHgPqw0d3DayqrhXO/+vb/yTb6oa1Ww= ARC-Authentication-Results: i=1; imf22.hostedemail.com; dkim=pass header.d=kernel-dk.20230601.gappssmtp.com header.s=20230601 header.b="TN6Ub/M4"; spf=pass (imf22.hostedemail.com: domain of axboe@kernel.dk designates 209.85.216.44 as permitted sender) smtp.mailfrom=axboe@kernel.dk; dmarc=none ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1726179107; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=6XHoL0ZBi9GE7Zl5WoiwZ5w6Gz6wVLeamEWBzoE27iE=; b=HjHYgRT1xoVYGMHAz+pczhIyGX6tx0ziWMJQhe2JC+XBTT3gjqBR+LOl8OWEHOkJHcxyft fArvBG58wuKG5jRDxp05jzDDLt2KNyQdja85U88C9gH8fhrQg4yH2UShpvL17Hzz3nHND4 RfQw1d9wT5lr97WIxcHTjCX9SWpQ/1A= Received: by mail-pj1-f44.google.com with SMTP id 98e67ed59e1d1-2da4ea973bdso1232210a91.1 for ; Thu, 12 Sep 2024 15:12:40 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kernel-dk.20230601.gappssmtp.com; s=20230601; t=1726179159; x=1726783959; darn=kvack.org; h=content-transfer-encoding:in-reply-to:from:content-language :references:cc:to:subject:user-agent:mime-version:date:message-id :from:to:cc:subject:date:message-id:reply-to; bh=6XHoL0ZBi9GE7Zl5WoiwZ5w6Gz6wVLeamEWBzoE27iE=; b=TN6Ub/M4DGjnWRw+tqpl5mhuptwZ07AakkjK3rJtqelvHIcUudUXgNUjvkeH0AIVYQ jt1GjCKgZIsbC5DXA9WdoDssrJAsS7zbsxeYsStpaz6WSBH35P6dqpn0QwJimXFz8uQ5 QMZeuPFedfsNRVA71prnpRwrwrQZZwdiy823T1uCCi7iTBTVV61kISz/yVtzfMMuHYFs r4GEMZ/J62XXxFYTjWIqrLzpvxDZiE0VrBExE2hlBMRH5OuQHrgTWtFGupeTH51TvBPj ZSuiivAsXQjzDiiiNpG6O/kXydY76Z18nxiKSoHPX0vIbkT2R/NRnBe/BpkQ8cmR1N8t cOkw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1726179159; x=1726783959; h=content-transfer-encoding:in-reply-to:from:content-language :references:cc:to:subject:user-agent:mime-version:date:message-id :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=6XHoL0ZBi9GE7Zl5WoiwZ5w6Gz6wVLeamEWBzoE27iE=; b=QhrTNZOQp2bGjikZF9wljQcw5M02MjAEvFsjTAIFFFibic+O0monWZlKle+a+PZrS5 ui64Ixbei87qQfhvP//M3SI59Z+IGFUnnuAWHG4ZEssdw8Y9guqD+94zT4U3n4G9p2IS /JSr0SF4niwsba76Gf4Nyq5LgPl3rICEzvMyfq/Om+QFJ2c9UFh181uWMjKsiAWzySF8 Md9HqnbQSxqBZZ6JgjpI01EJMabZQuUs810mw+Fd6MUSej3v/WL/NxrpG4XsxfQvRss8 9ErTZ+imrb9LnfDHWX+l5GsL9z3VyY8cvpP6AaAm3tfUylhBYLe4MLSCuvETe8N/3gZL ypHw== X-Gm-Message-State: AOJu0YzNeVUJDBw9WD5IRceJex9yK9sHgLv17JoDYL5x7DfDfQ9xBsRb cCMvTxFnOgDYbiTnXufLXoQ6PK4ChIrrX+0edoZpTFN982dRcvpLiOH9hOWqvGQ= X-Google-Smtp-Source: AGHT+IFpTCq6+ukL7H58M1PsiI9W1iyXsSunsAA643ZhHhX6ctp4jDxCLQSXqpQypnPnbrvP1So3dQ== X-Received: by 2002:a17:90b:4a4a:b0:2da:7536:2b8c with SMTP id 98e67ed59e1d1-2dba0082fecmr5160596a91.36.1726179158919; Thu, 12 Sep 2024 15:12:38 -0700 (PDT) Received: from [192.168.1.150] ([198.8.77.157]) by smtp.gmail.com with ESMTPSA id 98e67ed59e1d1-2dbb9ced741sm211330a91.39.2024.09.12.15.12.37 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Thu, 12 Sep 2024 15:12:38 -0700 (PDT) Message-ID: <0fc8c3e7-e5d2-40db-8661-8c7199f84e43@kernel.dk> Date: Thu, 12 Sep 2024 16:12:36 -0600 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: Known and unfixed active data loss bug in MM + XFS with large folios since Dec 2021 (any kernel from 6.1 upwards) To: Matthew Wilcox , Christian Theune Cc: linux-mm@kvack.org, "linux-xfs@vger.kernel.org" , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, torvalds@linux-foundation.org, Daniel Dao , Dave Chinner , clm@meta.com, regressions@lists.linux.dev, regressions@leemhuis.info References: Content-Language: en-US From: Jens Axboe In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-Stat-Signature: qyf5tswpp8qc9f6wcu73w4crj5h7smpk X-Rspamd-Queue-Id: 5CD01C000F X-Rspam-User: X-Rspamd-Server: rspam10 X-HE-Tag: 1726179160-298317 X-HE-Meta: U2FsdGVkX18IodwgkHl/y8W03OlXyJpKe5+kPi9xKSdcTO/M4jklzOo0q9mkwGAPKg66MMtM39mra5x3Gblb3qiTj0Kvm9T4kDEx/ltikNr/sOp5Gl5YGN+6FIgZQtYCdFRFN6NC9ooXiT5oLP/B/Ze7WHKUQon2jQC3VkbhPQGjjM4LPPQvAeUQGSQmjWb/6tGXZHTp1svgyjHsL1gifVXdjJ0WBVlf8BnYF1igPbZpOWGcU6v7DIxpInbqLbCgPmiVXBtBOiRQTxpnK3BuZA1PmMHq9RjhtlWG+5MrnGp8AYF9i+9bBFQ2sir2IGehrBozM8MrmgIgdDzDy1Gwl442vgFwOgah4WGJdgSidGNEbliwJo/2XY7SXXPRoGJ1yWEj1IxAYTfAK6Lz4d58Ygx67KqLEraAAw7X7KV2L7YysRbb5wEGGuWrE9TTmYJwB4A9IyMS4KnNTonL20dZFEIqclb5BNuSzR71DWLktEbIWtFyRcCBoJODk4Z4EcTi3fhMe9EH2vI/Ie6wyTdkaHTz8lL/1cuF2hGqQvrI3ohtsPSZxImPfTRh29ifdYbSyKA8cMdpS0fyORYVtBESyc7geLXhnp3mFvwqrWmx1eUaGd1RuaGIHAKd3E0l3pj9F8uGfkG6Khdhjdb1rUH8f45pAA3cTUU3H2asR4oPHgxIr3OKUhOLB7GXcjyN7PTRm+OEDT9S3oAoC9SCWvFgePll4hpdd2uwo3pQmmlHwphMpBKvzDDfDbZFcYl8Y5f6pK9oXF1JX0WlFPSgauo0yJSvb2HRPTcdswyETAatdXmkSdsQdJ9eEahc+iP2LHcxUnZ2v41zNC15LjgMFnrjOz2/fMRrXT+Hs/iq8Z29wUog3dFvy9W6ag8yUP2/Kh42mbjKynr27uHUvu1li2IdoKmlcZoNXQmMyJhCT8tonG5fq+oJnz9qFPo2jBTBoTa7Hv80fxh95EvKLjdtBNW U2DZGYIo 4X3s1piTPVFocaaoTKZNTbpsT3v4Dp2D8AURqAMse3ZgmnVq7DDZ8tf5AH8BQeQykQ7Ad5rIfY54ZomeYOYrcWcb39/zXlyCQri+O4OL/xt1IOcyr0JDsrcxinySUgYkJ+g4wl6JUMbQRhnIy2eM0aXrz/0GNknJ487Wprc6WB5HjvNweRC8iMKiSLx5KdhPFeLPSCAqZkv+OSosTullJ+RrJZj8zvKQSC/aCVSDsav4eXAV2rWmYqM1aIFGuqAyP3KkhlLv6ImS9rVMGf/drXbAF/nQ9ovGm73SoBp8xyLhjEffka5hxv+ujluNMYAvYvCwW8k3kDCcZZlofcCVq5hBGBLZI9jtCDLAw7pk+b0abtDiWlCMMvLu+ygvsQnM3F0fIqzDPGKrg1EY= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000374, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 9/12/24 3:55 PM, Matthew Wilcox wrote: > On Thu, Sep 12, 2024 at 11:18:34PM +0200, Christian Theune wrote: >> This bug is very hard to reproduce but has been known to exist as a >> ?fluke? for a while already. I have invested a number of days trying >> to come up with workloads to trigger it quicker than that stochastic >> ?once every few weeks in a fleet of 1.5k machines", but it eludes >> me so far. I know that this also affects Facebook/Meta as well as >> Cloudflare who are both running newer kernels (at least 6.1, 6.6, >> and 6.9) with the above mentioned patch reverted. I?m from a much >> smaller company and seeing that those guys are running with this patch >> reverted (that now makes their kernel basically an untested/unsupported >> deviation from the mainline) smells like desparation. I?m with a >> much smaller team and company and I?m wondering why this isn?t >> tackled more urgently from more hands to make it shallow (hopefully). > > This passive-aggressive nonsense is deeply aggravating. I've known > about this bug for much longer, but like you I am utterly unable to > reproduce it. I've spent months looking for the bug, and I cannot. What passive aggressiveness?! There's a data corruption bug where we know what causes it, yet we continue to ship it. That's aggravating. People are aware of the bug, and since there's no good reproducer, it's hard to fix. That part is fine and understandable. What seems amiss here is the fact that large folio support for xfs hasn't just been reverted until the issue is understood and resolved. When I saw Christian's report, I seemed to recall that we ran into this at Meta too. And we did, and hence have been reverting it since our 5.19 release (and hence 6.4, 6.9, and 6.11 next). We should not be shipping things that are known broken. -- Jens Axboe