From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id F08DEECE588 for ; Mon, 9 Sep 2024 19:29:16 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 8779C6B01F6; Mon, 9 Sep 2024 15:29:16 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 7FFE76B01F7; Mon, 9 Sep 2024 15:29:16 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 6A0926B01F8; Mon, 9 Sep 2024 15:29:16 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 4914D6B01F6 for ; Mon, 9 Sep 2024 15:29:16 -0400 (EDT) Received: from smtpin19.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id C6066120C4A for ; Mon, 9 Sep 2024 19:29:15 +0000 (UTC) X-FDA: 82546188270.19.F60CC1F Received: from nyc.source.kernel.org (nyc.source.kernel.org [147.75.193.91]) by imf19.hostedemail.com (Postfix) with ESMTP id 28A4B1A0014 for ; Mon, 9 Sep 2024 19:29:12 +0000 (UTC) Authentication-Results: imf19.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=MWJdOVMt; spf=pass (imf19.hostedemail.com: domain of kbusch@kernel.org designates 147.75.193.91 as permitted sender) smtp.mailfrom=kbusch@kernel.org; dmarc=pass (policy=quarantine) header.from=kernel.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1725910102; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=6yLxZ+P7PNADNn98SjMSW3xnN3rHwZcVidZMyeezQCM=; b=ZMuwHWozHVcqLOBOieG7/sNsxtrtfecNyB0hjHfXej49nd14kZiEuzLtMjDH0dWmsJXVlN nKUXZ3qAbOZH/9mlW+Ps9dmp3H8RwLH757Sf+Xo3kZaIkE6QjwTDmGnBjdNGGoKDGZQHtq QiFjjN/DL/nzGmvr1sUxOUuT+L4QMcM= ARC-Authentication-Results: i=1; imf19.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=MWJdOVMt; spf=pass (imf19.hostedemail.com: domain of kbusch@kernel.org designates 147.75.193.91 as permitted sender) smtp.mailfrom=kbusch@kernel.org; dmarc=pass (policy=quarantine) header.from=kernel.org ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1725910102; a=rsa-sha256; cv=none; b=8fpjEDIzIevkOCou3TIWMUZa272PM02LOCHRY82MjUzN4aG0iRfBhoOsqIPhFZafL8iQ0a ZA4SMLr4fC2TV5xtHaNZ8wMflu4v+O9SQjSAEhrZ1sHbS6hoCQd6uXAOyz59rUCxie/Moj nGybO/kfsg9d6y/yzjld+Z/lO/dVnJo= Received: from smtp.kernel.org (transwarp.subspace.kernel.org [100.75.92.58]) by nyc.source.kernel.org (Postfix) with ESMTP id EAF29A4322F; Mon, 9 Sep 2024 19:29:04 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 9C06DC4CEC5; Mon, 9 Sep 2024 19:29:11 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1725910152; bh=LDG1xPJGngJmQ2lFzft38sO/TZPsFVOisF2DF2vYvKA=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=MWJdOVMtxYNEZGbw9AsVYWGdcKpR2PPDgfn5K9L7TxMP4wfCsJGvT0RN5ehO4CCwt BeAwTuC1TMJNvVmN0kk4YN9TpeEovMkck9eYUF7miFcbvvk5+COjavwPjL1iXjY4qn 5oT0s91uw9qQ64x2wYBlFaQFzGdZZ5E/w7SmgSbC5nqtsbvEIlyH0mJrRWqiCpSqhb SXoEDKjph+SNgdDtu5y0agD2VL2s1rtnJgjKWY9ijr0VQxgy87AxPgZD7iOkib3ZEg gPFmuwgtuuL6Sm6hDZO2uDeo00+BQ0gkRkuV1PVjArk6MVY27q0aP474QJHt4ePbZa jYjZN1Z/eaF9Q== Date: Mon, 9 Sep 2024 13:29:09 -0600 From: Keith Busch To: Robert Beckett Cc: linux-nvme , Jens Axboe , Christoph Hellwig , Sagi Grimberg , Andrew Morton , linux-mm Subject: Re: possible regression fs corruption on 64GB nvme Message-ID: References: <191d810a4e3.fcc6066c765804.973611676137075390@collabora.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <191d810a4e3.fcc6066c765804.973611676137075390@collabora.com> X-Rspamd-Server: rspam03 X-Rspam-User: X-Rspamd-Queue-Id: 28A4B1A0014 X-Stat-Signature: kaiuqigq69jpt9zyodskin5nfe8md9gs X-HE-Tag: 1725910152-186311 X-HE-Meta: U2FsdGVkX1/i+q7liuJybo4rf6wvgVsC8oKNvHCcVcMv1M6vpLERs+Pjkgw7a5gu8j2Q0qBs5cZEp21ndV2CjkClVPVRF8NEWoKWpOy07zEri4hLoXsKx8Ei4nvjKSFDY8e8tQGLeCS6USVfujhO+wzGVfqIZv2GMakEaoZdZuRC/aIIKFBuaY6O5ZS86xNf3eKLmPju1a/jwpu3u+btWJjmQyt+BnXW6MJd6IZ5akz8LCJ1lqJKWMEdArl4Cd/X2IKv6mHb7vaWOni8He3ogxQhFkwRXz763LR4fNMbPy4qm14G1d8x6nJOP4HoMYGnvdWjV6JFB/9FPVyhHDJlB02l5bQHHkxWWN1H22/clPbDyXFLJSpnOmwrr4qVxkXbvi4OIQ6Hxwqz4F+j1lJ2r9A5C7KKULNCjqI654iIjpECRzUZrdjvThzTc0bm80awy5S6ocf0rziDNPGCk0Ot3RtQ4vOqFJZ8c7YwUrqtdLuDK+SqyPRxlLmfk4+Olrm0x9+PEN94hWrrJVihcW5Fiqu+1LeMkA6Z1Rw3DORyw5xxPTzuKmNZm1aaw3fd3TrOTCkaedTEiFSIleZVJsbp896IojXsaPSdWhGR0QZHwaf0GO7YaqRTp5I4Bem/M81yODT2/uxagNL8I0WkO64U779KW8tt9dJH7oEY92Jc7Y0iXnDvYa/rX/c0pIvJ4xMEdHhJJ0BcOoKtFShqpx76gcgNbUs0GgOS+xs8yBQMMtfTj1igbDRnxvNpUvCWM+OcJukNdGuRwF/ePCLPEbo4Ekqpuznm7wTfivDX/mTAHlm03w+EDJga2RtNeZhA7Lbee7VjV7gmaQWjHDr2vzQQkPda1Rd7Bl34LvguhyV4vE2uEtggifdA6/J0y1VbovYgdfDHfyAjIx9sbXglGvBS9k3LP9kbXjcMqeQtXO/nJuFIOS0iHx7Pdj4LV0Qu/LFNkp2y6v5XhABUoJ4LzSY WV91a/aS 4i+65sBxiJWH+OvPZSCT+EdYnb/yMfZ21hRSWjr5OgETynpGcs1J6k/9TZ0SBnczbcP8xfs0WEmdpyh6oty6V4BuRUZrfJbkz6C0rxEXzUE/rnuGCF2SZE4j61zihg8CJNSPqKss0kpDTWpipkmUxjuDJDgOW405n6Szp0qckvgCbrJXPlnSyvm8yauEWdcdnczo4oEia9VePOT1F0ZCfnt4KWgciBn6BuQDiUEPGiDDCWFk9hyrdyPDp7THtFNne9FMGsfqqPO9Sv3myN4sqmmz1OzFppIhKT8HgK4qjX3zJkpzXUT49LItiofVtFHiiTbk6mJjDqtQy9W4oz1KOdnPtEQCJsGg55bla88xR+fz69BcQGaJTluQEBLYNkwJ7HpgTQ9jgs45KCDw= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Mon, Sep 09, 2024 at 07:34:15PM +0100, Robert Beckett wrote: > After a lot of testing, we managed to get a repro case that would trigger within 2-3 tests using the desync tool [2], reducing the repro time from a day or more to minutes. For repro steps see [3]. > We bisected the issue to > > da9619a30e73b dmapool: link blocks across pages > https://lore.kernel.org/all/20230126215125.4069751-12-kbusch@meta.com/T/#u That's not the patch that was ultimately committed. Still, that's the one I tested extensively with nvme, so the updated one shouldn't make a difference for protocol. > Some other thoughts about the issue: > > - we have received reports of occasional filesystem corruptions on btrfs and ext4 filesystems on the same disk, this doesn't appear fs related > - it only seems to affect these 64GB single queue simple disks. Other devices with more capable disks have not showed this issue. > - using simple dd or md5sum testing does not sow the issue. desync seems to be very parallel in it's attack patterns. > - I was investigating a previous potential regression that was deemed not an issue https://lkml.org/lkml/2023/2/21/762 . I assume nvme doesn't need it's addresses to be ordered. I'm not familiar with the spec. nvme should not care about address ordering. The dma buffers are all pulled from the same pool for all threads, and could be dispatched in different orders than what was allocated, so any order should be fine. > I'd appreciate any advice you may have on why this dmapool patch could potentially cause or expose an issue with these nvme devices. > If any more info would be useful to help diagnose, I'll happily provide it. Did you try with CONFIG_SLUB_DEBUG_ON enabled?