From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 92B24CF9C6B for ; Tue, 24 Sep 2024 15:58:49 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 002BC6B00A7; Tue, 24 Sep 2024 11:58:49 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id EF3BE6B00A8; Tue, 24 Sep 2024 11:58:48 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id DE34F6B00A9; Tue, 24 Sep 2024 11:58:48 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id C2C206B00A7 for ; Tue, 24 Sep 2024 11:58:48 -0400 (EDT) Received: from smtpin23.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id 6A271161AB9 for ; Tue, 24 Sep 2024 15:58:48 +0000 (UTC) X-FDA: 82600089936.23.22BF1C8 Received: from casper.infradead.org (casper.infradead.org [90.155.50.34]) by imf27.hostedemail.com (Postfix) with ESMTP id 442484000E for ; Tue, 24 Sep 2024 15:58:45 +0000 (UTC) Authentication-Results: imf27.hostedemail.com; dkim=pass header.d=infradead.org header.s=casper.20170209 header.b=tcg08JrS; spf=none (imf27.hostedemail.com: domain of willy@infradead.org has no SPF policy when checking 90.155.50.34) smtp.mailfrom=willy@infradead.org; dmarc=none ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1727193465; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=V69cTC5cWm2Olzjm4Q+Ede0enQBKS02QbXSespSzrzY=; b=FSMKQPvMbY0bX7vSQgfo81qZzTIE+JhLpT0/4YlNmN5m0MPRzShIpI/PEA9Yzsmm/qW7Wd MubTcXFawfv/4b1/vZemk3qfHe6XnW+Iqi6d1DwceRmTEzJnH3CZZCat8l6lTF7+1pnYSz 8jeJwb4QOVSZyNkxDNdg6qVhcohRKHw= ARC-Authentication-Results: i=1; imf27.hostedemail.com; dkim=pass header.d=infradead.org header.s=casper.20170209 header.b=tcg08JrS; spf=none (imf27.hostedemail.com: domain of willy@infradead.org has no SPF policy when checking 90.155.50.34) smtp.mailfrom=willy@infradead.org; dmarc=none ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1727193465; a=rsa-sha256; cv=none; b=WdegphMjTcLoDBm8F/mkc5no6jSrt2FhMysjlEpStg2tS3BnaRD+ov1ml7+Q8cGA9SkDg/ DfUWxYGTDFZADSsdhxQneXugnb2Vjhvu0us7evgop3OOKj3hU52P3aWnLGp4w8vX5Pj+WO O/A9u8lHYfQFUmrhts4x2TH90FTzBpg= DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=casper.20170209; h=In-Reply-To:Content-Type:MIME-Version: References:Message-ID:Subject:Cc:To:From:Date:Sender:Reply-To: Content-Transfer-Encoding:Content-ID:Content-Description; bh=V69cTC5cWm2Olzjm4Q+Ede0enQBKS02QbXSespSzrzY=; b=tcg08JrSLY1o3obB4H1iG+ojAy iXS15zRyD0BwojJQARr5sOSMoLAsLnuu3awYoANJG4P8Rk2JJ83IqsOQWDzRjCilXHNGMDNljYpg2 jVAFl3fpvBJAaRkvvcOVVK8D2vcShv4Ovl+yID12YPswZjmjCHfGNpCFiFxHDJoblKqRShwMzF1LT UIwVvqR/PI3ZFJf1GTju1QuHltJrb8CMZ332HeqsV+18ZFHE8QrucO0iBlDfqsiCALP86WKmJM0E6 ZD8p34HmPcexSmz6U843J2MpvAjddsBjQ1ewfuywYxMC+NPIppd4XKYZrc7GDv6wCozeR5GbKPqDY 11A2/R1w==; Received: from willy by casper.infradead.org with local (Exim 4.98 #2 (Red Hat Linux)) id 1st7vw-00000001uiM-445t; Tue, 24 Sep 2024 15:58:36 +0000 Date: Tue, 24 Sep 2024 16:58:36 +0100 From: Matthew Wilcox To: Chris Mason Cc: Jens Axboe , Linus Torvalds , Dave Chinner , Christian Theune , linux-mm@kvack.org, "linux-xfs@vger.kernel.org" , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, Daniel Dao , regressions@lists.linux.dev, regressions@leemhuis.info Subject: Re: Known and unfixed active data loss bug in MM + XFS with large folios since Dec 2021 (any kernel from 6.1 upwards) Message-ID: References: <5bee194c-9cd3-47e7-919b-9f352441f855@kernel.dk> <459beb1c-defd-4836-952c-589203b7005c@meta.com> <8697e349-d22f-43a0-8469-beb857eb44a1@kernel.dk> <0a3b09db-23e8-4a06-85f8-a0d7bbc3228b@meta.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <0a3b09db-23e8-4a06-85f8-a0d7bbc3228b@meta.com> X-Rspamd-Server: rspam03 X-Rspam-User: X-Rspamd-Queue-Id: 442484000E X-Stat-Signature: p4patyhutdr4t7fiafuicbwq9ud633z7 X-HE-Tag: 1727193525-378812 X-HE-Meta: U2FsdGVkX18lwvoihlR1uxLRMGn6/iHNfjaFc9BleY7wRxDsNsRR/rutUeHvJzYU6cv3v6rysjMtHkZ/KLWGtySr1/pzOuRGlsQOgD1yzLYDAvpyNNT5y2vciXvijfyvUaQO0Ng65Wb8HEEMrsk3laZ+C6WiW5CYSJSThsrtTQMeJfjVc/ZHZXTqxqqFuPHdJCBtVra7GIJl1ukzK/1eWQoYJxKMVfFSzMCCU+RbErQwFrlIJiV3/HITmYSFnPND57KJ5jNUPR3dLS273HglqT4BQ/4AP775K1MxNJERy5WBSXKdiydQT+fUhOW/itkhhpBEwpFuXr/PzBlboQrXh9OYyryt3NCEeNK8//PvX6WdYimfQ2KoBXccs0QmgQKPWBRMnDmqeEWVIvEGu2//l8nto0WhTH8QCuebCL7v5UZ6W5iZuPg4wAuUXIwr2P74RRfRJNarxIYg/V6YLZlk3hTOpAMu3i6D72rJDzyg94QnKuILbmIV/9UoKtS8aGylj3jTjJ7bHsQBr4zjQk6v4mIJpQe4EFFEZuiTJ+9FElD0hdEaG8B+3Rn5foD8nd2QffLHGYhcM0tzkIVE+GJYKIcDr+YKmVuF1FLGGsNR/4q+MaOsocMx1xQn3eWryqTnyustIiV9yN+zfAU/HLEEVY6P5C4wilZkUzXZUwfI5mdnSD1ooQjhyej7TpgN1rck3iOOmLvo9LjAQpWfU+GH7l0NLb4FKUEi/Sl3yLVd5rQCiX1p1UR2iBMh25mNNNYluqYeArgaiFDHJjcqWsNp7dI+s//kakMBeib44MFoJxyfpry7RXlobviAdK69d0BTlmslNuQRnUg1w5CRWZoATGsmQsjVi4iPo6JMqgOjs3yCBJ8ASaO23WWA4tnCLTz5ScLd+bzmXGQsPgI5yY4+PPJu+I/SIMq/Z8t1+7Fmqw4pzBe5xr1lr6bmEGk2UYjU54GKgCAnh7XGmGEdtZe WbeSkklZ oSRjDETUjIzEtIGav50AnXwSSTTICUteZ/XSolFJ+X8UnvnWm0eyZO9IRX/0VTxrQ2mJ8K2IyiVhjZzS0yW8ViLHyzAboaa0N13+t4/G/AUgQUrujEa+HwrtSFv3kZBdZLxOzf9IEp6zL6/Zv8hLdtdGhEKNcRhWjvFf6WQPOAR2zYE4VTgvjpiT7zjxWdYNEDom+7Wpkb/NlbMFl165qjgXZZlHApBlnEuWEOecADbMGUGoObobjkUKITqBqD/4UjD9L7+dNgTqNLsV555zJUlML+/8lwqxEciAYx5WcMaP1qFPksBUxeSNEdobwsBhKwuiJVBa+vfS/LBg= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000001, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Fri, Sep 20, 2024 at 03:54:55PM +0200, Chris Mason wrote: > On 9/19/24 12:36 AM, Matthew Wilcox wrote: > > My brain is still mushy, but I think there is still a problem (both with > > the simple fix for 6.9 and indeed with 6.10). > > > > For splitting a folio, we have the folio locked, so we know it's not > > going anywhere. The tree may get rearranged around it while we don't > > have the xa_lock, but we're somewhat protected. > > > > In this case we're splitting something that was, at one point, a shadow > > entry. There's no struct there to lock. So I think we can have a > > situation where we replicate 'old' (in 6.10) or xa_load() (in 6.9) > > into the nodes we allocate in xas_split_alloc(). In 6.10, that's at > > least guaranteed to be a shadow entry, but in 6.9, it might already be a > > folio by this point because we've raced with something else also doing a > > split. > > > > Probably xas_split_alloc() needs to just do the alloc, like the name > > says, and drop the 'entry' argument. ICBW, but I think it explains > > what you're seeing? Maybe it doesn't? > > Jens and I went through a lot of iterations making the repro more > reliable, and we were able to pretty consistently show a UAF with > the debug code that Willy suggested: > > XA_NODE_BUG_ON(xas->xa_alloc, memchr_inv(&xas->xa_alloc->slots, 0, sizeof(void *) * XA_CHUNK_SIZE)); > > But, I didn't really catch what Willy was saying about xas_split_alloc() > until this morning. > > xas_split_alloc() does the allocation and also shoves an entry into some of > the slots. When the tree changes, the entry we've stored is wildly > wrong, but xas_reset() doesn't undo any of that. So when we actually > use the xas->xa_alloc nodes we've setup, they are pointing to the > wrong things. > > Which is probably why the commits in 6.10 added this: > > /* entry may have changed before we re-acquire the lock */ > if (alloced_order && (old != alloced_shadow || order != alloced_order)) { > xas_destroy(&xas); > alloced_order = 0; > } > > The only way to undo the work done by xas_split_alloc() is to call > xas_destroy(). I hadn't fully understood this until today. Here's what the code in 6.9 did (grossly simplified): do { unsigned int order = xa_get_order(xas.xa, xas.xa_index); if (order > folio_order(folio)) xas_split_alloc(&xas, xa_load(xas.xa, xas.xa_index), order, gfp); xas_lock_irq(&xas); if (old) { order = xa_get_order(xas.xa, xas.xa_index); if (order > folio_order(folio)) { xas_split(&xas, old, order); } } xas_store(&xas, folio); xas_unlock_irq(&xas); } while (xas_nomem(&xas, gfp)); The intent was that xas_store() would use the node allocated by xas_nomem() and xas_split() would use the nodes allocated by xas_split_alloc(). That doesn't end up happening if the split already happened before getting the lock. So if we were looking for a minimal fix for pre-6.10, calling xas_destroy if we don't call xas_split() would fix the problem. But I think we're better off backporting the 6.10 patches. For 6.12, I'm going to put this in -next: http://git.infradead.org/?p=users/willy/xarray.git;a=commitdiff;h=6684aba0780da9f505c202f27e68ee6d18c0aa66 and then send it to Linus in a couple of weeks as an "obviously correct" bit of hardening. We really should have called xas_reset() before retaking the lock. Beyond that, I really want to revisit how, when and what we split. A few months ago we came to the realisation that splitting order-9 folios to 512 order-0 folios was just legacy thinking. What each user really wants is to specify a precise page and say "I want this page to end up in a folio that is of order N" (where N is smaller than the order of the folio that it's currently in). That is, if we truncate a file which is currently a multiple of 2MB in size to one which has a tail of, say, 13377ea bytes, we'd want to create a 1MB folio which we leave at the end of the file, then a 512kB folio which we free, then a 256kB folio which we keep, a 128kB folio which we discard, a 64kB folio which we discard, ... So we need to do that first, then all this code becomes way easier and xas_split_alloc() no longer needs to fill in the node at the wrong time.