From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 6BF9EC3601E for ; Thu, 10 Apr 2025 12:05:52 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id A5DCB2800F5; Thu, 10 Apr 2025 08:05:51 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id A0C5E2800F4; Thu, 10 Apr 2025 08:05:51 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 884DC2800F5; Thu, 10 Apr 2025 08:05:51 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 6A0962800F4 for ; Thu, 10 Apr 2025 08:05:51 -0400 (EDT) Received: from smtpin22.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id 5D6875985B for ; Thu, 10 Apr 2025 12:05:51 +0000 (UTC) X-FDA: 83318005302.22.5D747F7 Received: from smtp-out2.suse.de (smtp-out2.suse.de [195.135.223.131]) by imf29.hostedemail.com (Postfix) with ESMTP id C748D120012 for ; Thu, 10 Apr 2025 12:05:48 +0000 (UTC) Authentication-Results: imf29.hostedemail.com; dkim=pass header.d=suse.cz header.s=susede2_rsa header.b=TVhROIqS; dkim=pass header.d=suse.cz header.s=susede2_ed25519 header.b=FKs34CSh; dkim=pass header.d=suse.cz header.s=susede2_rsa header.b=TVhROIqS; dkim=pass header.d=suse.cz header.s=susede2_ed25519 header.b=FKs34CSh; dmarc=none; spf=pass (imf29.hostedemail.com: domain of jack@suse.cz designates 195.135.223.131 as permitted sender) smtp.mailfrom=jack@suse.cz ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1744286749; a=rsa-sha256; cv=none; b=ghYkpjR5O4wl/ZQx/IjorJ+Y1y7B9oMqrA+rY6UxDQRbkM3MicCxsT8IkNWjRNherlPxvz 4/gub6QuoooCQTLzkcl60VMVCXYCDpm49ed4Q98lTub7qpRajmTRV7kI3tV+Bhe7OCFOw6 hWy2CP8dM8Gy+cVaVW/u2Oo1tvvTS4M= ARC-Authentication-Results: i=1; imf29.hostedemail.com; dkim=pass header.d=suse.cz header.s=susede2_rsa header.b=TVhROIqS; dkim=pass header.d=suse.cz header.s=susede2_ed25519 header.b=FKs34CSh; dkim=pass header.d=suse.cz header.s=susede2_rsa header.b=TVhROIqS; dkim=pass header.d=suse.cz header.s=susede2_ed25519 header.b=FKs34CSh; dmarc=none; spf=pass (imf29.hostedemail.com: domain of jack@suse.cz designates 195.135.223.131 as permitted sender) smtp.mailfrom=jack@suse.cz ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1744286749; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=TpD6XcpCBTjU8K7kBL2j3kkwigpBcTWrEVm7Rj4zBXE=; b=ptKf2rMRAGwJrdr8xk+I9bfxN7/0QdTJNVCeIOk573XcRwEbja/nY8d4BxAJu8slIbZKhv BmTsF2yLiRbjrSCXntGz1NQYc6NsfF1wBgQetm9qSmY7vGRo4KQIdkq+fq0OJ9M0H879zZ g05WkvtXjykT1ykeiwDHz7xGJjGdOo4= Received: from imap1.dmz-prg2.suse.org (unknown [10.150.64.97]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by smtp-out2.suse.de (Postfix) with ESMTPS id 2F65D1F38C; Thu, 10 Apr 2025 12:05:47 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.cz; s=susede2_rsa; t=1744286747; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=TpD6XcpCBTjU8K7kBL2j3kkwigpBcTWrEVm7Rj4zBXE=; b=TVhROIqSVI5PErHC8rtgFD1WOJ9u43JASsEZutKqr16cw8PvuM4+q0UGr1YXrdTRnKp23F K0qOjIiT9EODNU7HvCYiLc2bsDBFjmS9WhG8pcT8QPWKiqk0/QbIHmknNUJZABEo0KCXHJ zJ/Ifa8qCGct+Zrdkm1m+vPjjMvBFAs= DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.cz; s=susede2_ed25519; t=1744286747; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=TpD6XcpCBTjU8K7kBL2j3kkwigpBcTWrEVm7Rj4zBXE=; b=FKs34CShwrFCJOxyMWwf6eULdrzu6B4np08h+Jq3icJRhvvQgstoTQJrCy3sLjeRe5Ev9B dr39pdOOTylacIDQ== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.cz; s=susede2_rsa; t=1744286747; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=TpD6XcpCBTjU8K7kBL2j3kkwigpBcTWrEVm7Rj4zBXE=; b=TVhROIqSVI5PErHC8rtgFD1WOJ9u43JASsEZutKqr16cw8PvuM4+q0UGr1YXrdTRnKp23F K0qOjIiT9EODNU7HvCYiLc2bsDBFjmS9WhG8pcT8QPWKiqk0/QbIHmknNUJZABEo0KCXHJ zJ/Ifa8qCGct+Zrdkm1m+vPjjMvBFAs= DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.cz; s=susede2_ed25519; t=1744286747; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=TpD6XcpCBTjU8K7kBL2j3kkwigpBcTWrEVm7Rj4zBXE=; b=FKs34CShwrFCJOxyMWwf6eULdrzu6B4np08h+Jq3icJRhvvQgstoTQJrCy3sLjeRe5Ev9B dr39pdOOTylacIDQ== Received: from imap1.dmz-prg2.suse.org (localhost [127.0.0.1]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by imap1.dmz-prg2.suse.org (Postfix) with ESMTPS id 2010D132D8; Thu, 10 Apr 2025 12:05:47 +0000 (UTC) Received: from dovecot-director2.suse.de ([2a07:de40:b281:106:10:150:64:167]) by imap1.dmz-prg2.suse.org with ESMTPSA id nE3SBxu092crTQAAD6G6ig (envelope-from ); Thu, 10 Apr 2025 12:05:47 +0000 Received: by quack3.suse.cz (Postfix, from userid 1000) id 585A8A0910; Thu, 10 Apr 2025 14:05:38 +0200 (CEST) Date: Thu, 10 Apr 2025 14:05:38 +0200 From: Jan Kara To: Luis Chamberlain Cc: brauner@kernel.org, jack@suse.cz, tytso@mit.edu, adilger.kernel@dilger.ca, linux-ext4@vger.kernel.org, riel@surriel.com, dave@stgolabs.net, willy@infradead.org, hannes@cmpxchg.org, oliver.sang@intel.com, david@redhat.com, axboe@kernel.dk, hare@suse.de, david@fromorbit.com, djwong@kernel.org, ritesh.list@gmail.com, linux-fsdevel@vger.kernel.org, linux-block@vger.kernel.org, linux-mm@kvack.org, gost.dev@samsung.com, p.raghav@samsung.com, da.gomez@samsung.com, syzbot+f3c6fda1297c748a7076@syzkaller.appspotmail.com Subject: Re: [PATCH v2 1/8] migrate: fix skipping metadata buffer heads on migration Message-ID: References: <20250410014945.2140781-1-mcgrof@kernel.org> <20250410014945.2140781-2-mcgrof@kernel.org> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <20250410014945.2140781-2-mcgrof@kernel.org> X-Rspam-User: X-Rspamd-Server: rspam03 X-Rspamd-Queue-Id: C748D120012 X-Stat-Signature: z4735uqwgg9cycsbgjc7c8i67bdijikz X-HE-Tag: 1744286748-85869 X-HE-Meta: U2FsdGVkX19+IchnNzLoJGAQnw63aOWg1MCsXivd9yWZiFSHmn3YqFpGQH3YA9ByJ65rU5QlGLWm1LJD8GP3TL/tFxPCTcY0p4ndIOZ+xn8Sbq/9BhqygRTWWHrFvPCcosaeMr26BzsyOerRtfn3eoopm6JoKtfPjd6IW+ECOYS808rauAnPe28SSaP7L7Ck1/i9US9Eh5Mm6Ha0qNTNJjmViDpwEuyLwseierBNz/910a0xkWvgqEYNDthGvGy+eCfMyoly9qZLoU3QBsep7J7icPHt6bsw1eO7THoVxeLbFmOwblZLAFf4VbdX2xQfNClhDtxcg7INM8bjVFJ1HCvqQHJedsOxkGacBMbqO6RiFpM/A/D4MbgACPMV0AhrzcZSnrU+oRPUk8hT52CRwZ/EHe4v6/MHcIkrwu0RrPkMlL3HJi0PYi9+Rxu1U3GVlerHYVhIqQZZCcR0vRHZPhAu+iqxQ8cE553o0NRRaNKdhEbeEP90XGwSRd3GV7u4yNuYi5uNiLWFyIzqf09Q5VxNXuE8hf7vrSGdbAQqutwVDLEB7q2Z6RBWihh9W9HJ8vONSkC1auxnVvOy6YX0+RiYEHeo6PhdxsEhBvhANPkX8PHcWOTPfcQ354I8BDa14uL7UQdutqb8QRQ5YRm7iRv1OV0EzJGl5PiLeq0XOtbqtSovjGPG7yc5Cja4ZGh3wJvMW2dMa/vX4taazmj8qN9SZ5rIXaF16CI4ctoYFUfMfAP/drBPBfKxOE0FDCD0Km+w3RQyFq/JPSD1e+SVTwuTBkTLIz0Kell1nQO3RAEDUsNytjsPPPgE5+qJ9I2JFfV18auZaZ5JxjFENfTYQNLG4izk2YkIgozXzYtDE46B0/++WWSZH4GAjGw03sdZHYMVkVOLHg83j8YubSJBEbdruukW1sDimBSossmRLJ8Qt04VnR4QHBIr6GnUclLfSlOZYI12xmXtu2kZ9su AxrY1/I+ Q0cfNngdWxdZ0Wg+HHwoAlbF9MKdwmOSBSd2qJ0jQvipl93WLMigphwHBT7tkgz6kyLaUuSTeCjlISVYDDOJcCDFJkA6kDkN2UuKXRruGhSebrdY65DM0fWKwLmm1V5Jj2l82dsqwlqWvXwdEPO5eUG6FWT1z6SsczC4slAYuiOSccxbryVvouXaHWt0r+0vCaudS2q2/BjfuFtDUDpEBrGZm427fp6hrhjoB526AosG57vYdqyHJOABJDEu5FB5iEqTD/hvmJxMxH4y0eheDBmEjco3M1iXouZvDDhpxS1yePb3UGenCkGaEDbnUZB5bMdfTNnwXBQgqdiL5oOclSTpszOPoPn4aS1G/qQ8d0bVK2FjvdYL3bB5YMulSNUIHATBO2+NogvuYKP+82GprYqcJSvq11UamzA+xAgDm3UgsQBtn6nN3q/KyvD2dmDd/y1BKiSD3dE49YgjI59XOITJQOw== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Wed 09-04-25 18:49:38, Luis Chamberlain wrote: > Filesystems which use buffer-heads where it cannot guarantees that there > are no other references to the folio, for example with a folio > lock, must use buffer_migrate_folio_norefs() for the address space > mapping migrate_folio() callback. There are only 3 filesystems which use > this callback: > > 1) the block device cache Well, but through this also all simple filesystems that use buffer_heads for metadata handling... > 2) ext4 for its ext4_journalled_aops, ie, jbd2 > 3) nilfs2 > > jbd2's use of this however callback however is very race prone, consider > folio migration while reviewing jbd2_journal_write_metadata_buffer() > and the fact that jbd2: > > - does not hold the folio lock > - does not have have page writeback bit set > - does not lock the buffer > > And so, it can race with folio_set_bh() on folio migration. The commit > ebdf4de5642fb6 ("mm: migrate: fix reference check race between > __find_get_block() and migration") added a spin lock to prevent races > with page migration which ext4 users were reporting through the SUSE > bugzilla (bnc#1137609 [0]). Although we don't have exact traces of the > original filesystem corruption we can can reproduce fs corruption on > ext4 by just removing the spinlock and stress testing the filesystem > with generic/750, we eventually end up after 3 hours of testing with > kdevops using libvirt on the ext4 profiles ext4-4k and ext4-2k. Correct, jbd2 holds bh reference (its private jh structure attached to bh->b_private holds it) and that is expected to protect jbd2 from anybody else mucking with the bh. > It turns out that the spin lock doesn't in the end protect against > corruption, it *helps* reduce the possibility, but ext4 filesystem > corruption can still happen even with the spin lock held. A test was > done using vanilla Linux and adding a udelay(2000) right before we > spin_lock(&bd_mapping->i_private_lock) on __find_get_block_slow() and > we can reproduce the same exact filesystem corruption issues as observed > without the spinlock with generic/750 [1]. This is unexpected. > ** Reproduced on vanilla Linux with udelay(2000) ** > > Call trace (ENOSPC journal failure): > do_writepages() > → ext4_do_writepages() > → ext4_map_blocks() > → ext4_ext_map_blocks() > → ext4_ext_insert_extent() > → __ext4_handle_dirty_metadata() > → jbd2_journal_dirty_metadata() → ERROR -28 (ENOSPC) Curious. Did you try running e2fsck after the filesystem complained like this? This complains about journal handle not having enough credits for needed metadata update. Either we've lost some update to the journal_head structure (b_modified got accidentally cleared) or some update to extent tree. > And so jbd2 still needs more work to avoid races with folio migration. > So replace the current spin lock solution by just skipping jbd buffers > on folio migration. We identify jbd buffers as its the only user of > set_buffer_meta() on __ext4_handle_dirty_metadata(). By checking for > buffer_meta() and bailing on migration we fix the existing racy ext4 > corruption while also removing the spin lock to be held while sleeping > complaints originally reported by 0-day [5], and paves the way for > buffer-heads for more users of large folios other than the block > device cache. I think we need to understand why private_lock protection does not protect bh users holding reference like jbd2 from folio migration before papering over this problem with the hack. Because there are chances other simple filesystems suffer from the same problem... > diff --git a/mm/migrate.c b/mm/migrate.c > index f3ee6d8d5e2e..32fa72ba10b4 100644 > --- a/mm/migrate.c > +++ b/mm/migrate.c > @@ -841,6 +841,9 @@ static int __buffer_migrate_folio(struct address_space *mapping, > if (folio_ref_count(src) != expected_count) > return -EAGAIN; > > + if (buffer_meta(head)) > + return -EAGAIN; > + > if (!buffer_migrate_lock_buffers(head, mode)) > return -EAGAIN; > > @@ -859,12 +862,12 @@ static int __buffer_migrate_folio(struct address_space *mapping, > } > bh = bh->b_this_page; > } while (bh != head); > + spin_unlock(&mapping->i_private_lock); No, you've just broken all simple filesystems (like ext2) with this patch. You can reduce the spinlock critical section only after providing alternative way to protect them from migration. So this should probably happen at the end of the series. Honza -- Jan Kara SUSE Labs, CR