linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Luis Chamberlain <mcgrof@kernel.org>
To: brauner@kernel.org, jack@suse.cz, tytso@mit.edu,
	adilger.kernel@dilger.ca, linux-ext4@vger.kernel.org,
	riel@surriel.com
Cc: dave@stgolabs.net, willy@infradead.org, hannes@cmpxchg.org,
	oliver.sang@intel.com, david@redhat.com, axboe@kernel.dk,
	hare@suse.de, david@fromorbit.com, djwong@kernel.org,
	ritesh.list@gmail.com, linux-fsdevel@vger.kernel.org,
	linux-block@vger.kernel.org, linux-mm@kvack.org,
	gost.dev@samsung.com, p.raghav@samsung.com, da.gomez@samsung.com,
	mcgrof@kernel.org,
	syzbot+f3c6fda1297c748a7076@syzkaller.appspotmail.com
Subject: [PATCH v2 1/8] migrate: fix skipping metadata buffer heads on migration
Date: Wed,  9 Apr 2025 18:49:38 -0700	[thread overview]
Message-ID: <20250410014945.2140781-2-mcgrof@kernel.org> (raw)
In-Reply-To: <20250410014945.2140781-1-mcgrof@kernel.org>

Filesystems which use buffer-heads where it cannot guarantees that there
are no other references to the folio, for example with a folio
lock, must use buffer_migrate_folio_norefs() for the address space
mapping migrate_folio() callback. There are only 3 filesystems which use
this callback:

  1) the block device cache
  2) ext4 for its ext4_journalled_aops, ie, jbd2
  3) nilfs2

jbd2's use of this however callback however is very race prone, consider
folio migration while reviewing jbd2_journal_write_metadata_buffer()
and the fact that jbd2:

  - does not hold the folio lock
  - does not have have page writeback bit set
  - does not lock the buffer

And so, it can race with folio_set_bh() on folio migration. The commit
ebdf4de5642fb6 ("mm: migrate: fix reference  check race between
__find_get_block() and migration") added a spin lock to prevent races
with page migration which ext4 users were reporting through the SUSE
bugzilla (bnc#1137609 [0]). Although we don't have exact traces of the
original filesystem corruption we can can reproduce fs corruption on
ext4 by just removing the spinlock and stress testing the filesystem
with generic/750, we eventually end up after 3 hours of testing with
kdevops using libvirt on the ext4 profiles ext4-4k and ext4-2k. It
turns out that the spin lock doesn't in the end protect against
corruption, it *helps* reduce the possibility, but ext4 filesystem
corruption can still happen even with the spin lock held. A test was
done using vanilla Linux and adding a udelay(2000) right before we
spin_lock(&bd_mapping->i_private_lock) on __find_get_block_slow() and
we can reproduce the same exact filesystem corruption issues as observed
without the spinlock with generic/750 [1].

We now have a slew of traces collected for the ext4 corruptions possible
without the current spin lock held [2] [3] [4] but the general pattern
is as follows, as summarized by ChatGPT from all traces:

do_writepages() # write back -->
   ext4_map_block() # performs logical to physical block mapping -->
     ext4_ext_insert_extent() # updates extent tree -->
       jbd2_journal_dirty_metadata()  # marks metadata as dirty for
                                      # journaling. This can lead
                                      # to any of the following hints
                                      # as to what happened from
                                      # ext4 / jbd2

         - Directory and extent metadata corruption splats or

         - Filure to handle out-of-space conditions gracefully, with
           cascading metadata errors and eventual filesystem shutdown
           to prevent further damage.

         - Failure to journal new extent metadata during extent tree
           growth, triggered under memory pressure or heavy writeback.
           Commonly results in ENOSPC, journal abort, and read-only
           fallback. **

         - Journal metadata failure during extent tree growth causes
           read-only fallback. Seen repeatedly on small-block (2k)
           filesystems under stress (e.g. fsstress). Triggers errors in
           bitmap and inode updates, and persists in journal replay logs.
           "Error count since last fsck" shows long-term corruption
           footprint.

** Reproduced on vanilla Linux with udelay(2000) **

Call trace (ENOSPC journal failure):
  do_writepages()
    → ext4_do_writepages()
      → ext4_map_blocks()
        → ext4_ext_map_blocks()
          → ext4_ext_insert_extent()
            → __ext4_handle_dirty_metadata()
              → jbd2_journal_dirty_metadata() → ERROR -28 (ENOSPC)

And so jbd2 still needs more work to avoid races with folio migration.
So replace the current spin lock solution by just skipping jbd buffers
on folio migration. We identify jbd buffers as its the only user of
set_buffer_meta() on __ext4_handle_dirty_metadata(). By checking for
buffer_meta() and bailing on migration we fix the existing racy ext4
corruption while also removing the spin lock to be held while sleeping
complaints originally reported by 0-day [5], and paves the way for
buffer-heads for more users of large folios other than the block
device cache.

Link: https://bugzilla.suse.com/show_bug.cgi?id=1137609 # [0]
Link: https://web.git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/commit/?h=20250408-ext4-jbd-force-corruption # [1]
Link: https://lkml.kernel.org/r/Z-ZwToVfJbdTVRtG@bombadil.infradead.org # [2]
Link: https://lore.kernel.org/all/Z-rzyrS0Jr7t984Y@bombadil.infradead.org/ # [3]
Link: https://lore.kernel.org/all/Z_AA9SHZdRGq6tb4@bombadil.infradead.org/ # [4]
Reported-by: kernel test robot <oliver.sang@intel.com>
Reported-by: syzbot+f3c6fda1297c748a7076@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/oe-lkp/202503101536.27099c77-lkp@intel.com # [5]
Fixes: ebdf4de5642fb6 ("mm: migrate: fix reference  check race between __find_get_block() and migration")
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 mm/migrate.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index f3ee6d8d5e2e..32fa72ba10b4 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -841,6 +841,9 @@ static int __buffer_migrate_folio(struct address_space *mapping,
 	if (folio_ref_count(src) != expected_count)
 		return -EAGAIN;
 
+	if (buffer_meta(head))
+		return -EAGAIN;
+
 	if (!buffer_migrate_lock_buffers(head, mode))
 		return -EAGAIN;
 
@@ -859,12 +862,12 @@ static int __buffer_migrate_folio(struct address_space *mapping,
 			}
 			bh = bh->b_this_page;
 		} while (bh != head);
+		spin_unlock(&mapping->i_private_lock);
 		if (busy) {
 			if (invalidated) {
 				rc = -EAGAIN;
 				goto unlock_buffers;
 			}
-			spin_unlock(&mapping->i_private_lock);
 			invalidate_bh_lrus();
 			invalidated = true;
 			goto recheck_buffers;
@@ -880,10 +883,7 @@ static int __buffer_migrate_folio(struct address_space *mapping,
 		folio_set_bh(bh, dst, bh_offset(bh));
 		bh = bh->b_this_page;
 	} while (bh != head);
-
 unlock_buffers:
-	if (check_refs)
-		spin_unlock(&mapping->i_private_lock);
 	bh = head;
 	do {
 		unlock_buffer(bh);
-- 
2.47.2



  reply	other threads:[~2025-04-10  1:50 UTC|newest]

Thread overview: 38+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-04-10  1:49 [PATCH v2 0/8] mm: enhance migration work around on noref buffer-heads Luis Chamberlain
2025-04-10  1:49 ` Luis Chamberlain [this message]
2025-04-10  3:18   ` [PATCH v2 1/8] migrate: fix skipping metadata buffer heads on migration Matthew Wilcox
2025-04-10 12:05   ` Jan Kara
2025-04-14 21:09     ` Luis Chamberlain
2025-04-14 22:19       ` Luis Chamberlain
2025-04-15  9:05         ` Christian Brauner
2025-04-15 15:47           ` Luis Chamberlain
2025-04-15 16:23             ` Jan Kara
2025-04-15 21:06               ` Luis Chamberlain
2025-04-16  2:02                 ` Davidlohr Bueso
2025-04-15 11:17         ` Jan Kara
2025-04-15 11:23       ` Jan Kara
2025-04-15 16:18     ` Luis Chamberlain
2025-04-15 16:28       ` Jan Kara
2025-04-16 16:58         ` Luis Chamberlain
2025-04-23 17:09           ` Jan Kara
2025-04-23 20:30             ` Luis Chamberlain
2025-04-25 22:51               ` Luis Chamberlain
2025-04-28 23:08                 ` Luis Chamberlain
2025-04-29  9:32                   ` Jan Kara
2025-04-15  1:36   ` Davidlohr Bueso
2025-04-15 11:25     ` Jan Kara
2025-04-15 18:14       ` Davidlohr Bueso
2025-04-10  1:49 ` [PATCH v2 2/8] fs/buffer: try to use folio lock for pagecache lookups Luis Chamberlain
2025-04-10 14:38   ` Jan Kara
2025-04-10 17:38     ` Davidlohr Bueso
2025-04-10  1:49 ` [PATCH v2 3/8] fs/buffer: introduce __find_get_block_nonatomic() Luis Chamberlain
2025-04-10  1:49 ` [PATCH v2 4/8] fs/ocfs2: use sleeping version of __find_get_block() Luis Chamberlain
2025-04-10  1:49 ` [PATCH v2 5/8] fs/jbd2: " Luis Chamberlain
2025-04-10  1:49 ` [PATCH v2 6/8] fs/ext4: " Luis Chamberlain
2025-04-10 13:36   ` Jan Kara
2025-04-10 17:32     ` Davidlohr Bueso
2025-04-10  1:49 ` [PATCH v2 7/8] mm/migrate: enable noref migration for jbd2 Luis Chamberlain
2025-04-10 13:40   ` Jan Kara
2025-04-10 17:30     ` Davidlohr Bueso
2025-04-14 12:12       ` Jan Kara
2025-04-10  1:49 ` [PATCH v2 8/8] mm: add migration buffer-head debugfs interface Luis Chamberlain

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20250410014945.2140781-2-mcgrof@kernel.org \
    --to=mcgrof@kernel.org \
    --cc=adilger.kernel@dilger.ca \
    --cc=axboe@kernel.dk \
    --cc=brauner@kernel.org \
    --cc=da.gomez@samsung.com \
    --cc=dave@stgolabs.net \
    --cc=david@fromorbit.com \
    --cc=david@redhat.com \
    --cc=djwong@kernel.org \
    --cc=gost.dev@samsung.com \
    --cc=hannes@cmpxchg.org \
    --cc=hare@suse.de \
    --cc=jack@suse.cz \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=oliver.sang@intel.com \
    --cc=p.raghav@samsung.com \
    --cc=riel@surriel.com \
    --cc=ritesh.list@gmail.com \
    --cc=syzbot+f3c6fda1297c748a7076@syzkaller.appspotmail.com \
    --cc=tytso@mit.edu \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox