linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Jan Kara <jack@suse.cz>
To: Luis Chamberlain <mcgrof@kernel.org>
Cc: brauner@kernel.org, jack@suse.cz, tytso@mit.edu,
	 adilger.kernel@dilger.ca, linux-ext4@vger.kernel.org,
	riel@surriel.com, dave@stgolabs.net,  willy@infradead.org,
	hannes@cmpxchg.org, oliver.sang@intel.com, david@redhat.com,
	 axboe@kernel.dk, hare@suse.de, david@fromorbit.com,
	djwong@kernel.org,  ritesh.list@gmail.com,
	linux-fsdevel@vger.kernel.org, linux-block@vger.kernel.org,
	 linux-mm@kvack.org, gost.dev@samsung.com, p.raghav@samsung.com,
	da.gomez@samsung.com,
	 syzbot+f3c6fda1297c748a7076@syzkaller.appspotmail.com
Subject: Re: [PATCH v2 1/8] migrate: fix skipping metadata buffer heads on migration
Date: Thu, 10 Apr 2025 14:05:38 +0200	[thread overview]
Message-ID: <dpn6pb7hwpmajoh5k5zla6x7fsmh4rlttstj3hkuvunp6tok3j@ikz2fxpikfv4> (raw)
In-Reply-To: <20250410014945.2140781-2-mcgrof@kernel.org>

On Wed 09-04-25 18:49:38, Luis Chamberlain wrote:
> Filesystems which use buffer-heads where it cannot guarantees that there
> are no other references to the folio, for example with a folio
> lock, must use buffer_migrate_folio_norefs() for the address space
> mapping migrate_folio() callback. There are only 3 filesystems which use
> this callback:
> 
>   1) the block device cache

Well, but through this also all simple filesystems that use buffer_heads
for metadata handling...

>   2) ext4 for its ext4_journalled_aops, ie, jbd2
>   3) nilfs2
> 
> jbd2's use of this however callback however is very race prone, consider
> folio migration while reviewing jbd2_journal_write_metadata_buffer()
> and the fact that jbd2:
> 
>   - does not hold the folio lock
>   - does not have have page writeback bit set
>   - does not lock the buffer
>
> And so, it can race with folio_set_bh() on folio migration. The commit
> ebdf4de5642fb6 ("mm: migrate: fix reference  check race between
> __find_get_block() and migration") added a spin lock to prevent races
> with page migration which ext4 users were reporting through the SUSE
> bugzilla (bnc#1137609 [0]). Although we don't have exact traces of the
> original filesystem corruption we can can reproduce fs corruption on
> ext4 by just removing the spinlock and stress testing the filesystem
> with generic/750, we eventually end up after 3 hours of testing with
> kdevops using libvirt on the ext4 profiles ext4-4k and ext4-2k.

Correct, jbd2 holds bh reference (its private jh structure attached to
bh->b_private holds it) and that is expected to protect jbd2 from anybody
else mucking with the bh.

> It turns out that the spin lock doesn't in the end protect against
> corruption, it *helps* reduce the possibility, but ext4 filesystem
> corruption can still happen even with the spin lock held. A test was
> done using vanilla Linux and adding a udelay(2000) right before we
> spin_lock(&bd_mapping->i_private_lock) on __find_get_block_slow() and
> we can reproduce the same exact filesystem corruption issues as observed
> without the spinlock with generic/750 [1].

This is unexpected.

> ** Reproduced on vanilla Linux with udelay(2000) **
> 
> Call trace (ENOSPC journal failure):
>   do_writepages()
>     → ext4_do_writepages()
>       → ext4_map_blocks()
>         → ext4_ext_map_blocks()
>           → ext4_ext_insert_extent()
>             → __ext4_handle_dirty_metadata()
>               → jbd2_journal_dirty_metadata() → ERROR -28 (ENOSPC)

Curious. Did you try running e2fsck after the filesystem complained like
this? This complains about journal handle not having enough credits for
needed metadata update. Either we've lost some update to the journal_head
structure (b_modified got accidentally cleared) or some update to extent
tree.

> And so jbd2 still needs more work to avoid races with folio migration.
> So replace the current spin lock solution by just skipping jbd buffers
> on folio migration. We identify jbd buffers as its the only user of
> set_buffer_meta() on __ext4_handle_dirty_metadata(). By checking for
> buffer_meta() and bailing on migration we fix the existing racy ext4
> corruption while also removing the spin lock to be held while sleeping
> complaints originally reported by 0-day [5], and paves the way for
> buffer-heads for more users of large folios other than the block
> device cache.

I think we need to understand why private_lock protection does not protect
bh users holding reference like jbd2 from folio migration before papering
over this problem with the hack. Because there are chances other simple
filesystems suffer from the same problem...

> diff --git a/mm/migrate.c b/mm/migrate.c
> index f3ee6d8d5e2e..32fa72ba10b4 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -841,6 +841,9 @@ static int __buffer_migrate_folio(struct address_space *mapping,
>  	if (folio_ref_count(src) != expected_count)
>  		return -EAGAIN;
>  
> +	if (buffer_meta(head))
> +		return -EAGAIN;
> +
>  	if (!buffer_migrate_lock_buffers(head, mode))
>  		return -EAGAIN;
>  
> @@ -859,12 +862,12 @@ static int __buffer_migrate_folio(struct address_space *mapping,
>  			}
>  			bh = bh->b_this_page;
>  		} while (bh != head);
> +		spin_unlock(&mapping->i_private_lock);

No, you've just broken all simple filesystems (like ext2) with this patch.
You can reduce the spinlock critical section only after providing
alternative way to protect them from migration. So this should probably
happen at the end of the series.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR


  parent reply	other threads:[~2025-04-10 12:05 UTC|newest]

Thread overview: 38+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-04-10  1:49 [PATCH v2 0/8] mm: enhance migration work around on noref buffer-heads Luis Chamberlain
2025-04-10  1:49 ` [PATCH v2 1/8] migrate: fix skipping metadata buffer heads on migration Luis Chamberlain
2025-04-10  3:18   ` Matthew Wilcox
2025-04-10 12:05   ` Jan Kara [this message]
2025-04-14 21:09     ` Luis Chamberlain
2025-04-14 22:19       ` Luis Chamberlain
2025-04-15  9:05         ` Christian Brauner
2025-04-15 15:47           ` Luis Chamberlain
2025-04-15 16:23             ` Jan Kara
2025-04-15 21:06               ` Luis Chamberlain
2025-04-16  2:02                 ` Davidlohr Bueso
2025-04-15 11:17         ` Jan Kara
2025-04-15 11:23       ` Jan Kara
2025-04-15 16:18     ` Luis Chamberlain
2025-04-15 16:28       ` Jan Kara
2025-04-16 16:58         ` Luis Chamberlain
2025-04-23 17:09           ` Jan Kara
2025-04-23 20:30             ` Luis Chamberlain
2025-04-25 22:51               ` Luis Chamberlain
2025-04-28 23:08                 ` Luis Chamberlain
2025-04-29  9:32                   ` Jan Kara
2025-04-15  1:36   ` Davidlohr Bueso
2025-04-15 11:25     ` Jan Kara
2025-04-15 18:14       ` Davidlohr Bueso
2025-04-10  1:49 ` [PATCH v2 2/8] fs/buffer: try to use folio lock for pagecache lookups Luis Chamberlain
2025-04-10 14:38   ` Jan Kara
2025-04-10 17:38     ` Davidlohr Bueso
2025-04-10  1:49 ` [PATCH v2 3/8] fs/buffer: introduce __find_get_block_nonatomic() Luis Chamberlain
2025-04-10  1:49 ` [PATCH v2 4/8] fs/ocfs2: use sleeping version of __find_get_block() Luis Chamberlain
2025-04-10  1:49 ` [PATCH v2 5/8] fs/jbd2: " Luis Chamberlain
2025-04-10  1:49 ` [PATCH v2 6/8] fs/ext4: " Luis Chamberlain
2025-04-10 13:36   ` Jan Kara
2025-04-10 17:32     ` Davidlohr Bueso
2025-04-10  1:49 ` [PATCH v2 7/8] mm/migrate: enable noref migration for jbd2 Luis Chamberlain
2025-04-10 13:40   ` Jan Kara
2025-04-10 17:30     ` Davidlohr Bueso
2025-04-14 12:12       ` Jan Kara
2025-04-10  1:49 ` [PATCH v2 8/8] mm: add migration buffer-head debugfs interface Luis Chamberlain

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=dpn6pb7hwpmajoh5k5zla6x7fsmh4rlttstj3hkuvunp6tok3j@ikz2fxpikfv4 \
    --to=jack@suse.cz \
    --cc=adilger.kernel@dilger.ca \
    --cc=axboe@kernel.dk \
    --cc=brauner@kernel.org \
    --cc=da.gomez@samsung.com \
    --cc=dave@stgolabs.net \
    --cc=david@fromorbit.com \
    --cc=david@redhat.com \
    --cc=djwong@kernel.org \
    --cc=gost.dev@samsung.com \
    --cc=hannes@cmpxchg.org \
    --cc=hare@suse.de \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mcgrof@kernel.org \
    --cc=oliver.sang@intel.com \
    --cc=p.raghav@samsung.com \
    --cc=riel@surriel.com \
    --cc=ritesh.list@gmail.com \
    --cc=syzbot+f3c6fda1297c748a7076@syzkaller.appspotmail.com \
    --cc=tytso@mit.edu \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox