linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* bcachefs dropped writes with lockless buffered io path, COMPACTION/MIGRATION=y
@ 2024-08-27  3:29 Kent Overstreet
  2024-08-27  3:36 ` Matthew Wilcox
  2024-08-27  7:57 ` clonejo
  0 siblings, 2 replies; 5+ messages in thread
From: Kent Overstreet @ 2024-08-27  3:29 UTC (permalink / raw)
  To: linux-fsdevel, linux-mm, linux-bcachefs

We had a report of corruption on nixos, on tests that build a system
image, it bisected to the patch that enabled buffered writes without
taking the inode lock:

https://evilpiepirate.org/git/bcachefs.git/commit/?id=7e64c86cdc6c

It appears that dirty folios are being dropped somehow; corrupt files,
when checked against good copies, have ranges of 0s that are 4k aligned
(modulo 2k, likely a misaligned partition).

Interestingly, it only triggers for QEMU - the test fails pretty
consistently and we have a lot of nixos users, we'd notice (via nix
store verifies) if the corruption was more widespread. We believe it
only triggers with QEMU's snapshots mode (but don't quote me on that).

Further digging implicates CONFIG_COMPACTION or CONFIG_MIGRATION.

Testing with COMPACTION, MIGRATION=n and TRANSPARENT_HUGEPAGE=y passes
reliably.

On the bcachefs side, I've been testing with that patch reduced to just
"don't take inode lock if not extending"; i.e. killing the fancy stuff
to preserve write atomicity. It really does appear to be "don't take
inode lock -> dirty folios get dropped".

It's not a race with truncate, or anything silly like that; bcachefs has
the pagecache add lock, which serves here for locking vs. truncate.

So - this is a real head scratcher. The inode lock really doesn't do
much in IO paths, it's there for synchronization with truncate and write
vs. write atomicity - the mm paths know nothing about it. Page
fault/mkwrite paths don't take it at all; a buffered non-extending write
should be able to work similarly: the folio lock should be entirely
sufficient here.

Anyone got any bright ideas?


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: bcachefs dropped writes with lockless buffered io path, COMPACTION/MIGRATION=y
  2024-08-27  3:29 bcachefs dropped writes with lockless buffered io path, COMPACTION/MIGRATION=y Kent Overstreet
@ 2024-08-27  3:36 ` Matthew Wilcox
  2024-08-27  3:40   ` Kent Overstreet
  2024-08-27  7:57 ` clonejo
  1 sibling, 1 reply; 5+ messages in thread
From: Matthew Wilcox @ 2024-08-27  3:36 UTC (permalink / raw)
  To: Kent Overstreet; +Cc: linux-fsdevel, linux-mm, linux-bcachefs

On Mon, Aug 26, 2024 at 11:29:52PM -0400, Kent Overstreet wrote:
> We had a report of corruption on nixos, on tests that build a system
> image, it bisected to the patch that enabled buffered writes without
> taking the inode lock:
> 
> https://evilpiepirate.org/git/bcachefs.git/commit/?id=7e64c86cdc6c
> 
> It appears that dirty folios are being dropped somehow; corrupt files,
> when checked against good copies, have ranges of 0s that are 4k aligned
> (modulo 2k, likely a misaligned partition).
> 
> Interestingly, it only triggers for QEMU - the test fails pretty
> consistently and we have a lot of nixos users, we'd notice (via nix
> store verifies) if the corruption was more widespread. We believe it
> only triggers with QEMU's snapshots mode (but don't quote me on that).

Just to be crystal clear here, the corruption happens while running
bcachefs in the qemu guest, and it doesn't matter what the host
filesystem is?

Or did I misunderstand, and it occurs while running anything inside qemu
on top of a bcachefs host?

> Further digging implicates CONFIG_COMPACTION or CONFIG_MIGRATION.
> 
> Testing with COMPACTION, MIGRATION=n and TRANSPARENT_HUGEPAGE=y passes
> reliably.
> 
> On the bcachefs side, I've been testing with that patch reduced to just
> "don't take inode lock if not extending"; i.e. killing the fancy stuff
> to preserve write atomicity. It really does appear to be "don't take
> inode lock -> dirty folios get dropped".
> 
> It's not a race with truncate, or anything silly like that; bcachefs has
> the pagecache add lock, which serves here for locking vs. truncate.
> 
> So - this is a real head scratcher. The inode lock really doesn't do
> much in IO paths, it's there for synchronization with truncate and write
> vs. write atomicity - the mm paths know nothing about it. Page
> fault/mkwrite paths don't take it at all; a buffered non-extending write
> should be able to work similarly: the folio lock should be entirely
> sufficient here.
> 
> Anyone got any bright ideas?

No, but I'm going to sleep on it.


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: bcachefs dropped writes with lockless buffered io path, COMPACTION/MIGRATION=y
  2024-08-27  3:36 ` Matthew Wilcox
@ 2024-08-27  3:40   ` Kent Overstreet
  2024-08-27  3:46     ` Kent Overstreet
  0 siblings, 1 reply; 5+ messages in thread
From: Kent Overstreet @ 2024-08-27  3:40 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-fsdevel, linux-mm, linux-bcachefs

On Tue, Aug 27, 2024 at 04:36:33AM GMT, Matthew Wilcox wrote:
> On Mon, Aug 26, 2024 at 11:29:52PM -0400, Kent Overstreet wrote:
> > We had a report of corruption on nixos, on tests that build a system
> > image, it bisected to the patch that enabled buffered writes without
> > taking the inode lock:
> > 
> > https://evilpiepirate.org/git/bcachefs.git/commit/?id=7e64c86cdc6c
> > 
> > It appears that dirty folios are being dropped somehow; corrupt files,
> > when checked against good copies, have ranges of 0s that are 4k aligned
> > (modulo 2k, likely a misaligned partition).
> > 
> > Interestingly, it only triggers for QEMU - the test fails pretty
> > consistently and we have a lot of nixos users, we'd notice (via nix
> > store verifies) if the corruption was more widespread. We believe it
> > only triggers with QEMU's snapshots mode (but don't quote me on that).
> 
> Just to be crystal clear here, the corruption happens while running
> bcachefs in the qemu guest, and it doesn't matter what the host
> filesystem is?
> 
> Or did I misunderstand, and it occurs while running anything inside qemu
> on top of a bcachefs host?

The host is running bcachefs, backing qemu's disk image.

(And I'm using nested virtualization for bisecting, it's been a lot to
keep straight).


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: bcachefs dropped writes with lockless buffered io path, COMPACTION/MIGRATION=y
  2024-08-27  3:40   ` Kent Overstreet
@ 2024-08-27  3:46     ` Kent Overstreet
  0 siblings, 0 replies; 5+ messages in thread
From: Kent Overstreet @ 2024-08-27  3:46 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-fsdevel, linux-mm, linux-bcachefs

On Mon, Aug 26, 2024 at 11:40:57PM GMT, Kent Overstreet wrote:
> On Tue, Aug 27, 2024 at 04:36:33AM GMT, Matthew Wilcox wrote:
> > On Mon, Aug 26, 2024 at 11:29:52PM -0400, Kent Overstreet wrote:
> > > We had a report of corruption on nixos, on tests that build a system
> > > image, it bisected to the patch that enabled buffered writes without
> > > taking the inode lock:
> > > 
> > > https://evilpiepirate.org/git/bcachefs.git/commit/?id=7e64c86cdc6c
> > > 
> > > It appears that dirty folios are being dropped somehow; corrupt files,
> > > when checked against good copies, have ranges of 0s that are 4k aligned
> > > (modulo 2k, likely a misaligned partition).
> > > 
> > > Interestingly, it only triggers for QEMU - the test fails pretty
> > > consistently and we have a lot of nixos users, we'd notice (via nix
> > > store verifies) if the corruption was more widespread. We believe it
> > > only triggers with QEMU's snapshots mode (but don't quote me on that).
> > 
> > Just to be crystal clear here, the corruption happens while running
> > bcachefs in the qemu guest, and it doesn't matter what the host
> > filesystem is?
> > 
> > Or did I misunderstand, and it occurs while running anything inside qemu
> > on top of a bcachefs host?
> 
> The host is running bcachefs, backing qemu's disk image.
> 
> (And I'm using nested virtualization for bisecting, it's been a lot to
> keep straight).

Also, the size of the missing data is not a power of two - it's not a
single folio.


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: bcachefs dropped writes with lockless buffered io path, COMPACTION/MIGRATION=y
  2024-08-27  3:29 bcachefs dropped writes with lockless buffered io path, COMPACTION/MIGRATION=y Kent Overstreet
  2024-08-27  3:36 ` Matthew Wilcox
@ 2024-08-27  7:57 ` clonejo
  1 sibling, 0 replies; 5+ messages in thread
From: clonejo @ 2024-08-27  7:57 UTC (permalink / raw)
  To: Kent Overstreet, linux-fsdevel, linux-mm, linux-bcachefs

On 27/08/2024 05:29, Kent Overstreet wrote:
> Interestingly, it only triggers for QEMU - the test fails pretty
> consistently and we have a lot of nixos users, we'd notice (via nix
> store verifies) if the corruption was more widespread. We believe it
> only triggers with QEMU's snapshots mode (but don't quote me on that).

I have had this with 6.10.6 #1-NixOS (with nocow_lock patch added) _on 
bare hardware_, no qemu involved.

So far, i have seen it only on files in /var/lib/nixos, similar to 
https://github.com/NixOS/nixpkgs/issues/126971.

nix-store --verify --check-contents is still running, but no errors yet.

Also, copygc and rebalance block suspends for me, though the fs gets 
unmounted fine on shutdown.

Mount options:
> /dev/nvme0n1p2:/dev/sdc3:/dev/sdd2:/dev/sde3:/dev/sdb3:/dev/sda3 on / type bcachefs (rw,relatime,metadata_replicas=2,data_replicas=2,background_compression=zstd,metadata_target=ssd,foreground_target=ssd.tlc,background_target=hdd,promote_target=ssd)



^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2024-08-27  7:58 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-08-27  3:29 bcachefs dropped writes with lockless buffered io path, COMPACTION/MIGRATION=y Kent Overstreet
2024-08-27  3:36 ` Matthew Wilcox
2024-08-27  3:40   ` Kent Overstreet
2024-08-27  3:46     ` Kent Overstreet
2024-08-27  7:57 ` clonejo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox