possible regression fs corruption on 64GB nvme

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* possible regression fs corruption on 64GB nvme
@ 2024-09-09 18:34 Robert Beckett
  2024-09-09 19:29 ` Keith Busch
                   ` (2 more replies)
  0 siblings, 3 replies; 13+ messages in thread
From: Robert Beckett @ 2024-09-09 18:34 UTC (permalink / raw)
  To: linux-nvme, Keith Busch, Jens Axboe, Christoph Hellwig,
	Sagi Grimberg, Andrew Morton, linux-mm

Hi all,

We have been chasing an occasional fs corruption seen on 64GB NVME devices and would like to ask for advice and ideas as to what could be going on.
The devices in question are small cheap NVME devices which are eMMC behind an NVME bridge. They appear to be quite basic compared to other devices [1]

After a lot of testing, we managed to get a repro case that would trigger within 2-3 tests using the desync tool [2], reducing the repro time from a day or more to minutes. For repro steps see [3].
We bisected the issue to 

da9619a30e73b dmapool: link blocks across pages
https://lore.kernel.org/all/20230126215125.4069751-12-kbusch@meta.com/T/#u

With this patch we fail to verify the image within 2-3 attempts.
When we revert this patch it verifies every time.

This appears to be a concurrent read issue. The desync tool we use for testing fires off many threads.
If I first cat the file to /dev/null to prime the page cache with the file, it verifies fine.

I am currently attempting root cause analysis. As yet I don't know whether it is directly related to that dmapool patch or whether it is just exposing in underlying issue with the nvme driver.
For now, we are shipping with that patch reverted as a temporary fix while we work towards a root cause.

It was originally observed in the field after updating from 6.1 to 6.5 based kernels. Upon further testing this seems to hold true with the easily reproducible test scenario. This seems like a regression.
Testing on torvalds' latest tree shows the issue is still there as of 88fac17500f4ea49c7bac136cf1b27e7b9980075

I thought I'd let you all know in case you want issue a revert out of abundance of caution.

Some other thoughts about the issue:

- we have received reports of occasional filesystem corruptions on btrfs and ext4 filesystems on the same disk, this doesn't appear fs related
- it only seems to affect these 64GB single queue simple disks. Other devices with more capable disks have not showed this issue.
- using simple dd or md5sum testing does not sow the issue. desync seems to be very parallel in it's attack patterns.
- I was investigating a previous potential regression that was deemed not an issue https://lkml.org/lkml/2023/2/21/762 . I assume nvme doesn't need it's addresses to be ordered. I'm not familiar with the spec.

I'd appreciate any advice you may have on why this dmapool patch could potentially cause or expose an issue with these nvme devices.
If any more info would be useful to help diagnose, I'll happily provide it.

Thanks

Bob

[1]
$ sudo nvme list
Node                  Generic               SN                   Model                                    Namespace  Usage                      Format           FW Rev  
--------------------- --------------------- -------------------- ---------------------------------------- ---------- -------------------------- ---------------- --------
/dev/nvme0n1          /dev/ng0n1            NCE777D00B21D        E2M2 64GB                                0x1         61.87  GB /  61.87  GB    512   B +  0 B   10100080

$ sudo nvme get-feature /dev/nvme0n1
get-feature:0x01 (Arbitration), Current value:00000000
get-feature:0x02 (Power Management), Current value:00000000
get-feature:0x04 (Temperature Threshold), Current value:00000000
get-feature:0x05 (Error Recovery), Current value:00000000
get-feature:0x06 (Volatile Write Cache), Current value:0x00000001
get-feature:0x07 (Number of Queues), Current value:00000000
get-feature:0x08 (Interrupt Coalescing), Current value:00000000
get-feature:0x09 (Interrupt Vector Configuration), Current value:00000000
get-feature:0x0a (Write Atomicity Normal), Current value:00000000
get-feature:0x0b (Async Event Configuration), Current value:00000000
get-feature:0x0c (Autonomous Power State Transition), Current value:00000000
       0  1  2  3  4  5  6  7  8  9  a  b  c  d  e  f
0000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0040: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0050: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0060: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0070: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0080: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0090: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
00a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
00b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
00c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
00d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
00e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
00f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
get-feature:0x11 (Non-Operational Power State Config), Current value:00000000

[2]
https://github.com/folbricht/desync

[3]
$ dd if=/dev/urandom of=test_file bs=1M count=10240
$ sudo bash -c "echo 3 > /proc/sys/vm/drop_caches"
$ desync verify-index test_file.caibx test_file

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: possible regression fs corruption on 64GB nvme
  2024-09-09 18:34 possible regression fs corruption on 64GB nvme Robert Beckett
@ 2024-09-09 19:29 ` Keith Busch
  2024-09-09 20:29 ` Keith Busch
  2024-09-10  4:24 ` Christoph Hellwig
  2 siblings, 0 replies; 13+ messages in thread
From: Keith Busch @ 2024-09-09 19:29 UTC (permalink / raw)
  To: Robert Beckett
  Cc: linux-nvme, Jens Axboe, Christoph Hellwig, Sagi Grimberg,
	Andrew Morton, linux-mm

On Mon, Sep 09, 2024 at 07:34:15PM +0100, Robert Beckett wrote:
> After a lot of testing, we managed to get a repro case that would trigger within 2-3 tests using the desync tool [2], reducing the repro time from a day or more to minutes. For repro steps see [3].
> We bisected the issue to 
> 
> da9619a30e73b dmapool: link blocks across pages
> https://lore.kernel.org/all/20230126215125.4069751-12-kbusch@meta.com/T/#u

That's not the patch that was ultimately committed. Still, that's the
one I tested extensively with nvme, so the updated one shouldn't make a
difference for protocol.
 
> Some other thoughts about the issue:
> 
> - we have received reports of occasional filesystem corruptions on btrfs and ext4 filesystems on the same disk, this doesn't appear fs related
> - it only seems to affect these 64GB single queue simple disks. Other devices with more capable disks have not showed this issue.
> - using simple dd or md5sum testing does not sow the issue. desync seems to be very parallel in it's attack patterns.
> - I was investigating a previous potential regression that was deemed not an issue https://lkml.org/lkml/2023/2/21/762 . I assume nvme doesn't need it's addresses to be ordered. I'm not familiar with the spec.

nvme should not care about address ordering. The dma buffers are all
pulled from the same pool for all threads, and could be dispatched in
different orders than what was allocated, so any order should be fine.
 
> I'd appreciate any advice you may have on why this dmapool patch could potentially cause or expose an issue with these nvme devices.
> If any more info would be useful to help diagnose, I'll happily provide it.

Did you try with CONFIG_SLUB_DEBUG_ON enabled?


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: possible regression fs corruption on 64GB nvme
  2024-09-09 18:34 possible regression fs corruption on 64GB nvme Robert Beckett
  2024-09-09 19:29 ` Keith Busch
@ 2024-09-09 20:29 ` Keith Busch
  2024-09-09 20:31   ` Keith Busch
  2024-09-10  4:24 ` Christoph Hellwig
  2 siblings, 1 reply; 13+ messages in thread
From: Keith Busch @ 2024-09-09 20:29 UTC (permalink / raw)
  To: Robert Beckett
  Cc: linux-nvme, Jens Axboe, Christoph Hellwig, Sagi Grimberg,
	Andrew Morton, linux-mm

On Mon, Sep 09, 2024 at 07:34:15PM +0100, Robert Beckett wrote:
> - it only seems to affect these 64GB single queue simple disks. Other devices with more capable disks have not showed this issue.

As a test, could you try kernel parameter "nvme.io_queue_depth_set=2"?


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: possible regression fs corruption on 64GB nvme
  2024-09-09 20:29 ` Keith Busch
@ 2024-09-09 20:31   ` Keith Busch
  2024-09-10  9:30     ` Robert Beckett
  0 siblings, 1 reply; 13+ messages in thread
From: Keith Busch @ 2024-09-09 20:31 UTC (permalink / raw)
  To: Robert Beckett
  Cc: linux-nvme, Jens Axboe, Christoph Hellwig, Sagi Grimberg,
	Andrew Morton, linux-mm

On Mon, Sep 09, 2024 at 02:29:14PM -0600, Keith Busch wrote:
> As a test, could you try kernel parameter "nvme.io_queue_depth_set=2"?

Err, I mean "nvme.io_queue_depth=2".


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: possible regression fs corruption on 64GB nvme
  2024-09-09 18:34 possible regression fs corruption on 64GB nvme Robert Beckett
  2024-09-09 19:29 ` Keith Busch
  2024-09-09 20:29 ` Keith Busch
@ 2024-09-10  4:24 ` Christoph Hellwig
  2024-09-10  9:37   ` Robert Beckett
  2 siblings, 1 reply; 13+ messages in thread
From: Christoph Hellwig @ 2024-09-10  4:24 UTC (permalink / raw)
  To: Robert Beckett
  Cc: linux-nvme, Keith Busch, Jens Axboe, Christoph Hellwig,
	Sagi Grimberg, Andrew Morton, linux-mm

Hi Robert,

what platform is this on?  Does it have DMA that is not cache coherent?



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: possible regression fs corruption on 64GB nvme
  2024-09-09 20:31   ` Keith Busch
@ 2024-09-10  9:30     ` Robert Beckett
  2024-09-10 17:27       ` Robert Beckett
  0 siblings, 1 reply; 13+ messages in thread
From: Robert Beckett @ 2024-09-10  9:30 UTC (permalink / raw)
  To: Keith Busch
  Cc: linux-nvme, Jens Axboe, Christoph Hellwig, Sagi Grimberg,
	Andrew Morton, linux-mm






 ---- On Mon, 09 Sep 2024 21:31:41 +0100  Keith Busch  wrote --- 
 > On Mon, Sep 09, 2024 at 02:29:14PM -0600, Keith Busch wrote:
 > > As a test, could you try kernel parameter "nvme.io_queue_depth_set=2"?
 > 
 > Err, I mean "nvme.io_queue_depth=2".
 > 

Thanks, I'll give it a try along with your other questions and report back.

For clarity, the repro steps dropped a step. They should have included the make command:


$ dd if=/dev/urandom of=test_file bs=1M count=10240
$ desync make test_file.caibx test_file
$ sudo bash -c "echo 3 > /proc/sys/vm/drop_caches"
$ desync verify-index test_file.caibx test_file




^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: possible regression fs corruption on 64GB nvme
  2024-09-10  4:24 ` Christoph Hellwig
@ 2024-09-10  9:37   ` Robert Beckett
  0 siblings, 0 replies; 13+ messages in thread
From: Robert Beckett @ 2024-09-10  9:37 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-nvme, Keith Busch, Jens Axboe, Sagi Grimberg,
	Andrew Morton, linux-mm

 ---- On Tue, 10 Sep 2024 05:24:11 +0100  Christoph Hellwig  wrote --- 
 > Hi Robert,
 > 
 > what platform is this on?  Does it have DMA that is not cache coherent?
 > 
 > 

I't s an AMD Zen2 based device, should be coherent.
It also didn't occur on 6.1 kernel, making this fell more like a regression to me (though never discount statistics)


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: possible regression fs corruption on 64GB nvme
  2024-09-10  9:30     ` Robert Beckett
@ 2024-09-10 17:27       ` Robert Beckett
  2024-09-10 17:53         ` Keith Busch
  0 siblings, 1 reply; 13+ messages in thread
From: Robert Beckett @ 2024-09-10 17:27 UTC (permalink / raw)
  To: Keith Busch
  Cc: linux-nvme, Jens Axboe, Christoph Hellwig, Sagi Grimberg,
	Andrew Morton, linux-mm

 ---- On Tue, 10 Sep 2024 10:30:18 +0100  Robert Beckett  wrote --- 
 > 
 > 
 > 
 > 
 > 
 >  ---- On Mon, 09 Sep 2024 21:31:41 +0100  Keith Busch  wrote --- 
 >  > On Mon, Sep 09, 2024 at 02:29:14PM -0600, Keith Busch wrote:
 >  > > As a test, could you try kernel parameter "nvme.io_queue_depth_set=2"?
 >  > 
 >  > Err, I mean "nvme.io_queue_depth=2".
 >  > 
 > 
 > Thanks, I'll give it a try along with your other questions and report back.
 > 
 > For clarity, the repro steps dropped a step. They should have included the make command:
 > 
 > 
 > $ dd if=/dev/urandom of=test_file bs=1M count=10240
 > $ desync make test_file.caibx test_file
 > $ sudo bash -c "echo 3 > /proc/sys/vm/drop_caches"
 > $ desync verify-index test_file.caibx test_file
 > 

CONFIG_SLUB_DEBUG_ON showed no debug output.

nvme.io_queue_depth=2 appears to fix it. Could you explain the implications of this?
I assume it is limiting to 2 outstanding requests concurrently.
Does it suggest an issue with the specific device's FW?
I assume this would suggest that it is not actually anything wrong with the dmapool, it was just exposing the issue of the device/fw?
Any advice for handling this and/or investigating further?

My initial speculation was that maybe the disk fw is signalling completion of an access before it has actually finished making it's way to ram. I checked the code and saw that the dmapool appears to be used for storing the buffer page addresses, so I imagine that is not updated by the disk at all, which would rule out my assumption.
I'd appreciate any insight you could give on the usage of the dmapools in the driver and whether you would expect them to be significant in this issue, or if they are just making a device/fw bug more observable.

Thanks

Bob

p.s. Here is an transcript of the issue seen in testing. To my knowledge, if everything is working as it should, nothing should be able to produce this output, that dropping caches and re-priming the page cache via a linear read it fixes things.

$ dd if=/dev/urandom of=test_file bs=1M count=10240
10240+0 records in
10240+0 records out
10737418240 bytes (11 GB, 10 GiB) copied, 111.609 s, 96.2 MB/s
$ desync make test_file.caibx test_file
Chunking [=======================================================================================================================================] 100.00% 18s
$ sudo bash -c "echo 3 > /proc/sys/vm/drop_caches"
$ desync verify-index test_file.caibx test_file
[=============>-----------------------------------------------------------------------------------------------------------------------------------]   9.00% 4s
Error: seed index for test_file doesn't match its data
$ md5sum test_file
ce4f1cca0b3dfd63ea2adfd745e4bfc1  test_file
$ sudo bash -c "echo 3 > /proc/sys/vm/drop_caches"
$ md5sum test_file
1edb3eaf5ae57b6187cc0be843ed2e5c  test_file
$ desync verify-index test_file.caibx test_file
[=================================================================================================================================================] 100.00% 5s

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: possible regression fs corruption on 64GB nvme
  2024-09-10 17:27       ` Robert Beckett
@ 2024-09-10 17:53         ` Keith Busch
  2024-09-11 16:56           ` Robert Beckett
  0 siblings, 1 reply; 13+ messages in thread
From: Keith Busch @ 2024-09-10 17:53 UTC (permalink / raw)
  To: Robert Beckett
  Cc: linux-nvme, Jens Axboe, Christoph Hellwig, Sagi Grimberg,
	Andrew Morton, linux-mm

On Tue, Sep 10, 2024 at 06:27:55PM +0100, Robert Beckett wrote:
> nvme.io_queue_depth=2 appears to fix it. Could you explain the implications of this?
> I assume it is limiting to 2 outstanding requests concurrently.

You'd think so, but not quite. NVMe queues need to leave one entry
empty, so a submission queue with depth "2" means you can have at most 1
command outstanding.

> Does it suggest an issue with the specific device's FW?

I think that sounds probable. Especially considering the dmapool code
has had considerable run time in real life, and no other such issue has
been reported.

> I assume this would suggest that it is not actually anything wrong with the dmapool, it was just exposing the issue of the device/fw?

That's what I'm thinking, though, if you have a single queue with depth
2, we're not stressing the dmapool implementation either. It's always
going to return the same dma block for each command.

> Any advice for handling this and/or investigating further?

If you have the resources for it, get protocol analyzer trace and show
it to your nvme vendor.

> My initial speculation was that maybe the disk fw is signalling completion of an access before it has actually finished making it's way to ram. I checked the code and saw that the dmapool appears to be used for storing the buffer page addresses, so I imagine that is not updated by the disk at all, which would rule out my assumption.

Right, it's used to make the prp/sgl list. Once we get a completion,
that dma block becomes immediately available for the very next command.
If you have a higher queue depth, it's possible that dma block is reused
immediately while the driver is still notifying the block layer of the
completion.

If we're thinking that the device is completing the command before it's
really done with the list (which could explain your observation), that
would be a problem. Going to single queue-depth might introduce a delay
or work around some firmware issue when dealing with concurrent
commands.

Prior to the "new" dmapool allocation, it was much less likely (though I
think still possible) for your next command to reuse the same dma block
of the command currently being completed.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: possible regression fs corruption on 64GB nvme
  2024-09-10 17:53         ` Keith Busch
@ 2024-09-11 16:56           ` Robert Beckett
  2024-09-11 16:57             ` Robert Beckett
  2024-09-11 17:08             ` Keith Busch
  0 siblings, 2 replies; 13+ messages in thread
From: Robert Beckett @ 2024-09-11 16:56 UTC (permalink / raw)
  To: Keith Busch
  Cc: linux-nvme, Jens Axboe, Christoph Hellwig, Sagi Grimberg,
	Andrew Morton, linux-mm

 ---- On Tue, 10 Sep 2024 18:53:23 +0100  Keith Busch  wrote --- 
 > On Tue, Sep 10, 2024 at 06:27:55PM +0100, Robert Beckett wrote:
 > > nvme.io_queue_depth=2 appears to fix it. Could you explain the implications of this?
 > > I assume it is limiting to 2 outstanding requests concurrently.
 > 
 > You'd think so, but not quite. NVMe queues need to leave one entry
 > empty, so a submission queue with depth "2" means you can have at most 1
 > command outstanding.
 > 
 > > Does it suggest an issue with the specific device's FW?
 > 
 > I think that sounds probable. Especially considering the dmapool code
 > has had considerable run time in real life, and no other such issue has
 > been reported.
 > 
 > > I assume this would suggest that it is not actually anything wrong with the dmapool, it was just exposing the issue of the device/fw?
 > 
 > That's what I'm thinking, though, if you have a single queue with depth
 > 2, we're not stressing the dmapool implementation either. It's always
 > going to return the same dma block for each command.
 > 
 > > Any advice for handling this and/or investigating further?
 > 
 > If you have the resources for it, get protocol analyzer trace and show
 > it to your nvme vendor.

Unfortunately this is infeasible for us.

 >  
 > > My initial speculation was that maybe the disk fw is signalling completion of an access before it has actually finished making it's way to ram. I checked the code and saw that the dmapool appears to be used for storing the buffer page addresses, so I imagine that is not updated by the disk at all, which would rule out my assumption.
 > 
 > Right, it's used to make the prp/sgl list. Once we get a completion,
 > that dma block becomes immediately available for the very next command.
 > If you have a higher queue depth, it's possible that dma block is reused
 > immediately while the driver is still notifying the block layer of the
 > completion.
 > 
 > If we're thinking that the device is completing the command before it's
 > really done with the list (which could explain your observation), that
 > would be a problem. Going to single queue-depth might introduce a delay
 > or work around some firmware issue when dealing with concurrent
 > commands.
 > 
 > Prior to the "new" dmapool allocation, it was much less likely (though I
 > think still possible) for your next command to reuse the same dma block
 > of the command currently being completed.
 > 

given this ~9 year old temporary fix is still in the kernel for the Apple device, could we just add another device specific override? I could maybe convert it to a quirk that is set for them both (and any future devices)


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: possible regression fs corruption on 64GB nvme
  2024-09-11 16:56           ` Robert Beckett
@ 2024-09-11 16:57             ` Robert Beckett
  2024-09-11 17:08             ` Keith Busch
  1 sibling, 0 replies; 13+ messages in thread
From: Robert Beckett @ 2024-09-11 16:57 UTC (permalink / raw)
  To: Keith Busch
  Cc: linux-nvme, Jens Axboe, Christoph Hellwig, Sagi Grimberg,
	Andrew Morton, linux-mm






 ---- On Wed, 11 Sep 2024 17:56:37 +0100  Robert Beckett  wrote --- 
 >  ---- On Tue, 10 Sep 2024 18:53:23 +0100  Keith Busch  wrote --- 
 >  > On Tue, Sep 10, 2024 at 06:27:55PM +0100, Robert Beckett wrote:
 >  > > nvme.io_queue_depth=2 appears to fix it. Could you explain the implications of this?
 >  > > I assume it is limiting to 2 outstanding requests concurrently.
 >  > 
 >  > You'd think so, but not quite. NVMe queues need to leave one entry
 >  > empty, so a submission queue with depth "2" means you can have at most 1
 >  > command outstanding.
 >  > 
 >  > > Does it suggest an issue with the specific device's FW?
 >  > 
 >  > I think that sounds probable. Especially considering the dmapool code
 >  > has had considerable run time in real life, and no other such issue has
 >  > been reported.
 >  > 
 >  > > I assume this would suggest that it is not actually anything wrong with the dmapool, it was just exposing the issue of the device/fw?
 >  > 
 >  > That's what I'm thinking, though, if you have a single queue with depth
 >  > 2, we're not stressing the dmapool implementation either. It's always
 >  > going to return the same dma block for each command.
 >  > 
 >  > > Any advice for handling this and/or investigating further?
 >  > 
 >  > If you have the resources for it, get protocol analyzer trace and show
 >  > it to your nvme vendor.
 > 
 > Unfortunately this is infeasible for us.
 > 
 >  >  
 >  > > My initial speculation was that maybe the disk fw is signalling completion of an access before it has actually finished making it's way to ram. I checked the code and saw that the dmapool appears to be used for storing the buffer page addresses, so I imagine that is not updated by the disk at all, which would rule out my assumption.
 >  > 
 >  > Right, it's used to make the prp/sgl list. Once we get a completion,
 >  > that dma block becomes immediately available for the very next command.
 >  > If you have a higher queue depth, it's possible that dma block is reused
 >  > immediately while the driver is still notifying the block layer of the
 >  > completion.
 >  > 
 >  > If we're thinking that the device is completing the command before it's
 >  > really done with the list (which could explain your observation), that
 >  > would be a problem. Going to single queue-depth might introduce a delay
 >  > or work around some firmware issue when dealing with concurrent
 >  > commands.
 >  > 
 >  > Prior to the "new" dmapool allocation, it was much less likely (though I
 >  > think still possible) for your next command to reuse the same dma block
 >  > of the command currently being completed.
 >  > 
 > 
 > given this ~9 year old temporary fix is still in the kernel for the Apple device, could we just add another device specific override? I could maybe convert it to a quirk that is set for them both (and any future devices)
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/nvme/host/pci.c?h=v6.11-rc7#n2570
 > 


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: possible regression fs corruption on 64GB nvme
  2024-09-11 16:56           ` Robert Beckett
  2024-09-11 16:57             ` Robert Beckett
@ 2024-09-11 17:08             ` Keith Busch
  2024-09-11 17:17               ` Robert Beckett
  1 sibling, 1 reply; 13+ messages in thread
From: Keith Busch @ 2024-09-11 17:08 UTC (permalink / raw)
  To: Robert Beckett
  Cc: linux-nvme, Jens Axboe, Christoph Hellwig, Sagi Grimberg,
	Andrew Morton, linux-mm

On Wed, Sep 11, 2024 at 05:56:37PM +0100, Robert Beckett wrote:
> 
> given this ~9 year old temporary fix is still in the kernel for the
> Apple device, could we just add another device specific override? I
> could maybe convert it to a quirk that is set for them both (and any
> future devices)

Sure, that's an option. Do you want to send a patch? Alternatively you
can reply with the PCI VID:DID of your problematic device and I'll write
one for testing and consideration.

And just fyi, there are potential performance implications for doing
this. It's less noticable on higher latency devices, which sounds like
what you have anyway, so may not matter much for you. Just getting that
out there to avoid any surprises.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: possible regression fs corruption on 64GB nvme
  2024-09-11 17:08             ` Keith Busch
@ 2024-09-11 17:17               ` Robert Beckett
  0 siblings, 0 replies; 13+ messages in thread
From: Robert Beckett @ 2024-09-11 17:17 UTC (permalink / raw)
  To: Keith Busch
  Cc: linux-nvme, Jens Axboe, Christoph Hellwig, Sagi Grimberg,
	Andrew Morton, linux-mm






 ---- On Wed, 11 Sep 2024 18:08:50 +0100  Keith Busch  wrote --- 
 > On Wed, Sep 11, 2024 at 05:56:37PM +0100, Robert Beckett wrote:
 > > 
 > > given this ~9 year old temporary fix is still in the kernel for the
 > > Apple device, could we just add another device specific override? I
 > > could maybe convert it to a quirk that is set for them both (and any
 > > future devices)
 > 
 > Sure, that's an option. Do you want to send a patch? Alternatively you
 > can reply with the PCI VID:DID of your problematic device and I'll write
 > one for testing and consideration.

looks like it's 1217:8760

01:00.0 Non-Volatile memory controller [0108]: O2 Micro, Inc. Device [1217:8760] (rev 01) (prog-if 02 [NVM Express])
        Subsystem: O2 Micro, Inc. Device [1217:0002]
        Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0, Cache Line Size: 64 bytes
        Interrupt: pin A routed to IRQ 41
        Region 0: Memory at 80600000 (64-bit, non-prefetchable) [size=16K]
        Capabilities: <access denied>
        Kernel driver in use: nvme
        Kernel modules: nvme

 > 
 > And just fyi, there are potential performance implications for doing
 > this. It's less noticable on higher latency devices, which sounds like
 > what you have anyway, so may not matter much for you. Just getting that
 > out there to avoid any surprises.

Understood. I'll do some benchmarking with and without it to check difference.
I think for now it is safer to get it in the code as it can cause corruption. We can always revert it if we figure out a fix with less performance degredation.

Cheers muchly!


^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2024-09-11 17:18 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-09-09 18:34 possible regression fs corruption on 64GB nvme Robert Beckett
2024-09-09 19:29 ` Keith Busch
2024-09-09 20:29 ` Keith Busch
2024-09-09 20:31   ` Keith Busch
2024-09-10  9:30     ` Robert Beckett
2024-09-10 17:27       ` Robert Beckett
2024-09-10 17:53         ` Keith Busch
2024-09-11 16:56           ` Robert Beckett
2024-09-11 16:57             ` Robert Beckett
2024-09-11 17:08             ` Keith Busch
2024-09-11 17:17               ` Robert Beckett
2024-09-10  4:24 ` Christoph Hellwig
2024-09-10  9:37   ` Robert Beckett

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox