I'm hearing a bunch of FUD around NVMe hotplug but precious little in the way of bug reports! Keith Busch has been doing a stellar job of fixing up the bugs that he's found, but I have seen precisely zero hotplug bugs reported to the NVMe mailing list. So put up or shut up.
On Thu, May 8, 2014 at 5:37 AM, David Woodhouse <dwmw2@infradead.org> wrote:
> I'd like to have a discussion about handling device errors.
>
> IOMMUs are becoming more common, and we've seen some failure modes where
> we just end up with an endless stream of fault reports from a given
> device, and the kernel can do nothing else.
>
> We may have various options for shutting it up — a PCI function level
> reset, power cycling the offending device, or maybe just configuring the
> IOMMU to *ignore* further errors from it, which would at least let the
> system get on with doing something useful (and if we do, when do we
> re-enable reporting?).
I think there's a more general problem that's worth talking about
here. In addition to IOMMU faults, there are lots of other PCI errors
that can happen, and we have some small number of drivers that have
been "hardened" to try and recover from these errors. However even
for these "hardened" drivers it seems pretty easy to hit deadlocks
when the driver tries to tear down and reinitialize things.
So I wonder if we can do better without proliferating error handling
tentacles into all sorts of low-level drivers ("did we just read
0xffffffff here? how about here? are we in the middle of error
recovery? how about now?").
One context where this is becoming a real concern is with NVMe drives.
These are SSDs that (may) look like normal 2.5" drives, but use PCIe
rather than SATA or SAS to connect to the host. Since they look like
normal drives, it's natural to put them into hot-pluggable JBODs, but
it turns out we react much worse to PCIe surprise removal than, say,
SAS hotplug.
- R.
_______________________________________________
Ksummit-discuss mailing list
Ksummit-discuss@lists.linuxfoundation.org
https://lists.linuxfoundation.org/mailman/listinfo/ksummit-discuss