From: Matthew Wilcox <willy6545@gmail.com>
To: Roland Dreier <roland@kernel.org>
Cc: linux-nvme@lists.infradead.org,
"ksummit-discuss@lists.linuxfoundation.org"
<ksummit-discuss@lists.linuxfoundation.org>
Subject: Re: [Ksummit-discuss] [CORE TOPIC] Device error handling / reporting / isolation
Date: Fri, 9 May 2014 13:58:03 -0400 [thread overview]
Message-ID: <CAFhKne9pHJzsxt8JemvogCRWywU5Z3e2-_JtC9Z9un+RvVP7XQ@mail.gmail.com> (raw)
In-Reply-To: <CAG4TOxNJxWLWSZYW313XATtCpV+SM9WrYFJ-V+0Pf-yryLh67g@mail.gmail.com>
[-- Attachment #1: Type: text/plain, Size: 2300 bytes --]
I'm hearing a bunch of FUD around NVMe hotplug but precious little in the
way of bug reports! Keith Busch has been doing a stellar job of fixing up
the bugs that he's found, but I have seen precisely zero hotplug bugs
reported to the NVMe mailing list. So put up or shut up.
On 2014-05-09 1:49 PM, "Roland Dreier" <roland@kernel.org> wrote:
> On Thu, May 8, 2014 at 5:37 AM, David Woodhouse <dwmw2@infradead.org>
> wrote:
> > I'd like to have a discussion about handling device errors.
> >
> > IOMMUs are becoming more common, and we've seen some failure modes where
> > we just end up with an endless stream of fault reports from a given
> > device, and the kernel can do nothing else.
> >
> > We may have various options for shutting it up — a PCI function level
> > reset, power cycling the offending device, or maybe just configuring the
> > IOMMU to *ignore* further errors from it, which would at least let the
> > system get on with doing something useful (and if we do, when do we
> > re-enable reporting?).
>
> I think there's a more general problem that's worth talking about
> here. In addition to IOMMU faults, there are lots of other PCI errors
> that can happen, and we have some small number of drivers that have
> been "hardened" to try and recover from these errors. However even
> for these "hardened" drivers it seems pretty easy to hit deadlocks
> when the driver tries to tear down and reinitialize things.
>
> So I wonder if we can do better without proliferating error handling
> tentacles into all sorts of low-level drivers ("did we just read
> 0xffffffff here? how about here? are we in the middle of error
> recovery? how about now?").
>
> One context where this is becoming a real concern is with NVMe drives.
> These are SSDs that (may) look like normal 2.5" drives, but use PCIe
> rather than SATA or SAS to connect to the host. Since they look like
> normal drives, it's natural to put them into hot-pluggable JBODs, but
> it turns out we react much worse to PCIe surprise removal than, say,
> SAS hotplug.
>
> - R.
> _______________________________________________
> Ksummit-discuss mailing list
> Ksummit-discuss@lists.linuxfoundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/ksummit-discuss
>
[-- Attachment #2: Type: text/html, Size: 2918 bytes --]
next prev parent reply other threads:[~2014-05-09 17:58 UTC|newest]
Thread overview: 42+ messages / expand[flat|nested] mbox.gz Atom feed top
2014-05-08 12:37 David Woodhouse
2014-05-08 18:03 ` Bjorn Helgaas
2014-05-08 20:00 ` Rafael J. Wysocki
2014-05-08 19:56 ` James Bottomley
2014-05-09 8:55 ` David Woodhouse
2014-05-09 11:31 ` Laurent Pinchart
2014-05-14 1:28 ` Benjamin Herrenschmidt
2014-05-09 17:48 ` Roland Dreier
2014-05-09 17:58 ` Matthew Wilcox [this message]
2014-05-09 18:08 ` Roland Dreier
2014-05-14 1:40 ` Benjamin Herrenschmidt
2014-05-09 18:05 ` Will Deacon
2014-05-12 15:03 ` Joerg Roedel
2014-05-09 19:37 ` Josh Triplett
2014-05-09 19:44 ` David Woodhouse
2014-05-09 19:53 ` Roland Dreier
2014-05-09 20:13 ` Luck, Tony
2014-05-09 20:19 ` James Bottomley
2014-05-10 1:09 ` Laurent Pinchart
2014-05-11 22:43 ` Daniel Vetter
2014-05-12 15:07 ` Joerg Roedel
2014-05-12 15:35 ` Daniel Vetter
2014-05-12 16:16 ` Andy Lutomirski
2014-05-12 16:28 ` Joerg Roedel
2014-05-12 16:59 ` Laurent Pinchart
2014-05-12 17:15 ` Joerg Roedel
2014-05-12 17:11 ` Daniel Vetter
2014-05-12 17:40 ` Joerg Roedel
2014-05-13 10:06 ` Daniel Vetter
2014-05-12 17:04 ` Daniel Vetter
2014-05-13 11:27 ` David Woodhouse
2014-05-13 17:25 ` Daniel Vetter
2014-05-14 1:50 ` Benjamin Herrenschmidt
2014-05-14 20:09 ` Daniel Vetter
2014-05-15 1:08 ` Benjamin Herrenschmidt
2014-05-12 16:26 ` Joerg Roedel
2014-05-12 14:58 ` Joerg Roedel
2014-05-13 14:37 ` David Woodhouse
2014-05-14 1:46 ` Benjamin Herrenschmidt
2014-05-14 1:43 ` Benjamin Herrenschmidt
2014-05-14 1:42 ` Benjamin Herrenschmidt
2014-05-14 1:24 ` Benjamin Herrenschmidt
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=CAFhKne9pHJzsxt8JemvogCRWywU5Z3e2-_JtC9Z9un+RvVP7XQ@mail.gmail.com \
--to=willy6545@gmail.com \
--cc=ksummit-discuss@lists.linuxfoundation.org \
--cc=linux-nvme@lists.infradead.org \
--cc=roland@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox