[Ksummit-discuss] [CORE TOPIC] Device error handling / reporting / isolation

ksummit.lists.linux.dev archive mirror
 help / color / mirror / Atom feed

* [Ksummit-discuss] [CORE TOPIC] Device error handling / reporting / isolation
@ 2014-05-08 12:37 David Woodhouse
  2014-05-08 18:03 ` Bjorn Helgaas
                   ` (5 more replies)
  0 siblings, 6 replies; 42+ messages in thread
From: David Woodhouse @ 2014-05-08 12:37 UTC (permalink / raw)
  To: ksummit-discuss

[-- Attachment #1: Type: text/plain, Size: 1815 bytes --]

I'd like to have a discussion about handling device errors.

IOMMUs are becoming more common, and we've seen some failure modes where
we just end up with an endless stream of fault reports from a given
device, and the kernel can do nothing else.

We may have various options for shutting it up — a PCI function level
reset, power cycling the offending device, or maybe just configuring the
IOMMU to *ignore* further errors from it, which would at least let the
system get on with doing something useful (and if we do, when do we
re-enable reporting?).

But I absolutely don't want us to be implementing policies like that in
an individual IOMMU driver; this needs to be handled by generic device
code. Once upon a time I might have said PCI code, but this is actually
relevant for non-PCI devices too.

I want the IOMMU to report errors, and let the system do the appropriate
thing. Which requires some discussion about what the "appropriate thing"
can be in various circumstances, and indeed what options are available
to us on various platforms.

Participants would be those working with IOMMUs on various platforms,
including Jörg Rödel, myself, and hopefully someone with a fairly
intimate knowledge of EEH as used on POWER systems.

We probably also want KVM folks to weigh in on how, if at all, they'd
want errors on assigned devices to be reported to guests.

I strongly suspect that once we start looking at it, we'll find other
triggers than "IOMMU faults" for starting to isolate and reset
misbehaving devices. Interrupt storms perhaps being one of them — we've
never been particularly robust to those, either.

-- 
David Woodhouse                            Open Source Technology Centre
David.Woodhouse@intel.com                              Intel Corporation

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5745 bytes --]

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Ksummit-discuss] [CORE TOPIC] Device error handling / reporting / isolation
  2014-05-08 12:37 [Ksummit-discuss] [CORE TOPIC] Device error handling / reporting / isolation David Woodhouse
@ 2014-05-08 18:03 ` Bjorn Helgaas
  2014-05-08 20:00   ` Rafael J. Wysocki
  2014-05-08 19:56 ` James Bottomley
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 42+ messages in thread
From: Bjorn Helgaas @ 2014-05-08 18:03 UTC (permalink / raw)
  To: David Woodhouse; +Cc: ksummit-discuss

On Thu, May 8, 2014 at 6:37 AM, David Woodhouse <dwmw2@infradead.org> wrote:
> I'd like to have a discussion about handling device errors.
>
> IOMMUs are becoming more common, and we've seen some failure modes where
> we just end up with an endless stream of fault reports from a given
> device, and the kernel can do nothing else.
>
> We may have various options for shutting it up — a PCI function level
> reset, power cycling the offending device, or maybe just configuring the
> IOMMU to *ignore* further errors from it, which would at least let the
> system get on with doing something useful (and if we do, when do we
> re-enable reporting?).
>
> But I absolutely don't want us to be implementing policies like that in
> an individual IOMMU driver; this needs to be handled by generic device
> code. Once upon a time I might have said PCI code, but this is actually
> relevant for non-PCI devices too.
>
> I want the IOMMU to report errors, and let the system do the appropriate
> thing. Which requires some discussion about what the "appropriate thing"
> can be in various circumstances, and indeed what options are available
> to us on various platforms.

I'm interested in this discussion, too.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Ksummit-discuss] [CORE TOPIC] Device error handling / reporting / isolation
  2014-05-08 12:37 [Ksummit-discuss] [CORE TOPIC] Device error handling / reporting / isolation David Woodhouse
  2014-05-08 18:03 ` Bjorn Helgaas
@ 2014-05-08 19:56 ` James Bottomley
  2014-05-09  8:55   ` David Woodhouse
  2014-05-09 17:48 ` Roland Dreier
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 42+ messages in thread
From: James Bottomley @ 2014-05-08 19:56 UTC (permalink / raw)
  To: David Woodhouse; +Cc: ksummit-discuss

On Thu, 2014-05-08 at 13:37 +0100, David Woodhouse wrote:
> I'd like to have a discussion about handling device errors.
> 
> IOMMUs are becoming more common, and we've seen some failure modes where
> we just end up with an endless stream of fault reports from a given
> device, and the kernel can do nothing else.

This is when the addresses being sent by the bus don't have IOTLB
entries?

> We may have various options for shutting it up — a PCI function level
> reset, power cycling the offending device, or maybe just configuring the
> IOMMU to *ignore* further errors from it, which would at least let the
> system get on with doing something useful (and if we do, when do we
> re-enable reporting?).
> 
> But I absolutely don't want us to be implementing policies like that in
> an individual IOMMU driver; this needs to be handled by generic device
> code. Once upon a time I might have said PCI code, but this is actually
> relevant for non-PCI devices too.

Right, with my PARISC hat on, our IOMMUs sit adjacent to the CPUs.  The
PCI busses (if we have any) are a couple of layers down.

> I want the IOMMU to report errors, and let the system do the appropriate
> thing. Which requires some discussion about what the "appropriate thing"
> can be in various circumstances, and indeed what options are available
> to us on various platforms.
> 
> Participants would be those working with IOMMUs on various platforms,
> including Jörg Rödel, myself, and hopefully someone with a fairly
> intimate knowledge of EEH as used on POWER systems.
> 
> We probably also want KVM folks to weigh in on how, if at all, they'd
> want errors on assigned devices to be reported to guests.
> 
> I strongly suspect that once we start looking at it, we'll find other
> triggers than "IOMMU faults" for starting to isolate and reset
> misbehaving devices. Interrupt storms perhaps being one of them — we've
> never been particularly robust to those, either.

I'd be interested ... if just to make sure that whatever's agreed to
isn't just intel IOMMU centric.

James

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Ksummit-discuss] [CORE TOPIC] Device error handling / reporting / isolation
  2014-05-08 18:03 ` Bjorn Helgaas
@ 2014-05-08 20:00   ` Rafael J. Wysocki
  0 siblings, 0 replies; 42+ messages in thread
From: Rafael J. Wysocki @ 2014-05-08 20:00 UTC (permalink / raw)
  To: ksummit-discuss

On Thursday, May 08, 2014 12:03:39 PM Bjorn Helgaas wrote:
> On Thu, May 8, 2014 at 6:37 AM, David Woodhouse <dwmw2@infradead.org> wrote:
> > I'd like to have a discussion about handling device errors.
> >
> > IOMMUs are becoming more common, and we've seen some failure modes where
> > we just end up with an endless stream of fault reports from a given
> > device, and the kernel can do nothing else.
> >
> > We may have various options for shutting it up — a PCI function level
> > reset, power cycling the offending device, or maybe just configuring the
> > IOMMU to *ignore* further errors from it, which would at least let the
> > system get on with doing something useful (and if we do, when do we
> > re-enable reporting?).
> >
> > But I absolutely don't want us to be implementing policies like that in
> > an individual IOMMU driver; this needs to be handled by generic device
> > code. Once upon a time I might have said PCI code, but this is actually
> > relevant for non-PCI devices too.
> >
> > I want the IOMMU to report errors, and let the system do the appropriate
> > thing. Which requires some discussion about what the "appropriate thing"
> > can be in various circumstances, and indeed what options are available
> > to us on various platforms.
> 
> I'm interested in this discussion, too.

Yes, me too.

-- 
I speak only for myself.
Rafael J. Wysocki, Intel Open Source Technology Center.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Ksummit-discuss] [CORE TOPIC] Device error handling / reporting / isolation
  2014-05-08 19:56 ` James Bottomley
@ 2014-05-09  8:55   ` David Woodhouse
  2014-05-09 11:31     ` Laurent Pinchart
  0 siblings, 1 reply; 42+ messages in thread
From: David Woodhouse @ 2014-05-09  8:55 UTC (permalink / raw)
  To: James Bottomley; +Cc: ksummit-discuss

[-- Attachment #1: Type: text/plain, Size: 2167 bytes --]

On Thu, 2014-05-08 at 12:56 -0700, James Bottomley wrote:
> On Thu, 2014-05-08 at 13:37 +0100, David Woodhouse wrote:
> > I'd like to have a discussion about handling device errors.
> > 
> > IOMMUs are becoming more common, and we've seen some failure modes where
> > we just end up with an endless stream of fault reports from a given
> > device, and the kernel can do nothing else.
> 
> This is when the addresses being sent by the bus don't have IOTLB
> entries?

You speak as if you have a software-filled IOTLB. I'd have phrased that
as "don't have page table entries". But yes, that.

Or they have read-only IOTLB entries, and they're trying to write.

And as I said, once we start looking at it I suspect we'll end up
finding other offences that need to be taken into consideration. Which
is why I think this warrants a wider discussion rather than the IOMMU
owners sitting in a darkened room doing it amongst themselves.

> > But I absolutely don't want us to be implementing policies like that in
> > an individual IOMMU driver; this needs to be handled by generic device
> > code. Once upon a time I might have said PCI code, but this is actually
> > relevant for non-PCI devices too.
> 
> Right, with my PARISC hat on, our IOMMUs sit adjacent to the CPUs.  The
> PCI busses (if we have any) are a couple of layers down.

Even the Intel IOMMU can do mappings (and take faults) for ACPI devices,
these days.

> > I want the IOMMU to report errors, and let the system do the appropriate
> > thing. Which requires some discussion about what the "appropriate thing"
> > can be in various circumstances, and indeed what options are available
> > to us on various platforms.
> > 
> > Participants would be those working with IOMMUs on various platforms,
> > including Jörg Rödel, myself, and hopefully someone with a fairly
> > intimate knowledge of EEH as used on POWER systems.

I note that Jörg isn't actually on the nominations list. I think he
should be...

-- 
David Woodhouse                            Open Source Technology Centre
David.Woodhouse@intel.com                              Intel Corporation

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5745 bytes --]

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Ksummit-discuss] [CORE TOPIC] Device error handling / reporting / isolation
  2014-05-09  8:55   ` David Woodhouse
@ 2014-05-09 11:31     ` Laurent Pinchart
  2014-05-14  1:28       ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 42+ messages in thread
From: Laurent Pinchart @ 2014-05-09 11:31 UTC (permalink / raw)
  To: ksummit-discuss; +Cc: James Bottomley

On Friday 09 May 2014 09:55:14 David Woodhouse wrote:
> On Thu, 2014-05-08 at 12:56 -0700, James Bottomley wrote:
> > On Thu, 2014-05-08 at 13:37 +0100, David Woodhouse wrote:
> > > I'd like to have a discussion about handling device errors.
> > > 
> > > IOMMUs are becoming more common, and we've seen some failure modes where
> > > we just end up with an endless stream of fault reports from a given
> > > device, and the kernel can do nothing else.
> > 
> > This is when the addresses being sent by the bus don't have IOTLB
> > entries?
> 
> You speak as if you have a software-filled IOTLB. I'd have phrased that
> as "don't have page table entries". But yes, that.
> 
> Or they have read-only IOTLB entries, and they're trying to write.

Or they're trying to perform secure access on non-secure IOTLB entries.

I've recently run into IOMMU issues that resulted in endless messages being 
printed to the kernel log, exactly as you've mentioned, and found out the 
error reporting mechanisms to be less than adequate.

The problem is twofold: we first need a mechanism to associate errors with 
devices, and then a second mechanism to handle those errors.

I doubt the former could be made completely generic, but we should at least be 
able to implement those mechanisms in subsystem core code. For instance, in 
the IOMMU case, we will need to map I/O VAs to struct device, and I don't want 
to see that being  scattered across individual IOMMU drivers or bus master 
drivers. Better locations would be either the IOMMU core or the DMA mapping 
implementation.

The latter will likely require a mix of generic code for device isolation 
and/or reset (when possible) and driver-specific code for proper recovery. A 
fast reaction to prevent more faults from being generated should be coupled 
with a slower reaction to fix the actual cause of the problem. I expect the 
problem to be fatal in most cases, and, for IOMMUs again, usually caused by a 
software bug rather than a hardware misbehaviour (although the latter can of 
course happen). From an overall system point of view preventing the denial of 
service that follows such errors (caused by kernel log flooding for instance, 
or by the IOMMU being unable to serve other bus masters) could be our first 
priority.

I'm interested to take part in this discussion.

> And as I said, once we start looking at it I suspect we'll end up
> finding other offences that need to be taken into consideration. Which
> is why I think this warrants a wider discussion rather than the IOMMU
> owners sitting in a darkened room doing it amongst themselves.
> 
> > > But I absolutely don't want us to be implementing policies like that in
> > > an individual IOMMU driver; this needs to be handled by generic device
> > > code. Once upon a time I might have said PCI code, but this is actually
> > > relevant for non-PCI devices too.
> > 
> > Right, with my PARISC hat on, our IOMMUs sit adjacent to the CPUs.  The
> > PCI busses (if we have any) are a couple of layers down.
> 
> Even the Intel IOMMU can do mappings (and take faults) for ACPI devices,
> these days.
> 
> > > I want the IOMMU to report errors, and let the system do the appropriate
> > > thing. Which requires some discussion about what the "appropriate thing"
> > > can be in various circumstances, and indeed what options are available
> > > to us on various platforms.
> > > 
> > > Participants would be those working with IOMMUs on various platforms,
> > > including Jörg Rödel, myself, and hopefully someone with a fairly
> > > intimate knowledge of EEH as used on POWER systems.
> 
> I note that Jörg isn't actually on the nominations list. I think he
> should be...

-- 
Regards,

Laurent Pinchart

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Ksummit-discuss] [CORE TOPIC] Device error handling / reporting / isolation
  2014-05-08 12:37 [Ksummit-discuss] [CORE TOPIC] Device error handling / reporting / isolation David Woodhouse
  2014-05-08 18:03 ` Bjorn Helgaas
  2014-05-08 19:56 ` James Bottomley
@ 2014-05-09 17:48 ` Roland Dreier
  2014-05-09 17:58   ` Matthew Wilcox
  2014-05-14  1:40   ` Benjamin Herrenschmidt
  2014-05-09 18:05 ` Will Deacon
                   ` (2 subsequent siblings)
  5 siblings, 2 replies; 42+ messages in thread
From: Roland Dreier @ 2014-05-09 17:48 UTC (permalink / raw)
  To: David Woodhouse; +Cc: ksummit-discuss

On Thu, May 8, 2014 at 5:37 AM, David Woodhouse <dwmw2@infradead.org> wrote:
> I'd like to have a discussion about handling device errors.
>
> IOMMUs are becoming more common, and we've seen some failure modes where
> we just end up with an endless stream of fault reports from a given
> device, and the kernel can do nothing else.
>
> We may have various options for shutting it up — a PCI function level
> reset, power cycling the offending device, or maybe just configuring the
> IOMMU to *ignore* further errors from it, which would at least let the
> system get on with doing something useful (and if we do, when do we
> re-enable reporting?).

I think there's a more general problem that's worth talking about
here.  In addition to IOMMU faults, there are lots of other PCI errors
that can happen, and we have some small number of drivers that have
been "hardened" to try and recover from these errors.  However even
for these "hardened" drivers it seems pretty easy to hit deadlocks
when the driver tries to tear down and reinitialize things.

So I wonder if we can do better without proliferating error handling
tentacles into all sorts of low-level drivers ("did we just read
0xffffffff here?  how about here?  are we in the middle of error
recovery?  how about now?").

One context where this is becoming a real concern is with NVMe drives.
 These are SSDs that (may) look like normal 2.5" drives, but use PCIe
rather than SATA or SAS to connect to the host.  Since they look like
normal drives, it's natural to put them into hot-pluggable JBODs, but
it turns out we react much worse to PCIe surprise removal than, say,
SAS hotplug.

 - R.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Ksummit-discuss] [CORE TOPIC] Device error handling / reporting / isolation
  2014-05-09 17:48 ` Roland Dreier
@ 2014-05-09 17:58   ` Matthew Wilcox
  2014-05-09 18:08     ` Roland Dreier
  2014-05-14  1:40   ` Benjamin Herrenschmidt
  1 sibling, 1 reply; 42+ messages in thread
From: Matthew Wilcox @ 2014-05-09 17:58 UTC (permalink / raw)
  To: Roland Dreier; +Cc: linux-nvme, ksummit-discuss

[-- Attachment #1: Type: text/plain, Size: 2300 bytes --]

I'm hearing a bunch of FUD around NVMe hotplug but precious little in the
way of bug reports! Keith Busch has been doing a stellar job of fixing up
the bugs that he's found, but I have seen precisely zero hotplug bugs
reported to the NVMe mailing list. So put up or shut up.
 On 2014-05-09 1:49 PM, "Roland Dreier" <roland@kernel.org> wrote:

> On Thu, May 8, 2014 at 5:37 AM, David Woodhouse <dwmw2@infradead.org>
> wrote:
> > I'd like to have a discussion about handling device errors.
> >
> > IOMMUs are becoming more common, and we've seen some failure modes where
> > we just end up with an endless stream of fault reports from a given
> > device, and the kernel can do nothing else.
> >
> > We may have various options for shutting it up — a PCI function level
> > reset, power cycling the offending device, or maybe just configuring the
> > IOMMU to *ignore* further errors from it, which would at least let the
> > system get on with doing something useful (and if we do, when do we
> > re-enable reporting?).
>
> I think there's a more general problem that's worth talking about
> here.  In addition to IOMMU faults, there are lots of other PCI errors
> that can happen, and we have some small number of drivers that have
> been "hardened" to try and recover from these errors.  However even
> for these "hardened" drivers it seems pretty easy to hit deadlocks
> when the driver tries to tear down and reinitialize things.
>
> So I wonder if we can do better without proliferating error handling
> tentacles into all sorts of low-level drivers ("did we just read
> 0xffffffff here?  how about here?  are we in the middle of error
> recovery?  how about now?").
>
> One context where this is becoming a real concern is with NVMe drives.
>  These are SSDs that (may) look like normal 2.5" drives, but use PCIe
> rather than SATA or SAS to connect to the host.  Since they look like
> normal drives, it's natural to put them into hot-pluggable JBODs, but
> it turns out we react much worse to PCIe surprise removal than, say,
> SAS hotplug.
>
>  - R.
> _______________________________________________
> Ksummit-discuss mailing list
> Ksummit-discuss@lists.linuxfoundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/ksummit-discuss
>

[-- Attachment #2: Type: text/html, Size: 2918 bytes --]

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Ksummit-discuss] [CORE TOPIC] Device error handling / reporting / isolation
  2014-05-08 12:37 [Ksummit-discuss] [CORE TOPIC] Device error handling / reporting / isolation David Woodhouse
                   ` (2 preceding siblings ...)
  2014-05-09 17:48 ` Roland Dreier
@ 2014-05-09 18:05 ` Will Deacon
  2014-05-12 15:03   ` Joerg Roedel
  2014-05-09 19:37 ` Josh Triplett
  2014-05-14  1:24 ` Benjamin Herrenschmidt
  5 siblings, 1 reply; 42+ messages in thread
From: Will Deacon @ 2014-05-09 18:05 UTC (permalink / raw)
  To: David Woodhouse; +Cc: ksummit-discuss

Hi David,

On Thu, May 08, 2014 at 01:37:03PM +0100, David Woodhouse wrote:
> I'd like to have a discussion about handling device errors.
> 
> IOMMUs are becoming more common, and we've seen some failure modes where
> we just end up with an endless stream of fault reports from a given
> device, and the kernel can do nothing else.
> 
> We may have various options for shutting it up — a PCI function level
> reset, power cycling the offending device, or maybe just configuring the
> IOMMU to *ignore* further errors from it, which would at least let the
> system get on with doing something useful (and if we do, when do we
> re-enable reporting?).

There's also the fun of non-PCI devices, where even if you can kill the
offending device, there's not a specified way to ensure that it not longer
has transactions in flight. Also, the fault reports have to go somewhere,
so queues can fill up etc. etc.

> But I absolutely don't want us to be implementing policies like that in
> an individual IOMMU driver; this needs to be handled by generic device
> code. Once upon a time I might have said PCI code, but this is actually
> relevant for non-PCI devices too.
> 
> I want the IOMMU to report errors, and let the system do the appropriate
> thing. Which requires some discussion about what the "appropriate thing"
> can be in various circumstances, and indeed what options are available
> to us on various platforms.
> 
> Participants would be those working with IOMMUs on various platforms,
> including Jörg Rödel, myself, and hopefully someone with a fairly
> intimate knowledge of EEH as used on POWER systems.

I'd certainly be interested in this from the ARM side (I'm involved in the
architecture of our next SMMU and we've discussed this a lot internally).

Will

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Ksummit-discuss] [CORE TOPIC] Device error handling / reporting / isolation
  2014-05-09 17:58   ` Matthew Wilcox
@ 2014-05-09 18:08     ` Roland Dreier
  0 siblings, 0 replies; 42+ messages in thread
From: Roland Dreier @ 2014-05-09 18:08 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-nvme, ksummit-discuss

On Fri, May 9, 2014 at 10:58 AM, Matthew Wilcox <willy6545@gmail.com> wrote:
> I'm hearing a bunch of FUD around NVMe hotplug but precious little in the
> way of bug reports! Keith Busch has been doing a stellar job of fixing up
> the bugs that he's found, but I have seen precisely zero hotplug bugs
> reported to the NVMe mailing list. So put up or shut up.

Fair enough, not trying to spread FUD here.  The issues we've seen so
far have been more around platforms not configured to report PCIe
errors properly vs. the NVMe driver itself.

 - R.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Ksummit-discuss] [CORE TOPIC] Device error handling / reporting / isolation
  2014-05-08 12:37 [Ksummit-discuss] [CORE TOPIC] Device error handling / reporting / isolation David Woodhouse
                   ` (3 preceding siblings ...)
  2014-05-09 18:05 ` Will Deacon
@ 2014-05-09 19:37 ` Josh Triplett
  2014-05-09 19:44   ` David Woodhouse
                     ` (2 more replies)
  2014-05-14  1:24 ` Benjamin Herrenschmidt
  5 siblings, 3 replies; 42+ messages in thread
From: Josh Triplett @ 2014-05-09 19:37 UTC (permalink / raw)
  To: David Woodhouse; +Cc: ksummit-discuss

On Thu, May 08, 2014 at 01:37:03PM +0100, David Woodhouse wrote:
> But I absolutely don't want us to be implementing policies like that in
> an individual IOMMU driver; this needs to be handled by generic device
> code. Once upon a time I might have said PCI code, but this is actually
> relevant for non-PCI devices too.
[...]
> I strongly suspect that once we start looking at it, we'll find other
> triggers than "IOMMU faults" for starting to isolate and reset
> misbehaving devices. Interrupt storms perhaps being one of them — we've
> never been particularly robust to those, either.

I'm interested in a related topic: we should systematically use IOMMUs
and similar hardware features to protect against buggy or *malicious*
hardware devices.  Consider a laptop with an ExpressCard port: plug in a
device and you have full PCIe access.  (The same goes for other systems
if you open up the case.)  We should ensure that devices with no device
driver have zero privileges, and devices with a device driver have
carefully whitelisted privileges.

- Josh Triplett

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Ksummit-discuss] [CORE TOPIC] Device error handling / reporting / isolation
  2014-05-09 19:37 ` Josh Triplett
@ 2014-05-09 19:44   ` David Woodhouse
  2014-05-09 19:53   ` Roland Dreier
  2014-05-14  1:42   ` Benjamin Herrenschmidt
  2 siblings, 0 replies; 42+ messages in thread
From: David Woodhouse @ 2014-05-09 19:44 UTC (permalink / raw)
  To: Josh Triplett; +Cc: ksummit-discuss


> I'm interested in a related topic: we should systematically use IOMMUs
> and similar hardware features to protect against buggy or *malicious*
> hardware devices.  Consider a laptop with an ExpressCard port: plug in a
> device and you have full PCIe access.  (The same goes for other systems
> if you open up the case.)  We should ensure that devices with no device
> driver have zero privileges, and devices with a device driver have
> carefully whitelisted privileges.

That is precisely what we do by default when an IOMMU is enabled.

-- 
dwmw2

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Ksummit-discuss] [CORE TOPIC] Device error handling / reporting / isolation
  2014-05-09 19:37 ` Josh Triplett
  2014-05-09 19:44   ` David Woodhouse
@ 2014-05-09 19:53   ` Roland Dreier
  2014-05-09 20:13     ` Luck, Tony
  2014-05-14  1:43     ` Benjamin Herrenschmidt
  2014-05-14  1:42   ` Benjamin Herrenschmidt
  2 siblings, 2 replies; 42+ messages in thread
From: Roland Dreier @ 2014-05-09 19:53 UTC (permalink / raw)
  To: Josh Triplett; +Cc: ksummit-discuss

On Fri, May 9, 2014 at 12:37 PM, Josh Triplett <josh@joshtriplett.org> wrote:
> I'm interested in a related topic: we should systematically use IOMMUs
> and similar hardware features to protect against buggy or *malicious*
> hardware devices.  Consider a laptop with an ExpressCard port: plug in a
> device and you have full PCIe access.  (The same goes for other systems
> if you open up the case.)  We should ensure that devices with no device
> driver have zero privileges, and devices with a device driver have
> carefully whitelisted privileges.

Stuff without a device driver should be OK, since we don't turn on any
bits in the PCI command register until pci_enable_device().  So the
device can't be a bus master until someone claims it.

For devices with a driver, I guess it couldn't hurt.  But my wifi
adapter can already sniff and modify all my network traffic, etc.

I do agree that it's a bit sad that the current state of VT-d is such
that distros don't use it by default.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Ksummit-discuss] [CORE TOPIC] Device error handling / reporting / isolation
  2014-05-09 19:53   ` Roland Dreier
@ 2014-05-09 20:13     ` Luck, Tony
  2014-05-09 20:19       ` James Bottomley
  2014-05-14  1:43     ` Benjamin Herrenschmidt
  1 sibling, 1 reply; 42+ messages in thread
From: Luck, Tony @ 2014-05-09 20:13 UTC (permalink / raw)
  To: Roland Dreier, Josh Triplett; +Cc: ksummit-discuss

On Fri, May 9, 2014 at 12:37 PM, Josh Triplett <josh@joshtriplett.org> wrote:
> I'm interested in a related topic: we should systematically use IOMMUs
> and similar hardware features to protect against buggy or *malicious*
> hardware devices

Defending against buggy hardware is interesting from a RAS perspective.
You don't want a card with a stuck address line scribbling on memory
that you didn't want it to touch.

-Tony
 

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Ksummit-discuss] [CORE TOPIC] Device error handling / reporting / isolation
  2014-05-09 20:13     ` Luck, Tony
@ 2014-05-09 20:19       ` James Bottomley
  2014-05-10  1:09         ` Laurent Pinchart
                           ` (3 more replies)
  0 siblings, 4 replies; 42+ messages in thread
From: James Bottomley @ 2014-05-09 20:19 UTC (permalink / raw)
  To: Luck, Tony; +Cc: ksummit-discuss

On Fri, 2014-05-09 at 20:13 +0000, Luck, Tony wrote:
> On Fri, May 9, 2014 at 12:37 PM, Josh Triplett <josh@joshtriplett.org> wrote:
> > I'm interested in a related topic: we should systematically use IOMMUs
> > and similar hardware features to protect against buggy or *malicious*
> > hardware devices
> 
> Defending against buggy hardware is interesting from a RAS perspective.
> You don't want a card with a stuck address line scribbling on memory
> that you didn't want it to touch.

But for a laptop or desktop kernel, how far do we want to go?  In
theory, once the iommu is turned on, it corrals the device, since access
to non programmed addresses (those without IOTLB entries) produces a
fault.  Is there anything extra we need to do beyond turning on the
IOMMU?

James

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Ksummit-discuss] [CORE TOPIC] Device error handling / reporting / isolation
  2014-05-09 20:19       ` James Bottomley
@ 2014-05-10  1:09         ` Laurent Pinchart
  2014-05-11 22:43           ` Daniel Vetter
  2014-05-12 14:58         ` Joerg Roedel
                           ` (2 subsequent siblings)
  3 siblings, 1 reply; 42+ messages in thread
From: Laurent Pinchart @ 2014-05-10  1:09 UTC (permalink / raw)
  To: ksummit-discuss; +Cc: James Bottomley

On Friday 09 May 2014 13:19:08 James Bottomley wrote:
> On Fri, 2014-05-09 at 20:13 +0000, Luck, Tony wrote:
> > On Fri, May 9, 2014 at 12:37 PM, Josh Triplett <josh@joshtriplett.org> 
wrote:
> > > I'm interested in a related topic: we should systematically use IOMMUs
> > > and similar hardware features to protect against buggy or *malicious*
> > > hardware devices
> > 
> > Defending against buggy hardware is interesting from a RAS perspective.
> > You don't want a card with a stuck address line scribbling on memory
> > that you didn't want it to touch.
> 
> But for a laptop or desktop kernel, how far do we want to go?  In
> theory, once the iommu is turned on, it corrals the device, since access
> to non programmed addresses (those without IOTLB entries) produces a
> fault.  Is there anything extra we need to do beyond turning on the
> IOMMU?

We need a mechanism to correctly report and handle the IOMMU faults, otherwise 
a misbehaving device could generate interrupt storms and cause a denial of 
service.

-- 
Regards,

Laurent Pinchart

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Ksummit-discuss] [CORE TOPIC] Device error handling / reporting / isolation
  2014-05-10  1:09         ` Laurent Pinchart
@ 2014-05-11 22:43           ` Daniel Vetter
  2014-05-12 15:07             ` Joerg Roedel
  0 siblings, 1 reply; 42+ messages in thread
From: Daniel Vetter @ 2014-05-11 22:43 UTC (permalink / raw)
  To: Laurent Pinchart; +Cc: James Bottomley, ksummit-discuss

On Sat, May 10, 2014 at 3:09 AM, Laurent Pinchart
<laurent.pinchart@ideasonboard.com> wrote:
>
> We need a mechanism to correctly report and handle the IOMMU faults, otherwise
> a misbehaving device could generate interrupt storms and cause a denial of
> service.

Lack of this is a big pain for development since at least ime with
hacking around on gpu drivers iommu faults storms happen very often.
And often the load is so severe that you can't reload the driver even
if that would recover. Which in practice means that none of my
development systems have the iommu enabled because it's too often too
much pain. Which means regressions often slip into -rc or even release
kernels, reinforcing distro's decision to just not enable iommus by
default.

So I think having some iommu storm handling (like we have for
interrupts in general and a lot of other things) would go a long way
towards the goal of enabling iommus everywhere.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Ksummit-discuss] [CORE TOPIC] Device error handling / reporting / isolation
  2014-05-09 20:19       ` James Bottomley
  2014-05-10  1:09         ` Laurent Pinchart
@ 2014-05-12 14:58         ` Joerg Roedel
  2014-05-13 14:37         ` David Woodhouse
  2014-05-14  1:46         ` Benjamin Herrenschmidt
  3 siblings, 0 replies; 42+ messages in thread
From: Joerg Roedel @ 2014-05-12 14:58 UTC (permalink / raw)
  To: James Bottomley; +Cc: ksummit-discuss

On Fri, May 09, 2014 at 01:19:08PM -0700, James Bottomley wrote:
> On Fri, 2014-05-09 at 20:13 +0000, Luck, Tony wrote:
> > Defending against buggy hardware is interesting from a RAS perspective.
> > You don't want a card with a stuck address line scribbling on memory
> > that you didn't want it to touch.
> 
> But for a laptop or desktop kernel, how far do we want to go?  In
> theory, once the iommu is turned on, it corrals the device, since access
> to non programmed addresses (those without IOTLB entries) produces a
> fault.  Is there anything extra we need to do beyond turning on the
> IOMMU?

Especially for Laptops and Desktops proper fault handling is important.
Newer GPUS can use the IOMMU to directly access process address spaces
and support demand paging and CPU page-table layouts. Support for these
features in Linux is already being worked on, so handling faults in a
meaningful way is important there too.


	Joerg

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Ksummit-discuss] [CORE TOPIC] Device error handling / reporting / isolation
  2014-05-09 18:05 ` Will Deacon
@ 2014-05-12 15:03   ` Joerg Roedel
  0 siblings, 0 replies; 42+ messages in thread
From: Joerg Roedel @ 2014-05-12 15:03 UTC (permalink / raw)
  To: Will Deacon; +Cc: ksummit-discuss

On Fri, May 09, 2014 at 07:05:10PM +0100, Will Deacon wrote:
> On Thu, May 08, 2014 at 01:37:03PM +0100, David Woodhouse wrote:
> > We may have various options for shutting it up — a PCI function level
> > reset, power cycling the offending device, or maybe just configuring the
> > IOMMU to *ignore* further errors from it, which would at least let the
> > system get on with doing something useful (and if we do, when do we
> > re-enable reporting?).
> 
> There's also the fun of non-PCI devices, where even if you can kill the
> offending device, there's not a specified way to ensure that it not longer
> has transactions in flight. Also, the fault reports have to go somewhere,
> so queues can fill up etc. etc.

I am of course also interested in this discussion. Fault handling is
currently implemented per IOMMU driver. There is no reason we should not
unify the way we report faults and handle misbehaving devices.

> I'd certainly be interested in this from the ARM side (I'm involved in the
> architecture of our next SMMU and we've discussed this a lot internally).

Interesting. I strongly hope the next SMMU will still work with the
current in-kernel SMMU driver :)


	Joerg

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Ksummit-discuss] [CORE TOPIC] Device error handling / reporting / isolation
  2014-05-11 22:43           ` Daniel Vetter
@ 2014-05-12 15:07             ` Joerg Roedel
  2014-05-12 15:35               ` Daniel Vetter
  0 siblings, 1 reply; 42+ messages in thread
From: Joerg Roedel @ 2014-05-12 15:07 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: James Bottomley, ksummit-discuss

On Mon, May 12, 2014 at 12:43:09AM +0200, Daniel Vetter wrote:
> So I think having some iommu storm handling (like we have for
> interrupts in general and a lot of other things) would go a long way
> towards the goal of enabling iommus everywhere.

Right, the developer use-case needs also be taken into account. We could
easily ignore a device after it did something wrong to get rid of
io-page-fault or interupt storms. But we also need a way to tell the
kernel to unignore the device later :)

	
	Joerg

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Ksummit-discuss] [CORE TOPIC] Device error handling / reporting / isolation
  2014-05-12 15:07             ` Joerg Roedel
@ 2014-05-12 15:35               ` Daniel Vetter
  2014-05-12 16:16                 ` Andy Lutomirski
  2014-05-12 16:26                 ` Joerg Roedel
  0 siblings, 2 replies; 42+ messages in thread
From: Daniel Vetter @ 2014-05-12 15:35 UTC (permalink / raw)
  To: Joerg Roedel; +Cc: James Bottomley, ksummit-discuss

On Mon, May 12, 2014 at 5:07 PM, Joerg Roedel <joro@8bytes.org> wrote:
> On Mon, May 12, 2014 at 12:43:09AM +0200, Daniel Vetter wrote:
>> So I think having some iommu storm handling (like we have for
>> interrupts in general and a lot of other things) would go a long way
>> towards the goal of enabling iommus everywhere.
>
> Right, the developer use-case needs also be taken into account. We could
> easily ignore a device after it did something wrong to get rid of
> io-page-fault or interupt storms. But we also need a way to tell the
> kernel to unignore the device later :)

A disable/enable cycle of the pci bus master setting should be a good
enough signal? Presuming you can say for sure which devices is doing
the offending dma transactions ofc ... Or maybe we should just be
optimists and re-enable the IOMMU if _any_ child device gets
re-enabled (or bus master re-enabled for pci) in the hopes that the
developers just reloaded the driver. Worst case the storm handling
will kick in again shortly.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Ksummit-discuss] [CORE TOPIC] Device error handling / reporting / isolation
  2014-05-12 15:35               ` Daniel Vetter
@ 2014-05-12 16:16                 ` Andy Lutomirski
  2014-05-12 16:28                   ` Joerg Roedel
  2014-05-12 17:04                   ` Daniel Vetter
  2014-05-12 16:26                 ` Joerg Roedel
  1 sibling, 2 replies; 42+ messages in thread
From: Andy Lutomirski @ 2014-05-12 16:16 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: James Bottomley, ksummit-discuss

On Mon, May 12, 2014 at 8:35 AM, Daniel Vetter <daniel.vetter@ffwll.ch> wrote:
> On Mon, May 12, 2014 at 5:07 PM, Joerg Roedel <joro@8bytes.org> wrote:
>> On Mon, May 12, 2014 at 12:43:09AM +0200, Daniel Vetter wrote:
>>> So I think having some iommu storm handling (like we have for
>>> interrupts in general and a lot of other things) would go a long way
>>> towards the goal of enabling iommus everywhere.
>>
>> Right, the developer use-case needs also be taken into account. We could
>> easily ignore a device after it did something wrong to get rid of
>> io-page-fault or interupt storms. But we also need a way to tell the
>> kernel to unignore the device later :)
>
> A disable/enable cycle of the pci bus master setting should be a good
> enough signal? Presuming you can say for sure which devices is doing
> the offending dma transactions ofc ... Or maybe we should just be
> optimists and re-enable the IOMMU if _any_ child device gets
> re-enabled (or bus master re-enabled for pci) in the hopes that the
> developers just reloaded the driver. Worst case the storm handling
> will kick in again shortly.

Just to check: are you talking about disabling the IOMMU if there's a
fault storm or disabling reporting of IOMMU faults?

--Andy

> -Daniel
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> +41 (0) 79 365 57 48 - http://blog.ffwll.ch
> _______________________________________________
> Ksummit-discuss mailing list
> Ksummit-discuss@lists.linuxfoundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/ksummit-discuss



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Ksummit-discuss] [CORE TOPIC] Device error handling / reporting / isolation
  2014-05-12 15:35               ` Daniel Vetter
  2014-05-12 16:16                 ` Andy Lutomirski
@ 2014-05-12 16:26                 ` Joerg Roedel
  1 sibling, 0 replies; 42+ messages in thread
From: Joerg Roedel @ 2014-05-12 16:26 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: James Bottomley, ksummit-discuss

On Mon, May 12, 2014 at 05:35:15PM +0200, Daniel Vetter wrote:
> A disable/enable cycle of the pci bus master setting should be a good
> enough signal? Presuming you can say for sure which devices is doing
> the offending dma transactions ofc ... Or maybe we should just be
> optimists and re-enable the IOMMU if _any_ child device gets
> re-enabled (or bus master re-enabled for pci) in the hopes that the
> developers just reloaded the driver. Worst case the storm handling
> will kick in again shortly.

The PCI bus master setting is specific to the PCI bus, not all IOMMUs
Linux supports are for PCI. So probably a new driver-bind event for a
device is a more generic signal.

Back to PCI, the right way to handle faulty legacy 32 bit PCI devices
needs to be discussed. If any of those devices goes crazy the isolation
will hit all devices on the same bus. A re-bind signal for a single
device on that bus is not a good enough signal so we have to keep it
isolated even then.

	Joerg

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Ksummit-discuss] [CORE TOPIC] Device error handling / reporting / isolation
  2014-05-12 16:16                 ` Andy Lutomirski
@ 2014-05-12 16:28                   ` Joerg Roedel
  2014-05-12 16:59                     ` Laurent Pinchart
  2014-05-12 17:11                     ` Daniel Vetter
  2014-05-12 17:04                   ` Daniel Vetter
  1 sibling, 2 replies; 42+ messages in thread
From: Joerg Roedel @ 2014-05-12 16:28 UTC (permalink / raw)
  To: Andy Lutomirski; +Cc: James Bottomley, ksummit-discuss

On Mon, May 12, 2014 at 09:16:11AM -0700, Andy Lutomirski wrote:
> On Mon, May 12, 2014 at 8:35 AM, Daniel Vetter <daniel.vetter@ffwll.ch> wrote:
> Just to check: are you talking about disabling the IOMMU if there's a
> fault storm or disabling reporting of IOMMU faults?

Probably about disabling the reporting of IOMMU faults. An IOMMU that is
used for DMA-API mappings can not be disabled at runtime in a safe way.


	Joerg

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Ksummit-discuss] [CORE TOPIC] Device error handling / reporting / isolation
  2014-05-12 16:28                   ` Joerg Roedel
@ 2014-05-12 16:59                     ` Laurent Pinchart
  2014-05-12 17:15                       ` Joerg Roedel
  2014-05-12 17:11                     ` Daniel Vetter
  1 sibling, 1 reply; 42+ messages in thread
From: Laurent Pinchart @ 2014-05-12 16:59 UTC (permalink / raw)
  To: ksummit-discuss; +Cc: James Bottomley

On Monday 12 May 2014 18:28:14 Joerg Roedel wrote:
> On Mon, May 12, 2014 at 09:16:11AM -0700, Andy Lutomirski wrote:
> > On Mon, May 12, 2014 at 8:35 AM, Daniel Vetter <daniel.vetter@ffwll.ch>
> > wrote: Just to check: are you talking about disabling the IOMMU if
> > there's a fault storm or disabling reporting of IOMMU faults?
> 
> Probably about disabling the reporting of IOMMU faults. An IOMMU that is
> used for DMA-API mappings can not be disabled at runtime in a safe way.

If possible I'd like to avoid the fault to be generated instead of just not 
reporting it. As long as the bus master generates bad requests, even if we 
don't report the fault to upper layers, the IOMMU fault interrupts that will 
occur will hurt. When the hardware supports ignoring requests (either globally 
or from a particular bus master) that feature should be used, much like we 
disable interrupts at the IRQ controller when an interrupt storm is detected.

-- 
Regards,

Laurent Pinchart

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Ksummit-discuss] [CORE TOPIC] Device error handling / reporting / isolation
  2014-05-12 16:16                 ` Andy Lutomirski
  2014-05-12 16:28                   ` Joerg Roedel
@ 2014-05-12 17:04                   ` Daniel Vetter
  2014-05-13 11:27                     ` David Woodhouse
  1 sibling, 1 reply; 42+ messages in thread
From: Daniel Vetter @ 2014-05-12 17:04 UTC (permalink / raw)
  To: Andy Lutomirski; +Cc: James Bottomley, ksummit-discuss

On Mon, May 12, 2014 at 6:16 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> On Mon, May 12, 2014 at 8:35 AM, Daniel Vetter <daniel.vetter@ffwll.ch> wrote:
>> On Mon, May 12, 2014 at 5:07 PM, Joerg Roedel <joro@8bytes.org> wrote:
>>> On Mon, May 12, 2014 at 12:43:09AM +0200, Daniel Vetter wrote:
>>>> So I think having some iommu storm handling (like we have for
>>>> interrupts in general and a lot of other things) would go a long way
>>>> towards the goal of enabling iommus everywhere.
>>>
>>> Right, the developer use-case needs also be taken into account. We could
>>> easily ignore a device after it did something wrong to get rid of
>>> io-page-fault or interupt storms. But we also need a way to tell the
>>> kernel to unignore the device later :)
>>
>> A disable/enable cycle of the pci bus master setting should be a good
>> enough signal? Presuming you can say for sure which devices is doing
>> the offending dma transactions ofc ... Or maybe we should just be
>> optimists and re-enable the IOMMU if _any_ child device gets
>> re-enabled (or bus master re-enabled for pci) in the hopes that the
>> developers just reloaded the driver. Worst case the storm handling
>> will kick in again shortly.
>
> Just to check: are you talking about disabling the IOMMU if there's a
> fault storm or disabling reporting of IOMMU faults?

Re-enabling of the IOMMU after it was completely shut off to isolate a
fault storm from a rouge device. Since if I as a developer still have
to reboot if I wreak havoc in my driver it's only marginally better
than a box that went down in a iommu page fault storm. But if I can
just reload the driver (with the bug fixed) and get back a working
device because the IOMMU was re-enabling then that would help. Not
sure yet how feasible this really is.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Ksummit-discuss] [CORE TOPIC] Device error handling / reporting / isolation
  2014-05-12 16:28                   ` Joerg Roedel
  2014-05-12 16:59                     ` Laurent Pinchart
@ 2014-05-12 17:11                     ` Daniel Vetter
  2014-05-12 17:40                       ` Joerg Roedel
  1 sibling, 1 reply; 42+ messages in thread
From: Daniel Vetter @ 2014-05-12 17:11 UTC (permalink / raw)
  To: Joerg Roedel; +Cc: James Bottomley, ksummit-discuss

On Mon, May 12, 2014 at 6:28 PM, Joerg Roedel <joro@8bytes.org> wrote:
> On Mon, May 12, 2014 at 09:16:11AM -0700, Andy Lutomirski wrote:
>> On Mon, May 12, 2014 at 8:35 AM, Daniel Vetter <daniel.vetter@ffwll.ch> wrote:
>> Just to check: are you talking about disabling the IOMMU if there's a
>> fault storm or disabling reporting of IOMMU faults?
>
> Probably about disabling the reporting of IOMMU faults. An IOMMU that is
> used for DMA-API mappings can not be disabled at runtime in a safe way.

I was actually thinking of fully disabling the IOMMU if it only has
one child device to isolate the possible damage. But maybe we need a
bit more clevernesss and a driver notifer. In drm/i915 we could use
that to declare the gpu wedged, which should be about the optimal
outcome:
- We can do that from any atomic context.
- It will stop userspace from submitting more commands, and userspace
falls back to software rendering if this happens.
- Kernel modeset should keep on working, increasing chances that the
user/developer can grab crucial information from the life system.

I think we'd need to play around with some real bugs to know what will
actually work.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Ksummit-discuss] [CORE TOPIC] Device error handling / reporting / isolation
  2014-05-12 16:59                     ` Laurent Pinchart
@ 2014-05-12 17:15                       ` Joerg Roedel
  0 siblings, 0 replies; 42+ messages in thread
From: Joerg Roedel @ 2014-05-12 17:15 UTC (permalink / raw)
  To: Laurent Pinchart; +Cc: James Bottomley, ksummit-discuss

On Mon, May 12, 2014 at 06:59:34PM +0200, Laurent Pinchart wrote:
> On Monday 12 May 2014 18:28:14 Joerg Roedel wrote:
> > Probably about disabling the reporting of IOMMU faults. An IOMMU that is
> > used for DMA-API mappings can not be disabled at runtime in a safe way.
> 
> If possible I'd like to avoid the fault to be generated instead of just not 
> reporting it. As long as the bus master generates bad requests, even if we 
> don't report the fault to upper layers, the IOMMU fault interrupts that will 
> occur will hurt. When the hardware supports ignoring requests (either globally 
> or from a particular bus master) that feature should be used, much like we 
> disable interrupts at the IRQ controller when an interrupt storm is detected.

Yes, when a device is isolated by the IOMMU it might create a
lot of faults. These should be blocked in the IOMMU hardware already. If
this is not supported the whole isolation might is useless because the
CPU is still busy handling IOMMU interrupts.


	Joerg

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Ksummit-discuss] [CORE TOPIC] Device error handling / reporting / isolation
  2014-05-12 17:11                     ` Daniel Vetter
@ 2014-05-12 17:40                       ` Joerg Roedel
  2014-05-13 10:06                         ` Daniel Vetter
  0 siblings, 1 reply; 42+ messages in thread
From: Joerg Roedel @ 2014-05-12 17:40 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: James Bottomley, ksummit-discuss

On Mon, May 12, 2014 at 07:11:46PM +0200, Daniel Vetter wrote:
> I was actually thinking of fully disabling the IOMMU if it only has
> one child device to isolate the possible damage.

If you disable the IOMMU you also disable the protection from the child.
This also changes the address space of the device by disabling the IOTLB
and might make the device overwrite random memory.

> But maybe we need a bit more clevernesss and a driver notifer. In
> drm/i915 we could use that to declare the gpu wedged, which should be
> about the optimal outcome:
> - We can do that from any atomic context.
> - It will stop userspace from submitting more commands, and userspace
> falls back to software rendering if this happens.
> - Kernel modeset should keep on working, increasing chances that the
> user/developer can grab crucial information from the life system.
> 
> I think we'd need to play around with some real bugs to know what will
> actually work.

Sure. What we can provide from the IOMMU side is to disable the faults
and/or isolate the device so that it can't harm the system anymore.


	Joerg

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Ksummit-discuss] [CORE TOPIC] Device error handling / reporting / isolation
  2014-05-12 17:40                       ` Joerg Roedel
@ 2014-05-13 10:06                         ` Daniel Vetter
  0 siblings, 0 replies; 42+ messages in thread
From: Daniel Vetter @ 2014-05-13 10:06 UTC (permalink / raw)
  To: Joerg Roedel; +Cc: James Bottomley, ksummit-discuss

On Mon, May 12, 2014 at 7:40 PM, Joerg Roedel <joro@8bytes.org> wrote:
> On Mon, May 12, 2014 at 07:11:46PM +0200, Daniel Vetter wrote:
>> I was actually thinking of fully disabling the IOMMU if it only has
>> one child device to isolate the possible damage.
>
> If you disable the IOMMU you also disable the protection from the child.
> This also changes the address space of the device by disabling the IOTLB
> and might make the device overwrite random memory.

Oh, I think I'm using confusing language here. By disable I mean fully
isolate the device by dropping all dma silently onto the floor.
Disabling the iommu as in allowing the device direct access to the
unremapping memory so it can scribble all over is ofc not what I want,
ever ;-) So maybe I should say "disable all DMA by isolating the
device".
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Ksummit-discuss] [CORE TOPIC] Device error handling / reporting / isolation
  2014-05-12 17:04                   ` Daniel Vetter
@ 2014-05-13 11:27                     ` David Woodhouse
  2014-05-13 17:25                       ` Daniel Vetter
  2014-05-14  1:50                       ` Benjamin Herrenschmidt
  0 siblings, 2 replies; 42+ messages in thread
From: David Woodhouse @ 2014-05-13 11:27 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: James Bottomley, ksummit-discuss

[-- Attachment #1: Type: text/plain, Size: 1366 bytes --]

On Mon, 2014-05-12 at 19:04 +0200, Daniel Vetter wrote:
> On Mon, May 12, 2014 at 6:16 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> > Just to check: are you talking about disabling the IOMMU if there's a
> > fault storm or disabling reporting of IOMMU faults?
> 
> Re-enabling of the IOMMU after it was completely shut off to isolate a
> fault storm from a rouge device. Since if I as a developer still have
> to reboot if I wreak havoc in my driver it's only marginally better
> than a box that went down in a iommu page fault storm. But if I can
> just reload the driver (with the bug fixed) and get back a working
> device because the IOMMU was re-enabling then that would help. Not
> sure yet how feasible this really is.

You probably don't want to completely isolate it in that case. If it's
doing some bad DMA *and* it's also doing some good DMA to display its
framebuffer, why stop the latter?

The Intel IOMMU at least can be configured to avoid reporting faults for
a given device (well, requester-id). So valid transactions still happen,
while invalid transactions are still blocked. But silently, without
bothering the host with the details and causing a fault-IRQ storm.

-- 
David Woodhouse                            Open Source Technology Centre
David.Woodhouse@intel.com                              Intel Corporation

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5745 bytes --]

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Ksummit-discuss] [CORE TOPIC] Device error handling / reporting / isolation
  2014-05-09 20:19       ` James Bottomley
  2014-05-10  1:09         ` Laurent Pinchart
  2014-05-12 14:58         ` Joerg Roedel
@ 2014-05-13 14:37         ` David Woodhouse
  2014-05-14  1:46         ` Benjamin Herrenschmidt
  3 siblings, 0 replies; 42+ messages in thread
From: David Woodhouse @ 2014-05-13 14:37 UTC (permalink / raw)
  To: James Bottomley; +Cc: ksummit-discuss

[-- Attachment #1: Type: text/plain, Size: 885 bytes --]

On Fri, 2014-05-09 at 13:19 -0700, James Bottomley wrote:
> But for a laptop or desktop kernel, how far do we want to go?  In
> theory, once the iommu is turned on, it corrals the device, since access
> to non programmed addresses (those without IOTLB entries) produces a
> fault.  Is there anything extra we need to do beyond turning on the
> IOMMU?

Well, if it persists in misbehaving, we can try a function level reset.
Or perhaps power cycle it — we've gained the facility for power
management reasons, but we can probably use it for beating the device on
the head when it's naughty too. And some platforms can electrically
isolate misbehaving devices completely, rather than just ignoring their
DMA attempts.

-- 
David Woodhouse                            Open Source Technology Centre
David.Woodhouse@intel.com                              Intel Corporation

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5745 bytes --]

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Ksummit-discuss] [CORE TOPIC] Device error handling / reporting / isolation
  2014-05-13 11:27                     ` David Woodhouse
@ 2014-05-13 17:25                       ` Daniel Vetter
  2014-05-14  1:50                       ` Benjamin Herrenschmidt
  1 sibling, 0 replies; 42+ messages in thread
From: Daniel Vetter @ 2014-05-13 17:25 UTC (permalink / raw)
  To: David Woodhouse; +Cc: James Bottomley, ksummit-discuss

On Tue, May 13, 2014 at 1:27 PM, David Woodhouse <dwmw2@infradead.org> wrote:
> On Mon, 2014-05-12 at 19:04 +0200, Daniel Vetter wrote:
>> On Mon, May 12, 2014 at 6:16 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>> > Just to check: are you talking about disabling the IOMMU if there's a
>> > fault storm or disabling reporting of IOMMU faults?
>>
>> Re-enabling of the IOMMU after it was completely shut off to isolate a
>> fault storm from a rouge device. Since if I as a developer still have
>> to reboot if I wreak havoc in my driver it's only marginally better
>> than a box that went down in a iommu page fault storm. But if I can
>> just reload the driver (with the bug fixed) and get back a working
>> device because the IOMMU was re-enabling then that would help. Not
>> sure yet how feasible this really is.
>
> You probably don't want to completely isolate it in that case. If it's
> doing some bad DMA *and* it's also doing some good DMA to display its
> framebuffer, why stop the latter?

Yeah, I think some coordination between driver and iommu subsystem
when bad things happen would be useful. One example is that i915 could
block further command submission once a storm happens to prevent more
damage. And if the IOMMU can disabled fault reporting while everything
else keeps on working as best as possible that's indeed useful. But
imo the first line should be damage control, if we can save a few bits
that's just the icing on the cake.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Ksummit-discuss] [CORE TOPIC] Device error handling / reporting / isolation
  2014-05-08 12:37 [Ksummit-discuss] [CORE TOPIC] Device error handling / reporting / isolation David Woodhouse
                   ` (4 preceding siblings ...)
  2014-05-09 19:37 ` Josh Triplett
@ 2014-05-14  1:24 ` Benjamin Herrenschmidt
  5 siblings, 0 replies; 42+ messages in thread
From: Benjamin Herrenschmidt @ 2014-05-14  1:24 UTC (permalink / raw)
  To: David Woodhouse; +Cc: Gavin Shan, ksummit-discuss

On Thu, 2014-05-08 at 13:37 +0100, David Woodhouse wrote:
> I'd like to have a discussion about handling device errors.
> 
> IOMMUs are becoming more common, and we've seen some failure modes where
> we just end up with an endless stream of fault reports from a given
> device, and the kernel can do nothing else.

 .../...

I'm definitely interested in this, and would nominate Gavin Shan from
IBM as well who is our EEH expert for the kernel.

To cut a long story short, we have an extensive set of HW facilities
in our PCI host bridges to detect errors and freeze all operations
in and out of devices upon detection of errors, in order to prevent
propagation of bad data.

In addition, we have a recovery process involving the few drivers
who support the corresponding hooks. We could describe the process,
it can be fairly convoluted.

We fallback to simulating an unplug of the device (unbind the driver),
a reset and a re-bind for devices that don't have the hooks.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Ksummit-discuss] [CORE TOPIC] Device error handling / reporting / isolation
  2014-05-09 11:31     ` Laurent Pinchart
@ 2014-05-14  1:28       ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 42+ messages in thread
From: Benjamin Herrenschmidt @ 2014-05-14  1:28 UTC (permalink / raw)
  To: Laurent Pinchart; +Cc: James Bottomley, ksummit-discuss

On Fri, 2014-05-09 at 13:31 +0200, Laurent Pinchart wrote:
> 
> The latter will likely require a mix of generic code for device isolation 
> and/or reset (when possible) and driver-specific code for proper recovery.

We already have some amount of hooks that drivers can implement for that
but most of the core and policy are a mixture of HW facilities and
platform specific code, at least for PowerPC EEH. But ACPI/AER somewhat
biggy-backs in the same hooks today so I think the driver side interface
is a good start.

We do want to improve reporting if possible (ie, some IOMMUs will tell
us more about the actual error than others).
>  A
> fast reaction to prevent more faults from being generated should be coupled 
> with a slower reaction to fix the actual cause of the problem. I expect the 
> problem to be fatal in most cases, and, for IOMMUs again, usually caused by a 
> software bug rather than a hardware misbehaviour (although the latter can of 
> course happen). From an overall system point of view preventing the denial of 
> service that follows such errors (caused by kernel log flooding for instance, 
> or by the IOMMU being unable to serve other bus masters) could be our first 
> priority.

Don't discount HW issues and the effect of random bit flips on crappy HW
using cheap latches and no ECC or even parity on its internal busses inside
very noisy environments such as ... your computer :-)

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Ksummit-discuss] [CORE TOPIC] Device error handling / reporting / isolation
  2014-05-09 17:48 ` Roland Dreier
  2014-05-09 17:58   ` Matthew Wilcox
@ 2014-05-14  1:40   ` Benjamin Herrenschmidt
  1 sibling, 0 replies; 42+ messages in thread
From: Benjamin Herrenschmidt @ 2014-05-14  1:40 UTC (permalink / raw)
  To: Roland Dreier; +Cc: ksummit-discuss

On Fri, 2014-05-09 at 10:48 -0700, Roland Dreier wrote:
> 
> I think there's a more general problem that's worth talking about
> here.  In addition to IOMMU faults, there are lots of other PCI errors
> that can happen, and we have some small number of drivers that have
> been "hardened" to try and recover from these errors.  However even
> for these "hardened" drivers it seems pretty easy to hit deadlocks
> when the driver tries to tear down and reinitialize things.

Right. We are hitting that every time we test  a new round of machines /
FW / Distro on power when testing EEH. The error path in the drivers are
very badly tested.

For example, when our HW "isolates" a device, all reads start returning
ff's on MMIOs. Plenty of drivers will have either infinite or
very-long-timeout loops waiting for a bit to clear...

Also, when our HW decides to fence the entire PCI Express controller
(which can happen for example if it took a parity error in an internal
cache), subsequent MMIOs return ff's but also take a long time (hundreds
of microseconds or more).

We had issues where driver implement timeouts like this:

	for (i = 0; i < 10000; i++) {
		foo = readl(bar);
		if ((foo & my_bit) == 0)
			break;
		udelay(1);
	}

And expect this to be a 10ms timeout ... in fenced situations, it ends
up being a 100ms or 1s timeout (we've seen much longer ones).

One way to help find/fix these would be a better error injection
capability to "isolate" devices, for example my remapping their
MMIOs to something that returns ff's :-)

> So I wonder if we can do better without proliferating error handling
> tentacles into all sorts of low-level drivers ("did we just read
> 0xffffffff here?  how about here?  are we in the middle of error
> recovery?  how about now?").

We can't because ultimately, that is what HW will return when it's
broken, disconnected, lost a link, or EEH.

> One context where this is becoming a real concern is with NVMe drives.
>  These are SSDs that (may) look like normal 2.5" drives, but use PCIe
> rather than SATA or SAS to connect to the host.  Since they look like
> normal drives, it's natural to put them into hot-pluggable JBODs, but
> it turns out we react much worse to PCIe surprise removal than, say,
> SAS hotplug.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Ksummit-discuss] [CORE TOPIC] Device error handling / reporting / isolation
  2014-05-09 19:37 ` Josh Triplett
  2014-05-09 19:44   ` David Woodhouse
  2014-05-09 19:53   ` Roland Dreier
@ 2014-05-14  1:42   ` Benjamin Herrenschmidt
  2 siblings, 0 replies; 42+ messages in thread
From: Benjamin Herrenschmidt @ 2014-05-14  1:42 UTC (permalink / raw)
  To: Josh Triplett; +Cc: ksummit-discuss

On Fri, 2014-05-09 at 12:37 -0700, Josh Triplett wrote:
> I'm interested in a related topic: we should systematically use IOMMUs
> and similar hardware features to protect against buggy or *malicious*
> hardware devices.  Consider a laptop with an ExpressCard port: plug in a
> device and you have full PCIe access.  (The same goes for other systems
> if you open up the case.)  We should ensure that devices with no device
> driver have zero privileges, and devices with a device driver have
> carefully whitelisted privileges.

On the other hand, we have been going backward implementing iommu bypass
on power for non-virtualized systems because of the performance cost of
the IOMMU which can be non-trivial, especially for network devices.

It becomes a policy decision, which is fine, however, having a "generic"
way to configure that policy, possibly per-adapter, rather than each IOMMU
implementation does its own, would make it a lot palatable on the field.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Ksummit-discuss] [CORE TOPIC] Device error handling / reporting / isolation
  2014-05-09 19:53   ` Roland Dreier
  2014-05-09 20:13     ` Luck, Tony
@ 2014-05-14  1:43     ` Benjamin Herrenschmidt
  1 sibling, 0 replies; 42+ messages in thread
From: Benjamin Herrenschmidt @ 2014-05-14  1:43 UTC (permalink / raw)
  To: Roland Dreier; +Cc: ksummit-discuss

On Fri, 2014-05-09 at 12:53 -0700, Roland Dreier wrote:
> 
> Stuff without a device driver should be OK, since we don't turn on any
> bits in the PCI command register until pci_enable_device().  So the
> device can't be a bus master until someone claims it.

Provided you trust the device not to utterly ignore that bit ...

> For devices with a driver, I guess it couldn't hurt.  But my wifi
> adapter can already sniff and modify all my network traffic, etc.
> 
> I do agree that it's a bit sad that the current state of VT-d is such
> that distros don't use it by default.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Ksummit-discuss] [CORE TOPIC] Device error handling / reporting / isolation
  2014-05-09 20:19       ` James Bottomley
                           ` (2 preceding siblings ...)
  2014-05-13 14:37         ` David Woodhouse
@ 2014-05-14  1:46         ` Benjamin Herrenschmidt
  3 siblings, 0 replies; 42+ messages in thread
From: Benjamin Herrenschmidt @ 2014-05-14  1:46 UTC (permalink / raw)
  To: James Bottomley; +Cc: ksummit-discuss

On Fri, 2014-05-09 at 13:19 -0700, James Bottomley wrote:
> > Defending against buggy hardware is interesting from a RAS perspective.
> > You don't want a card with a stuck address line scribbling on memory
> > that you didn't want it to touch.
> 
> But for a laptop or desktop kernel, how far do we want to go?  In
> theory, once the iommu is turned on, it corrals the device, since access
> to non programmed addresses (those without IOTLB entries) produces a
> fault.  Is there anything extra we need to do beyond turning on the
> IOMMU?

These tend to be the ones with the cheapest/buggiest hardware and less
prone to be affected by a 100ns overhead on PCIe transactions.

So they are the prime candidate for defaulting to a stricter protection
model.

They are also the ones likely to have an evil express card or whatever
that new PCIe-on-the-wire pushed by Apple & Intel or similar plugged into
them that tries to steal their data.

On the contrary, servers in controlled data centers using high end adapters
might wish to disable the IOMMU for performance reasons.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Ksummit-discuss] [CORE TOPIC] Device error handling / reporting / isolation
  2014-05-13 11:27                     ` David Woodhouse
  2014-05-13 17:25                       ` Daniel Vetter
@ 2014-05-14  1:50                       ` Benjamin Herrenschmidt
  2014-05-14 20:09                         ` Daniel Vetter
  1 sibling, 1 reply; 42+ messages in thread
From: Benjamin Herrenschmidt @ 2014-05-14  1:50 UTC (permalink / raw)
  To: David Woodhouse; +Cc: James Bottomley, ksummit-discuss

On Tue, 2014-05-13 at 12:27 +0100, David Woodhouse wrote:
> You probably don't want to completely isolate it in that case. If it's
> doing some bad DMA *and* it's also doing some good DMA to display its
> framebuffer, why stop the latter?

I don't think you can go to that level of granularity. We certainly
can't on power.

Propagation of bad data due to faulty adapters or simple bit flips
is a real big issue on servers and the policy for us is simple, on the
first "hint" of an error, block *all* traffic to an from the adapter.

Then the driver can get into the dance to figure out what's up (we can
selectively enable MMIO under driver control to try to get at diagnostic
registers for example) and reset / reconfigure things.

> The Intel IOMMU at least can be configured to avoid reporting faults for
> a given device (well, requester-id). So valid transactions still happen,
> while invalid transactions are still blocked. But silently, without
> bothering the host with the details and causing a fault-IRQ storm.

I would argue against that sort of policy. At least in server contexts.

It could well be that this is appropriate for laptops/desktops, I don't know,
but once an adapter starts doing bad DMAs, I think you can't really trust
much out of it anymore at all.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Ksummit-discuss] [CORE TOPIC] Device error handling / reporting / isolation
  2014-05-14  1:50                       ` Benjamin Herrenschmidt
@ 2014-05-14 20:09                         ` Daniel Vetter
  2014-05-15  1:08                           ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 42+ messages in thread
From: Daniel Vetter @ 2014-05-14 20:09 UTC (permalink / raw)
  To: Benjamin Herrenschmidt; +Cc: James Bottomley, ksummit-discuss

On Wed, May 14, 2014 at 3:50 AM, Benjamin Herrenschmidt
<benh@kernel.crashing.org> wrote:
>> The Intel IOMMU at least can be configured to avoid reporting faults for
>> a given device (well, requester-id). So valid transactions still happen,
>> while invalid transactions are still blocked. But silently, without
>> bothering the host with the details and causing a fault-IRQ storm.
>
> I would argue against that sort of policy. At least in server contexts.
>
> It could well be that this is appropriate for laptops/desktops, I don't know,
> but once an adapter starts doing bad DMAs, I think you can't really trust
> much out of it anymore at all.

I'm not sure we really need to make a server/desktop disdinction here
but more whether the driver (and all the stuff relying on it) care
about data integrity all that much. With gpus we can forward such
information to userspace and through some opengl extensions to
applications, and the expectation is very much that if you want robust
opengl, you need to be able to cope. The extension essentially tells
you "oops, sorry something bad happened, please throw away all your
gpu buffers".

Of course if a gpu reset does not fix the situation the driver should
be able to tell the iommu to give up and fully isolate it. Also, to
really make this work we'd need a way to tell the iommu to re-allow
everything again and track faults again. Otherwise we can't tell
whether the gpu reset worked in resolving the fault storm.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Ksummit-discuss] [CORE TOPIC] Device error handling / reporting / isolation
  2014-05-14 20:09                         ` Daniel Vetter
@ 2014-05-15  1:08                           ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 42+ messages in thread
From: Benjamin Herrenschmidt @ 2014-05-15  1:08 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: James Bottomley, ksummit-discuss

On Wed, 2014-05-14 at 22:09 +0200, Daniel Vetter wrote:

> I'm not sure we really need to make a server/desktop disdinction here
> but more whether the driver (and all the stuff relying on it) care
> about data integrity all that much. With gpus we can forward such
> information to userspace and through some opengl extensions to
> applications, and the expectation is very much that if you want robust
> opengl, you need to be able to cope. The extension essentially tells
> you "oops, sorry something bad happened, please throw away all your
> gpu buffers".
> 
> Of course if a gpu reset does not fix the situation the driver should
> be able to tell the iommu to give up and fully isolate it. Also, to
> really make this work we'd need a way to tell the iommu to re-allow
> everything again and track faults again. Otherwise we can't tell
> whether the gpu reset worked in resolving the fault storm.

Right, though arguably in that context, doing an unconditional freeze
on error is still perfectly fine as long as the driver has the option
to unfreeze selectively.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 42+ messages in thread

end of thread, other threads:[~2014-05-15  1:09 UTC | newest]

Thread overview: 42+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-05-08 12:37 [Ksummit-discuss] [CORE TOPIC] Device error handling / reporting / isolation David Woodhouse
2014-05-08 18:03 ` Bjorn Helgaas
2014-05-08 20:00   ` Rafael J. Wysocki
2014-05-08 19:56 ` James Bottomley
2014-05-09  8:55   ` David Woodhouse
2014-05-09 11:31     ` Laurent Pinchart
2014-05-14  1:28       ` Benjamin Herrenschmidt
2014-05-09 17:48 ` Roland Dreier
2014-05-09 17:58   ` Matthew Wilcox
2014-05-09 18:08     ` Roland Dreier
2014-05-14  1:40   ` Benjamin Herrenschmidt
2014-05-09 18:05 ` Will Deacon
2014-05-12 15:03   ` Joerg Roedel
2014-05-09 19:37 ` Josh Triplett
2014-05-09 19:44   ` David Woodhouse
2014-05-09 19:53   ` Roland Dreier
2014-05-09 20:13     ` Luck, Tony
2014-05-09 20:19       ` James Bottomley
2014-05-10  1:09         ` Laurent Pinchart
2014-05-11 22:43           ` Daniel Vetter
2014-05-12 15:07             ` Joerg Roedel
2014-05-12 15:35               ` Daniel Vetter
2014-05-12 16:16                 ` Andy Lutomirski
2014-05-12 16:28                   ` Joerg Roedel
2014-05-12 16:59                     ` Laurent Pinchart
2014-05-12 17:15                       ` Joerg Roedel
2014-05-12 17:11                     ` Daniel Vetter
2014-05-12 17:40                       ` Joerg Roedel
2014-05-13 10:06                         ` Daniel Vetter
2014-05-12 17:04                   ` Daniel Vetter
2014-05-13 11:27                     ` David Woodhouse
2014-05-13 17:25                       ` Daniel Vetter
2014-05-14  1:50                       ` Benjamin Herrenschmidt
2014-05-14 20:09                         ` Daniel Vetter
2014-05-15  1:08                           ` Benjamin Herrenschmidt
2014-05-12 16:26                 ` Joerg Roedel
2014-05-12 14:58         ` Joerg Roedel
2014-05-13 14:37         ` David Woodhouse
2014-05-14  1:46         ` Benjamin Herrenschmidt
2014-05-14  1:43     ` Benjamin Herrenschmidt
2014-05-14  1:42   ` Benjamin Herrenschmidt
2014-05-14  1:24 ` Benjamin Herrenschmidt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox