From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from smtp1.linuxfoundation.org (smtp1.linux-foundation.org [172.17.192.35]) by mail.linuxfoundation.org (Postfix) with ESMTP id 170ED9A0 for ; Fri, 9 May 2014 17:49:17 +0000 (UTC) Received: from mail-qc0-f170.google.com (mail-qc0-f170.google.com [209.85.216.170]) by smtp1.linuxfoundation.org (Postfix) with ESMTPS id 77416203AE for ; Fri, 9 May 2014 17:49:16 +0000 (UTC) Received: by mail-qc0-f170.google.com with SMTP id i8so5009040qcq.15 for ; Fri, 09 May 2014 10:49:15 -0700 (PDT) MIME-Version: 1.0 Sender: roland.dreier@gmail.com In-Reply-To: <1399552623.17118.22.camel@i7.infradead.org> References: <1399552623.17118.22.camel@i7.infradead.org> From: Roland Dreier Date: Fri, 9 May 2014 10:48:55 -0700 Message-ID: To: David Woodhouse Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Cc: "ksummit-discuss@lists.linuxfoundation.org" Subject: Re: [Ksummit-discuss] [CORE TOPIC] Device error handling / reporting / isolation List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , On Thu, May 8, 2014 at 5:37 AM, David Woodhouse wrote= : > I'd like to have a discussion about handling device errors. > > IOMMUs are becoming more common, and we've seen some failure modes where > we just end up with an endless stream of fault reports from a given > device, and the kernel can do nothing else. > > We may have various options for shutting it up =E2=80=94 a PCI function l= evel > reset, power cycling the offending device, or maybe just configuring the > IOMMU to *ignore* further errors from it, which would at least let the > system get on with doing something useful (and if we do, when do we > re-enable reporting?). I think there's a more general problem that's worth talking about here. In addition to IOMMU faults, there are lots of other PCI errors that can happen, and we have some small number of drivers that have been "hardened" to try and recover from these errors. However even for these "hardened" drivers it seems pretty easy to hit deadlocks when the driver tries to tear down and reinitialize things. So I wonder if we can do better without proliferating error handling tentacles into all sorts of low-level drivers ("did we just read 0xffffffff here? how about here? are we in the middle of error recovery? how about now?"). One context where this is becoming a real concern is with NVMe drives. These are SSDs that (may) look like normal 2.5" drives, but use PCIe rather than SATA or SAS to connect to the host. Since they look like normal drives, it's natural to put them into hot-pluggable JBODs, but it turns out we react much worse to PCIe surprise removal than, say, SAS hotplug. - R.