* [Ksummit-discuss] [TECH TOPIC] Project Banbury
@ 2018-09-14 17:28 Matthew Wilcox
2018-09-16 10:53 ` Hannes Reinecke
` (3 more replies)
0 siblings, 4 replies; 9+ messages in thread
From: Matthew Wilcox @ 2018-09-14 17:28 UTC (permalink / raw)
To: ksummit-discuss
[-- Attachment #1: Type: text/plain, Size: 603 bytes --]
We've all pulled the wrong drive out of a machine or unplugged a USB key
before the write back has completely finished. You try to plug it back in,
but the damage is done. The pending writes are lost, the filesystem is
damaged and full of errors and you are having a Bad Day. What if ...
plugging the drive back in could be made to work?
This session would be more of a discussion than a presentation since I've
not written a single line of code towards fixing the problem. I have
written a web page sketching out an architecture for how we might make this
work:
http://www.wil.cx/~willy/banbury.html
[-- Attachment #2: Type: text/html, Size: 808 bytes --]
^ permalink raw reply [flat|nested] 9+ messages in thread* Re: [Ksummit-discuss] [TECH TOPIC] Project Banbury 2018-09-14 17:28 [Ksummit-discuss] [TECH TOPIC] Project Banbury Matthew Wilcox @ 2018-09-16 10:53 ` Hannes Reinecke 2018-09-16 12:45 ` Matthew Wilcox 2018-09-16 16:03 ` Laurent Pinchart ` (2 subsequent siblings) 3 siblings, 1 reply; 9+ messages in thread From: Hannes Reinecke @ 2018-09-16 10:53 UTC (permalink / raw) To: ksummit-discuss On 09/14/2018 07:28 PM, Matthew Wilcox wrote: > We've all pulled the wrong drive out of a machine or unplugged a USB key > before the write back has completely finished. You try to plug it back > in, but the damage is done. The pending writes are lost, the filesystem > is damaged and full of errors and you are having a Bad Day. What if ... > plugging the drive back in could be made to work? > > This session would be more of a discussion than a presentation since > I've not written a single line of code towards fixing the problem. I > have written a web page sketching out an architecture for how we might > make this work: > I'd be all for it. Maybe we can have a session in Edinburgh to discuss things further Cheers, Hannes ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [Ksummit-discuss] [TECH TOPIC] Project Banbury 2018-09-16 10:53 ` Hannes Reinecke @ 2018-09-16 12:45 ` Matthew Wilcox 2018-09-18 8:17 ` Hannes Reinecke 0 siblings, 1 reply; 9+ messages in thread From: Matthew Wilcox @ 2018-09-16 12:45 UTC (permalink / raw) To: Hannes Reinecke; +Cc: ksummit-discuss [-- Attachment #1: Type: text/plain, Size: 1291 bytes --] Unfortunately I won't be in Edinburgh. My understanding was that the tech topics are for Plumbers. Since the user space component is probably the more interesting and complex component, Plumbers is probably the better conference for a discussion anyway. On Sun., Sep. 16, 2018, 11:53 Hannes Reinecke, <hare@suse.com> wrote: > On 09/14/2018 07:28 PM, Matthew Wilcox wrote: > > We've all pulled the wrong drive out of a machine or unplugged a USB key > > before the write back has completely finished. You try to plug it back > > in, but the damage is done. The pending writes are lost, the filesystem > > is damaged and full of errors and you are having a Bad Day. What if ... > > plugging the drive back in could be made to work? > > > > This session would be more of a discussion than a presentation since > > I've not written a single line of code towards fixing the problem. I > > have written a web page sketching out an architecture for how we might > > make this work: > > > I'd be all for it. > Maybe we can have a session in Edinburgh to discuss things further > > Cheers, > > Hannes > _______________________________________________ > Ksummit-discuss mailing list > Ksummit-discuss@lists.linuxfoundation.org > https://lists.linuxfoundation.org/mailman/listinfo/ksummit-discuss > [-- Attachment #2: Type: text/html, Size: 1866 bytes --] ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [Ksummit-discuss] [TECH TOPIC] Project Banbury 2018-09-16 12:45 ` Matthew Wilcox @ 2018-09-18 8:17 ` Hannes Reinecke 0 siblings, 0 replies; 9+ messages in thread From: Hannes Reinecke @ 2018-09-18 8:17 UTC (permalink / raw) To: Matthew Wilcox; +Cc: ksummit-discuss On 09/16/2018 02:45 PM, Matthew Wilcox wrote: > Unfortunately I won't be in Edinburgh. My understanding was that the > tech topics are for Plumbers. Since the user space component is > probably the more interesting and complex component, Plumbers is > probably the better conference for a discussion anyway. > Au contraire. ATM most filesystems are not able to report any I/O error directly to the application due to internal caching. An I/O error will only ever be seen once the cached data is written to disk, but by then the fs has already acknowledged the write to the application, so there literally is no way of how we could signal the I/O error (other than setting the fs read-only). So from that it might be possible to switch to an asynchronous model of signalling I/O errors much like DAX does nowadays. IE we could signal errors to the filesystem via asynchronous methods, and do away with the current synchronous model. I guess some further discussion is warranted here. Sadly I can't go to Vancouver, so we'll need to find another venue; LSF next year? Cheers, Hannes ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [Ksummit-discuss] [TECH TOPIC] Project Banbury 2018-09-14 17:28 [Ksummit-discuss] [TECH TOPIC] Project Banbury Matthew Wilcox 2018-09-16 10:53 ` Hannes Reinecke @ 2018-09-16 16:03 ` Laurent Pinchart 2018-09-16 16:25 ` Linus Torvalds 2018-09-16 16:37 ` James Bottomley 2018-09-16 23:58 ` David Howells 3 siblings, 1 reply; 9+ messages in thread From: Laurent Pinchart @ 2018-09-16 16:03 UTC (permalink / raw) To: ksummit-discuss Hi Matthew, On Friday, 14 September 2018 20:28:01 EEST Matthew Wilcox wrote: > We've all pulled the wrong drive out of a machine or unplugged a USB key > before the write back has completely finished. You try to plug it back in, > but the damage is done. The pending writes are lost, the filesystem is > damaged and full of errors and you are having a Bad Day. What if ... > plugging the drive back in could be made to work? > > This session would be more of a discussion than a presentation since I've > not written a single line of code towards fixing the problem. I have > written a web page sketching out an architecture for how we might make this > work: > > http://www.wil.cx/~willy/banbury.html Having lost a server due to a DDoS attach that rendered the link between CPU and storage unusable for a too long time, I think this would be an amazing improvement. -- Regards, Laurent Pinchart ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [Ksummit-discuss] [TECH TOPIC] Project Banbury 2018-09-16 16:03 ` Laurent Pinchart @ 2018-09-16 16:25 ` Linus Torvalds 0 siblings, 0 replies; 9+ messages in thread From: Linus Torvalds @ 2018-09-16 16:25 UTC (permalink / raw) To: Laurent Pinchart; +Cc: ksummit On Sun, Sep 16, 2018 at 9:03 AM Laurent Pinchart <laurent.pinchart@ideasonboard.com> wrote: > > > > http://www.wil.cx/~willy/banbury.html > > Having lost a server due to a DDoS attach that rendered the link between CPU > and storage unusable for a too long time, I think this would be an amazing > improvement. I agree on the "amazing", but in a more literal sense. I don't think it's all that realistic. Pausing IO will basically hang the machine, and you'll run out of memory in not too long too. It's probably doable with a mount option and filesystem help (aka "intr" for NFS). But people should be aware that one reason "intr" worked as well as it did for NFS was that it (a) broke POSIX rules (b) NFS traditionally did almost synchronous writes (c) the metadata is/was on the disconnected side and even then, you really really didn't want NFS "intr" to be on a core filesystem. With hotplug devices, you have some "interesting" issues in addition, namely making sure you really connect it back to the right disk, and don't re-use *anything* in case it turns out it's not the same one. Even for the "simple" USB case, you'll have serious issues with the serial numbers not being reliable (maybe things are better now, but it used to be that the supposedly "unique" USB serial number wasn't unique at all). So I think it would be a good addition, but people should realize that the current behavior is there for some pretty fundamental reasons. Linus ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [Ksummit-discuss] [TECH TOPIC] Project Banbury 2018-09-14 17:28 [Ksummit-discuss] [TECH TOPIC] Project Banbury Matthew Wilcox 2018-09-16 10:53 ` Hannes Reinecke 2018-09-16 16:03 ` Laurent Pinchart @ 2018-09-16 16:37 ` James Bottomley 2018-09-16 19:25 ` Theodore Y. Ts'o 2018-09-16 23:58 ` David Howells 3 siblings, 1 reply; 9+ messages in thread From: James Bottomley @ 2018-09-16 16:37 UTC (permalink / raw) To: Matthew Wilcox, ksummit-discuss On Fri, 2018-09-14 at 18:28 +0100, Matthew Wilcox wrote: > We've all pulled the wrong drive out of a machine or unplugged a USB > key before the write back has completely finished. You try to plug it > back in, but the damage is done. The pending writes are lost, the > filesystem is damaged and full of errors and you are having a Bad > Day. What if ... plugging the drive back in could be made to work? For a lot of modern external storage devices this simply can't be made to work. The reason is they all have an internal write back cache to make operations faster and if they're SATA they may lie about it and if they're USB they always lie about it. For these devices we have a set of writes that we think are completed but in-fact only hit the device cache. When you pulled it out, the cache was lost and so were these writes. This is unfixable on the host side unless there's some way we can get the device to tell us it has a write back cache and behave correctly with regard to flushes. Even for devices that behave correctly, we currently have no real way to repeat the I/O that was lost in the powered down cache, unless you have a way to cope with this case (it doesn't seem to be accounted for in your plan)? The reason is we use barrier type caches which assume everything behind them is available to the device (either on disk or in the cache). The block layer would need some way to replay I/Os (in order) from the last barrier because some of them might have been lost from the cache. Provided we have write through caches (not a given), the lower layer error handling will mostly take care of repeating the lost but unacknowledged I/O provided you preserve the queue, so I agree that part can work, but the big thing is having a write through cache. James ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [Ksummit-discuss] [TECH TOPIC] Project Banbury 2018-09-16 16:37 ` James Bottomley @ 2018-09-16 19:25 ` Theodore Y. Ts'o 0 siblings, 0 replies; 9+ messages in thread From: Theodore Y. Ts'o @ 2018-09-16 19:25 UTC (permalink / raw) To: James Bottomley; +Cc: ksummit-discuss On Sun, Sep 16, 2018 at 09:37:50AM -0700, James Bottomley wrote: > > For a lot of modern external storage devices this simply can't be made > to work. The reason is they all have an internal write back cache to > make operations faster and if they're SATA they may lie about it and if > they're USB they always lie about it. For these devices we have a set > of writes that we think are completed but in-fact only hit the device > cache. When you pulled it out, the cache was lost and so were these > writes. This is unfixable on the host side unless there's some way we > can get the device to tell us it has a write back cache and behave > correctly with regard to flushes. > > Even for devices that behave correctly, we currently have no real way > to repeat the I/O that was lost in the powered down cache, unless you > have a way to cope with this case (it doesn't seem to be accounted for > in your plan)? The reason is we use barrier type caches which assume > everything behind them is available to the device (either on disk or in > the cache). The block layer would need some way to replay I/Os (in > order) from the last barrier because some of them might have been lost > from the cache. > > Provided we have write through caches (not a given), the lower layer > error handling will mostly take care of repeating the lost but > unacknowledged I/O provided you preserve the queue, so I agree that > part can work, but the big thing is having a write through cache. The way I'd suggest approaching this is not by making any changes in the block layer at all. Instead, I'd suggest putting all of the magic into a device-mapper device. The device mapper driver would be responsible for keeping a copy of all blocks written to the removeable device in kernel memory. This would work much like the TCP retransmit buffers; which is to say, until we are *sure* that we don't need to retransmit the writes to the device, we have to keep a copy in non-swappable kernel memory. To avoid overflowing all available memory, there must be a configurable cap of the maximum memory that can be used for retransmit buffers; and to periodically send a CACHE FLUSH command to the block device to free up buffer space once the device has acknowledged the CACHE FLUSH command. If the real device ever gets yanked, this ends up disconnecting the block device from the device-mapper psuedo-device. When the USB thumb drive gets plugged back in, userspace would be responsible for determining that it is the previously attached external attach, and reconnecting it to the device-mapper device, which can then replay all of the blocks in retransmit buffers. Advantages of this strategy: * No need to make changes to the block device layer * The overhead of dealing with removable devices can be avoided for devices which are non-removeable. * It deals with devices that don't support a write through cache; it only requires devices that don't lie about supporting CACHE FLUSH correctly. - Ted ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [Ksummit-discuss] [TECH TOPIC] Project Banbury 2018-09-14 17:28 [Ksummit-discuss] [TECH TOPIC] Project Banbury Matthew Wilcox ` (2 preceding siblings ...) 2018-09-16 16:37 ` James Bottomley @ 2018-09-16 23:58 ` David Howells 3 siblings, 0 replies; 9+ messages in thread From: David Howells @ 2018-09-16 23:58 UTC (permalink / raw) To: Matthew Wilcox; +Cc: ksummit-discuss Matthew Wilcox <willy6545@gmail.com> wrote: > We've all pulled the wrong drive out of a machine or unplugged a USB key > before the write back has completely finished. You try to plug it back in, > but the damage is done. The pending writes are lost, the filesystem is > damaged and full of errors and you are having a Bad Day. What if > ... plugging the drive back in could be made to work? Is this something fscache could be made to help with? Though that might be more at a filesystem level than a blockdev level. David ^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2018-09-18 8:17 UTC | newest] Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2018-09-14 17:28 [Ksummit-discuss] [TECH TOPIC] Project Banbury Matthew Wilcox 2018-09-16 10:53 ` Hannes Reinecke 2018-09-16 12:45 ` Matthew Wilcox 2018-09-18 8:17 ` Hannes Reinecke 2018-09-16 16:03 ` Laurent Pinchart 2018-09-16 16:25 ` Linus Torvalds 2018-09-16 16:37 ` James Bottomley 2018-09-16 19:25 ` Theodore Y. Ts'o 2018-09-16 23:58 ` David Howells
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox