ksummit.lists.linux.dev archive mirror
 help / color / mirror / Atom feed
* [Ksummit-discuss] [TECH TOPIC] Project Banbury
@ 2018-09-14 17:28 Matthew Wilcox
  2018-09-16 10:53 ` Hannes Reinecke
                   ` (3 more replies)
  0 siblings, 4 replies; 9+ messages in thread
From: Matthew Wilcox @ 2018-09-14 17:28 UTC (permalink / raw)
  To: ksummit-discuss

[-- Attachment #1: Type: text/plain, Size: 603 bytes --]

We've all pulled the wrong drive out of a machine or unplugged a USB key
before the write back has completely finished. You try to plug it back in,
but the damage is done. The pending writes are lost, the filesystem is
damaged and full of errors and you are having a Bad Day. What if ...
plugging the drive back in could be made to work?

This session would be more of a discussion than a presentation since I've
not written a single line of code towards fixing the problem. I have
written a web page sketching out an architecture for how we might make this
work:

http://www.wil.cx/~willy/banbury.html

[-- Attachment #2: Type: text/html, Size: 808 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Ksummit-discuss] [TECH TOPIC] Project Banbury
  2018-09-14 17:28 [Ksummit-discuss] [TECH TOPIC] Project Banbury Matthew Wilcox
@ 2018-09-16 10:53 ` Hannes Reinecke
  2018-09-16 12:45   ` Matthew Wilcox
  2018-09-16 16:03 ` Laurent Pinchart
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 9+ messages in thread
From: Hannes Reinecke @ 2018-09-16 10:53 UTC (permalink / raw)
  To: ksummit-discuss

On 09/14/2018 07:28 PM, Matthew Wilcox wrote:
> We've all pulled the wrong drive out of a machine or unplugged a USB key
> before the write back has completely finished. You try to plug it back
> in, but the damage is done. The pending writes are lost, the filesystem
> is damaged and full of errors and you are having a Bad Day. What if ...
> plugging the drive back in could be made to work?
> 
> This session would be more of a discussion than a presentation since
> I've not written a single line of code towards fixing the problem. I
> have written a web page sketching out an architecture for how we might
> make this work:
> 
I'd be all for it.
Maybe we can have a session in Edinburgh to discuss things further

Cheers,

Hannes

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Ksummit-discuss] [TECH TOPIC] Project Banbury
  2018-09-16 10:53 ` Hannes Reinecke
@ 2018-09-16 12:45   ` Matthew Wilcox
  2018-09-18  8:17     ` Hannes Reinecke
  0 siblings, 1 reply; 9+ messages in thread
From: Matthew Wilcox @ 2018-09-16 12:45 UTC (permalink / raw)
  To: Hannes Reinecke; +Cc: ksummit-discuss

[-- Attachment #1: Type: text/plain, Size: 1291 bytes --]

Unfortunately I won't be in Edinburgh. My understanding was that the tech
topics are for Plumbers. Since the user  space component is probably the
more interesting and complex component, Plumbers is probably the better
conference for a discussion anyway.

On Sun., Sep. 16, 2018, 11:53 Hannes Reinecke, <hare@suse.com> wrote:

> On 09/14/2018 07:28 PM, Matthew Wilcox wrote:
> > We've all pulled the wrong drive out of a machine or unplugged a USB key
> > before the write back has completely finished. You try to plug it back
> > in, but the damage is done. The pending writes are lost, the filesystem
> > is damaged and full of errors and you are having a Bad Day. What if ...
> > plugging the drive back in could be made to work?
> >
> > This session would be more of a discussion than a presentation since
> > I've not written a single line of code towards fixing the problem. I
> > have written a web page sketching out an architecture for how we might
> > make this work:
> >
> I'd be all for it.
> Maybe we can have a session in Edinburgh to discuss things further
>
> Cheers,
>
> Hannes
> _______________________________________________
> Ksummit-discuss mailing list
> Ksummit-discuss@lists.linuxfoundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/ksummit-discuss
>

[-- Attachment #2: Type: text/html, Size: 1866 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Ksummit-discuss] [TECH TOPIC] Project Banbury
  2018-09-14 17:28 [Ksummit-discuss] [TECH TOPIC] Project Banbury Matthew Wilcox
  2018-09-16 10:53 ` Hannes Reinecke
@ 2018-09-16 16:03 ` Laurent Pinchart
  2018-09-16 16:25   ` Linus Torvalds
  2018-09-16 16:37 ` James Bottomley
  2018-09-16 23:58 ` David Howells
  3 siblings, 1 reply; 9+ messages in thread
From: Laurent Pinchart @ 2018-09-16 16:03 UTC (permalink / raw)
  To: ksummit-discuss

Hi Matthew,

On Friday, 14 September 2018 20:28:01 EEST Matthew Wilcox wrote:
> We've all pulled the wrong drive out of a machine or unplugged a USB key
> before the write back has completely finished. You try to plug it back in,
> but the damage is done. The pending writes are lost, the filesystem is
> damaged and full of errors and you are having a Bad Day. What if ...
> plugging the drive back in could be made to work?
> 
> This session would be more of a discussion than a presentation since I've
> not written a single line of code towards fixing the problem. I have
> written a web page sketching out an architecture for how we might make this
> work:
> 
> http://www.wil.cx/~willy/banbury.html

Having lost a server due to a DDoS attach that rendered the link between CPU 
and storage unusable for a too long time, I think this would be an amazing 
improvement.

-- 
Regards,

Laurent Pinchart

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Ksummit-discuss] [TECH TOPIC] Project Banbury
  2018-09-16 16:03 ` Laurent Pinchart
@ 2018-09-16 16:25   ` Linus Torvalds
  0 siblings, 0 replies; 9+ messages in thread
From: Linus Torvalds @ 2018-09-16 16:25 UTC (permalink / raw)
  To: Laurent Pinchart; +Cc: ksummit

On Sun, Sep 16, 2018 at 9:03 AM Laurent Pinchart
<laurent.pinchart@ideasonboard.com> wrote:
> >
> > http://www.wil.cx/~willy/banbury.html
>
> Having lost a server due to a DDoS attach that rendered the link between CPU
> and storage unusable for a too long time, I think this would be an amazing
> improvement.

I agree on the "amazing", but in a more literal sense. I don't think
it's all that realistic. Pausing IO will basically hang the machine,
and you'll run out of memory in not too long too.

It's probably doable with a mount option and filesystem help (aka
"intr" for NFS). But people should be aware that one reason "intr"
worked as well as it did for NFS was that it

 (a) broke POSIX rules

 (b) NFS traditionally did almost synchronous writes

 (c) the metadata is/was on the disconnected side

and even then, you really really didn't want NFS "intr" to be on a
core filesystem.

With hotplug devices, you have some "interesting" issues in addition,
namely making sure you really connect it back to the right disk, and
don't re-use *anything* in case it turns out it's not the same one.
Even for the "simple" USB case, you'll have serious issues with the
serial numbers not being reliable (maybe things are better now, but it
used to be that the supposedly "unique" USB serial number wasn't
unique at all).

So I think it would be a good addition, but people should realize that
the current behavior is there for some pretty fundamental reasons.

               Linus

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Ksummit-discuss] [TECH TOPIC] Project Banbury
  2018-09-14 17:28 [Ksummit-discuss] [TECH TOPIC] Project Banbury Matthew Wilcox
  2018-09-16 10:53 ` Hannes Reinecke
  2018-09-16 16:03 ` Laurent Pinchart
@ 2018-09-16 16:37 ` James Bottomley
  2018-09-16 19:25   ` Theodore Y. Ts'o
  2018-09-16 23:58 ` David Howells
  3 siblings, 1 reply; 9+ messages in thread
From: James Bottomley @ 2018-09-16 16:37 UTC (permalink / raw)
  To: Matthew Wilcox, ksummit-discuss

On Fri, 2018-09-14 at 18:28 +0100, Matthew Wilcox wrote:
> We've all pulled the wrong drive out of a machine or unplugged a USB
> key before the write back has completely finished. You try to plug it
> back in, but the damage is done. The pending writes are lost, the
> filesystem is damaged and full of errors and you are having a Bad
> Day. What if ... plugging the drive back in could be made to work?

For a lot of modern external storage devices this simply can't be made
to work.  The reason is they all have an internal write back cache to
make operations faster and if they're SATA they may lie about it and if
they're USB they always lie about it.  For these devices we have a set
of writes that we think are completed but in-fact only hit the device
cache.  When you pulled it out, the cache was lost and so were these
writes.  This is unfixable on the host side unless there's some way we
can get the device to tell us it has a write back cache and behave
correctly with regard to flushes.

Even for devices that behave correctly, we currently have no real way
to repeat the I/O that was lost in the powered down cache, unless you
have a way to cope with this case (it doesn't seem to be accounted for
in your plan)?  The reason is we use barrier type caches which assume
everything behind them is available to the device (either on disk or in
the cache).  The block layer would need some way to replay I/Os (in
order) from the last barrier because some of them might have been lost
from the cache.

Provided we have write through caches (not a given), the lower layer
error handling will mostly take care of repeating the lost but
unacknowledged I/O provided you preserve the queue, so I agree that
part can work, but the big thing is having a write through cache.

James

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Ksummit-discuss] [TECH TOPIC] Project Banbury
  2018-09-16 16:37 ` James Bottomley
@ 2018-09-16 19:25   ` Theodore Y. Ts'o
  0 siblings, 0 replies; 9+ messages in thread
From: Theodore Y. Ts'o @ 2018-09-16 19:25 UTC (permalink / raw)
  To: James Bottomley; +Cc: ksummit-discuss

On Sun, Sep 16, 2018 at 09:37:50AM -0700, James Bottomley wrote:
> 
> For a lot of modern external storage devices this simply can't be made
> to work.  The reason is they all have an internal write back cache to
> make operations faster and if they're SATA they may lie about it and if
> they're USB they always lie about it.  For these devices we have a set
> of writes that we think are completed but in-fact only hit the device
> cache.  When you pulled it out, the cache was lost and so were these
> writes.  This is unfixable on the host side unless there's some way we
> can get the device to tell us it has a write back cache and behave
> correctly with regard to flushes.
> 
> Even for devices that behave correctly, we currently have no real way
> to repeat the I/O that was lost in the powered down cache, unless you
> have a way to cope with this case (it doesn't seem to be accounted for
> in your plan)?  The reason is we use barrier type caches which assume
> everything behind them is available to the device (either on disk or in
> the cache).  The block layer would need some way to replay I/Os (in
> order) from the last barrier because some of them might have been lost
> from the cache.
> 
> Provided we have write through caches (not a given), the lower layer
> error handling will mostly take care of repeating the lost but
> unacknowledged I/O provided you preserve the queue, so I agree that
> part can work, but the big thing is having a write through cache.

The way I'd suggest approaching this is not by making any changes in
the block layer at all.  Instead, I'd suggest putting all of the magic
into a device-mapper device.  The device mapper driver would be
responsible for keeping a copy of all blocks written to the removeable
device in kernel memory.  This would work much like the TCP retransmit
buffers; which is to say, until we are *sure* that we don't need to
retransmit the writes to the device, we have to keep a copy in
non-swappable kernel memory.

To avoid overflowing all available memory, there must be a
configurable cap of the maximum memory that can be used for retransmit
buffers; and to periodically send a CACHE FLUSH command to the block
device to free up buffer space once the device has acknowledged the
CACHE FLUSH command.

If the real device ever gets yanked, this ends up disconnecting the
block device from the device-mapper psuedo-device.  When the USB thumb
drive gets plugged back in, userspace would be responsible for
determining that it is the previously attached external attach, and
reconnecting it to the device-mapper device, which can then replay all
of the blocks in retransmit buffers.

Advantages of this strategy:

* No need to make changes to the block device layer

* The overhead of dealing with removable devices can be avoided for
  devices which are non-removeable.

* It deals with devices that don't support a write through cache; it
  only requires devices that don't lie about supporting CACHE FLUSH
  correctly.

						- Ted

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Ksummit-discuss] [TECH TOPIC] Project Banbury
  2018-09-14 17:28 [Ksummit-discuss] [TECH TOPIC] Project Banbury Matthew Wilcox
                   ` (2 preceding siblings ...)
  2018-09-16 16:37 ` James Bottomley
@ 2018-09-16 23:58 ` David Howells
  3 siblings, 0 replies; 9+ messages in thread
From: David Howells @ 2018-09-16 23:58 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: ksummit-discuss

Matthew Wilcox <willy6545@gmail.com> wrote:

> We've all pulled the wrong drive out of a machine or unplugged a USB key
> before the write back has completely finished. You try to plug it back in,
> but the damage is done.  The pending writes are lost, the filesystem is
> damaged and full of errors and you are having a Bad Day. What if
> ... plugging the drive back in could be made to work?

Is this something fscache could be made to help with?  Though that might be
more at a filesystem level than a blockdev level.

David

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Ksummit-discuss] [TECH TOPIC] Project Banbury
  2018-09-16 12:45   ` Matthew Wilcox
@ 2018-09-18  8:17     ` Hannes Reinecke
  0 siblings, 0 replies; 9+ messages in thread
From: Hannes Reinecke @ 2018-09-18  8:17 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: ksummit-discuss

On 09/16/2018 02:45 PM, Matthew Wilcox wrote:
> Unfortunately I won't be in Edinburgh. My understanding was that the
> tech topics are for Plumbers. Since the user  space component is
> probably the more interesting and complex component, Plumbers is
> probably the better conference for a discussion anyway.
> 
Au contraire.

ATM most filesystems are not able to report any I/O error directly to
the application due to internal caching. An I/O error will only ever be
seen once the cached data is written to disk, but by then the fs has
already acknowledged the write to the application, so there literally is
no way of how we could signal the I/O error (other than setting the fs
read-only).

So from that it might be possible to switch to an asynchronous model of
signalling I/O errors much like DAX does nowadays.
IE we could signal errors to the filesystem via asynchronous methods,
and do away with the current synchronous model.

I guess some further discussion is warranted here.
Sadly I can't go to Vancouver, so we'll need to find another venue; LSF
next year?

Cheers,

Hannes

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2018-09-18  8:17 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-09-14 17:28 [Ksummit-discuss] [TECH TOPIC] Project Banbury Matthew Wilcox
2018-09-16 10:53 ` Hannes Reinecke
2018-09-16 12:45   ` Matthew Wilcox
2018-09-18  8:17     ` Hannes Reinecke
2018-09-16 16:03 ` Laurent Pinchart
2018-09-16 16:25   ` Linus Torvalds
2018-09-16 16:37 ` James Bottomley
2018-09-16 19:25   ` Theodore Y. Ts'o
2018-09-16 23:58 ` David Howells

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox