ksummit.lists.linux.dev archive mirror
 help / color / mirror / Atom feed
From: "Theodore Y. Ts'o" <tytso@mit.edu>
To: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: ksummit-discuss@lists.linuxfoundation.org
Subject: Re: [Ksummit-discuss] [TECH TOPIC] Project Banbury
Date: Sun, 16 Sep 2018 15:25:13 -0400	[thread overview]
Message-ID: <20180916192513.GA3575@thunk.org> (raw)
In-Reply-To: <1537115870.3056.1.camel@HansenPartnership.com>

On Sun, Sep 16, 2018 at 09:37:50AM -0700, James Bottomley wrote:
> 
> For a lot of modern external storage devices this simply can't be made
> to work.  The reason is they all have an internal write back cache to
> make operations faster and if they're SATA they may lie about it and if
> they're USB they always lie about it.  For these devices we have a set
> of writes that we think are completed but in-fact only hit the device
> cache.  When you pulled it out, the cache was lost and so were these
> writes.  This is unfixable on the host side unless there's some way we
> can get the device to tell us it has a write back cache and behave
> correctly with regard to flushes.
> 
> Even for devices that behave correctly, we currently have no real way
> to repeat the I/O that was lost in the powered down cache, unless you
> have a way to cope with this case (it doesn't seem to be accounted for
> in your plan)?  The reason is we use barrier type caches which assume
> everything behind them is available to the device (either on disk or in
> the cache).  The block layer would need some way to replay I/Os (in
> order) from the last barrier because some of them might have been lost
> from the cache.
> 
> Provided we have write through caches (not a given), the lower layer
> error handling will mostly take care of repeating the lost but
> unacknowledged I/O provided you preserve the queue, so I agree that
> part can work, but the big thing is having a write through cache.

The way I'd suggest approaching this is not by making any changes in
the block layer at all.  Instead, I'd suggest putting all of the magic
into a device-mapper device.  The device mapper driver would be
responsible for keeping a copy of all blocks written to the removeable
device in kernel memory.  This would work much like the TCP retransmit
buffers; which is to say, until we are *sure* that we don't need to
retransmit the writes to the device, we have to keep a copy in
non-swappable kernel memory.

To avoid overflowing all available memory, there must be a
configurable cap of the maximum memory that can be used for retransmit
buffers; and to periodically send a CACHE FLUSH command to the block
device to free up buffer space once the device has acknowledged the
CACHE FLUSH command.

If the real device ever gets yanked, this ends up disconnecting the
block device from the device-mapper psuedo-device.  When the USB thumb
drive gets plugged back in, userspace would be responsible for
determining that it is the previously attached external attach, and
reconnecting it to the device-mapper device, which can then replay all
of the blocks in retransmit buffers.

Advantages of this strategy:

* No need to make changes to the block device layer

* The overhead of dealing with removable devices can be avoided for
  devices which are non-removeable.

* It deals with devices that don't support a write through cache; it
  only requires devices that don't lie about supporting CACHE FLUSH
  correctly.

						- Ted

  reply	other threads:[~2018-09-16 19:25 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-09-14 17:28 Matthew Wilcox
2018-09-16 10:53 ` Hannes Reinecke
2018-09-16 12:45   ` Matthew Wilcox
2018-09-18  8:17     ` Hannes Reinecke
2018-09-16 16:03 ` Laurent Pinchart
2018-09-16 16:25   ` Linus Torvalds
2018-09-16 16:37 ` James Bottomley
2018-09-16 19:25   ` Theodore Y. Ts'o [this message]
2018-09-16 23:58 ` David Howells

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20180916192513.GA3575@thunk.org \
    --to=tytso@mit.edu \
    --cc=James.Bottomley@HansenPartnership.com \
    --cc=ksummit-discuss@lists.linuxfoundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox