Re: [Lsf-pc] [LSF/MM TOPIC] I/O error handling and fsync()

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Theodore Ts'o <tytso@mit.edu>
To: Trond Myklebust <trondmy@primarydata.com>
Cc: "kwolf@redhat.com" <kwolf@redhat.com>,
	"jlayton@poochiereds.net" <jlayton@poochiereds.net>,
	"neilb@suse.com" <neilb@suse.com>,
	"hch@infradead.org" <hch@infradead.org>,
	"riel@redhat.com" <riel@redhat.com>,
	"linux-mm@kvack.org" <linux-mm@kvack.org>,
	"rwheeler@redhat.com" <rwheeler@redhat.com>,
	"lsf-pc@lists.linux-foundation.org"
	<lsf-pc@lists.linux-foundation.org>,
	"linux-fsdevel@vger.kernel.org" <linux-fsdevel@vger.kernel.org>
Subject: Re: [Lsf-pc] [LSF/MM TOPIC] I/O error handling and fsync()
Date: Wed, 25 Jan 2017 13:35:42 -0500	[thread overview]
Message-ID: <20170125183542.557drncuktc5wgzy@thunk.org> (raw)
In-Reply-To: <1485228841.8987.1.camel@primarydata.com>

On Tue, Jan 24, 2017 at 03:34:04AM +0000, Trond Myklebust wrote:
> The reason why I'm thinking open() is because it has to be a contract
> between a specific application and the kernel. If the application
> doesn't open the file with the O_TIMEOUT flag, then it shouldn't see
> nasty non-POSIX timeout errors, even if there is another process that
> is using that flag on the same file.
> 
> The only place where that is difficult to manage is when the file is
> mmap()ed (no file descriptor), so you'd presumably have to disallow
> mixing mmap and O_TIMEOUT.

Well, technically there *is* a file descriptor when you do an mmap.
You can close the fd after you call mmap(), but the mmap bumps the
refcount on the struct file while the memory map is active.

I would argue though that at least for buffered writes, the timeout
has to be property of the underlying inode, and if there is an attempt
to set timeout on an inode that already has a timeout set to some
other non-zero value, the "set timeout" operation should fail with a
"timeout already set".  That's becuase we really don't want to have to
keep track, on a per-page basis, which struct file was responsible for
dirtying a page --- and what if it is dirtied by two different file
descriptors?

That being said, I suspect that for many applications, the timeout is
going to be *much* more interesting for O_DIRECT writes, and there we
can certainly have different timeouts on a per-fd basis.  This is
especially for cases where the timeout is implemented in storage
device, using multi-media extensions, and where the timout might be
measured in milliseconds (e.g., no point reading a video frame if its
been delayed too long).  That being said, it block layer would need to
know about this as well, since the timeout needs to be relative to
when the read(2) system call is issued, not to when it is finally
submitted to the storage device.

And if the process has suitable privileges, perhaps the I/O scheduler
should take the timeout into account, so that reads with a timeout
attached should be submitted, with the presumption that reads w/o a
timeout can afford to be queued.  If the process doesn't have suitable
privileges, or if cgroup has exceeded its I/O quota, perhaps the right
answer would be to fail the read right away.  In the case of a cluster
file system such, if a particular server knows its can't serve a
particular low latency read within the SLO, it might be worthwhile to
signal to the cluster file system client that it should start doing an
erasure code reconstruction right away (or read from one of the
mirrors if the file is stored with n=3 replication, etc.)

So depending on what the goals of userspace are, there are number of
different kernel policies that might be the best match for the
particular application in question.  In particular, if you are trying
to provide low latency reads to assure decent response time for web
applications, it may be *reads* that are much more interesting for
timeout purposes rather than *writes*.

(Especially in a distributed system, you're going to be using some
kind of encoding with redundancy, so as long as enough of the writes
have completed, it doesn't matter if the other writes take a long time
--- although if you eventually decide that the write's never going to
make it, it's ideal if you can reshard the chunk more aggressively,
instead of waiting for the scurbbing pass to notice that some of the
redundant copies of the chunk had gotten corrupted or were never
written out.)

Cheers,

					- Ted

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

next prev parent reply	other threads:[~2017-01-25 18:35 UTC|newest]

Thread overview: 42+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-01-10 16:02 Kevin Wolf
2017-01-11  0:41 ` NeilBrown
2017-01-13 11:09   ` Kevin Wolf
2017-01-13 14:21     ` Theodore Ts'o
2017-01-13 16:00       ` Kevin Wolf
2017-01-13 22:28         ` NeilBrown
2017-01-14  6:18           ` Darrick J. Wong
2017-01-16 12:14           ` [Lsf-pc] " Jeff Layton
2017-01-22 22:44             ` NeilBrown
2017-01-22 23:31               ` Jeff Layton
2017-01-23  0:21                 ` Theodore Ts'o
2017-01-23 10:09                   ` Kevin Wolf
2017-01-23 12:10                     ` Jeff Layton
2017-01-23 17:25                       ` Theodore Ts'o
2017-01-23 17:53                         ` Chuck Lever
2017-01-23 22:40                         ` Jeff Layton
2017-01-23 22:35                     ` Jeff Layton
2017-01-23 23:09                       ` Trond Myklebust
2017-01-24  0:16                         ` NeilBrown
2017-01-24  0:46                           ` Jeff Layton
2017-01-24 21:58                             ` NeilBrown
2017-01-25 13:00                               ` Jeff Layton
2017-01-30  5:30                                 ` NeilBrown
2017-01-24  3:34                           ` Trond Myklebust
2017-01-25 18:35                             ` Theodore Ts'o [this message]
2017-01-26  0:36                               ` NeilBrown
2017-01-26  9:25                                 ` Jan Kara
2017-01-26 22:19                                   ` NeilBrown
2017-01-27  3:23                                     ` Theodore Ts'o
2017-01-27  6:03                                       ` NeilBrown
2017-01-30 16:04                                       ` Jan Kara
2017-01-13 18:40     ` Al Viro
2017-01-13 19:06       ` Kevin Wolf
2017-01-11  5:03 ` Theodore Ts'o
2017-01-11  9:47   ` [Lsf-pc] " Jan Kara
2017-01-11 15:45     ` Theodore Ts'o
2017-01-11 10:55   ` Chris Vest
2017-01-11 11:40   ` Kevin Wolf
2017-01-13  4:51     ` NeilBrown
2017-01-13 11:51       ` Kevin Wolf
2017-01-13 21:55         ` NeilBrown
2017-01-11 12:14   ` Chris Vest

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20170125183542.557drncuktc5wgzy@thunk.org \
    --to=tytso@mit.edu \
    --cc=hch@infradead.org \
    --cc=jlayton@poochiereds.net \
    --cc=kwolf@redhat.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lsf-pc@lists.linux-foundation.org \
    --cc=neilb@suse.com \
    --cc=riel@redhat.com \
    --cc=rwheeler@redhat.com \
    --cc=trondmy@primarydata.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox