linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Dan Williams <dan.j.williams@intel.com>
To: Dave Chinner <david@fromorbit.com>
Cc: Jeff Moyer <jmoyer@redhat.com>,
	"Verma, Vishal L" <vishal.l.verma@intel.com>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"linux-block@vger.kernel.org" <linux-block@vger.kernel.org>,
	"hch@infradead.org" <hch@infradead.org>,
	"xfs@oss.sgi.com" <xfs@oss.sgi.com>,
	"linux-nvdimm@ml01.01.org" <linux-nvdimm@ml01.01.org>,
	"linux-mm@kvack.org" <linux-mm@kvack.org>,
	"viro@zeniv.linux.org.uk" <viro@zeniv.linux.org.uk>,
	"axboe@fb.com" <axboe@fb.com>,
	"akpm@linux-foundation.org" <akpm@linux-foundation.org>,
	"linux-fsdevel@vger.kernel.org" <linux-fsdevel@vger.kernel.org>,
	"linux-ext4@vger.kernel.org" <linux-ext4@vger.kernel.org>,
	"Wilcox, Matthew R" <matthew.r.wilcox@intel.com>,
	"jack@suse.cz" <jack@suse.cz>
Subject: Re: [PATCH v2 5/5] dax: handle media errors in dax_do_io
Date: Mon, 2 May 2016 16:25:51 -0700	[thread overview]
Message-ID: <CAPcyv4jDTvSUDGTBZb0MaK_gKxMxWtMecnR_OjLzim1Sdg5Y9g@mail.gmail.com> (raw)
In-Reply-To: <20160502230422.GQ26977@dastard>

On Mon, May 2, 2016 at 4:04 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Mon, May 02, 2016 at 11:18:36AM -0400, Jeff Moyer wrote:
>> Dave Chinner <david@fromorbit.com> writes:
>>
>> > On Mon, Apr 25, 2016 at 11:53:13PM +0000, Verma, Vishal L wrote:
>> >> On Tue, 2016-04-26 at 09:25 +1000, Dave Chinner wrote:
>> > You're assuming that only the DAX aware application accesses it's
>> > files.  users, backup programs, data replicators, fileystem
>> > re-organisers (e.g.  defragmenters) etc all may access the files and
>> > they may throw errors. What then?
>>
>> I'm not sure how this is any different from regular storage.  If an
>> application gets EIO, it's up to the app to decide what to do with that.
>
> Sure - they'll fail. But the question I'm asking is that if the
> application that owns the data is supposed to do error recovery,
> what happens when a 3rd party application hits an error? If that
> consumes the error, the the app that owns the data won't ever get a
> chance to correct the error.
>
> This is a minefield - a 3rd party app that swallows and clears DAX
> based IO errors is a data corruption vector. can yo imagine if
> *grep* did this? The model that is being promoted here effectively
> allows this sort of behaviour - I don't really think we
> should be architecting an error recovery strategy that has the
> capability to go this wrong....

Since when does grep write to a file on error?

>
>> >> > Where does the application find the data that was lost to be able to
>> >> > rewrite it?
>> >>
>> >> The data that was lost is gone -- this assumes the application has some
>> >> ability to recover using a journal/log or other redundancy - yes, at the
>> >> application layer. If it doesn't have this sort of capability, the only
>> >> option is to restore files from a backup/mirror.
>> >
>> > So the architecture has a built in assumption that only userspace
>> > can handle data loss?
>>
>> Remember that the proposed programming model completely bypasses the
>> kernel, so yes, it is expected that user-space will have to deal with
>> the problem.
>
> No, it doesn't completely bypass the kernel - the kernel is the
> infrastructure that catches the errors in the first place, and it
> owns and controls all the metadata that corresponds to the physical
> location of that error. The only thing the kernel doesn't own is the
> *contents* of that location.
>
>> > What about filesytsems like NOVA, that use log structured design to
>> > provide DAX w/ update atomicity and can potentially also provide
>> > redundancy/repair through the same mechanisms? Won't pmem native
>> > filesystems with built in data protection features like this remove
>> > the need for adding all this to userspace applications?
>>
>> I don't think we'll /only/ support NOVA for pmem.  So we'll have to deal
>> with this for existing file systems, right?
>
> Yes, but that misses my point that it seems that the design is only
> focussed on userspace and existing filesystems and there is no
> consideration of kernel side functionality that could do transparent
> recovery....
>
>> > If so, shouldn't that be the focus of development rahter than
>> > placing the burden on userspace apps to handle storage repair
>> > situations?
>>
>> It really depends on the programming model.  In the model Vishal is
>> talking about, either the applications themselves or the libraries they
>> link to are expected to implement the redundancies where necessary.
>
> IOWs, filesystems no longer have any control over data integrity.
> Yet it's the filesystem developers who will still be responsible for
> data integrity and when the filesystem has a data coruption event
> we'll get blamed and the filesystem gets a bad name, even though
> it's entirely the applications fault. We've seen this time and time
> again - application developers cannot be trusted to guarantee data
> integrity. yes, some apps will be fine, but do you really expect
> application devs that refuse to use fsync because it's too slow are
> going to have a different approach to integrity when it comes to
> DAX?

Yes, completely agree.  The applications that will implement competent
error recovery with these mechanisms will be vanishingly small, and
there is definite room for a kernel data-redundancy solution that
builds on these patches.

>
>> >> > There's an implicit assumption that applications will keep redundant
>> >> > copies of their data at the /application layer/ and be able to
>> >> > automatically repair it?
>>
>> That's one way to do things.  It really depends on the application what
>> it will do for recovery.
>>
>> >> > And then there's the implicit assumption that it will unlink and
>> >> > free the entire file before writing a new copy
>>
>> I think Vishal was referring to restoring from backup.  cp itself will
>> truncate the file before overwriting, iirc.
>
> Which version of cp? what happens if they use --sparse and the error
> is in a zeroed region? There's so many assumptions about undefined userspace
> environment, application and user behaviour being made here, and
> it's all being handwaved away.
>
> I'm asking for this to be defined, demonstrated and documented as a
> working model that cannot be abused and doesn't have holes the size
> of trucks in it, not handwaving...

You lost me...  how are these patches abusing the existing semantics
of -EIO and write to clear?

>> >> To summarize, the two cases we want to handle are:
>> >> 1. Application has inbuilt recovery:
>> >>   - hits badblock
>> >>   - figures out it is able to recover the data
>> >>   - handles SIGBUS or EIO
>> >>   - does a (sector aligned) write() to restore the data
>> >
>> > The "figures out" step here is where >95% of the work we'd have to
>> > do is. And that's in filesystem and block layer code, not
>> > userspace, and userspace can't do that work in a signal handler.
>> > And it  can still fall down to the second case when the application
>> > doesn't have another copy of the data somewhere.
>>
>> I read that "figures out" step as the application determining whether or
>> not it had a redundant copy.
>
> Another undocumented assumption, that doesn't simplify what needs to
> be done. Indeed, userspace can't do that until it is in SIGBUS
> context, which tends to imply applications need to do a major amount
> of work from within the signal handler....
>
>> > FWIW, we don't have a DAX enabled filesystem that can do
>> > reverse block mapping, so we're a year or two away from this being a
>> > workable production solution from the filesystem perspective. And
>> > AFAICT, it's not even on the roadmap for dm/md layers.
>>
>> Do we even need that?  What if we added an FIEMAP flag for determining
>> bad blocks.
>
> So you're assuming that the filesystem has been informed of the bad
> blocks and has already marked the bad regions of the file in it's
> extent list?
>
> How does that happen? What mechanism is used for the underlying
> block device to inform the filesytem that theirs a bad LBA, and how
> does the filesytem the map that to a path/file/offset with reverse
> mapping? Or is there some other magic that hasn't been explained
> happening here?

In 4.5 we added this:

commit 99e6608c9e7414ae4f2168df8bf8fae3eb49e41f
Author: Vishal Verma <vishal.l.verma@intel.com>
Date:   Sat Jan 9 08:36:51 2016 -0800

    block: Add badblock management for gendisks

    NVDIMM devices, which can behave more like DRAM rather than block
    devices, may develop bad cache lines, or 'poison'. A block device
    exposed by the pmem driver can then consume poison via a read (or
    write), and cause a machine check. On platforms without machine
    check recovery features, this would mean a crash.

    The block device maintaining a runtime list of all known sectors that
    have poison can directly avoid this, and also provide a path forward
    to enable proper handling/recovery for DAX faults on such a device.

    Use the new badblock management interfaces to add a badblocks list to
    gendisks.

    Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
    Signed-off-by: Dan Williams <dan.j.williams@intel.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  parent reply	other threads:[~2016-05-02 23:25 UTC|newest]

Thread overview: 64+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-03-30  1:59 [PATCH v2 0/5] dax: handling of media errors Vishal Verma
2016-03-30  1:59 ` [PATCH v2 1/5] block, dax: pass blk_dax_ctl through to drivers Vishal Verma
2016-03-30  4:19   ` kbuild test robot
2016-04-15 14:55   ` Jeff Moyer
2016-03-30  1:59 ` [PATCH v2 2/5] dax: fallback from pmd to pte on error Vishal Verma
2016-04-15 14:55   ` Jeff Moyer
2016-03-30  1:59 ` [PATCH v2 3/5] dax: enable dax in the presence of known media errors (badblocks) Vishal Verma
2016-04-15 14:56   ` Jeff Moyer
2016-03-30  1:59 ` [PATCH v2 4/5] dax: use sb_issue_zerout instead of calling dax_clear_sectors Vishal Verma
2016-04-15 15:18   ` Jeff Moyer
2016-03-30  1:59 ` [PATCH v2 5/5] dax: handle media errors in dax_do_io Vishal Verma
2016-03-30  3:00   ` kbuild test robot
2016-03-30  6:34   ` Christoph Hellwig
2016-03-30  6:54     ` Vishal Verma
2016-03-30  6:56       ` Christoph Hellwig
2016-04-15 16:11   ` Jeff Moyer
2016-04-15 16:54     ` Verma, Vishal L
2016-04-15 17:11       ` Jeff Moyer
2016-04-15 17:37         ` Verma, Vishal L
2016-04-15 17:57           ` Dan Williams
2016-04-15 18:06             ` Jeff Moyer
2016-04-15 18:17               ` Dan Williams
2016-04-15 18:24                 ` Jeff Moyer
2016-04-15 18:56                   ` Dan Williams
2016-04-15 19:13                     ` Jeff Moyer
2016-04-15 19:01                 ` Toshi Kani
2016-04-15 19:08                   ` Toshi Kani
2016-04-20 20:59     ` Christoph Hellwig
2016-04-23 18:08       ` Verma, Vishal L
2016-04-25  8:31         ` hch
2016-04-25 15:32           ` Jeff Moyer
2016-04-26  8:32             ` hch
2016-04-25 17:14           ` Verma, Vishal L
2016-04-25 17:21             ` Dan Williams
2016-04-25 23:25             ` Dave Chinner
2016-04-25 23:34               ` Darrick J. Wong
2016-04-25 23:43               ` Dan Williams
2016-04-26  0:11                 ` Dave Chinner
2016-04-26  1:45                   ` Dan Williams
2016-04-26  2:56                     ` Dave Chinner
2016-04-26  4:18                       ` Dan Williams
2016-04-26  8:27                         ` Dave Chinner
2016-04-26 14:59                           ` Dan Williams
2016-04-26 15:31                             ` Jan Kara
2016-04-26 17:16                               ` Dan Williams
2016-04-25 23:53               ` Verma, Vishal L
2016-04-26  0:41                 ` Dave Chinner
2016-04-26 14:58                   ` Vishal Verma
2016-05-02 15:18                   ` Jeff Moyer
2016-05-02 17:53                     ` Dan Williams
2016-05-03  0:42                       ` Dave Chinner
2016-05-03  1:26                         ` Rudoff, Andy
2016-05-03  2:49                           ` Dave Chinner
2016-05-03 18:30                             ` Rudoff, Andy
2016-05-04  1:36                               ` Dave Chinner
2016-05-02 23:04                     ` Dave Chinner
2016-05-02 23:17                       ` Verma, Vishal L
2016-05-02 23:25                       ` Dan Williams [this message]
2016-05-03  1:51                         ` Dave Chinner
2016-05-03 17:28                           ` Dan Williams
2016-05-04  3:18                             ` Dave Chinner
2016-05-04  5:05                               ` Dan Williams
2016-04-26  8:33             ` hch
2016-04-26 15:01               ` Vishal Verma

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAPcyv4jDTvSUDGTBZb0MaK_gKxMxWtMecnR_OjLzim1Sdg5Y9g@mail.gmail.com \
    --to=dan.j.williams@intel.com \
    --cc=akpm@linux-foundation.org \
    --cc=axboe@fb.com \
    --cc=david@fromorbit.com \
    --cc=hch@infradead.org \
    --cc=jack@suse.cz \
    --cc=jmoyer@redhat.com \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-nvdimm@ml01.01.org \
    --cc=matthew.r.wilcox@intel.com \
    --cc=viro@zeniv.linux.org.uk \
    --cc=vishal.l.verma@intel.com \
    --cc=xfs@oss.sgi.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox