Re: [PATCH v2 5/5] dax: handle media errors in dax_do_io

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: "Rudoff, Andy" <andy.rudoff@intel.com>
To: Dave Chinner <david@fromorbit.com>,
	"Williams, Dan J" <dan.j.williams@intel.com>
Cc: "hch@infradead.org" <hch@infradead.org>,
	"jack@suse.cz" <jack@suse.cz>, "axboe@fb.com" <axboe@fb.com>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"xfs@oss.sgi.com" <xfs@oss.sgi.com>,
	"linux-block@vger.kernel.org" <linux-block@vger.kernel.org>,
	"linux-mm@kvack.org" <linux-mm@kvack.org>,
	"viro@zeniv.linux.org.uk" <viro@zeniv.linux.org.uk>,
	"linux-nvdimm@ml01.01.org" <linux-nvdimm@ml01.01.org>,
	"linux-fsdevel@vger.kernel.org" <linux-fsdevel@vger.kernel.org>,
	"akpm@linux-foundation.org" <akpm@linux-foundation.org>,
	"linux-ext4@vger.kernel.org" <linux-ext4@vger.kernel.org>,
	"Wilcox, Matthew R" <matthew.r.wilcox@intel.com>
Subject: Re: [PATCH v2 5/5] dax: handle media errors in dax_do_io
Date: Tue, 3 May 2016 01:26:46 +0000	[thread overview]
Message-ID: <D26BCF92-ED25-4ACA-9CC8-7B1C05A1D5FC@intel.com> (raw)
In-Reply-To: <20160503004226.GR26977@dastard>

>> The takeaway is that msync() is 9-10x slower than userspace cache management.
>
>An alternative viewpoint: that flushing clean cachelines is
>extremely expensive on Intel CPUs. ;)
>
>i.e. Same numbers, different analysis from a different PoV, and
>that gives a *completely different conclusion*.
>
>Think about it for the moment. The hardware inefficiency being
>demonstrated could be fixed/optimised in the next hardware product
>cycle(s) and so will eventually go away. OTOH, we'll be stuck with
>whatever programming model we come up with for the next 30-40 years,
>and we'll never be able to fix flaws in it because applications will
>be depending on them. Do we really want to be stuck with a pmem
>model that is designed around the flaws and deficiencies of ~1st
>generation hardware?

Hi Dave,

Not sure I agree with your completely different conclusion.  (Not sure
I completely disagree either, but please let me raise some practical
points.)

First of all, let's say you're completely right and flushing clean
cache lines is extremely expensive.  So your solution is to wait for
the chip to be fixed?  Remember the model we're putting forward (which
we're working on documenting, because I fully agree with the lack of
documentation point you keep raising) requires the application to ASK
for the file system's permission before assuming flushing from user space
to persistence is allowed.  So that doesn't stick us with 30-40 years of
a flawed model.  I don't think the model is wrong, having spent lots of
research time on it, but if I'm full of crap, all we have to do is stop
telling the app that flushing from user space is allowed and it must go
back to using msync().  This is my understanding of what Dan suggested
at LSF and this is what I'm currently writing up.  By the way, the NVM
Libraries already contain the logic to ask if flushing from user space
is allowed, falling back to msync() if not.  Currently those libraries
check for DAX mappings.  But the points you raised about metadata changes
happening during page faults made us realize we have to ask the file
system to opt-in to allowing user space flushing, so that's what we're
changing the library to do.  See, we are listening :-)

Anyway, I doubt that flushing a clean cache line is extremely expensive.
Remember the code is building transactions to maintain a consistent
in-memory data structure in the face of sudden failure like powerloss.
So it is using the flushes to create store barriers, but not the block-
based store barriers we're used to in the storage world, but cache-line-
sized store barriers (usually multiples of cache lines, but most commonly
smaller than 4k of them).  So I think when you turn a cache line flush
into an msync(), you're seeing some dirty stuff get flushed before it
is time to flush it.  I'm not sure though, but certainly we could spend
more time testing & measuring.

More importantly, I think it is interesting to decide what we want the
pmem programming model to be long-term.  I think we want applications to
just map pmem, do normal stores to it, and assume they are persistent.
This is quite different from the 30-year-old POSIX Model where msync()
is required.  But I think it is cleaner, easier to understand, and less
error-prone.  So why doesn't it work that way right now?  Because we're
finding it impractical.  Using write-through caching for pmem simply
doesn't perform well, and depending on the platform to flush the CPU
caches on shutdown/powerfail is not practical yet.  But I think the day
will come when it is practical.

So given that long-term target, the idea is for an application to ask if
the msync() calls are required, or if just flushing the CPU caches is
sufficient for persistence.  Then, we're also adding an ACPI property
that allows SW to discover if the caches are flushed automatically
on shutdown/powerloss.  Initially that will only be true for custom
platforms, but hopefully it can be available more broadly in the future.
The result will be that the programming model gets simpler as more and
more hardware requires less explicit flushing.

Now I'll go back to writing up the big picture for this programming
model so I can ask you for comments on that as well...

-andy

next prev parent reply	other threads:[~2016-05-03  1:26 UTC|newest]

Thread overview: 64+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-03-30  1:59 [PATCH v2 0/5] dax: handling of media errors Vishal Verma
2016-03-30  1:59 ` [PATCH v2 1/5] block, dax: pass blk_dax_ctl through to drivers Vishal Verma
2016-03-30  4:19   ` kbuild test robot
2016-04-15 14:55   ` Jeff Moyer
2016-03-30  1:59 ` [PATCH v2 2/5] dax: fallback from pmd to pte on error Vishal Verma
2016-04-15 14:55   ` Jeff Moyer
2016-03-30  1:59 ` [PATCH v2 3/5] dax: enable dax in the presence of known media errors (badblocks) Vishal Verma
2016-04-15 14:56   ` Jeff Moyer
2016-03-30  1:59 ` [PATCH v2 4/5] dax: use sb_issue_zerout instead of calling dax_clear_sectors Vishal Verma
2016-04-15 15:18   ` Jeff Moyer
2016-03-30  1:59 ` [PATCH v2 5/5] dax: handle media errors in dax_do_io Vishal Verma
2016-03-30  3:00   ` kbuild test robot
2016-03-30  6:34   ` Christoph Hellwig
2016-03-30  6:54     ` Vishal Verma
2016-03-30  6:56       ` Christoph Hellwig
2016-04-15 16:11   ` Jeff Moyer
2016-04-15 16:54     ` Verma, Vishal L
2016-04-15 17:11       ` Jeff Moyer
2016-04-15 17:37         ` Verma, Vishal L
2016-04-15 17:57           ` Dan Williams
2016-04-15 18:06             ` Jeff Moyer
2016-04-15 18:17               ` Dan Williams
2016-04-15 18:24                 ` Jeff Moyer
2016-04-15 18:56                   ` Dan Williams
2016-04-15 19:13                     ` Jeff Moyer
2016-04-15 19:01                 ` Toshi Kani
2016-04-15 19:08                   ` Toshi Kani
2016-04-20 20:59     ` Christoph Hellwig
2016-04-23 18:08       ` Verma, Vishal L
2016-04-25  8:31         ` hch
2016-04-25 15:32           ` Jeff Moyer
2016-04-26  8:32             ` hch
2016-04-25 17:14           ` Verma, Vishal L
2016-04-25 17:21             ` Dan Williams
2016-04-25 23:25             ` Dave Chinner
2016-04-25 23:34               ` Darrick J. Wong
2016-04-25 23:43               ` Dan Williams
2016-04-26  0:11                 ` Dave Chinner
2016-04-26  1:45                   ` Dan Williams
2016-04-26  2:56                     ` Dave Chinner
2016-04-26  4:18                       ` Dan Williams
2016-04-26  8:27                         ` Dave Chinner
2016-04-26 14:59                           ` Dan Williams
2016-04-26 15:31                             ` Jan Kara
2016-04-26 17:16                               ` Dan Williams
2016-04-25 23:53               ` Verma, Vishal L
2016-04-26  0:41                 ` Dave Chinner
2016-04-26 14:58                   ` Vishal Verma
2016-05-02 15:18                   ` Jeff Moyer
2016-05-02 17:53                     ` Dan Williams
2016-05-03  0:42                       ` Dave Chinner
2016-05-03  1:26                         ` Rudoff, Andy [this message]
2016-05-03  2:49                           ` Dave Chinner
2016-05-03 18:30                             ` Rudoff, Andy
2016-05-04  1:36                               ` Dave Chinner
2016-05-02 23:04                     ` Dave Chinner
2016-05-02 23:17                       ` Verma, Vishal L
2016-05-02 23:25                       ` Dan Williams
2016-05-03  1:51                         ` Dave Chinner
2016-05-03 17:28                           ` Dan Williams
2016-05-04  3:18                             ` Dave Chinner
2016-05-04  5:05                               ` Dan Williams
2016-04-26  8:33             ` hch
2016-04-26 15:01               ` Vishal Verma

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=D26BCF92-ED25-4ACA-9CC8-7B1C05A1D5FC@intel.com \
    --to=andy.rudoff@intel.com \
    --cc=akpm@linux-foundation.org \
    --cc=axboe@fb.com \
    --cc=dan.j.williams@intel.com \
    --cc=david@fromorbit.com \
    --cc=hch@infradead.org \
    --cc=jack@suse.cz \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-nvdimm@ml01.01.org \
    --cc=matthew.r.wilcox@intel.com \
    --cc=viro@zeniv.linux.org.uk \
    --cc=xfs@oss.sgi.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox