From: Andy Lutomirski <luto@kernel.org>
To: Boaz Harrosh <boaz@plexistor.com>,
Dan Williams <dan.j.williams@intel.com>,
Ross Zwisler <ross.zwisler@linux.intel.com>,
linux-nvdimm <linux-nvdimm@ml01.01.org>,
Matthew Wilcox <willy@linux.intel.com>,
"Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>,
Dave Chinner <david@fromorbit.com>
Cc: Oleg Nesterov <oleg@redhat.com>, Mel Gorman <mgorman@suse.de>,
Johannes Weiner <hannes@cmpxchg.org>,
linux-mm <linux-mm@kvack.org>, Arnd Bergmann <arnd@arndb.de>
Subject: Re: [RFC 0/2] New MAP_PMEM_AWARE mmap flag
Date: Thu, 10 Mar 2016 22:44:16 -0800 [thread overview]
Message-ID: <56E26940.8020203@kernel.org> (raw)
In-Reply-To: <56C9EDCF.8010007@plexistor.com>
On 02/21/2016 09:03 AM, Boaz Harrosh wrote:
> Hi all
>
> Recent DAX code fixed the cl_flushing ie durability of mmap access
> of direct persistent-memory from applications. It uses the radix-tree
> per inode to track the indexes of a file that where page-faulted for
> write. Then at m/fsync time it would cl_flush these pages and clean
> the radix-tree, for the next round.
>
> Sigh, that is life, for legacy applications this is the price we must
> pay. But for NV aware applications like nvml library, we pay extra extra
> price, even if we do not actually call m/fsync eventually. For these
> applications these extra resources and especially the extra radix locking
> per page-fault, costs a lot, like x3 a lot.
>
> What we propose here is a way for those applications to enjoy the
> boost and still not sacrifice any correctness of legacy applications.
> Any concurrent access from legacy apps vs nv-aware apps even to the same
> file / same page, will work correctly.
>
> We do that by defining a new MMAP flag that is set by the nv-aware
> app. this flag is carried by the VMA. In the dax code we bypass any
> radix handling of the page if this flag is set. Those pages accessed *without*
> this flag will be added to the radix-tree, those with will not.
> At m/fsync time if the radix tree is then empty nothing will happen.
>
I'm a little late to the party, but let me offer a variant that might be
considerably safer:
Add a flag MAP_DAX_WRITETHROUGH (name could be debated --
MAP_DAX_FASTFLUSH might be more architecture-neutral, but I'm only
familiar with the x86 semantics).
MAP_DAX_WRITETHROUGH does whatever is needed to ensure that writing
through the mapping and then calling fsync is both safe and fast. On
x86, it would (surprise, surprise!) map the pages writethrough and skip
adding them to the radix tree. fsync makes sure to do sfence before
pcommit.
This is totally safe. You *can't* abuse this to cause fsync to leave
non-persistent dirty cached data anywhere.
It makes sufficiently DAX-aware applications very fast. Reads are
unaffected, and non-temporal writes should be the same speed as they are
under any other circumstances.
It makes applications that set it blindly very slow. Applications that
use standard writes (i.e. plain stores that are neither fast string
operations nor explicit non-temporal writes) will suffer. But they'll
still work correctly.
Applications that want a WB mapping with manually-managed persistence
can still do it, but fsync will be slow. Adding an fmetadatasync() for
their benefit might be a decent idea, but it would just be icing on the
cake.
Unlike with MAP_DAX_AWARE, there's no issue with malicious users who map
the thing with the wrong flag, write, call fsync, and snicker because
now the other applications might read data and be surprised that the
data they just read isn't persistent even if they subsequently call fsync.
There would be details to be hashed out in case a page is mapped
normally and with MAP_DAX_WRITETHROUGH in separate mappings.
--Andy
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2016-03-11 6:44 UTC|newest]
Thread overview: 69+ messages / expand[flat|nested] mbox.gz Atom feed top
2016-02-21 17:03 Boaz Harrosh
2016-02-21 17:04 ` [RFC 1/2] mmap: Define a new " Boaz Harrosh
2016-02-21 17:06 ` [RFC 2/2] dax: Support " Boaz Harrosh
2016-02-21 19:51 ` [RFC 0/2] New " Dan Williams
2016-02-21 20:24 ` Boaz Harrosh
2016-02-21 20:57 ` Dan Williams
2016-02-21 21:23 ` Boaz Harrosh
2016-02-21 22:03 ` Dan Williams
2016-02-21 22:31 ` Dave Chinner
2016-02-22 9:57 ` Boaz Harrosh
2016-02-22 15:34 ` Jeff Moyer
2016-02-22 17:44 ` Christoph Hellwig
2016-02-22 17:58 ` Jeff Moyer
2016-02-22 18:03 ` Christoph Hellwig
2016-02-22 18:52 ` Jeff Moyer
2016-02-23 9:45 ` Christoph Hellwig
2016-02-22 20:05 ` Rudoff, Andy
2016-02-23 9:52 ` Christoph Hellwig
2016-02-23 10:07 ` Rudoff, Andy
2016-02-23 12:06 ` Dave Chinner
2016-02-23 17:10 ` Ross Zwisler
2016-02-23 21:47 ` Dave Chinner
2016-02-23 22:15 ` Boaz Harrosh
2016-02-23 23:28 ` Dave Chinner
2016-02-24 0:08 ` Boaz Harrosh
2016-02-23 14:10 ` Boaz Harrosh
2016-02-23 16:56 ` Dan Williams
2016-02-23 17:05 ` Ross Zwisler
2016-02-23 17:26 ` Dan Williams
2016-02-23 21:55 ` Boaz Harrosh
2016-02-23 22:33 ` Dan Williams
2016-02-23 23:07 ` Boaz Harrosh
2016-02-23 23:23 ` Dan Williams
2016-02-23 23:40 ` Boaz Harrosh
2016-02-24 0:08 ` Dave Chinner
2016-02-23 23:28 ` Jeff Moyer
2016-02-23 23:34 ` Dan Williams
2016-02-23 23:43 ` Jeff Moyer
2016-02-23 23:56 ` Dan Williams
2016-02-24 4:09 ` Ross Zwisler
2016-02-24 19:30 ` Ross Zwisler
2016-02-25 9:46 ` Jan Kara
2016-02-25 7:44 ` Boaz Harrosh
2016-02-24 15:02 ` Jeff Moyer
2016-02-24 22:56 ` Dave Chinner
2016-02-25 16:24 ` Jeff Moyer
2016-02-25 19:11 ` Jeff Moyer
2016-02-25 20:15 ` Dave Chinner
2016-02-25 20:57 ` Jeff Moyer
2016-02-25 22:27 ` Dave Chinner
2016-02-26 4:02 ` Dan Williams
2016-02-26 10:04 ` Thanumalayan Sankaranarayana Pillai
2016-02-28 10:17 ` Boaz Harrosh
2016-03-03 17:38 ` Howard Chu
2016-02-29 20:25 ` Jeff Moyer
2016-02-25 21:08 ` Phil Terry
2016-02-25 21:39 ` Dave Chinner
2016-02-25 21:20 ` Dave Chinner
2016-02-29 20:32 ` Jeff Moyer
2016-02-23 17:25 ` Ross Zwisler
2016-02-23 22:47 ` Boaz Harrosh
2016-02-22 21:50 ` Dave Chinner
2016-02-23 13:51 ` Boaz Harrosh
2016-02-23 14:22 ` Jeff Moyer
2016-02-22 11:05 ` Boaz Harrosh
2016-03-11 6:44 ` Andy Lutomirski [this message]
2016-03-11 19:07 ` Dan Williams
2016-03-11 19:10 ` Andy Lutomirski
2016-03-11 23:02 ` Rudoff, Andy
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=56E26940.8020203@kernel.org \
--to=luto@kernel.org \
--cc=arnd@arndb.de \
--cc=boaz@plexistor.com \
--cc=dan.j.williams@intel.com \
--cc=david@fromorbit.com \
--cc=hannes@cmpxchg.org \
--cc=kirill.shutemov@linux.intel.com \
--cc=linux-mm@kvack.org \
--cc=linux-nvdimm@ml01.01.org \
--cc=mgorman@suse.de \
--cc=oleg@redhat.com \
--cc=ross.zwisler@linux.intel.com \
--cc=willy@linux.intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox