From: Andy Lutomirski <luto@kernel.org>
To: Dave Chinner <david@fromorbit.com>
Cc: Andy Lutomirski <luto@kernel.org>,
Dan Williams <dan.j.williams@intel.com>,
Ross Zwisler <ross.zwisler@linux.intel.com>,
andy.rudoff@intel.com, Andrew Morton <akpm@linux-foundation.org>,
Jan Kara <jack@suse.cz>, linux-nvdimm <linux-nvdimm@lists.01.org>,
Linux API <linux-api@vger.kernel.org>,
"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
"linux-mm@kvack.org" <linux-mm@kvack.org>,
Jeff Moyer <jmoyer@redhat.com>,
Linux FS Devel <linux-fsdevel@vger.kernel.org>,
Christoph Hellwig <hch@lst.de>
Subject: Re: [RFC PATCH 2/2] mm, fs: daxfile, an interface for byte-addressable updates to pmem
Date: Tue, 20 Jun 2017 22:18:24 -0700 [thread overview]
Message-ID: <CALCETrVYmbyNS-btvsN_M-QyWPZA_Y_4JXOM893g7nhZA+WviQ@mail.gmail.com> (raw)
In-Reply-To: <20170621014032.GL17542@dastard>
On Tue, Jun 20, 2017 at 6:40 PM, Dave Chinner <david@fromorbit.com> wrote:
> Your mangling terminology here. We don't "break COW" - we *use*
> copy-on-write to break *extent sharing*. We can break extent sharing
> in page_mkwrite - that's exactly what we do for normal pagecache
> based mmap writes, and it's done in page_mkwrite.
Right, my bad.
>
> It hasn't been enabled it for DAX yet because it simply hasn't been
> robustly tested yet.
>
>> A per-inode
>> count of the number of live DAX mappings or of the number of struct
>> file instances that have requested DAX would work here.
>
> For what purpose does this serve? The reflink invalidates all the
> existing mappings, so the next write access causes a fault and then
> page_mkwrite is called and the shared extent will get COWed....
The same purpose as XFS's FS_XFLAG_DAX (assuming I'm understanding it
right), except that IMO an API that doesn't involve making a change to
an inode that sticks around would be nice. The inode flag has the
unfortunate property that, if two different programs each try to set
the flag, mmap, write, and clear the flag, they'll stomp on each other
and risk data corruption.
I admit I'm now thoroughly confused as to exactly what XFS does here
-- does FS_XFLAG_DAX persist across unmount/mount?
>
>> - Trying to use DAX on a file that is already reflinked. The order
>> of operations doesn't matter hugely, except that breaking COW for the
>> entire range in question all at once would be faster and result in
>> better allocation.
>
> We have COW extent size hints for that. i.e. if you want to COW a
> huge page at a time, set the COW extent size hint to the huge page
> size...
Nifty.
> Apparently it is. There are people telling us that mtime
> updates in page faults introduce too much unpredictable latency and
> that screws over their low latency real time applications.
I was one of those, and I even wrote patches. I should try to dust them off.
>
> Those same people are telling use that dirty tracking in page faults
> for msync/fsync on DAX is too heavyweight and calling msync is too
> onerous and has unpredictable latencies because it might result in
> having to sync tens of thousands of unrelated dirty objects. Hence
> they want to use userspace data sync primitives to avoid this
> overhead and so filesystems need to make it possible to provide this
> userspace idata sync capability.
If I were using DAX in production, I'd have exactly this issue. Let
me quote myself:
On Tue, Jun 20, 2017 at 9:14 AM, Andy Lutomirski <luto@kernel.org> wrote:
> 3. (Not strictly related to DAX.) A way to tell the kernel "I have
> this file mmapped for write. Please go out of your way to avoid
> future page faults." I've wanted this for ordinary files on ext4.
> The kernel could, but presently does not, use hardware dirty tracking
> instead of software dirty tracking to decide when to write the page
> back. The kernel could also, in principle, write back dirty pages
> without ever write protecting them. For DAX, this might change
> behavior to prevent any operation that would relocate blocks or to
> have the kernel go out of its way to only do such operations when
> absolutely necessary and to immediately update and unwriteprotect the
> relevant pages.
I agree that this is a real issue, but it's not limited to DAX. I've
wanted a mode where I tell the kernel "I'm a high-performance
application mmapping this file and I'm going to write to it a lot. Do
your best to avoid any page faults, even if it adversely affects the
performance of the system." This mode could do lots of things. It
could cause the system to leave the page writable even after writeback
and, if possible, to use hardware dirty tracking. It could cause the
system to make a copy of the page and write back from the copy if
there is anything in play that could need stable pages during
writeback. And, for DAX, it could tell the system to keep the page
pinned and disallow moving it and reflinking it.
(Of course, the above requires that we either deal with mtime like my
patches do or that this heavyweight mechanism disable mtime updates.
I prefer the former.)
Here's the overall point I'm trying to make: unprivileged programs
that want to write to DAX files with userspace commit mechanisms
(CLFLUSHOPT;SFENCE, etc) should be able to do so reliably, without
privilege, and with reasonably clean APIs. Ideally they could do this
to any file they have write access to. Programs that want to write to
mmapped files, DAX or otherwise, without latency spikes due to
.page_mkwrite should be able to opt in to a heavier weight mechanism.
But these two issues are someone independent, and I think they should
be solved separately.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2017-06-21 5:18 UTC|newest]
Thread overview: 39+ messages / expand[flat|nested] mbox.gz Atom feed top
2017-06-17 1:15 [RFC PATCH 0/2] daxfile: enable " Dan Williams
2017-06-17 1:15 ` [RFC PATCH 1/2] mm: introduce bmap_walk() Dan Williams
2017-06-17 5:22 ` Christoph Hellwig
2017-06-17 12:29 ` Dan Williams
2017-06-18 7:51 ` Christoph Hellwig
2017-06-19 16:18 ` Darrick J. Wong
2017-06-19 18:19 ` Al Viro
2017-06-20 7:34 ` Christoph Hellwig
2017-06-17 1:15 ` [RFC PATCH 2/2] mm, fs: daxfile, an interface for byte-addressable updates to pmem Dan Williams
2017-06-17 16:25 ` Andy Lutomirski
2017-06-17 21:52 ` Dan Williams
2017-06-17 23:50 ` Andy Lutomirski
2017-06-18 3:15 ` Dan Williams
2017-06-18 5:05 ` Andy Lutomirski
2017-06-19 13:21 ` Dave Chinner
2017-06-19 15:22 ` Andy Lutomirski
2017-06-20 0:46 ` Dave Chinner
2017-06-20 5:53 ` Andy Lutomirski
2017-06-20 8:49 ` Christoph Hellwig
2017-06-20 16:17 ` Dan Williams
2017-06-20 16:26 ` Andy Lutomirski
2017-06-20 23:53 ` Dave Chinner
2017-06-21 1:24 ` Darrick J. Wong
2017-06-21 2:19 ` Dave Chinner
2017-06-20 10:11 ` Dave Chinner
2017-06-20 16:14 ` Andy Lutomirski
2017-06-21 1:40 ` Dave Chinner
2017-06-21 5:18 ` Andy Lutomirski [this message]
2017-06-22 0:02 ` Dave Chinner
2017-06-22 4:07 ` Andy Lutomirski
2017-06-23 0:52 ` Dave Chinner
2017-06-23 3:07 ` Andy Lutomirski
2017-06-18 8:18 ` Christoph Hellwig
2017-06-19 1:51 ` Dan Williams
2017-06-20 5:22 ` Darrick J. Wong
2017-06-20 15:42 ` Ross Zwisler
2017-06-22 7:09 ` Darrick J. Wong
2017-06-21 23:37 ` Dave Chinner
2017-06-22 7:23 ` Darrick J. Wong
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=CALCETrVYmbyNS-btvsN_M-QyWPZA_Y_4JXOM893g7nhZA+WviQ@mail.gmail.com \
--to=luto@kernel.org \
--cc=akpm@linux-foundation.org \
--cc=andy.rudoff@intel.com \
--cc=dan.j.williams@intel.com \
--cc=david@fromorbit.com \
--cc=hch@lst.de \
--cc=jack@suse.cz \
--cc=jmoyer@redhat.com \
--cc=linux-api@vger.kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=linux-nvdimm@lists.01.org \
--cc=ross.zwisler@linux.intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox