From: Matthew Wilcox <willy@linux.intel.com>
To: Jan Kara <jack@suse.cz>
Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>,
linux-fsdevel@vger.kernel.org, linux-mm@kvack.org,
linux-kernel@vger.kernel.org
Subject: Re: [PATCH v7 06/22] Replace XIP read and write with DAX I/O
Date: Tue, 8 Apr 2014 16:21:02 -0400 [thread overview]
Message-ID: <20140408202102.GB5727@linux.intel.com> (raw)
In-Reply-To: <20140408175600.GE2713@quack.suse.cz>
On Tue, Apr 08, 2014 at 07:56:00PM +0200, Jan Kara wrote:
> > +static void dax_new_buf(void *addr, unsigned size, unsigned first,
> > + loff_t offset, loff_t end, int rw)
> > +{
> > + loff_t final = end - offset + first; /* The final byte of the buffer */
> > + if (rw != WRITE) {
> > + memset(addr, 0, size);
> > + return;
> > + }
> It seems counterintuitive to zero out "on-disk" blocks (it seems you'd do
> this for unwritten blocks) when reading from them. Presumably it could also
> have undesired effects on endurance of persistent memory. Instead I'd expect
> that you simply zero out user provided buffer the same way as you do it for
> holes.
I think we have to zero it here, because the second time we call
get_block() for a given block, it won't be BH_New any more, so we won't
know that it's supposed to be zeroed.
> > +/*
> > + * When ext4 encounters a hole, it likes to return without modifying the
> > + * buffer_head which means that we can't trust b_size. To cope with this,
> > + * we set b_state to 0 before calling get_block and, if any bit is set, we
> > + * know we can trust b_size. Unfortunate, really, since ext4 does know
> > + * precisely how long a hole is and would save us time calling get_block
> > + * repeatedly.
> Well, this is really a problem of get_blocks() returning the result in
> struct buffer_head which is used for input as well. I don't think it is
> actually ext4 specific.
Of course it's ext4 specific! It's the ext4_get_block() implementation
which is choosing not to return the length of the hole. XFS does return
the length of the hole. I think something like this would fix it:
+++ b/fs/ext4/inode.c
@@ -727,14 +727,14 @@ static int _ext4_get_block(struct inode *inode, sector_t i
}
ret = ext4_map_blocks(handle, inode, &map, flags);
+ map_bh(bh, inode->i_sb, map.m_pblk);
+ bh->b_state = (bh->b_state & ~EXT4_MAP_FLAGS) | map.m_flags;
+ bh->b_size = inode->i_sb->s_blocksize * map.m_len;
if (ret > 0) {
ext4_io_end_t *io_end = ext4_inode_aio(inode);
- map_bh(bh, inode->i_sb, map.m_pblk);
- bh->b_state = (bh->b_state & ~EXT4_MAP_FLAGS) | map.m_flags;
if (io_end && io_end->flag & EXT4_IO_END_UNWRITTEN)
set_buffer_defer_completion(bh);
- bh->b_size = inode->i_sb->s_blocksize * map.m_len;
ret = 0;
}
if (started)
(completely untested).
> > + while (offset < end) {
> > + void __user *buf = iov[seg].iov_base + copied;
> > +
> > + if (offset == max) {
> > + sector_t block = offset >> inode->i_blkbits;
> > + unsigned first = offset - (block << inode->i_blkbits);
> > + long size;
> > +
> > + if (offset == bh_max) {
> > + bh->b_size = PAGE_ALIGN(end - offset);
> > + bh->b_state = 0;
> > + retval = get_block(inode, block, bh,
> > + rw == WRITE);
> > + if (retval)
> > + break;
> > + if (!buffer_size_valid(bh))
> > + bh->b_size = 1 << inode->i_blkbits;
> > + bh_max = offset - first + bh->b_size;
> > + } else {
> > + unsigned done = bh->b_size - (bh_max -
> > + (offset - first));
> > + bh->b_blocknr += done >> inode->i_blkbits;
> > + bh->b_size -= done;
> It took me quite some time to figure out what this does and whether it is
> correct :). Why isn't this at the place where we advance all other
> iterators like offset, addr, etc.?
It'll be kind of tricky to move it because 'len' is not necessarily
a multiple of i_blkbits, so we can't necessarily maintain b_blocknr
accurately.
> > + if (rw == WRITE) {
> > + if (!buffer_mapped(bh)) {
> > + retval = -EIO;
> > + break;
> -EIO looks like a wrong error here. Or maybe it is the right one and it
> only needs some explanation? The thing is that for direct IO some
> filesystems choose not to fill holes for direct IO and fall back to
> buffered IO instead (to avoid exposure of uninitialized blocks if the
> system crashes after blocks have been added to a file but before they were
> written out). For DAX you are pretty much free to define what you ask from
> the get_blocks() (and this fallback behavior is somewhat disputed behavior
> in direct IO case so you might want to differ here) but you should document
> it somewhere.
Hmm ... I thought that calling get_block() with the create argument would
force the return of a bh with the Mapped bit set. Did I misunderstand that
aspect of the undocumented get_block() API too?
> > + if ((flags & DIO_LOCKING) && (rw == READ)) {
> > + struct address_space *mapping = inode->i_mapping;
> > + mutex_lock(&inode->i_mutex);
> > + retval = filemap_write_and_wait_range(mapping, offset, end - 1);
> > + if (retval) {
> > + mutex_unlock(&inode->i_mutex);
> > + goto out;
> > + }
> Is there a reason for this? I'd assume DAX has no pages in pagecache...
There will be pages in the page cache for holes that we page faulted on.
They must go! :-)
> > @@ -858,7 +858,11 @@ ext2_direct_IO(int rw, struct kiocb *iocb, const struct iovec *iov,
> > struct inode *inode = mapping->host;
> > ssize_t ret;
> >
> > - ret = blockdev_direct_IO(rw, iocb, inode, iov, offset, nr_segs,
> > + if (IS_DAX(inode))
> > + ret = dax_do_io(rw, iocb, inode, iov, offset, nr_segs,
> > + ext2_get_block, NULL, DIO_LOCKING);
> > + else
> > + ret = blockdev_direct_IO(rw, iocb, inode, iov, offset, nr_segs,
> > ext2_get_block);
> I'd somewhat prefer to have a ext2_direct_IO() as is and have
> ext2_dax_IO() call only dax_do_io() (and use that as .direct_io in
> ext2_aops_xip). Then there's no need to check IS_DAX() and the code would
> look more obvious to me. But I don't feel strongly about it.
I can look at that ... but I was hoping to not have separate aops for
XIP and non-XIP files.
> > @@ -2681,6 +2686,11 @@ extern int generic_show_options(struct seq_file *m, struct dentry *root);
> > extern void save_mount_options(struct super_block *sb, char *options);
> > extern void replace_mount_options(struct super_block *sb, char *options);
> >
> > +static inline bool io_is_direct(struct file *filp)
> > +{
> > + return (filp->f_flags & O_DIRECT) || IS_DAX(file_inode(filp));
> > +}
> > +
> BTW: It seems fs/open.c: open_check_o_direct() can be simplified to not
> check for get_xip_mem(), cannot it?
That's in a later patch
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2014-04-08 23:17 UTC|newest]
Thread overview: 90+ messages / expand[flat|nested] mbox.gz Atom feed top
2014-03-23 19:08 [PATCH v7 00/22] Support ext4 on NV-DIMMs Matthew Wilcox
2014-03-23 19:08 ` [PATCH v7 01/22] Fix XIP fault vs truncate race Matthew Wilcox
2014-03-29 15:57 ` Jan Kara
2014-03-23 19:08 ` [PATCH v7 02/22] Allow page fault handlers to perform the COW Matthew Wilcox
2014-04-08 16:34 ` Jan Kara
2014-03-23 19:08 ` [PATCH v7 03/22] axonram: Fix bug in direct_access Matthew Wilcox
2014-03-29 16:22 ` Jan Kara
2014-04-02 19:24 ` Matthew Wilcox
2014-03-23 19:08 ` [PATCH v7 04/22] Change direct_access calling convention Matthew Wilcox
2014-03-29 16:30 ` Jan Kara
2014-04-02 19:27 ` Matthew Wilcox
2014-03-23 19:08 ` [PATCH v7 05/22] Introduce IS_DAX(inode) Matthew Wilcox
2014-04-08 15:32 ` Jan Kara
2014-03-23 19:08 ` [PATCH v7 06/22] Replace XIP read and write with DAX I/O Matthew Wilcox
2014-04-08 17:56 ` Jan Kara
2014-04-08 20:21 ` Matthew Wilcox [this message]
2014-04-09 9:14 ` Jan Kara
2014-04-09 15:19 ` Matthew Wilcox
2014-04-09 20:55 ` Jan Kara
2014-04-13 18:05 ` Matthew Wilcox
2014-04-09 12:04 ` Jan Kara
2014-03-23 19:08 ` [PATCH v7 07/22] Replace the XIP page fault handler with the DAX page fault handler Matthew Wilcox
2014-04-08 22:05 ` Jan Kara
2014-04-09 20:48 ` Matthew Wilcox
2014-04-09 21:12 ` Jan Kara
2014-04-13 11:21 ` Matthew Wilcox
2014-04-14 16:04 ` Jan Kara
2014-04-09 10:27 ` Jan Kara
2014-04-09 20:51 ` Matthew Wilcox
2014-04-09 21:43 ` Jan Kara
2014-04-13 18:03 ` Matthew Wilcox
2014-07-29 12:12 ` Matthew Wilcox
2014-07-29 21:04 ` Jan Kara
2014-07-29 21:23 ` Matthew Wilcox
2014-07-30 9:52 ` Jan Kara
2014-07-30 21:02 ` Matthew Wilcox
2014-08-09 11:00 ` Matthew Wilcox
2014-08-11 8:51 ` Jan Kara
2014-08-11 14:13 ` Matthew Wilcox
2014-08-11 14:35 ` Jan Kara
2014-08-11 15:02 ` Matthew Wilcox
2014-08-11 15:25 ` Jan Kara
2014-05-21 20:35 ` Toshi Kani
2014-06-05 22:38 ` Toshi Kani
2014-03-23 19:08 ` [PATCH v7 08/22] Replace xip_truncate_page with dax_truncate_page Matthew Wilcox
2014-04-08 22:17 ` Jan Kara
2014-04-09 9:26 ` Jan Kara
2014-04-13 19:07 ` Matthew Wilcox
2014-03-23 19:08 ` [PATCH v7 09/22] Remove mm/filemap_xip.c Matthew Wilcox
2014-04-08 18:21 ` Jan Kara
2014-03-23 19:08 ` [PATCH v7 10/22] Remove get_xip_mem Matthew Wilcox
2014-04-08 18:20 ` Jan Kara
2014-03-23 19:08 ` [PATCH v7 11/22] Replace ext2_clear_xip_target with dax_clear_blocks Matthew Wilcox
2014-04-09 9:46 ` Jan Kara
2014-04-10 14:16 ` Matthew Wilcox
2014-04-10 18:31 ` Jan Kara
2014-03-23 19:08 ` [PATCH v7 12/22] ext2: Remove ext2_xip_verify_sb() Matthew Wilcox
2014-04-09 9:52 ` Jan Kara
2014-04-10 14:22 ` Matthew Wilcox
2014-04-10 18:35 ` Jan Kara
2014-03-23 19:08 ` [PATCH v7 13/22] ext2: Remove ext2_use_xip Matthew Wilcox
2014-04-09 9:55 ` Jan Kara
2014-03-23 19:08 ` [PATCH v7 14/22] ext2: Remove xip.c and xip.h Matthew Wilcox
2014-04-09 9:59 ` Jan Kara
2014-03-23 19:08 ` [PATCH v7 15/22] Remove CONFIG_EXT2_FS_XIP and rename CONFIG_FS_XIP to CONFIG_FS_DAX Matthew Wilcox
2014-04-09 9:59 ` Jan Kara
2014-04-10 14:23 ` Matthew Wilcox
2014-03-23 19:08 ` [PATCH v7 16/22] ext2: Remove ext2_aops_xip Matthew Wilcox
2014-04-09 10:02 ` Jan Kara
2014-03-23 19:08 ` [PATCH v7 17/22] Get rid of most mentions of XIP in ext2 Matthew Wilcox
2014-04-09 10:04 ` Jan Kara
2014-04-10 14:26 ` Matthew Wilcox
2014-04-10 18:40 ` Jan Kara
2014-03-23 19:08 ` [PATCH v7 18/22] xip: Add xip_zero_page_range Matthew Wilcox
2014-04-09 10:15 ` Jan Kara
2014-04-10 14:27 ` Matthew Wilcox
2014-04-10 18:43 ` Jan Kara
2014-03-23 19:08 ` [PATCH v7 19/22] ext4: Make ext4_block_zero_page_range static Matthew Wilcox
2014-03-24 19:11 ` tytso
2014-03-23 19:08 ` [PATCH v7 20/22] ext4: Add DAX functionality Matthew Wilcox
2014-04-09 12:17 ` Jan Kara
2014-03-23 19:08 ` [PATCH v7 21/22] ext4: Fix typos Matthew Wilcox
2014-03-24 19:16 ` tytso
2014-03-23 19:08 ` [PATCH v7 22/22] brd: Rename XIP to DAX Matthew Wilcox
2014-04-09 10:07 ` Jan Kara
2014-05-18 14:58 ` [PATCH v7 00/22] Support ext4 on NV-DIMMs Boaz Harrosh
2014-05-18 23:24 ` Matthew Wilcox
2014-06-17 18:11 ` Boaz Harrosh
2014-06-17 18:19 ` Matthew Wilcox
2014-06-17 18:39 ` Boaz Harrosh
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20140408202102.GB5727@linux.intel.com \
--to=willy@linux.intel.com \
--cc=jack@suse.cz \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=matthew.r.wilcox@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox