From: Jan Kara <jack@suse.cz>
To: Ritesh Harjani <ritesh.list@gmail.com>
Cc: Jan Kara <jack@suse.cz>,
linux-fsdevel@vger.kernel.org, linux-mm@kvack.org,
lsf-pc@lists.linux-foundation.org
Subject: Re: [RFCv1][WIP] ext2: Move direct-io to use iomap
Date: Thu, 23 Mar 2023 12:30:30 +0100 [thread overview]
Message-ID: <20230323113030.ryne2oq27b6cx6xz@quack3> (raw)
In-Reply-To: <87zg85pa5i.fsf@doe.com>
On Wed 22-03-23 12:04:01, Ritesh Harjani wrote:
> Jan Kara <jack@suse.cz> writes:
> >> + pos += size;
> >> + if (pos > i_size_read(inode))
> >> + i_size_write(inode, pos);
> >> +
> >> + return 0;
> >> +}
> >> +
> >> +static const struct iomap_dio_ops ext2_dio_write_ops = {
> >> + .end_io = ext2_dio_write_end_io,
> >> +};
> >> +
> >> +static ssize_t ext2_dio_write_iter(struct kiocb *iocb, struct iov_iter *from)
> >> +{
> >> + struct file *file = iocb->ki_filp;
> >> + struct inode *inode = file->f_mapping->host;
> >> + ssize_t ret;
> >> + unsigned int flags;
> >> + unsigned long blocksize = inode->i_sb->s_blocksize;
> >> + loff_t offset = iocb->ki_pos;
> >> + loff_t count = iov_iter_count(from);
> >> +
> >> +
> >> + inode_lock(inode);
> >> + ret = generic_write_checks(iocb, from);
> >> + if (ret <= 0)
> >> + goto out_unlock;
> >> + ret = file_remove_privs(file);
> >> + if (ret)
> >> + goto out_unlock;
> >> + ret = file_update_time(file);
> >> + if (ret)
> >> + goto out_unlock;
> >> +
> >> + /*
> >> + * We pass IOMAP_DIO_NOSYNC because otherwise iomap_dio_rw()
> >> + * calls for generic_write_sync in iomap_dio_complete().
> >> + * Since ext2_fsync nmust be called w/o inode lock,
> >> + * hence we pass IOMAP_DIO_NOSYNC and handle generic_write_sync()
> >> + * ourselves.
> >> + */
> >> + flags = IOMAP_DIO_NOSYNC;
> >
> > Meh, this is kind of ugly and we should come up with something better for
> > simple filesystems so that they don't have to play these games. Frankly,
> > these days I doubt there's anybody really needing inode_lock in
> > __generic_file_fsync(). Neither sync_mapping_buffers() nor
> > sync_inode_metadata() need inode_lock for their self-consistency. So it is
> > only about flushing more consistent set of metadata to disk when fsync(2)
> > races with other write(2)s to the same file so after a crash we have higher
> > chances of seeing some real state of the file. But I'm not sure it's really
> > worth keeping for filesystems that are still using sync_mapping_buffers().
> > People that care about consistency after a crash have IMHO moved to other
> > filesystems long ago.
> >
>
> One way which hch is suggesting is to use __iomap_dio_rw() -> unlock
> inode -> call generic_write_sync(). I haven't yet worked on this part.
So I see two problems with what Christoph suggests:
a) It is unfortunate API design to require trivial (and low maintenance)
filesystem to do these relatively complex locking games. But this can
be solved by providing appropriate wrapper for them I guess.
b) When you unlock the inode, other stuff can happen with the inode. And
e.g. i_size update needs to happen after IO is completed so filesystems
would have to be taught to avoid say two racing expanding writes. That's
IMHO really too much to ask.
> Are you suggesting to rip of inode_lock from __generic_file_fsync()?
> Won't it have a much larger implications?
Yes and yes :). But inode writeback already happens from other paths
without inode_lock so there's hardly any surprise there.
sync_mapping_buffers() is impossible to "customize" by filesystems and the
generic code is fine without inode_lock. So I have hard time imagining how
any filesystem would really depend on inode_lock in this path (famous last
words ;)).
> >> + if (iocb->ki_pos + iov_iter_count(from) > i_size_read(inode) ||
> >> + (!IS_ALIGNED(iocb->ki_pos | iov_iter_alignment(from), blocksize)))
> >> + flags |= IOMAP_DIO_FORCE_WAIT;
> >> +
> >> + ret = iomap_dio_rw(iocb, from, &ext2_iomap_ops, &ext2_dio_write_ops,
> >> + flags, NULL, 0);
> >> +
> >> + if (ret == -ENOTBLK)
> >> + ret = 0;
> >
> > So iomap_dio_rw() doesn't have the DIO_SKIP_HOLES behavior of
> > blockdev_direct_IO(). Thus you have to implement that in your
> > ext2_iomap_ops, in particular in iomap_begin...
> >
>
> Aah yes. Thanks for pointing that out -
> ext2_iomap_begin() should have something like this -
> /*
> * We cannot fill holes in indirect tree based inodes as that could
> * expose stale data in the case of a crash. Use the magic error code
> * to fallback to buffered I/O.
> */
>
> Also I think ext2_iomap_end() should also handle a case like in ext4 -
>
> /*
> * Check to see whether an error occurred while writing out the data to
> * the allocated blocks. If so, return the magic error code so that we
> * fallback to buffered I/O and attempt to complete the remainder of
> * the I/O. Any blocks that may have been allocated in preparation for
> * the direct I/O will be reused during buffered I/O.
> */
> if (flags & (IOMAP_WRITE | IOMAP_DIRECT) && written == 0)
> return -ENOTBLK;
>
>
> I am wondering if we have testcases in xfstests which really tests these
> functionalities also or not? Let me give it a try...
> ... So I did and somehow couldn't find any testcase which fails w/o
> above changes.
I guess we don't. It isn't that simple (but certainly possible) to test for
stale data exposure...
> Another query -
>
> We have this function ext2_iomap_end() (pasted below)
> which calls ext2_write_failed().
>
> Here IMO two cases are possible -
>
> 1. written is 0. which means an error has occurred.
> In that case calling ext2_write_failed() make sense.
>
> 2. consider a case where written > 0 && written < length.
> (This is possible right?). In that case we still go and call
> ext2_write_failed(). This function will truncate the pagecache and disk
> blocks beyong i_size. Now we haven't yet updated inode->i_size (we do
> that in ->end_io which gets called in the end during completion)
> So that means it just removes everything.
>
> Then in ext2_dax_write_iter(), we might go and update inode->i_size
> to iocb->ki_pos including for short writes. This looks like it isn't
> consistent because earlier we had destroyed all the blocks for the short
> writes and we will be returning ret > 0 to the user saying these many
> bytes have been written.
> Again I haven't yet found a test case at least not in xfstests which
> can trigger this short writes. Let me know your thoughts on this.
> All of this lies on the fact that there can be a case where
> written > 0 && written < length. I will read more to see if this even
> happens or not. But I atleast wanted to capture this somewhere.
So as far as I remember, direct IO writes as implemented in iomap are
all-or-nothing (see iomap_dio_complete()). But it would be good to assert
that in ext4 code to avoid surprises if the generic code changes.
> Another thing -
> In dax while truncating the inode i_size in ext2_setsize(),
> I think we don't properly call dax_zero_blocks() when we are trying to
> zero the last block beyond EOF. i.e. for e.g. it can be called with len
> as 0 if newsize is page_aligned. It then will call ext2_get_blocks() with
> len = 0 and can bug_on at maxblocks == 0.
How will it call ext2_get_blocks() with len == 0? AFAICS iomap_iter() will
not call iomap_begin() at all if iter.len == 0.
> I think it should be this. I will spend some more time analyzing this
> and also test it once against DAX paths.
>
> diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
> index 7ff669d0b6d2..cc264b1e288c 100644
> --- a/fs/ext2/inode.c
> +++ b/fs/ext2/inode.c
> @@ -1243,9 +1243,8 @@ static int ext2_setsize(struct inode *inode, loff_t newsize)
> inode_dio_wait(inode);
>
> if (IS_DAX(inode))
> - error = dax_zero_range(inode, newsize,
> - PAGE_ALIGN(newsize) - newsize, NULL,
> - &ext2_iomap_ops);
> + error = dax_truncate_page(inode, newsize, NULL,
> + &ext2_iomap_ops);
> else
> error = block_truncate_page(inode->i_mapping,
> newsize, ext2_get_block);
That being said this is indeed a nice cleanup.
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
next prev parent reply other threads:[~2023-03-23 11:30 UTC|newest]
Thread overview: 23+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-01-29 4:46 LSF/MM/BPF 2023 IOMAP conversion status update Luis Chamberlain
2023-01-29 5:06 ` Matthew Wilcox
2023-01-29 5:39 ` Luis Chamberlain
2023-02-08 16:04 ` Jan Kara
2023-02-24 7:01 ` Zhang Yi
2023-02-26 20:16 ` Ritesh Harjani
2023-03-16 14:40 ` [RFCv1][WIP] ext2: Move direct-io to use iomap Ritesh Harjani (IBM)
2023-03-16 15:41 ` Darrick J. Wong
2023-03-20 16:11 ` Ritesh Harjani
2023-03-20 13:15 ` Christoph Hellwig
2023-03-20 17:51 ` Jan Kara
2023-03-22 6:34 ` Ritesh Harjani
2023-03-23 11:30 ` Jan Kara [this message]
2023-03-23 13:19 ` Ritesh Harjani
2023-03-30 0:02 ` Christoph Hellwig
2023-02-27 19:26 ` LSF/MM/BPF 2023 IOMAP conversion status update Darrick J. Wong
2023-02-27 21:02 ` Matthew Wilcox
2023-02-27 19:47 ` Darrick J. Wong
2023-02-27 20:24 ` Luis Chamberlain
2023-02-27 19:06 ` Darrick J. Wong
2023-02-27 19:58 ` Luis Chamberlain
2023-03-01 16:59 ` Ritesh Harjani
2023-03-01 17:08 ` Darrick J. Wong
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20230323113030.ryne2oq27b6cx6xz@quack3 \
--to=jack@suse.cz \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=lsf-pc@lists.linux-foundation.org \
--cc=ritesh.list@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox