From: Ojaswin Mujoo <ojaswin@linux.ibm.com>
To: Jan Kara <jack@suse.cz>
Cc: linux-xfs@vger.kernel.org, linux-fsdevel@vger.kernel.org,
djwong@kernel.org, john.g.garry@oracle.com, willy@infradead.org,
hch@lst.de, ritesh.list@gmail.com,
Luis Chamberlain <mcgrof@kernel.org>,
dgc@kernel.org, tytso@mit.edu, p.raghav@samsung.com,
andres@anarazel.de, brauner@kernel.org,
linux-kernel@vger.kernel.org, linux-mm@kvack.org
Subject: Re: [RFC PATCH v2 2/5] iomap: Add initial support for buffered RWF_WRITETHROUGH
Date: Thu, 23 Apr 2026 15:38:12 +0530 [thread overview]
Message-ID: <aenvjOFl9_yI0uTK@li-dc0c254c-257c-11b2-a85c-98b6c1322444.ibm.com> (raw)
In-Reply-To: <eqbodkjxxg32ryxsdngx35yhv5fy4p44fmvlnlnoyib6bstlmf@4hzxnfwleynb>
On Wed, Apr 22, 2026 at 12:00:34PM +0200, Jan Kara wrote:
> On Tue 21-04-26 23:37:01, Ojaswin Mujoo wrote:
> > On Mon, Apr 20, 2026 at 01:28:18PM +0200, Jan Kara wrote:
> > > On Sat 18-04-26 01:12:22, Ojaswin Mujoo wrote:
> > > > On Thu, Apr 16, 2026 at 02:34:15PM +0200, Jan Kara wrote:
> > > > > > @@ -1096,6 +1097,276 @@ static bool iomap_write_end(struct iomap_iter *iter, size_t len, size_t copied,
> > > > > > +static int iomap_writethrough_iter(struct iomap_writethrough_ctx *wt_ctx,
> > > > > > + struct iomap_iter *iter, struct iov_iter *i,
> > > > > > + const struct iomap_writethrough_ops *wt_ops)
> > > > > > +
> > > > > > +{
> >
> > <...>
> >
> > > > your comment but) after this email, I started diggin a bit more into why
> > > > it is needed. As per my understanding, it tackles 2 things:
> > > >
> > > > Problem 1. mkclean's the old EOF folio so that the FS can fault again. This
> > > > allows us to allocate new blocks which previously might not be allocated
> > > > if bs < ps.
> > > >
> > > > Problem 2. Since mmap writes can dirty data beyond EOF, we zero the range from
> > > > old EOF to end of that folio so that readers dont read junk data after
> > > > isize extension.
> > >
> > > Correct.
> > >
> > > > Another thing I noticed is that most users of
> > > > iomap_file_buffered_write() do their own eof zeroing in the FS layer
> > > > (eg, xfs_file_write_zero_eof(), ext4's new changes,
> > > > ntfs_extend_initialized_size() etc).
> > > > I think this FS level zerooing should take care of mkcleaning the eof
> > > > folio (problem 1), as they call iomap_zero_range() which would flush the
> > > > eof range anyways. So am I right in assuming that for FSes that do their
> > > > own zeroing, 1. is already taken care of?
> > >
> > > Well, I don't see anything that would writeprotect the old tail page in
> > > iomap_zero_range(). I think iomap_zero_range() calls are there mostly to
> > > address 2. Not only due to mmap but also possibly to clear whatever junk
> > > there can be in the blocks after EOF.
> >
> > Well I was thinking more like if the EOF page was mmap'd it would be
> > dirty and blocks beyond EOF would be unmapped, so iomap_zero_range() will
> > write it back which shall mkclean() the folio.
> >
> > But I think the same race we discussed for problem 2 can also occur
> > here.
> >
> > Thread 1 (extending write) Thread 2 (mmap writer)
> >
> > iomap_zero_range()
> > filemap_write_and_wait_range()
> > // mmaps & writes EOF range
> > iomap_write_iter()
> > isize = new_size
> > // pagecache_isize_extended() is
> > needed to mkclean() old EOF page.
>
> Yes, this race exists and unlike in the case of zeroing where it is mostly
> harmless not guranteeing calling page_mkwrite() with updated i_size can
> lead to filesystem tripping on assertions, data loss or similar.
Right.
>
> > > > As for 2, I think after the EOF zeroing of the FS, there might be a
> > > > window before iomap_write_iter() where an mmap writer can still dirty
> > > > EOF blocks, hence the pagecache_isize_extended() would be needed here.
> > > > But doesn't that then make the eof zeroing in the FS layer redundant? Am
> > > > I missing something here?
> > >
> > > Hmm, I agree the zeroing looks duplicit (for some users of
> > > pagecache_isize_extended()). And yes, doing the zeroing from
> > > xfs_file_write_zero_eof() is somewhat racy (mmap writer can still come and
> > > write non-zeros before we update i_size) but I'd have hard time to argue it
> > > really practically matters - you are racing mmap writes with buffered
> > > writes so any kind of write atomicity guarantees are not there.
> >
> > Yeah, seems like it is not enough to take care of either 1 or 2 and
> > pagecache_isize_extended() should maybe be enough. I was just wondering
> > if we could optimize it away even for normal extend path (no racing mmap),
> > we can avoid the expensive folio_zero_range() calls.
> >
> > Regardless, Ive not looked at this more closely and its a separate issue
> > so we can revisit it later. For now I wanted some clarity around
> > pagecache_isize_extended() so thanks for that.
>
> Well, but pagecache_isize_extended() doesn't guarantee on disk blocks are
> zeroed out as well as it doesn't dirty the page. Also
> xfs_file_write_zero_eof() potentially handles zeroing of more than a tail
> page. So you cannot simply drop one of these.
Hmm yea true, xfs_file_write_zero_eof() does take of zeroing any mapped
eof blocks on disk as well.
>
> > > > Regardless, for our case I think we will also need to do the
> > > > pagecache_isize_extended(), mainly to take care of problem 2, but where
> > > > exactly should we do it now? We currently change the isize in endio()
> > > > but for aio, it can run outside inode or folio lock. I think this
> > > > function needs to be called under inode lock(). Hmm.. its a bit late here so
> > > > I'll revisit this tomorrow with a fresh mind :)
> > >
> > > I think mainly to take care of problem 1... You are correct about
> > > inode_lock but since we are updating i_size, we should be better holding
> > > it, shouldn't we?
> >
> > Yes you are correct. In the aio writethrough codepath, the inode update
> > is happening without the inode lock which is wrong. I overlooked the
> > fact that even aio dio uses IOMAP_DIO_FORCE_WAIT to force isize update
> > under inode lock, and we should do something similar as well.
>
> Yes.
>
> > So in v3, I make the change that for extending writes we shall always
> > finish them in "sync" fashion so ->endio() runs under inode lock. Then,
> > after ->endio() in iomap_dio_complete(), I will call
( I meant iomap_writethrough_complete(), here)
> > pagecache_isize_extended() to take care of this. Just like isize update
> > right now, the isize_extension only runs when the IO was successful
> > otherwise we return an error to the user. This gives us semantics like
> > dio while handling extension properly.
> >
> > Does that sound okay?
>
> Yep, sounds fine.
Got it, thanks for the review Jan!
Regards,
ojaswin
>
> Honza
> --
> Jan Kara <jack@suse.com>
> SUSE Labs, CR
next prev parent reply other threads:[~2026-04-23 10:08 UTC|newest]
Thread overview: 24+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-04-08 18:45 [RFC PATCH v2 0/5] Add buffered write-through support to iomap & xfs Ojaswin Mujoo
2026-04-08 18:45 ` [RFC PATCH v2 1/5] mm: Refactor folio_clear_dirty_for_io() Ojaswin Mujoo
2026-04-15 6:14 ` Christoph Hellwig
2026-04-23 10:12 ` Ojaswin Mujoo
2026-04-08 18:45 ` [RFC PATCH v2 2/5] iomap: Add initial support for buffered RWF_WRITETHROUGH Ojaswin Mujoo
2026-04-16 12:05 ` Jan Kara
2026-04-16 12:34 ` Jan Kara
2026-04-17 19:42 ` Ojaswin Mujoo
2026-04-20 11:28 ` Jan Kara
2026-04-21 18:07 ` Ojaswin Mujoo
2026-04-22 10:00 ` Jan Kara
2026-04-23 10:08 ` Ojaswin Mujoo [this message]
2026-04-17 4:13 ` Pankaj Raghav (Samsung)
2026-04-18 7:33 ` Ojaswin Mujoo
2026-04-20 11:56 ` Pankaj Raghav (Samsung)
2026-04-21 18:15 ` Ojaswin Mujoo
2026-04-22 6:19 ` Pankaj Raghav
2026-04-22 6:40 ` Ritesh Harjani
2026-04-08 18:45 ` [RFC PATCH v2 3/5] xfs: Add RWF_WRITETHROUGH support to xfs Ojaswin Mujoo
2026-04-08 18:45 ` [RFC PATCH v2 4/5] iomap: Add aio support to RWF_WRITETHROUGH Ojaswin Mujoo
2026-04-08 18:45 ` [RFC PATCH v2 5/5] iomap: Add DSYNC support to writethrough Ojaswin Mujoo
2026-04-17 3:54 ` [RFC PATCH v2 0/5] Add buffered write-through support to iomap & xfs Pankaj Raghav (Samsung)
2026-04-18 7:26 ` Ojaswin Mujoo
2026-04-23 12:25 ` Pankaj Raghav (Samsung)
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=aenvjOFl9_yI0uTK@li-dc0c254c-257c-11b2-a85c-98b6c1322444.ibm.com \
--to=ojaswin@linux.ibm.com \
--cc=andres@anarazel.de \
--cc=brauner@kernel.org \
--cc=dgc@kernel.org \
--cc=djwong@kernel.org \
--cc=hch@lst.de \
--cc=jack@suse.cz \
--cc=john.g.garry@oracle.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=linux-xfs@vger.kernel.org \
--cc=mcgrof@kernel.org \
--cc=p.raghav@samsung.com \
--cc=ritesh.list@gmail.com \
--cc=tytso@mit.edu \
--cc=willy@infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox