From: Jan Kara <jack@suse.cz>
To: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Cc: Jan Kara <jack@suse.cz>,
linux-xfs@vger.kernel.org, linux-fsdevel@vger.kernel.org,
djwong@kernel.org, john.g.garry@oracle.com, willy@infradead.org,
hch@lst.de, ritesh.list@gmail.com,
Luis Chamberlain <mcgrof@kernel.org>,
dgc@kernel.org, tytso@mit.edu, p.raghav@samsung.com,
andres@anarazel.de, brauner@kernel.org,
linux-kernel@vger.kernel.org, linux-mm@kvack.org
Subject: Re: [RFC PATCH v2 2/5] iomap: Add initial support for buffered RWF_WRITETHROUGH
Date: Wed, 22 Apr 2026 12:00:34 +0200 [thread overview]
Message-ID: <eqbodkjxxg32ryxsdngx35yhv5fy4p44fmvlnlnoyib6bstlmf@4hzxnfwleynb> (raw)
In-Reply-To: <aee8xaz_5x-aQ5BK@li-dc0c254c-257c-11b2-a85c-98b6c1322444.ibm.com>
On Tue 21-04-26 23:37:01, Ojaswin Mujoo wrote:
> On Mon, Apr 20, 2026 at 01:28:18PM +0200, Jan Kara wrote:
> > On Sat 18-04-26 01:12:22, Ojaswin Mujoo wrote:
> > > On Thu, Apr 16, 2026 at 02:34:15PM +0200, Jan Kara wrote:
> > > > > @@ -1096,6 +1097,276 @@ static bool iomap_write_end(struct iomap_iter *iter, size_t len, size_t copied,
> > > > > +static int iomap_writethrough_iter(struct iomap_writethrough_ctx *wt_ctx,
> > > > > + struct iomap_iter *iter, struct iov_iter *i,
> > > > > + const struct iomap_writethrough_ops *wt_ops)
> > > > > +
> > > > > +{
>
> <...>
>
> > > your comment but) after this email, I started diggin a bit more into why
> > > it is needed. As per my understanding, it tackles 2 things:
> > >
> > > Problem 1. mkclean's the old EOF folio so that the FS can fault again. This
> > > allows us to allocate new blocks which previously might not be allocated
> > > if bs < ps.
> > >
> > > Problem 2. Since mmap writes can dirty data beyond EOF, we zero the range from
> > > old EOF to end of that folio so that readers dont read junk data after
> > > isize extension.
> >
> > Correct.
> >
> > > Another thing I noticed is that most users of
> > > iomap_file_buffered_write() do their own eof zeroing in the FS layer
> > > (eg, xfs_file_write_zero_eof(), ext4's new changes,
> > > ntfs_extend_initialized_size() etc).
> > > I think this FS level zerooing should take care of mkcleaning the eof
> > > folio (problem 1), as they call iomap_zero_range() which would flush the
> > > eof range anyways. So am I right in assuming that for FSes that do their
> > > own zeroing, 1. is already taken care of?
> >
> > Well, I don't see anything that would writeprotect the old tail page in
> > iomap_zero_range(). I think iomap_zero_range() calls are there mostly to
> > address 2. Not only due to mmap but also possibly to clear whatever junk
> > there can be in the blocks after EOF.
>
> Well I was thinking more like if the EOF page was mmap'd it would be
> dirty and blocks beyond EOF would be unmapped, so iomap_zero_range() will
> write it back which shall mkclean() the folio.
>
> But I think the same race we discussed for problem 2 can also occur
> here.
>
> Thread 1 (extending write) Thread 2 (mmap writer)
>
> iomap_zero_range()
> filemap_write_and_wait_range()
> // mmaps & writes EOF range
> iomap_write_iter()
> isize = new_size
> // pagecache_isize_extended() is
> needed to mkclean() old EOF page.
Yes, this race exists and unlike in the case of zeroing where it is mostly
harmless not guranteeing calling page_mkwrite() with updated i_size can
lead to filesystem tripping on assertions, data loss or similar.
> > > As for 2, I think after the EOF zeroing of the FS, there might be a
> > > window before iomap_write_iter() where an mmap writer can still dirty
> > > EOF blocks, hence the pagecache_isize_extended() would be needed here.
> > > But doesn't that then make the eof zeroing in the FS layer redundant? Am
> > > I missing something here?
> >
> > Hmm, I agree the zeroing looks duplicit (for some users of
> > pagecache_isize_extended()). And yes, doing the zeroing from
> > xfs_file_write_zero_eof() is somewhat racy (mmap writer can still come and
> > write non-zeros before we update i_size) but I'd have hard time to argue it
> > really practically matters - you are racing mmap writes with buffered
> > writes so any kind of write atomicity guarantees are not there.
>
> Yeah, seems like it is not enough to take care of either 1 or 2 and
> pagecache_isize_extended() should maybe be enough. I was just wondering
> if we could optimize it away even for normal extend path (no racing mmap),
> we can avoid the expensive folio_zero_range() calls.
>
> Regardless, Ive not looked at this more closely and its a separate issue
> so we can revisit it later. For now I wanted some clarity around
> pagecache_isize_extended() so thanks for that.
Well, but pagecache_isize_extended() doesn't guarantee on disk blocks are
zeroed out as well as it doesn't dirty the page. Also
xfs_file_write_zero_eof() potentially handles zeroing of more than a tail
page. So you cannot simply drop one of these.
> > > Regardless, for our case I think we will also need to do the
> > > pagecache_isize_extended(), mainly to take care of problem 2, but where
> > > exactly should we do it now? We currently change the isize in endio()
> > > but for aio, it can run outside inode or folio lock. I think this
> > > function needs to be called under inode lock(). Hmm.. its a bit late here so
> > > I'll revisit this tomorrow with a fresh mind :)
> >
> > I think mainly to take care of problem 1... You are correct about
> > inode_lock but since we are updating i_size, we should be better holding
> > it, shouldn't we?
>
> Yes you are correct. In the aio writethrough codepath, the inode update
> is happening without the inode lock which is wrong. I overlooked the
> fact that even aio dio uses IOMAP_DIO_FORCE_WAIT to force isize update
> under inode lock, and we should do something similar as well.
Yes.
> So in v3, I make the change that for extending writes we shall always
> finish them in "sync" fashion so ->endio() runs under inode lock. Then,
> after ->endio() in iomap_dio_complete(), I will call
> pagecache_isize_extended() to take care of this. Just like isize update
> right now, the isize_extension only runs when the IO was successful
> otherwise we return an error to the user. This gives us semantics like
> dio while handling extension properly.
>
> Does that sound okay?
Yep, sounds fine.
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
next prev parent reply other threads:[~2026-04-22 10:00 UTC|newest]
Thread overview: 24+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-04-08 18:45 [RFC PATCH v2 0/5] Add buffered write-through support to iomap & xfs Ojaswin Mujoo
2026-04-08 18:45 ` [RFC PATCH v2 1/5] mm: Refactor folio_clear_dirty_for_io() Ojaswin Mujoo
2026-04-15 6:14 ` Christoph Hellwig
2026-04-23 10:12 ` Ojaswin Mujoo
2026-04-08 18:45 ` [RFC PATCH v2 2/5] iomap: Add initial support for buffered RWF_WRITETHROUGH Ojaswin Mujoo
2026-04-16 12:05 ` Jan Kara
2026-04-16 12:34 ` Jan Kara
2026-04-17 19:42 ` Ojaswin Mujoo
2026-04-20 11:28 ` Jan Kara
2026-04-21 18:07 ` Ojaswin Mujoo
2026-04-22 10:00 ` Jan Kara [this message]
2026-04-23 10:08 ` Ojaswin Mujoo
2026-04-17 4:13 ` Pankaj Raghav (Samsung)
2026-04-18 7:33 ` Ojaswin Mujoo
2026-04-20 11:56 ` Pankaj Raghav (Samsung)
2026-04-21 18:15 ` Ojaswin Mujoo
2026-04-22 6:19 ` Pankaj Raghav
2026-04-22 6:40 ` Ritesh Harjani
2026-04-08 18:45 ` [RFC PATCH v2 3/5] xfs: Add RWF_WRITETHROUGH support to xfs Ojaswin Mujoo
2026-04-08 18:45 ` [RFC PATCH v2 4/5] iomap: Add aio support to RWF_WRITETHROUGH Ojaswin Mujoo
2026-04-08 18:45 ` [RFC PATCH v2 5/5] iomap: Add DSYNC support to writethrough Ojaswin Mujoo
2026-04-17 3:54 ` [RFC PATCH v2 0/5] Add buffered write-through support to iomap & xfs Pankaj Raghav (Samsung)
2026-04-18 7:26 ` Ojaswin Mujoo
2026-04-23 12:25 ` Pankaj Raghav (Samsung)
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=eqbodkjxxg32ryxsdngx35yhv5fy4p44fmvlnlnoyib6bstlmf@4hzxnfwleynb \
--to=jack@suse.cz \
--cc=andres@anarazel.de \
--cc=brauner@kernel.org \
--cc=dgc@kernel.org \
--cc=djwong@kernel.org \
--cc=hch@lst.de \
--cc=john.g.garry@oracle.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=linux-xfs@vger.kernel.org \
--cc=mcgrof@kernel.org \
--cc=ojaswin@linux.ibm.com \
--cc=p.raghav@samsung.com \
--cc=ritesh.list@gmail.com \
--cc=tytso@mit.edu \
--cc=willy@infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox