linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Jan Kara <jack@suse.cz>
To: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Cc: Jan Kara <jack@suse.cz>,
	linux-xfs@vger.kernel.org,  linux-fsdevel@vger.kernel.org,
	djwong@kernel.org, john.g.garry@oracle.com, willy@infradead.org,
	 hch@lst.de, ritesh.list@gmail.com,
	Luis Chamberlain <mcgrof@kernel.org>,
	 dgc@kernel.org, tytso@mit.edu, p.raghav@samsung.com,
	andres@anarazel.de,  brauner@kernel.org,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org
Subject: Re: [RFC PATCH v2 2/5] iomap: Add initial support for buffered RWF_WRITETHROUGH
Date: Wed, 22 Apr 2026 12:00:34 +0200	[thread overview]
Message-ID: <eqbodkjxxg32ryxsdngx35yhv5fy4p44fmvlnlnoyib6bstlmf@4hzxnfwleynb> (raw)
In-Reply-To: <aee8xaz_5x-aQ5BK@li-dc0c254c-257c-11b2-a85c-98b6c1322444.ibm.com>

On Tue 21-04-26 23:37:01, Ojaswin Mujoo wrote:
> On Mon, Apr 20, 2026 at 01:28:18PM +0200, Jan Kara wrote:
> > On Sat 18-04-26 01:12:22, Ojaswin Mujoo wrote:
> > > On Thu, Apr 16, 2026 at 02:34:15PM +0200, Jan Kara wrote:
> > > > > @@ -1096,6 +1097,276 @@ static bool iomap_write_end(struct iomap_iter *iter, size_t len, size_t copied,
> > > > > +static int iomap_writethrough_iter(struct iomap_writethrough_ctx *wt_ctx,
> > > > > +				   struct iomap_iter *iter, struct iov_iter *i,
> > > > > +				   const struct iomap_writethrough_ops *wt_ops)
> > > > > +
> > > > > +{
> 
> <...>
> 
> > > your comment but) after this email, I started diggin a bit more into why
> > > it is needed.  As per my understanding, it tackles 2 things:
> > > 
> > > Problem 1. mkclean's the old EOF folio so that the FS can fault again. This
> > > allows us to allocate new blocks which previously might not be allocated
> > > if bs < ps.
> > > 
> > > Problem 2. Since mmap writes can dirty data beyond EOF, we zero the range from
> > > old EOF to end of that folio so that readers dont read junk data after
> > > isize extension.
> > 
> > Correct.
> > 
> > > Another thing I noticed is that most users of
> > > iomap_file_buffered_write() do their own eof zeroing in the FS layer
> > > (eg, xfs_file_write_zero_eof(), ext4's new changes,
> > > ntfs_extend_initialized_size() etc). 
> > > I think this FS level zerooing should take care of mkcleaning the eof
> > > folio (problem 1), as they call iomap_zero_range() which would flush the
> > > eof range anyways. So am I right in assuming that for FSes that do their
> > > own zeroing, 1. is already taken care of?
> > 
> > Well, I don't see anything that would writeprotect the old tail page in
> > iomap_zero_range(). I think iomap_zero_range() calls are there mostly to
> > address 2. Not only due to mmap but also possibly to clear whatever junk
> > there can be in the blocks after EOF.
> 
> Well I was thinking more like if the EOF page was mmap'd it would be
> dirty and blocks beyond EOF would be unmapped, so iomap_zero_range() will
> write it back which shall mkclean() the folio.
> 
> But I think the same race we discussed for problem 2 can also occur
> here.
> 
> Thread 1 (extending write)             Thread 2 (mmap writer)
> 
> iomap_zero_range()
> filemap_write_and_wait_range()
>                                        // mmaps & writes EOF range
> iomap_write_iter()
> isize = new_size
> // pagecache_isize_extended() is
>    needed to mkclean() old EOF page.

Yes, this race exists and unlike in the case of zeroing where it is mostly
harmless not guranteeing calling page_mkwrite() with updated i_size can
lead to filesystem tripping on assertions, data loss or similar.

> > > As for 2, I think after the EOF zeroing of the FS, there might be a
> > > window before iomap_write_iter() where an mmap writer can still dirty
> > > EOF blocks, hence the pagecache_isize_extended() would be needed here.
> > > But doesn't that then make the eof zeroing in the FS layer redundant? Am
> > > I missing something here?
> > 
> > Hmm, I agree the zeroing looks duplicit (for some users of
> > pagecache_isize_extended()). And yes, doing the zeroing from
> > xfs_file_write_zero_eof() is somewhat racy (mmap writer can still come and
> > write non-zeros before we update i_size) but I'd have hard time to argue it
> > really practically matters - you are racing mmap writes with buffered
> > writes so any kind of write atomicity guarantees are not there. 
> 
> Yeah, seems like it is not enough to take care of either 1 or 2 and
> pagecache_isize_extended() should maybe be enough. I was just wondering
> if we could optimize it away even for normal extend path (no racing mmap),
> we can avoid the expensive folio_zero_range() calls.
> 
> Regardless, Ive not looked at this more closely and its a separate issue
> so we can revisit it later. For now I wanted some clarity around
> pagecache_isize_extended() so thanks for that.

Well, but pagecache_isize_extended() doesn't guarantee on disk blocks are
zeroed out as well as it doesn't dirty the page. Also
xfs_file_write_zero_eof() potentially handles zeroing of more than a tail
page. So you cannot simply drop one of these.

> > > Regardless, for our case I think we will also need to do the
> > > pagecache_isize_extended(), mainly to take care of problem 2, but where
> > > exactly should we do it now?  We currently change the isize in endio()
> > > but for aio, it can run outside inode or folio lock. I think this
> > > function needs to be called under inode lock(). Hmm.. its a bit late here so
> > > I'll revisit this tomorrow with a fresh mind :)
> > 
> > I think mainly to take care of problem 1... You are correct about
> > inode_lock but since we are updating i_size, we should be better holding
> > it, shouldn't we?
> 
> Yes you are correct. In the aio writethrough codepath, the inode update
> is happening without the inode lock which is wrong. I overlooked the
> fact that even aio dio uses IOMAP_DIO_FORCE_WAIT to force isize update
> under inode lock, and we should do something similar as well.

Yes.

> So in v3, I make the change that for extending writes we shall always
> finish them in "sync" fashion so ->endio() runs under inode lock. Then,
> after ->endio() in iomap_dio_complete(), I will call
> pagecache_isize_extended() to take care of this. Just like isize update
> right now, the isize_extension only runs when the IO was successful
> otherwise we return an error to the user. This gives us semantics like
> dio while handling extension properly.
> 
> Does that sound okay?

Yep, sounds fine.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR


  reply	other threads:[~2026-04-22 10:00 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-04-08 18:45 [RFC PATCH v2 0/5] Add buffered write-through support to iomap & xfs Ojaswin Mujoo
2026-04-08 18:45 ` [RFC PATCH v2 1/5] mm: Refactor folio_clear_dirty_for_io() Ojaswin Mujoo
2026-04-15  6:14   ` Christoph Hellwig
2026-04-23 10:12     ` Ojaswin Mujoo
2026-04-08 18:45 ` [RFC PATCH v2 2/5] iomap: Add initial support for buffered RWF_WRITETHROUGH Ojaswin Mujoo
2026-04-16 12:05   ` Jan Kara
2026-04-16 12:34   ` Jan Kara
2026-04-17 19:42     ` Ojaswin Mujoo
2026-04-20 11:28       ` Jan Kara
2026-04-21 18:07         ` Ojaswin Mujoo
2026-04-22 10:00           ` Jan Kara [this message]
2026-04-23 10:08             ` Ojaswin Mujoo
2026-04-17  4:13   ` Pankaj Raghav (Samsung)
2026-04-18  7:33     ` Ojaswin Mujoo
2026-04-20 11:56   ` Pankaj Raghav (Samsung)
2026-04-21 18:15     ` Ojaswin Mujoo
2026-04-22  6:19       ` Pankaj Raghav
2026-04-22  6:40         ` Ritesh Harjani
2026-04-08 18:45 ` [RFC PATCH v2 3/5] xfs: Add RWF_WRITETHROUGH support to xfs Ojaswin Mujoo
2026-04-08 18:45 ` [RFC PATCH v2 4/5] iomap: Add aio support to RWF_WRITETHROUGH Ojaswin Mujoo
2026-04-08 18:45 ` [RFC PATCH v2 5/5] iomap: Add DSYNC support to writethrough Ojaswin Mujoo
2026-04-17  3:54 ` [RFC PATCH v2 0/5] Add buffered write-through support to iomap & xfs Pankaj Raghav (Samsung)
2026-04-18  7:26   ` Ojaswin Mujoo
2026-04-23 12:25 ` Pankaj Raghav (Samsung)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=eqbodkjxxg32ryxsdngx35yhv5fy4p44fmvlnlnoyib6bstlmf@4hzxnfwleynb \
    --to=jack@suse.cz \
    --cc=andres@anarazel.de \
    --cc=brauner@kernel.org \
    --cc=dgc@kernel.org \
    --cc=djwong@kernel.org \
    --cc=hch@lst.de \
    --cc=john.g.garry@oracle.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-xfs@vger.kernel.org \
    --cc=mcgrof@kernel.org \
    --cc=ojaswin@linux.ibm.com \
    --cc=p.raghav@samsung.com \
    --cc=ritesh.list@gmail.com \
    --cc=tytso@mit.edu \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox