From: Jan Kara <jack@suse.cz>
To: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Cc: Jan Kara <jack@suse.cz>,
linux-xfs@vger.kernel.org, linux-fsdevel@vger.kernel.org,
djwong@kernel.org, john.g.garry@oracle.com, willy@infradead.org,
hch@lst.de, ritesh.list@gmail.com,
Luis Chamberlain <mcgrof@kernel.org>,
dgc@kernel.org, tytso@mit.edu, p.raghav@samsung.com,
andres@anarazel.de, brauner@kernel.org,
linux-kernel@vger.kernel.org, linux-mm@kvack.org
Subject: Re: [RFC PATCH v2 2/5] iomap: Add initial support for buffered RWF_WRITETHROUGH
Date: Mon, 20 Apr 2026 13:28:18 +0200 [thread overview]
Message-ID: <laegs2jt5ifqmfmdlsl7gsrc64z4nrnwbxazewbr7jk27emdej@l6vhj7pjh3dy> (raw)
In-Reply-To: <aeKNHi8GloNgch7X@li-dc0c254c-257c-11b2-a85c-98b6c1322444.ibm.com>
On Sat 18-04-26 01:12:22, Ojaswin Mujoo wrote:
> On Thu, Apr 16, 2026 at 02:34:15PM +0200, Jan Kara wrote:
> > > @@ -1096,6 +1097,276 @@ static bool iomap_write_end(struct iomap_iter *iter, size_t len, size_t copied,
> > > +static int iomap_writethrough_iter(struct iomap_writethrough_ctx *wt_ctx,
> > > + struct iomap_iter *iter, struct iov_iter *i,
> > > + const struct iomap_writethrough_ops *wt_ops)
> > > +
> > > +{
> > > + ssize_t total_written = 0;
> > > + int status = 0;
> > > + struct address_space *mapping = iter->inode->i_mapping;
> > > + size_t chunk = mapping_max_folio_size(mapping);
> > > + unsigned int bdp_flags = (iter->flags & IOMAP_NOWAIT) ? BDP_ASYNC : 0;
> > > + unsigned int bs = i_blocksize(iter->inode);
> > > +
> > > + /* copied over based on DIO handles these flags */
> > > + if (iter->iomap.type == IOMAP_UNWRITTEN)
> > > + wt_ctx->flags |= IOMAP_DIO_UNWRITTEN;
> > > + if (iter->iomap.flags & IOMAP_F_SHARED)
> > > + wt_ctx->flags |= IOMAP_DIO_COW;
> > > +
> > > + if (!(iter->flags & IOMAP_WRITETHROUGH))
> > > + return -EINVAL;
> > > +
> > > + do {
> > > + struct folio *folio;
> > > + size_t offset; /* Offset into folio */
> > > + u64 bytes; /* Bytes to write to folio */
> > > + size_t copied; /* Bytes copied from user */
> > > + u64 written; /* Bytes have been written */
> > > + loff_t pos;
> > > + size_t off_aligned, len_aligned;
> > > +
> > > + bytes = iov_iter_count(i);
> > > +retry:
> > > + offset = iter->pos & (chunk - 1);
> > > + bytes = min(chunk - offset, bytes);
> > > + status = balance_dirty_pages_ratelimited_flags(mapping,
> > > + bdp_flags);
> > > + if (unlikely(status))
> > > + break;
> > > +
> > > + /*
> > > + * If completions already occurred and reported errors, give up
> > > + * now and don't bother submitting more bios.
> > > + */
> > > + if (unlikely(data_race(wt_ctx->error))) {
> > > + wt_ctx->nr_bvecs = 0;
> > > + break;
> > > + }
> > > +
> > > + if (bytes > iomap_length(iter))
> > > + bytes = iomap_length(iter);
> > > +
> > > + /*
> > > + * Bring in the user page that we'll copy from _first_.
> > > + * Otherwise there's a nasty deadlock on copying from the
> > > + * same page as we're writing to, without it being marked
> > > + * up-to-date.
> > > + *
> > > + * For async buffered writes the assumption is that the user
> > > + * page has already been faulted in. This can be optimized by
> > > + * faulting the user page.
> > > + */
> > > + if (unlikely(fault_in_iov_iter_readable(i, bytes) == bytes)) {
> > > + status = -EFAULT;
> > > + break;
> > > + }
> > > +
> > > + status = iomap_write_begin(iter, wt_ops->write_ops, &folio,
> > > + &offset, &bytes);
> > > + if (unlikely(status)) {
> > > + iomap_write_failed(iter->inode, iter->pos, bytes);
> > > + break;
> > > + }
> > > + if (iter->iomap.flags & IOMAP_F_STALE)
> > > + break;
> > > +
> > > + pos = iter->pos;
> > > +
> > > + if (mapping_writably_mapped(mapping))
> > > + flush_dcache_folio(folio);
> > > +
> > > + copied = copy_folio_from_iter_atomic(folio, offset, bytes, i);
> > > + written = iomap_write_end(iter, bytes, copied, folio) ?
> > > + copied : 0;
> > > +
> > > + if (!written)
> > > + goto put_folio;
> > > +
> > > + off_aligned = round_down(offset, bs);
> > > + len_aligned = round_up(offset + written, bs) - off_aligned;
> > > +
> > > + iomap_folio_prepare_writethrough(folio, off_aligned,
> > > + len_aligned);
> > > +
> > > + if (!wt_ctx->nr_bvecs)
> > > + wt_ctx->bio_pos = round_down(pos, bs);
> > > +
> > > + bvec_set_folio(&wt_ctx->bvec[wt_ctx->nr_bvecs], folio,
> > > + len_aligned, off_aligned);
> >
> > Shouldn't we zero out the tail of the folio if we are submitting partial
> > folio for write?
>
> Hmm, so for the folio range we zeroout if needed in
> __iomap_write_begin(). I think that should take care of this right?
Yeah, right, that seems to do it.
> > > + wt_ctx->nr_bvecs++;
> > > + wt_ctx->written += written;
> > > +
> > > + if (pos + written > wt_ctx->new_i_size)
> > > + wt_ctx->new_i_size = pos + written;
> >
> > I'm probably missing something here but where is i_size update handled? I
> > don't see new_i_size used anywhere?
>
> So the i_size update happens in endio(), similar to dio. We initially
> had the update in iomap_writethrough_iter in v1 however based on Dave's
> feedback [1], moved it to the endio. The idea is for writethrough
> semantics to be closer to dio hence we either update isize when we
> succeffuly write, or return an error to user without update isize.
>
> [1] https://lore.kernel.org/linux-fsdevel/aa--rBKQG7ck5nuM@dread/
>
> > Also why is it OK to not call pagecache_isize_extended() but that goes
> > with the i_size update...
>
> As for pagecache_isize_extended(), (this might be a bit tangential from
> your comment but) after this email, I started diggin a bit more into why
> it is needed. As per my understanding, it tackles 2 things:
>
> Problem 1. mkclean's the old EOF folio so that the FS can fault again. This
> allows us to allocate new blocks which previously might not be allocated
> if bs < ps.
>
> Problem 2. Since mmap writes can dirty data beyond EOF, we zero the range from
> old EOF to end of that folio so that readers dont read junk data after
> isize extension.
Correct.
> Another thing I noticed is that most users of
> iomap_file_buffered_write() do their own eof zeroing in the FS layer
> (eg, xfs_file_write_zero_eof(), ext4's new changes,
> ntfs_extend_initialized_size() etc).
> I think this FS level zerooing should take care of mkcleaning the eof
> folio (problem 1), as they call iomap_zero_range() which would flush the
> eof range anyways. So am I right in assuming that for FSes that do their
> own zeroing, 1. is already taken care of?
Well, I don't see anything that would writeprotect the old tail page in
iomap_zero_range(). I think iomap_zero_range() calls are there mostly to
address 2. Not only due to mmap but also possibly to clear whatever junk
there can be in the blocks after EOF.
> As for 2, I think after the EOF zeroing of the FS, there might be a
> window before iomap_write_iter() where an mmap writer can still dirty
> EOF blocks, hence the pagecache_isize_extended() would be needed here.
> But doesn't that then make the eof zeroing in the FS layer redundant? Am
> I missing something here?
Hmm, I agree the zeroing looks duplicit (for some users of
pagecache_isize_extended()). And yes, doing the zeroing from
xfs_file_write_zero_eof() is somewhat racy (mmap writer can still come and
write non-zeros before we update i_size) but I'd have hard time to argue it
really practically matters - you are racing mmap writes with buffered
writes so any kind of write atomicity guarantees are not there.
> Regardless, for our case I think we will also need to do the
> pagecache_isize_extended(), mainly to take care of problem 2, but where
> exactly should we do it now? We currently change the isize in endio()
> but for aio, it can run outside inode or folio lock. I think this
> function needs to be called under inode lock(). Hmm.. its a bit late here so
> I'll revisit this tomorrow with a fresh mind :)
I think mainly to take care of problem 1... You are correct about
inode_lock but since we are updating i_size, we should be better holding
it, shouldn't we?
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
next prev parent reply other threads:[~2026-04-20 14:37 UTC|newest]
Thread overview: 16+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-04-08 18:45 [RFC PATCH v2 0/5] Add buffered write-through support to iomap & xfs Ojaswin Mujoo
2026-04-08 18:45 ` [RFC PATCH v2 1/5] mm: Refactor folio_clear_dirty_for_io() Ojaswin Mujoo
2026-04-15 6:14 ` Christoph Hellwig
2026-04-08 18:45 ` [RFC PATCH v2 2/5] iomap: Add initial support for buffered RWF_WRITETHROUGH Ojaswin Mujoo
2026-04-16 12:05 ` Jan Kara
2026-04-16 12:34 ` Jan Kara
2026-04-17 19:42 ` Ojaswin Mujoo
2026-04-20 11:28 ` Jan Kara [this message]
2026-04-17 4:13 ` Pankaj Raghav (Samsung)
2026-04-18 7:33 ` Ojaswin Mujoo
2026-04-20 11:56 ` Pankaj Raghav (Samsung)
2026-04-08 18:45 ` [RFC PATCH v2 3/5] xfs: Add RWF_WRITETHROUGH support to xfs Ojaswin Mujoo
2026-04-08 18:45 ` [RFC PATCH v2 4/5] iomap: Add aio support to RWF_WRITETHROUGH Ojaswin Mujoo
2026-04-08 18:45 ` [RFC PATCH v2 5/5] iomap: Add DSYNC support to writethrough Ojaswin Mujoo
2026-04-17 3:54 ` [RFC PATCH v2 0/5] Add buffered write-through support to iomap & xfs Pankaj Raghav (Samsung)
2026-04-18 7:26 ` Ojaswin Mujoo
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=laegs2jt5ifqmfmdlsl7gsrc64z4nrnwbxazewbr7jk27emdej@l6vhj7pjh3dy \
--to=jack@suse.cz \
--cc=andres@anarazel.de \
--cc=brauner@kernel.org \
--cc=dgc@kernel.org \
--cc=djwong@kernel.org \
--cc=hch@lst.de \
--cc=john.g.garry@oracle.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=linux-xfs@vger.kernel.org \
--cc=mcgrof@kernel.org \
--cc=ojaswin@linux.ibm.com \
--cc=p.raghav@samsung.com \
--cc=ritesh.list@gmail.com \
--cc=tytso@mit.edu \
--cc=willy@infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox