linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Jan Kara <jack@suse.cz>
To: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Cc: linux-xfs@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	 djwong@kernel.org, john.g.garry@oracle.com, willy@infradead.org,
	hch@lst.de,  ritesh.list@gmail.com, jack@suse.cz,
	Luis Chamberlain <mcgrof@kernel.org>,
	 dgc@kernel.org, tytso@mit.edu, p.raghav@samsung.com,
	andres@anarazel.de,  brauner@kernel.org,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org
Subject: Re: [RFC PATCH v2 2/5] iomap: Add initial support for buffered RWF_WRITETHROUGH
Date: Thu, 16 Apr 2026 14:05:34 +0200	[thread overview]
Message-ID: <5l2pnrwodwbey7lwmysjxldqpm2kbyi7kqp5tqg7xozvaoecuh@dcglcdr3ipnz> (raw)
In-Reply-To: <d44171d5a1ec783722019ad7d1a4ede0478cc914.1775658795.git.ojaswin@linux.ibm.com>

On Thu 09-04-26 00:15:43, Ojaswin Mujoo wrote:
> This adds initial support for performing buffered non-aio
> RWF_WRITETHROUGH write. The rough flow for a writethrough write is as
> follows:
> 
> 1. Acquire inode lock
> 2. initialize writethrough context (wt_ctx) and mark
>    mapping as stable.
> 3. Start the iomap_iter() loop. For each iomap:
>   3.1. Acquire folio and folio_lock.
>   3.2. perform memcpy from user buffer to the folio and mark it
>        dirty
>   3.3. Wait for any current writeback to complete and then call
>        folio_mkclean() to prevent mmap writes from changing it.
>   3.4. Start writeback on the folio
>   3.5. Add the folio range under write to wt_ctx->bvec and folio_unlock()
>   3.6. If bvec is full, submit the current bvecs for IO.
>   3.7. Repeat 3.2 to 3.6 till the whole iomap is processed. Submit
>        the final set of bvecs for IO.
> 4. Repeat step 3 till we have no more data to write.
> 5. Finally, sleep in the syscall thread till all the IOs are
>    completed (refcount == 0). Once that happens, the end io handler will
>    wake us up.
> 6. Upon waking up, call fs ->end_io() callback (which updates inode
>    size), record any errors and return.
> 7. inode_unlock()
> 
> This design gives buffered writethrough the same semantics as dio and
> any error in the IO is directly returned to the caller. The design has
> delibrately open coded the IO submission and completion flow (inspired
> by dio) rather than reusing the dio functions as accomodating buffered
> writethrough logic in dio code was polluting it with too many if else
> conditionals and special cases.
> 
> Suggested-by: Jan Kara <jack@suse.cz>
> Suggested-by: Dave Chinner <dgc@kernel.org>
> Co-developed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
> Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>

Overall this looks good to me. Just a few smaller things below:

> +static int iomap_writethrough_iter(struct iomap_writethrough_ctx *wt_ctx,
> +				   struct iomap_iter *iter, struct iov_iter *i,
> +				   const struct iomap_writethrough_ops *wt_ops)
> +
> +{
> +	ssize_t total_written = 0;
> +	int status = 0;
> +	struct address_space *mapping = iter->inode->i_mapping;
> +	size_t chunk = mapping_max_folio_size(mapping);
> +	unsigned int bdp_flags = (iter->flags & IOMAP_NOWAIT) ? BDP_ASYNC : 0;
> +	unsigned int bs = i_blocksize(iter->inode);
> +
> +	/* copied over based on DIO handles these flags */
			       ^ missing 'how' here

> +ssize_t iomap_file_writethrough_write(struct kiocb *iocb, struct iov_iter *i,
> +				      const struct iomap_writethrough_ops *wt_ops,
> +				      void *private)
> +{
> +	struct inode *inode = iocb->ki_filp->f_mapping->host;
> +	struct iomap_iter iter = {
> +		.inode		= inode,
> +		.pos		= iocb->ki_pos,
> +		.len		= iov_iter_count(i),
> +		.flags		= IOMAP_WRITE | IOMAP_WRITETHROUGH,
> +		.private	= private,
> +	};
> +	struct iomap_writethrough_ctx *wt_ctx;
> +	unsigned int max_bvecs;
> +	ssize_t ret;
> +
> +
> +	/*
> +	 * For now we don't support any other flag with WRITETHROUGH
> +	 */
> +	if (!(iocb->ki_flags & IOCB_WRITETHROUGH))
> +		return -EINVAL;
> +	if (iocb->ki_flags & (IOCB_NOWAIT | IOCB_DONTCACHE))
> +		return -EINVAL;
> +	if (iocb_is_dsync(iocb))
> +		/* D_SYNC support not implemented yet */
> +		return -EOPNOTSUPP;
> +	if (!is_sync_kiocb(iocb))
> +		/* aio support not implemented yet */
> +		return -EOPNOTSUPP;
> +
> +	/*
> +	 * +1 to max bvecs to account for unaligned write spanning multiple
> +	 * folios
> +	 */
> +	max_bvecs = DIV_ROUND_UP(
> +		iov_iter_count(i),
> +		PAGE_SIZE << mapping_min_folio_order(inode->i_mapping)) + 1;

Can this overflow? iov_iter_count() returns size_t which is ulong.

> +
> +	if (max_bvecs > BIO_MAX_VECS)
> +		max_bvecs = BIO_MAX_VECS;
> +	if (!max_bvecs)
> +		max_bvecs = 1;

I don't think 0 is possible here since we do +1 in max_bvecs computation
above.

> +
> +	wt_ctx = kzalloc(struct_size(wt_ctx, bvec, max_bvecs), GFP_NOFS);
> +	if (!wt_ctx)
> +		return -ENOMEM;
> +
> +	wt_ctx->iocb = iocb;
> +	wt_ctx->inode = inode;
> +	wt_ctx->dops = wt_ops->dops;
> +	wt_ctx->pos = iocb->ki_pos;
> +	wt_ctx->new_i_size = i_size_read(inode);
> +	wt_ctx->max_bvecs = max_bvecs;
> +	atomic_set(&wt_ctx->ref, 1);
> +	wt_ctx->waiter = current;
> +
> +	mapping_set_stable_writes(inode->i_mapping);
> +

We should check if mapping is already marked as requiring stable pages
avoid messing with (in particular clearing) the flag in that case.

> +	while ((ret = iomap_iter(&iter, wt_ops->ops)) > 0) {
> +		WARN_ON(iter.iomap.type != IOMAP_UNWRITTEN &&
> +			iter.iomap.type != IOMAP_MAPPED);
> +		iter.status = iomap_writethrough_iter(wt_ctx, &iter, i, wt_ops);
> +	}
> +	if (ret < 0)
> +		cmpxchg(&wt_ctx->error, 0, ret);
> +
> +	if (!atomic_dec_and_test(&wt_ctx->ref)) {
> +		for (;;) {
> +			set_current_state(TASK_UNINTERRUPTIBLE);
> +			if (!READ_ONCE(wt_ctx->waiter))
> +				break;
> +			blk_io_schedule();
> +		}
> +		__set_current_state(TASK_RUNNING);
> +	}
> +
> +	return iomap_writethrough_complete(wt_ctx);
> +}

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR


  reply	other threads:[~2026-04-16 12:05 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-04-08 18:45 [RFC PATCH v2 0/5] Add buffered write-through support to iomap & xfs Ojaswin Mujoo
2026-04-08 18:45 ` [RFC PATCH v2 1/5] mm: Refactor folio_clear_dirty_for_io() Ojaswin Mujoo
2026-04-15  6:14   ` Christoph Hellwig
2026-04-08 18:45 ` [RFC PATCH v2 2/5] iomap: Add initial support for buffered RWF_WRITETHROUGH Ojaswin Mujoo
2026-04-16 12:05   ` Jan Kara [this message]
2026-04-16 12:34   ` Jan Kara
2026-04-17  4:13   ` Pankaj Raghav (Samsung)
2026-04-08 18:45 ` [RFC PATCH v2 3/5] xfs: Add RWF_WRITETHROUGH support to xfs Ojaswin Mujoo
2026-04-08 18:45 ` [RFC PATCH v2 4/5] iomap: Add aio support to RWF_WRITETHROUGH Ojaswin Mujoo
2026-04-08 18:45 ` [RFC PATCH v2 5/5] iomap: Add DSYNC support to writethrough Ojaswin Mujoo
2026-04-17  3:54 ` [RFC PATCH v2 0/5] Add buffered write-through support to iomap & xfs Pankaj Raghav (Samsung)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5l2pnrwodwbey7lwmysjxldqpm2kbyi7kqp5tqg7xozvaoecuh@dcglcdr3ipnz \
    --to=jack@suse.cz \
    --cc=andres@anarazel.de \
    --cc=brauner@kernel.org \
    --cc=dgc@kernel.org \
    --cc=djwong@kernel.org \
    --cc=hch@lst.de \
    --cc=john.g.garry@oracle.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-xfs@vger.kernel.org \
    --cc=mcgrof@kernel.org \
    --cc=ojaswin@linux.ibm.com \
    --cc=p.raghav@samsung.com \
    --cc=ritesh.list@gmail.com \
    --cc=tytso@mit.edu \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox