linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Ojaswin Mujoo <ojaswin@linux.ibm.com>
To: Pankaj Raghav <pankaj.raghav@linux.dev>
Cc: linux-xfs@vger.kernel.org, linux-mm@kvack.org,
	linux-fsdevel@vger.kernel.org, lsf-pc@lists.linux-foundation.org,
	Andres Freund <andres@anarazel.de>,
	djwong@kernel.org, john.g.garry@oracle.com, willy@infradead.org,
	hch@lst.de, ritesh.list@gmail.com, jack@suse.cz,
	Luis Chamberlain <mcgrof@kernel.org>,
	dchinner@redhat.com, Javier Gonzalez <javier.gonz@samsung.com>,
	gost.dev@samsung.com, tytso@mit.edu, p.raghav@samsung.com,
	vi.shah@samsung.com
Subject: Re: [LSF/MM/BPF TOPIC] Buffered atomic writes
Date: Tue, 17 Feb 2026 22:50:17 +0530	[thread overview]
Message-ID: <aZSjUWBkUFeJNEL6@li-dc0c254c-257c-11b2-a85c-98b6c1322444.ibm.com> (raw)
In-Reply-To: <7cf3f249-453d-423a-91d1-dfb45c474b78@linux.dev>

On Mon, Feb 16, 2026 at 10:52:35AM +0100, Pankaj Raghav wrote:
> On 2/13/26 14:32, Ojaswin Mujoo wrote:
> > On Fri, Feb 13, 2026 at 11:20:36AM +0100, Pankaj Raghav wrote:
> >> Hi all,
> >>
> >> Atomic (untorn) writes for Direct I/O have successfully landed in kernel
> >> for ext4 and XFS[1][2]. However, extending this support to Buffered I/O
> >> remains a contentious topic, with previous discussions often stalling due to
> >> concerns about complexity versus utility.
> >>
> >> I would like to propose a session to discuss the concrete use cases for
> >> buffered atomic writes and if possible, talk about the outstanding
> >> architectural blockers blocking the current RFCs[3][4].
> > 
> > Hi Pankaj,
> > 
> > Thanks for the proposal and glad to hear there is a wider interest in
> > this topic. We have also been actively working on this and I in middle
> > of testing and ironing out bugs in my RFC v2 for buffered atomic
> > writes, which is largely based on Dave's suggestions to maintain atomic
> > write mappings in FS layer (aka XFS COW fork). Infact I was going to
> > propose a discussion on this myself :) 
> > 
> 
> Perfect.
> 
> >>
> >> ## Use Case:
> >>
> >> A recurring objection to buffered atomics is the lack of a convincing use
> >> case, with the argument that databases should simply migrate to direct I/O.
> >> We have been working with PostgreSQL developer Andres Freund, who has
> >> highlighted a specific architectural requirement where buffered I/O remains
> >> preferable in certain scenarios.
> > 
> > Looks like you have some nice insights to cover from postgres side which
> > filesystem community has been asking for. As I've also been working on
> > the kernel implementation side of it, do you think we could do a joint
> > session on this topic?
> >
> As one of the main pushback for this feature has been a valid usecase, the main
> outcome I would like to get out of this session is a community consensus on the use case
> for this feature.
> 
> It looks like you already made quite a bit of progress with the CoW impl, so it
> would be great to if it can be a joint session.

Awesome! 

> 
> 
> >> We currently have RFCs posted by John Garry and Ojaswin Mujoo, and there
> >> was a previous LSFMM proposal about untorn buffered writes from Ted Tso.
> >> Based on the conversation/blockers we had before, the discussion at LSFMM
> >> should focus on the following blocking issues:
> >>
> >> - Handling Short Writes under Memory Pressure[6]: A buffered atomic
> >>   write might span page boundaries. If memory pressure causes a page
> >>   fault or reclaim mid-copy, the write could be torn inside the page
> >>   cache before it even reaches the filesystem.
> >>     - The current RFC uses a "pinning" approach: pinning user pages and
> >>       creating a BVEC to ensure the full copy can proceed atomically.
> >>       This adds complexity to the write path.
> >>     - Discussion: Is this acceptable? Should we consider alternatives,
> >>       such as requiring userspace to mlock the I/O buffers before
> >>       issuing the write to guarantee atomic copy in the page cache?
> > 
> > Right, I chose this approach because we only get to know about the short
> > copy after it has actually happened in copy_folio_from_iter_atomic()
> > and it seemed simpler to just not let the short copy happen. This is
> > inspired from how dio pins the pages for DMA, just that we do it
> > for a shorter time.
> > 
> > It does add slight complexity to the path but I'm not sure if it's complex
> > enough to justify adding a hard requirement of having pages mlock'd.
> > 
> 
> As databases like postgres have a buffer cache that they manage in userspace,
> which is eventually used to do IO, I am wondering if they already do a mlock
> or some other way to guarantee the buffer cache does not get reclaimed. That is
> why I was thinking if we could make it a requirement. Of course, that also requires
> checking if the range is mlocked in the iomap_write_iter path.

Hmm got it,I still feel it might be an overkill for something we
already have a mechanism for and can achieve easily, but I'm open to 
discussion on this :)

> 
> >>
> >> - Page Cache Model vs. Filesystem CoW: The current RFC introduces a
> >>   PG_atomic page flag to track dirty pages requiring atomic writeback.
> >>   This faced pushback due to page flags being a scarce resource[7].
> >>   Furthermore, it was argued that atomic model does not fit the buffered
> >>   I/O model because data sitting in the page cache is vulnerable to
> >>   modification before writeback occurs, and writeback does not preserve
> >>   application ordering[8].
> >>     -  Dave Chinner has proposed leveraging the filesystem's CoW path
> >>        where we always allocate new blocks for the atomic write (forced
> >>        CoW). If the hardware supports it (e.g., NVMe atomic limits), the
> >>        filesystem can optimize the writeback to use REQ_ATOMIC in place,
> >>        avoiding the CoW overhead while maintaining the architectural
> >>        separation.
> > 
> > Right, this is what I'm doing in the new RFC where we maintain the
> > mappings for atomic write in COW fork. This way we are able to utilize a
> > lot of existing infrastructure, however it does add some complexity to
> > ->iomap_begin() and ->writeback_range() callbacks of the FS. I believe
> > it is a tradeoff since the general consesus was mostly to avoid adding
> > too much complexity to iomap layer.
> > 
> > Another thing that came up is to consider using write through semantics 
> > for buffered atomic writes, where we are able to transition page to
> > writeback state immediately after the write and avoid any other users to
> > modify the data till writeback completes. This might affect performance
> > since we won't be able to batch similar atomic IOs but maybe
> > applications like postgres would not mind this too much. If we go with
> > this approach, we will be able to avoid worrying too much about other
> > users changing atomic data underneath us. 
> > 
> 
> Hmm, IIUC, postgres will write their dirty buffer cache by combining multiple DB
> pages based on `io_combine_limit` (typically 128kb). So immediately writing them
> might be ok as long as we don't remove those pages from the page cache like we do in
> RWF_UNCACHED.

Yep, and Ive not looked at the code path much but I think if we really
care about the user not changing the data b/w write and writeback then
we will probably need to start the writeback while holding the folio
lock, which is currently not done in RWF_UNCACHED.

> 
> 
> > An argument against this however is that it is user's responsibility to
> > not do non atomic IO over an atomic range and this shall be considered a
> > userspace usage error. This is similar to how there are ways users can
> > tear a dio if they perform overlapping writes. [1]. 
> > 
> > That being said, I think these points are worth discussing and it would
> > be helpful to have people from postgres around while discussing these
> > semantics with the FS community members.
> > 
> > As for ordering of writes, I'm not sure if that is something that
> > we should guarantee via the RWF_ATOMIC api. Ensuring ordering has mostly
> > been the task of userspace via fsync() and friends.
> > 
> 
> Agreed.
> 
> > 
> > [1] https://lore.kernel.org/fstests/0af205d9-6093-4931-abe9-f236acae8d44@oracle.com/
> > 
> >>     - Discussion: While the CoW approach fits XFS and other CoW
> >>       filesystems well, it presents challenges for filesystems like ext4
> >>       which lack CoW capabilities for data. Should this be a filesystem
> >>       specific feature?
> > 
> > I believe your question is if we should have a hard dependency on COW
> > mappings for atomic writes. Currently, COW in atomic write context in
> > XFS, is used for these 2 things:
> > 
> > 1. COW fork holds atomic write ranges.
> > 
> > This is not strictly a COW feature, just that we are repurposing the COW
> > fork to hold our atomic ranges. Basically a way for writeback path to
> > know that atomic write was done here.
> > 
> > COW fork is one way to do this but I believe every FS has a version of
> > in memory extent trees where such ephemeral atomic write mappings can be
> > held. The extent status cache is ext4's version of this, and can be used
> > to manage the atomic write ranges. 
> > 
> > There is an alternate suggestion that came up from discussions with Ted
> > and Darrick that we can instead use a generic side-car structure which
> > holds atomic write ranges. FSes can populate these during atomic writes
> > and query these in their writeback paths. 
> > 
> > This means for any FS operation (think truncate, falloc, mwrite, write
> > ...) we would need to keep this structure in sync, which can become pretty
> > complex pretty fast. I'm yet to implement this so not sure how it would
> > look in practice though.
> > 
> > 2. COW feature as a whole enables software based atomic writes.
> > 
> > This is something that ext4 won't be able to support (right now), just
> > like how we don't support software writes for dio.
> > 
> > I believe Baokun and Yi and working on a feature that can eventually
> > enable COW writes in ext4 [2]. Till we have something like that, we
> > would have to rely on hardware support.
> > 
> > Regardless, I don't think the ability to support or not support
> > software atomic writes largely depends on the filesystem so I'm not
> > sure how we can lift this up to a generic layer anyways.
> > 
> > [2] https://lore.kernel.org/linux-ext4/9666679c-c9f7-435c-8b67-c67c2f0c19ab@huawei.com/
> > 
> 
> Thanks for the explanation. I am also planning to take a shot at the CoW approach. I would
> be more than happy to review and test if you send a RFC in the meantime.

Thanks Pankaj, I'm testing the current RFC internally. I think I'll have
something in coming weeks and we can go over the design and how it looks
etc.

Regards,
ojaswin

> 
> --
> Pankaj
> 


  parent reply	other threads:[~2026-02-17 17:20 UTC|newest]

Thread overview: 38+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-02-13 10:20 Pankaj Raghav
2026-02-13 13:32 ` Ojaswin Mujoo
2026-02-16  9:52   ` Pankaj Raghav
2026-02-16 15:45     ` Andres Freund
2026-02-17 12:06       ` Jan Kara
2026-02-17 12:42         ` Pankaj Raghav
2026-02-17 16:21           ` Andres Freund
2026-02-18  1:04             ` Dave Chinner
2026-02-18  6:47               ` Christoph Hellwig
2026-02-18 23:42                 ` Dave Chinner
2026-02-17 16:13         ` Andres Freund
2026-02-17 18:27           ` Ojaswin Mujoo
2026-02-17 18:42             ` Andres Freund
2026-02-18 17:37           ` Jan Kara
2026-02-18 21:04             ` Andres Freund
2026-02-19  0:32             ` Dave Chinner
2026-02-17 18:33       ` Ojaswin Mujoo
2026-02-17 17:20     ` Ojaswin Mujoo [this message]
2026-02-18 17:42       ` [Lsf-pc] " Jan Kara
2026-02-18 20:22         ` Ojaswin Mujoo
2026-02-16 11:38   ` Jan Kara
2026-02-16 13:18     ` Pankaj Raghav
2026-02-17 18:36       ` Ojaswin Mujoo
2026-02-16 15:57     ` Andres Freund
2026-02-17 18:39     ` Ojaswin Mujoo
2026-02-18  0:26       ` Dave Chinner
2026-02-18  6:49         ` Christoph Hellwig
2026-02-18 12:54         ` Ojaswin Mujoo
2026-02-15  9:01 ` Amir Goldstein
2026-02-17  5:51 ` Christoph Hellwig
2026-02-17  9:23   ` [Lsf-pc] " Amir Goldstein
2026-02-17 15:47     ` Andres Freund
2026-02-17 22:45       ` Dave Chinner
2026-02-18  4:10         ` Andres Freund
2026-02-18  6:53       ` Christoph Hellwig
2026-02-18  6:51     ` Christoph Hellwig
2026-02-20 10:08 ` Pankaj Raghav (Samsung)
2026-02-20 15:10   ` Christoph Hellwig

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aZSjUWBkUFeJNEL6@li-dc0c254c-257c-11b2-a85c-98b6c1322444.ibm.com \
    --to=ojaswin@linux.ibm.com \
    --cc=andres@anarazel.de \
    --cc=dchinner@redhat.com \
    --cc=djwong@kernel.org \
    --cc=gost.dev@samsung.com \
    --cc=hch@lst.de \
    --cc=jack@suse.cz \
    --cc=javier.gonz@samsung.com \
    --cc=john.g.garry@oracle.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-xfs@vger.kernel.org \
    --cc=lsf-pc@lists.linux-foundation.org \
    --cc=mcgrof@kernel.org \
    --cc=p.raghav@samsung.com \
    --cc=pankaj.raghav@linux.dev \
    --cc=ritesh.list@gmail.com \
    --cc=tytso@mit.edu \
    --cc=vi.shah@samsung.com \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox