Re: [LSF/MM/BPF TOPIC] Buffered atomic writes

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Ojaswin Mujoo <ojaswin@linux.ibm.com>
To: Pankaj Raghav <pankaj.raghav@linux.dev>
Cc: linux-xfs@vger.kernel.org, linux-mm@kvack.org,
	linux-fsdevel@vger.kernel.org, lsf-pc@lists.linux-foundation.org,
	Andres Freund <andres@anarazel.de>,
	djwong@kernel.org, john.g.garry@oracle.com, willy@infradead.org,
	hch@lst.de, ritesh.list@gmail.com, jack@suse.cz,
	Luis Chamberlain <mcgrof@kernel.org>,
	dchinner@redhat.com, Javier Gonzalez <javier.gonz@samsung.com>,
	gost.dev@samsung.com, tytso@mit.edu, p.raghav@samsung.com,
	vi.shah@samsung.com
Subject: Re: [LSF/MM/BPF TOPIC] Buffered atomic writes
Date: Fri, 13 Feb 2026 19:02:39 +0530	[thread overview]
Message-ID: <aY8n97G_hXzA5MMn@li-dc0c254c-257c-11b2-a85c-98b6c1322444.ibm.com> (raw)
In-Reply-To: <d0c4d95b-8064-4a7e-996d-7ad40eb4976b@linux.dev>

On Fri, Feb 13, 2026 at 11:20:36AM +0100, Pankaj Raghav wrote:
> Hi all,
> 
> Atomic (untorn) writes for Direct I/O have successfully landed in kernel
> for ext4 and XFS[1][2]. However, extending this support to Buffered I/O
> remains a contentious topic, with previous discussions often stalling due to
> concerns about complexity versus utility.
> 
> I would like to propose a session to discuss the concrete use cases for
> buffered atomic writes and if possible, talk about the outstanding
> architectural blockers blocking the current RFCs[3][4].

Hi Pankaj,

Thanks for the proposal and glad to hear there is a wider interest in
this topic. We have also been actively working on this and I in middle
of testing and ironing out bugs in my RFC v2 for buffered atomic
writes, which is largely based on Dave's suggestions to maintain atomic
write mappings in FS layer (aka XFS COW fork). Infact I was going to
propose a discussion on this myself :) 

> 
> ## Use Case:
> 
> A recurring objection to buffered atomics is the lack of a convincing use
> case, with the argument that databases should simply migrate to direct I/O.
> We have been working with PostgreSQL developer Andres Freund, who has
> highlighted a specific architectural requirement where buffered I/O remains
> preferable in certain scenarios.

Looks like you have some nice insights to cover from postgres side which
filesystem community has been asking for. As I've also been working on
the kernel implementation side of it, do you think we could do a joint
session on this topic?

> 
> While Postgres recently started to support direct I/O, optimal performance
> requires a large, statically configured user-space buffer pool. This becomes
> problematic when running many Postgres instances on the same hardware, a
> common deployment scenario. Statically partitioning RAM for direct I/O
> caches across many instances is inefficient compared to allowing the kernel
> page cache to dynamically balance memory pressure between instances.
> 
> The other use case is using postgres as part of a larger workload on one
> instance. Using up enough memory for postgres' buffer pool to make DIO use
> viable is often not realistic, because some deployments require a lot of
> memory to cache database IO, while others need a lot of memory for
> non-database caching.
> 
> Enabling atomic writes for this buffered workload would allow Postgres to
> disable full-page writes [5]. For direct I/O, this has shown to reduce
> transaction variability; for buffered I/O, we expect similar gains,
> alongside decreased WAL bandwidth and storage costs for WAL archival. As a
> side note, for most workloads full page writes occupy  a significant portion
> of WAL volume.
> 
> Andres has agreed to attend LSFMM this year to discuss these requirements.

Glad to hear people from postgres would also be joining!

> 
> ## Discussion:
> 
> We currently have RFCs posted by John Garry and Ojaswin Mujoo, and there
> was a previous LSFMM proposal about untorn buffered writes from Ted Tso.
> Based on the conversation/blockers we had before, the discussion at LSFMM
> should focus on the following blocking issues:
> 
> - Handling Short Writes under Memory Pressure[6]: A buffered atomic
>   write might span page boundaries. If memory pressure causes a page
>   fault or reclaim mid-copy, the write could be torn inside the page
>   cache before it even reaches the filesystem.
>     - The current RFC uses a "pinning" approach: pinning user pages and
>       creating a BVEC to ensure the full copy can proceed atomically.
>       This adds complexity to the write path.
>     - Discussion: Is this acceptable? Should we consider alternatives,
>       such as requiring userspace to mlock the I/O buffers before
>       issuing the write to guarantee atomic copy in the page cache?

Right, I chose this approach because we only get to know about the short
copy after it has actually happened in copy_folio_from_iter_atomic()
and it seemed simpler to just not let the short copy happen. This is
inspired from how dio pins the pages for DMA, just that we do it
for a shorter time.

It does add slight complexity to the path but I'm not sure if it's complex
enough to justify adding a hard requirement of having pages mlock'd.

> 
> - Page Cache Model vs. Filesystem CoW: The current RFC introduces a
>   PG_atomic page flag to track dirty pages requiring atomic writeback.
>   This faced pushback due to page flags being a scarce resource[7].
>   Furthermore, it was argued that atomic model does not fit the buffered
>   I/O model because data sitting in the page cache is vulnerable to
>   modification before writeback occurs, and writeback does not preserve
>   application ordering[8].
>     -  Dave Chinner has proposed leveraging the filesystem's CoW path
>        where we always allocate new blocks for the atomic write (forced
>        CoW). If the hardware supports it (e.g., NVMe atomic limits), the
>        filesystem can optimize the writeback to use REQ_ATOMIC in place,
>        avoiding the CoW overhead while maintaining the architectural
>        separation.

Right, this is what I'm doing in the new RFC where we maintain the
mappings for atomic write in COW fork. This way we are able to utilize a
lot of existing infrastructure, however it does add some complexity to
->iomap_begin() and ->writeback_range() callbacks of the FS. I believe
it is a tradeoff since the general consesus was mostly to avoid adding
too much complexity to iomap layer.

Another thing that came up is to consider using write through semantics 
for buffered atomic writes, where we are able to transition page to
writeback state immediately after the write and avoid any other users to
modify the data till writeback completes. This might affect performance
since we won't be able to batch similar atomic IOs but maybe
applications like postgres would not mind this too much. If we go with
this approach, we will be able to avoid worrying too much about other
users changing atomic data underneath us. 

An argument against this however is that it is user's responsibility to
not do non atomic IO over an atomic range and this shall be considered a
userspace usage error. This is similar to how there are ways users can
tear a dio if they perform overlapping writes. [1]. 

That being said, I think these points are worth discussing and it would
be helpful to have people from postgres around while discussing these
semantics with the FS community members.

As for ordering of writes, I'm not sure if that is something that
we should guarantee via the RWF_ATOMIC api. Ensuring ordering has mostly
been the task of userspace via fsync() and friends.

[1] https://lore.kernel.org/fstests/0af205d9-6093-4931-abe9-f236acae8d44@oracle.com/

>     - Discussion: While the CoW approach fits XFS and other CoW
>       filesystems well, it presents challenges for filesystems like ext4
>       which lack CoW capabilities for data. Should this be a filesystem
>       specific feature?

I believe your question is if we should have a hard dependency on COW
mappings for atomic writes. Currently, COW in atomic write context in
XFS, is used for these 2 things:

1. COW fork holds atomic write ranges.

This is not strictly a COW feature, just that we are repurposing the COW
fork to hold our atomic ranges. Basically a way for writeback path to
know that atomic write was done here.

COW fork is one way to do this but I believe every FS has a version of
in memory extent trees where such ephemeral atomic write mappings can be
held. The extent status cache is ext4's version of this, and can be used
to manage the atomic write ranges. 

There is an alternate suggestion that came up from discussions with Ted
and Darrick that we can instead use a generic side-car structure which
holds atomic write ranges. FSes can populate these during atomic writes
and query these in their writeback paths. 

This means for any FS operation (think truncate, falloc, mwrite, write
...) we would need to keep this structure in sync, which can become pretty
complex pretty fast. I'm yet to implement this so not sure how it would
look in practice though.

2. COW feature as a whole enables software based atomic writes.

This is something that ext4 won't be able to support (right now), just
like how we don't support software writes for dio.

I believe Baokun and Yi and working on a feature that can eventually
enable COW writes in ext4 [2]. Till we have something like that, we
would have to rely on hardware support.

Regardless, I don't think the ability to support or not support
software atomic writes largely depends on the filesystem so I'm not
sure how we can lift this up to a generic layer anyways.

[2] https://lore.kernel.org/linux-ext4/9666679c-c9f7-435c-8b67-c67c2f0c19ab@huawei.com/

Thanks,
Ojaswin
> 
> Comments or Curses, all are welcome.
> 
> --
> Pankaj
> 
> [1] https://lwn.net/Articles/1009298/
> [2] https://docs.kernel.org/6.17/filesystems/ext4/atomic_writes.html
> [3] https://lore.kernel.org/linux-fsdevel/20240422143923.3927601-1-john.g.garry@oracle.com/
> [4] https://lore.kernel.org/all/cover.1762945505.git.ojaswin@linux.ibm.com
> [5] https://www.postgresql.org/docs/16/runtime-config-wal.html#GUC-FULL-PAGE-WRITES
> [6]
> https://lore.kernel.org/linux-fsdevel/ZiZ8XGZz46D3PRKr@casper.infradead.org/
> [7]
> https://lore.kernel.org/linux-fsdevel/aRSuH82gM-8BzPCU@casper.infradead.org/
> [8]
> https://lore.kernel.org/linux-fsdevel/aRmHRk7FGD4nCT0s@dread.disaster.area/
>

next prev parent reply	other threads:[~2026-02-13 13:33 UTC|newest]

Thread overview: 38+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-02-13 10:20 Pankaj Raghav
2026-02-13 13:32 ` Ojaswin Mujoo [this message]
2026-02-16  9:52   ` Pankaj Raghav
2026-02-16 15:45     ` Andres Freund
2026-02-17 12:06       ` Jan Kara
2026-02-17 12:42         ` Pankaj Raghav
2026-02-17 16:21           ` Andres Freund
2026-02-18  1:04             ` Dave Chinner
2026-02-18  6:47               ` Christoph Hellwig
2026-02-18 23:42                 ` Dave Chinner
2026-02-17 16:13         ` Andres Freund
2026-02-17 18:27           ` Ojaswin Mujoo
2026-02-17 18:42             ` Andres Freund
2026-02-18 17:37           ` Jan Kara
2026-02-18 21:04             ` Andres Freund
2026-02-19  0:32             ` Dave Chinner
2026-02-17 18:33       ` Ojaswin Mujoo
2026-02-17 17:20     ` Ojaswin Mujoo
2026-02-18 17:42       ` [Lsf-pc] " Jan Kara
2026-02-18 20:22         ` Ojaswin Mujoo
2026-02-16 11:38   ` Jan Kara
2026-02-16 13:18     ` Pankaj Raghav
2026-02-17 18:36       ` Ojaswin Mujoo
2026-02-16 15:57     ` Andres Freund
2026-02-17 18:39     ` Ojaswin Mujoo
2026-02-18  0:26       ` Dave Chinner
2026-02-18  6:49         ` Christoph Hellwig
2026-02-18 12:54         ` Ojaswin Mujoo
2026-02-15  9:01 ` Amir Goldstein
2026-02-17  5:51 ` Christoph Hellwig
2026-02-17  9:23   ` [Lsf-pc] " Amir Goldstein
2026-02-17 15:47     ` Andres Freund
2026-02-17 22:45       ` Dave Chinner
2026-02-18  4:10         ` Andres Freund
2026-02-18  6:53       ` Christoph Hellwig
2026-02-18  6:51     ` Christoph Hellwig
2026-02-20 10:08 ` Pankaj Raghav (Samsung)
2026-02-20 15:10   ` Christoph Hellwig

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aY8n97G_hXzA5MMn@li-dc0c254c-257c-11b2-a85c-98b6c1322444.ibm.com \
    --to=ojaswin@linux.ibm.com \
    --cc=andres@anarazel.de \
    --cc=dchinner@redhat.com \
    --cc=djwong@kernel.org \
    --cc=gost.dev@samsung.com \
    --cc=hch@lst.de \
    --cc=jack@suse.cz \
    --cc=javier.gonz@samsung.com \
    --cc=john.g.garry@oracle.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-xfs@vger.kernel.org \
    --cc=lsf-pc@lists.linux-foundation.org \
    --cc=mcgrof@kernel.org \
    --cc=p.raghav@samsung.com \
    --cc=pankaj.raghav@linux.dev \
    --cc=ritesh.list@gmail.com \
    --cc=tytso@mit.edu \
    --cc=vi.shah@samsung.com \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox