From: Pankaj Raghav <pankaj.raghav@linux.dev>
To: linux-xfs@vger.kernel.org, linux-mm@kvack.org,
linux-fsdevel@vger.kernel.org, lsf-pc@lists.linux-foundation.org
Cc: Andres Freund <andres@anarazel.de>,
djwong@kernel.org, john.g.garry@oracle.com, willy@infradead.org,
hch@lst.de, ritesh.list@gmail.com, jack@suse.cz,
ojaswin@linux.ibm.com, Luis Chamberlain <mcgrof@kernel.org>,
dchinner@redhat.com, Javier Gonzalez <javier.gonz@samsung.com>,
gost.dev@samsung.com, tytso@mit.edu, p.raghav@samsung.com,
vi.shah@samsung.com
Subject: [LSF/MM/BPF TOPIC] Buffered atomic writes
Date: Fri, 13 Feb 2026 11:20:36 +0100 [thread overview]
Message-ID: <d0c4d95b-8064-4a7e-996d-7ad40eb4976b@linux.dev> (raw)
Hi all,
Atomic (untorn) writes for Direct I/O have successfully landed in kernel
for ext4 and XFS[1][2]. However, extending this support to Buffered I/O
remains a contentious topic, with previous discussions often stalling
due to concerns about complexity versus utility.
I would like to propose a session to discuss the concrete use cases for
buffered atomic writes and if possible, talk about the outstanding
architectural blockers blocking the current RFCs[3][4].
## Use Case:
A recurring objection to buffered atomics is the lack of a convincing
use case, with the argument that databases should simply migrate to
direct I/O. We have been working with PostgreSQL developer Andres
Freund, who has highlighted a specific architectural requirement where
buffered I/O remains preferable in certain scenarios.
While Postgres recently started to support direct I/O, optimal
performance requires a large, statically configured user-space buffer
pool. This becomes problematic when running many Postgres instances on
the same hardware, a common deployment scenario. Statically partitioning
RAM for direct I/O caches across many instances is inefficient compared
to allowing the kernel page cache to dynamically balance memory pressure
between instances.
The other use case is using postgres as part of a larger workload on one
instance. Using up enough memory for postgres' buffer pool to make DIO
use viable is often not realistic, because some deployments require a
lot of memory to cache database IO, while others need a lot of memory
for non-database caching.
Enabling atomic writes for this buffered workload would allow Postgres
to disable full-page writes [5]. For direct I/O, this has shown to
reduce transaction variability; for buffered I/O, we expect similar
gains, alongside decreased WAL bandwidth and storage costs for WAL
archival. As a side note, for most workloads full page writes occupy a
significant portion of WAL volume.
Andres has agreed to attend LSFMM this year to discuss these requirements.
## Discussion:
We currently have RFCs posted by John Garry and Ojaswin Mujoo, and there
was a previous LSFMM proposal about untorn buffered writes from Ted Tso.
Based on the conversation/blockers we had before, the discussion at
LSFMM should focus on the following blocking issues:
- Handling Short Writes under Memory Pressure[6]: A buffered atomic
write might span page boundaries. If memory pressure causes a page
fault or reclaim mid-copy, the write could be torn inside the page
cache before it even reaches the filesystem.
- The current RFC uses a "pinning" approach: pinning user pages and
creating a BVEC to ensure the full copy can proceed atomically.
This adds complexity to the write path.
- Discussion: Is this acceptable? Should we consider alternatives,
such as requiring userspace to mlock the I/O buffers before
issuing the write to guarantee atomic copy in the page cache?
- Page Cache Model vs. Filesystem CoW: The current RFC introduces a
PG_atomic page flag to track dirty pages requiring atomic writeback.
This faced pushback due to page flags being a scarce resource[7].
Furthermore, it was argued that atomic model does not fit the buffered
I/O model because data sitting in the page cache is vulnerable to
modification before writeback occurs, and writeback does not preserve
application ordering[8].
- Dave Chinner has proposed leveraging the filesystem's CoW path
where we always allocate new blocks for the atomic write (forced
CoW). If the hardware supports it (e.g., NVMe atomic limits), the
filesystem can optimize the writeback to use REQ_ATOMIC in place,
avoiding the CoW overhead while maintaining the architectural
separation.
- Discussion: While the CoW approach fits XFS and other CoW
filesystems well, it presents challenges for filesystems like ext4
which lack CoW capabilities for data. Should this be a filesystem
specific feature?
Comments or Curses, all are welcome.
--
Pankaj
[1] https://lwn.net/Articles/1009298/
[2] https://docs.kernel.org/6.17/filesystems/ext4/atomic_writes.html
[3]
https://lore.kernel.org/linux-fsdevel/20240422143923.3927601-1-john.g.garry@oracle.com/
[4] https://lore.kernel.org/all/cover.1762945505.git.ojaswin@linux.ibm.com
[5]
https://www.postgresql.org/docs/16/runtime-config-wal.html#GUC-FULL-PAGE-WRITES
[6]
https://lore.kernel.org/linux-fsdevel/ZiZ8XGZz46D3PRKr@casper.infradead.org/
[7]
https://lore.kernel.org/linux-fsdevel/aRSuH82gM-8BzPCU@casper.infradead.org/
[8]
https://lore.kernel.org/linux-fsdevel/aRmHRk7FGD4nCT0s@dread.disaster.area/
next reply other threads:[~2026-02-13 10:20 UTC|newest]
Thread overview: 38+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-02-13 10:20 Pankaj Raghav [this message]
2026-02-13 13:32 ` Ojaswin Mujoo
2026-02-16 9:52 ` Pankaj Raghav
2026-02-16 15:45 ` Andres Freund
2026-02-17 12:06 ` Jan Kara
2026-02-17 12:42 ` Pankaj Raghav
2026-02-17 16:21 ` Andres Freund
2026-02-18 1:04 ` Dave Chinner
2026-02-18 6:47 ` Christoph Hellwig
2026-02-18 23:42 ` Dave Chinner
2026-02-17 16:13 ` Andres Freund
2026-02-17 18:27 ` Ojaswin Mujoo
2026-02-17 18:42 ` Andres Freund
2026-02-18 17:37 ` Jan Kara
2026-02-18 21:04 ` Andres Freund
2026-02-19 0:32 ` Dave Chinner
2026-02-17 18:33 ` Ojaswin Mujoo
2026-02-17 17:20 ` Ojaswin Mujoo
2026-02-18 17:42 ` [Lsf-pc] " Jan Kara
2026-02-18 20:22 ` Ojaswin Mujoo
2026-02-16 11:38 ` Jan Kara
2026-02-16 13:18 ` Pankaj Raghav
2026-02-17 18:36 ` Ojaswin Mujoo
2026-02-16 15:57 ` Andres Freund
2026-02-17 18:39 ` Ojaswin Mujoo
2026-02-18 0:26 ` Dave Chinner
2026-02-18 6:49 ` Christoph Hellwig
2026-02-18 12:54 ` Ojaswin Mujoo
2026-02-15 9:01 ` Amir Goldstein
2026-02-17 5:51 ` Christoph Hellwig
2026-02-17 9:23 ` [Lsf-pc] " Amir Goldstein
2026-02-17 15:47 ` Andres Freund
2026-02-17 22:45 ` Dave Chinner
2026-02-18 4:10 ` Andres Freund
2026-02-18 6:53 ` Christoph Hellwig
2026-02-18 6:51 ` Christoph Hellwig
2026-02-20 10:08 ` Pankaj Raghav (Samsung)
2026-02-20 15:10 ` Christoph Hellwig
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=d0c4d95b-8064-4a7e-996d-7ad40eb4976b@linux.dev \
--to=pankaj.raghav@linux.dev \
--cc=andres@anarazel.de \
--cc=dchinner@redhat.com \
--cc=djwong@kernel.org \
--cc=gost.dev@samsung.com \
--cc=hch@lst.de \
--cc=jack@suse.cz \
--cc=javier.gonz@samsung.com \
--cc=john.g.garry@oracle.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=linux-xfs@vger.kernel.org \
--cc=lsf-pc@lists.linux-foundation.org \
--cc=mcgrof@kernel.org \
--cc=ojaswin@linux.ibm.com \
--cc=p.raghav@samsung.com \
--cc=ritesh.list@gmail.com \
--cc=tytso@mit.edu \
--cc=vi.shah@samsung.com \
--cc=willy@infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox