linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Pankaj Raghav <pankaj.raghav@linux.dev>
To: linux-xfs@vger.kernel.org, linux-mm@kvack.org,
	linux-fsdevel@vger.kernel.org, lsf-pc@lists.linux-foundation.org
Cc: Andres Freund <andres@anarazel.de>,
	djwong@kernel.org, john.g.garry@oracle.com, willy@infradead.org,
	hch@lst.de, ritesh.list@gmail.com, jack@suse.cz,
	ojaswin@linux.ibm.com, Luis Chamberlain <mcgrof@kernel.org>,
	dchinner@redhat.com, Javier Gonzalez <javier.gonz@samsung.com>,
	gost.dev@samsung.com, tytso@mit.edu, p.raghav@samsung.com,
	vi.shah@samsung.com
Subject: [LSF/MM/BPF TOPIC] Buffered atomic writes
Date: Fri, 13 Feb 2026 11:20:36 +0100	[thread overview]
Message-ID: <d0c4d95b-8064-4a7e-996d-7ad40eb4976b@linux.dev> (raw)

Hi all,

Atomic (untorn) writes for Direct I/O have successfully landed in kernel
for ext4 and XFS[1][2]. However, extending this support to Buffered I/O 
remains a contentious topic, with previous discussions often stalling 
due to concerns about complexity versus utility.

I would like to propose a session to discuss the concrete use cases for
buffered atomic writes and if possible, talk about the outstanding
architectural blockers blocking the current RFCs[3][4].

## Use Case:

A recurring objection to buffered atomics is the lack of a convincing 
use case, with the argument that databases should simply migrate to 
direct I/O. We have been working with PostgreSQL developer Andres 
Freund, who has highlighted a specific architectural requirement where 
buffered I/O remains preferable in certain scenarios.

While Postgres recently started to support direct I/O, optimal 
performance requires a large, statically configured user-space buffer 
pool. This becomes problematic when running many Postgres instances on 
the same hardware, a common deployment scenario. Statically partitioning 
RAM for direct I/O caches across many instances is inefficient compared 
to allowing the kernel page cache to dynamically balance memory pressure 
between instances.

The other use case is using postgres as part of a larger workload on one
instance. Using up enough memory for postgres' buffer pool to make DIO 
use viable is often not realistic, because some deployments require a 
lot of memory to cache database IO, while others need a lot of memory 
for non-database caching.

Enabling atomic writes for this buffered workload would allow Postgres 
to disable full-page writes [5]. For direct I/O, this has shown to 
reduce transaction variability; for buffered I/O, we expect similar 
gains, alongside decreased WAL bandwidth and storage costs for WAL 
archival. As a side note, for most workloads full page writes occupy  a 
significant portion of WAL volume.

Andres has agreed to attend LSFMM this year to discuss these requirements.

## Discussion:

We currently have RFCs posted by John Garry and Ojaswin Mujoo, and there
was a previous LSFMM proposal about untorn buffered writes from Ted Tso.
Based on the conversation/blockers we had before, the discussion at 
LSFMM should focus on the following blocking issues:

- Handling Short Writes under Memory Pressure[6]: A buffered atomic
   write might span page boundaries. If memory pressure causes a page
   fault or reclaim mid-copy, the write could be torn inside the page
   cache before it even reaches the filesystem.
     - The current RFC uses a "pinning" approach: pinning user pages and
       creating a BVEC to ensure the full copy can proceed atomically.
       This adds complexity to the write path.
     - Discussion: Is this acceptable? Should we consider alternatives,
       such as requiring userspace to mlock the I/O buffers before
       issuing the write to guarantee atomic copy in the page cache?

- Page Cache Model vs. Filesystem CoW: The current RFC introduces a
   PG_atomic page flag to track dirty pages requiring atomic writeback.
   This faced pushback due to page flags being a scarce resource[7].
   Furthermore, it was argued that atomic model does not fit the buffered
   I/O model because data sitting in the page cache is vulnerable to
   modification before writeback occurs, and writeback does not preserve
   application ordering[8].
     -  Dave Chinner has proposed leveraging the filesystem's CoW path
        where we always allocate new blocks for the atomic write (forced
        CoW). If the hardware supports it (e.g., NVMe atomic limits), the
        filesystem can optimize the writeback to use REQ_ATOMIC in place,
        avoiding the CoW overhead while maintaining the architectural
        separation.
     - Discussion: While the CoW approach fits XFS and other CoW
       filesystems well, it presents challenges for filesystems like ext4
       which lack CoW capabilities for data. Should this be a filesystem
       specific feature?

Comments or Curses, all are welcome.

--
Pankaj

[1] https://lwn.net/Articles/1009298/
[2] https://docs.kernel.org/6.17/filesystems/ext4/atomic_writes.html
[3] 
https://lore.kernel.org/linux-fsdevel/20240422143923.3927601-1-john.g.garry@oracle.com/
[4] https://lore.kernel.org/all/cover.1762945505.git.ojaswin@linux.ibm.com
[5] 
https://www.postgresql.org/docs/16/runtime-config-wal.html#GUC-FULL-PAGE-WRITES
[6] 
https://lore.kernel.org/linux-fsdevel/ZiZ8XGZz46D3PRKr@casper.infradead.org/
[7] 
https://lore.kernel.org/linux-fsdevel/aRSuH82gM-8BzPCU@casper.infradead.org/
[8] 
https://lore.kernel.org/linux-fsdevel/aRmHRk7FGD4nCT0s@dread.disaster.area/



             reply	other threads:[~2026-02-13 10:20 UTC|newest]

Thread overview: 38+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-02-13 10:20 Pankaj Raghav [this message]
2026-02-13 13:32 ` Ojaswin Mujoo
2026-02-16  9:52   ` Pankaj Raghav
2026-02-16 15:45     ` Andres Freund
2026-02-17 12:06       ` Jan Kara
2026-02-17 12:42         ` Pankaj Raghav
2026-02-17 16:21           ` Andres Freund
2026-02-18  1:04             ` Dave Chinner
2026-02-18  6:47               ` Christoph Hellwig
2026-02-18 23:42                 ` Dave Chinner
2026-02-17 16:13         ` Andres Freund
2026-02-17 18:27           ` Ojaswin Mujoo
2026-02-17 18:42             ` Andres Freund
2026-02-18 17:37           ` Jan Kara
2026-02-18 21:04             ` Andres Freund
2026-02-19  0:32             ` Dave Chinner
2026-02-17 18:33       ` Ojaswin Mujoo
2026-02-17 17:20     ` Ojaswin Mujoo
2026-02-18 17:42       ` [Lsf-pc] " Jan Kara
2026-02-18 20:22         ` Ojaswin Mujoo
2026-02-16 11:38   ` Jan Kara
2026-02-16 13:18     ` Pankaj Raghav
2026-02-17 18:36       ` Ojaswin Mujoo
2026-02-16 15:57     ` Andres Freund
2026-02-17 18:39     ` Ojaswin Mujoo
2026-02-18  0:26       ` Dave Chinner
2026-02-18  6:49         ` Christoph Hellwig
2026-02-18 12:54         ` Ojaswin Mujoo
2026-02-15  9:01 ` Amir Goldstein
2026-02-17  5:51 ` Christoph Hellwig
2026-02-17  9:23   ` [Lsf-pc] " Amir Goldstein
2026-02-17 15:47     ` Andres Freund
2026-02-17 22:45       ` Dave Chinner
2026-02-18  4:10         ` Andres Freund
2026-02-18  6:53       ` Christoph Hellwig
2026-02-18  6:51     ` Christoph Hellwig
2026-02-20 10:08 ` Pankaj Raghav (Samsung)
2026-02-20 15:10   ` Christoph Hellwig

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=d0c4d95b-8064-4a7e-996d-7ad40eb4976b@linux.dev \
    --to=pankaj.raghav@linux.dev \
    --cc=andres@anarazel.de \
    --cc=dchinner@redhat.com \
    --cc=djwong@kernel.org \
    --cc=gost.dev@samsung.com \
    --cc=hch@lst.de \
    --cc=jack@suse.cz \
    --cc=javier.gonz@samsung.com \
    --cc=john.g.garry@oracle.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-xfs@vger.kernel.org \
    --cc=lsf-pc@lists.linux-foundation.org \
    --cc=mcgrof@kernel.org \
    --cc=ojaswin@linux.ibm.com \
    --cc=p.raghav@samsung.com \
    --cc=ritesh.list@gmail.com \
    --cc=tytso@mit.edu \
    --cc=vi.shah@samsung.com \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox