* [LSF/MM/BPF TOPIC] Buffered atomic writes
@ 2026-02-13 10:20 Pankaj Raghav
2026-02-13 13:32 ` Ojaswin Mujoo
` (3 more replies)
0 siblings, 4 replies; 38+ messages in thread
From: Pankaj Raghav @ 2026-02-13 10:20 UTC (permalink / raw)
To: linux-xfs, linux-mm, linux-fsdevel, lsf-pc
Cc: Andres Freund, djwong, john.g.garry, willy, hch, ritesh.list,
jack, ojaswin, Luis Chamberlain, dchinner, Javier Gonzalez,
gost.dev, tytso, p.raghav, vi.shah
Hi all,
Atomic (untorn) writes for Direct I/O have successfully landed in kernel
for ext4 and XFS[1][2]. However, extending this support to Buffered I/O
remains a contentious topic, with previous discussions often stalling
due to concerns about complexity versus utility.
I would like to propose a session to discuss the concrete use cases for
buffered atomic writes and if possible, talk about the outstanding
architectural blockers blocking the current RFCs[3][4].
## Use Case:
A recurring objection to buffered atomics is the lack of a convincing
use case, with the argument that databases should simply migrate to
direct I/O. We have been working with PostgreSQL developer Andres
Freund, who has highlighted a specific architectural requirement where
buffered I/O remains preferable in certain scenarios.
While Postgres recently started to support direct I/O, optimal
performance requires a large, statically configured user-space buffer
pool. This becomes problematic when running many Postgres instances on
the same hardware, a common deployment scenario. Statically partitioning
RAM for direct I/O caches across many instances is inefficient compared
to allowing the kernel page cache to dynamically balance memory pressure
between instances.
The other use case is using postgres as part of a larger workload on one
instance. Using up enough memory for postgres' buffer pool to make DIO
use viable is often not realistic, because some deployments require a
lot of memory to cache database IO, while others need a lot of memory
for non-database caching.
Enabling atomic writes for this buffered workload would allow Postgres
to disable full-page writes [5]. For direct I/O, this has shown to
reduce transaction variability; for buffered I/O, we expect similar
gains, alongside decreased WAL bandwidth and storage costs for WAL
archival. As a side note, for most workloads full page writes occupy a
significant portion of WAL volume.
Andres has agreed to attend LSFMM this year to discuss these requirements.
## Discussion:
We currently have RFCs posted by John Garry and Ojaswin Mujoo, and there
was a previous LSFMM proposal about untorn buffered writes from Ted Tso.
Based on the conversation/blockers we had before, the discussion at
LSFMM should focus on the following blocking issues:
- Handling Short Writes under Memory Pressure[6]: A buffered atomic
write might span page boundaries. If memory pressure causes a page
fault or reclaim mid-copy, the write could be torn inside the page
cache before it even reaches the filesystem.
- The current RFC uses a "pinning" approach: pinning user pages and
creating a BVEC to ensure the full copy can proceed atomically.
This adds complexity to the write path.
- Discussion: Is this acceptable? Should we consider alternatives,
such as requiring userspace to mlock the I/O buffers before
issuing the write to guarantee atomic copy in the page cache?
- Page Cache Model vs. Filesystem CoW: The current RFC introduces a
PG_atomic page flag to track dirty pages requiring atomic writeback.
This faced pushback due to page flags being a scarce resource[7].
Furthermore, it was argued that atomic model does not fit the buffered
I/O model because data sitting in the page cache is vulnerable to
modification before writeback occurs, and writeback does not preserve
application ordering[8].
- Dave Chinner has proposed leveraging the filesystem's CoW path
where we always allocate new blocks for the atomic write (forced
CoW). If the hardware supports it (e.g., NVMe atomic limits), the
filesystem can optimize the writeback to use REQ_ATOMIC in place,
avoiding the CoW overhead while maintaining the architectural
separation.
- Discussion: While the CoW approach fits XFS and other CoW
filesystems well, it presents challenges for filesystems like ext4
which lack CoW capabilities for data. Should this be a filesystem
specific feature?
Comments or Curses, all are welcome.
--
Pankaj
[1] https://lwn.net/Articles/1009298/
[2] https://docs.kernel.org/6.17/filesystems/ext4/atomic_writes.html
[3]
https://lore.kernel.org/linux-fsdevel/20240422143923.3927601-1-john.g.garry@oracle.com/
[4] https://lore.kernel.org/all/cover.1762945505.git.ojaswin@linux.ibm.com
[5]
https://www.postgresql.org/docs/16/runtime-config-wal.html#GUC-FULL-PAGE-WRITES
[6]
https://lore.kernel.org/linux-fsdevel/ZiZ8XGZz46D3PRKr@casper.infradead.org/
[7]
https://lore.kernel.org/linux-fsdevel/aRSuH82gM-8BzPCU@casper.infradead.org/
[8]
https://lore.kernel.org/linux-fsdevel/aRmHRk7FGD4nCT0s@dread.disaster.area/
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Buffered atomic writes
2026-02-13 10:20 [LSF/MM/BPF TOPIC] Buffered atomic writes Pankaj Raghav
@ 2026-02-13 13:32 ` Ojaswin Mujoo
2026-02-16 9:52 ` Pankaj Raghav
2026-02-16 11:38 ` Jan Kara
2026-02-15 9:01 ` Amir Goldstein
` (2 subsequent siblings)
3 siblings, 2 replies; 38+ messages in thread
From: Ojaswin Mujoo @ 2026-02-13 13:32 UTC (permalink / raw)
To: Pankaj Raghav
Cc: linux-xfs, linux-mm, linux-fsdevel, lsf-pc, Andres Freund,
djwong, john.g.garry, willy, hch, ritesh.list, jack,
Luis Chamberlain, dchinner, Javier Gonzalez, gost.dev, tytso,
p.raghav, vi.shah
On Fri, Feb 13, 2026 at 11:20:36AM +0100, Pankaj Raghav wrote:
> Hi all,
>
> Atomic (untorn) writes for Direct I/O have successfully landed in kernel
> for ext4 and XFS[1][2]. However, extending this support to Buffered I/O
> remains a contentious topic, with previous discussions often stalling due to
> concerns about complexity versus utility.
>
> I would like to propose a session to discuss the concrete use cases for
> buffered atomic writes and if possible, talk about the outstanding
> architectural blockers blocking the current RFCs[3][4].
Hi Pankaj,
Thanks for the proposal and glad to hear there is a wider interest in
this topic. We have also been actively working on this and I in middle
of testing and ironing out bugs in my RFC v2 for buffered atomic
writes, which is largely based on Dave's suggestions to maintain atomic
write mappings in FS layer (aka XFS COW fork). Infact I was going to
propose a discussion on this myself :)
>
> ## Use Case:
>
> A recurring objection to buffered atomics is the lack of a convincing use
> case, with the argument that databases should simply migrate to direct I/O.
> We have been working with PostgreSQL developer Andres Freund, who has
> highlighted a specific architectural requirement where buffered I/O remains
> preferable in certain scenarios.
Looks like you have some nice insights to cover from postgres side which
filesystem community has been asking for. As I've also been working on
the kernel implementation side of it, do you think we could do a joint
session on this topic?
>
> While Postgres recently started to support direct I/O, optimal performance
> requires a large, statically configured user-space buffer pool. This becomes
> problematic when running many Postgres instances on the same hardware, a
> common deployment scenario. Statically partitioning RAM for direct I/O
> caches across many instances is inefficient compared to allowing the kernel
> page cache to dynamically balance memory pressure between instances.
>
> The other use case is using postgres as part of a larger workload on one
> instance. Using up enough memory for postgres' buffer pool to make DIO use
> viable is often not realistic, because some deployments require a lot of
> memory to cache database IO, while others need a lot of memory for
> non-database caching.
>
> Enabling atomic writes for this buffered workload would allow Postgres to
> disable full-page writes [5]. For direct I/O, this has shown to reduce
> transaction variability; for buffered I/O, we expect similar gains,
> alongside decreased WAL bandwidth and storage costs for WAL archival. As a
> side note, for most workloads full page writes occupy a significant portion
> of WAL volume.
>
> Andres has agreed to attend LSFMM this year to discuss these requirements.
Glad to hear people from postgres would also be joining!
>
> ## Discussion:
>
> We currently have RFCs posted by John Garry and Ojaswin Mujoo, and there
> was a previous LSFMM proposal about untorn buffered writes from Ted Tso.
> Based on the conversation/blockers we had before, the discussion at LSFMM
> should focus on the following blocking issues:
>
> - Handling Short Writes under Memory Pressure[6]: A buffered atomic
> write might span page boundaries. If memory pressure causes a page
> fault or reclaim mid-copy, the write could be torn inside the page
> cache before it even reaches the filesystem.
> - The current RFC uses a "pinning" approach: pinning user pages and
> creating a BVEC to ensure the full copy can proceed atomically.
> This adds complexity to the write path.
> - Discussion: Is this acceptable? Should we consider alternatives,
> such as requiring userspace to mlock the I/O buffers before
> issuing the write to guarantee atomic copy in the page cache?
Right, I chose this approach because we only get to know about the short
copy after it has actually happened in copy_folio_from_iter_atomic()
and it seemed simpler to just not let the short copy happen. This is
inspired from how dio pins the pages for DMA, just that we do it
for a shorter time.
It does add slight complexity to the path but I'm not sure if it's complex
enough to justify adding a hard requirement of having pages mlock'd.
>
> - Page Cache Model vs. Filesystem CoW: The current RFC introduces a
> PG_atomic page flag to track dirty pages requiring atomic writeback.
> This faced pushback due to page flags being a scarce resource[7].
> Furthermore, it was argued that atomic model does not fit the buffered
> I/O model because data sitting in the page cache is vulnerable to
> modification before writeback occurs, and writeback does not preserve
> application ordering[8].
> - Dave Chinner has proposed leveraging the filesystem's CoW path
> where we always allocate new blocks for the atomic write (forced
> CoW). If the hardware supports it (e.g., NVMe atomic limits), the
> filesystem can optimize the writeback to use REQ_ATOMIC in place,
> avoiding the CoW overhead while maintaining the architectural
> separation.
Right, this is what I'm doing in the new RFC where we maintain the
mappings for atomic write in COW fork. This way we are able to utilize a
lot of existing infrastructure, however it does add some complexity to
->iomap_begin() and ->writeback_range() callbacks of the FS. I believe
it is a tradeoff since the general consesus was mostly to avoid adding
too much complexity to iomap layer.
Another thing that came up is to consider using write through semantics
for buffered atomic writes, where we are able to transition page to
writeback state immediately after the write and avoid any other users to
modify the data till writeback completes. This might affect performance
since we won't be able to batch similar atomic IOs but maybe
applications like postgres would not mind this too much. If we go with
this approach, we will be able to avoid worrying too much about other
users changing atomic data underneath us.
An argument against this however is that it is user's responsibility to
not do non atomic IO over an atomic range and this shall be considered a
userspace usage error. This is similar to how there are ways users can
tear a dio if they perform overlapping writes. [1].
That being said, I think these points are worth discussing and it would
be helpful to have people from postgres around while discussing these
semantics with the FS community members.
As for ordering of writes, I'm not sure if that is something that
we should guarantee via the RWF_ATOMIC api. Ensuring ordering has mostly
been the task of userspace via fsync() and friends.
[1] https://lore.kernel.org/fstests/0af205d9-6093-4931-abe9-f236acae8d44@oracle.com/
> - Discussion: While the CoW approach fits XFS and other CoW
> filesystems well, it presents challenges for filesystems like ext4
> which lack CoW capabilities for data. Should this be a filesystem
> specific feature?
I believe your question is if we should have a hard dependency on COW
mappings for atomic writes. Currently, COW in atomic write context in
XFS, is used for these 2 things:
1. COW fork holds atomic write ranges.
This is not strictly a COW feature, just that we are repurposing the COW
fork to hold our atomic ranges. Basically a way for writeback path to
know that atomic write was done here.
COW fork is one way to do this but I believe every FS has a version of
in memory extent trees where such ephemeral atomic write mappings can be
held. The extent status cache is ext4's version of this, and can be used
to manage the atomic write ranges.
There is an alternate suggestion that came up from discussions with Ted
and Darrick that we can instead use a generic side-car structure which
holds atomic write ranges. FSes can populate these during atomic writes
and query these in their writeback paths.
This means for any FS operation (think truncate, falloc, mwrite, write
...) we would need to keep this structure in sync, which can become pretty
complex pretty fast. I'm yet to implement this so not sure how it would
look in practice though.
2. COW feature as a whole enables software based atomic writes.
This is something that ext4 won't be able to support (right now), just
like how we don't support software writes for dio.
I believe Baokun and Yi and working on a feature that can eventually
enable COW writes in ext4 [2]. Till we have something like that, we
would have to rely on hardware support.
Regardless, I don't think the ability to support or not support
software atomic writes largely depends on the filesystem so I'm not
sure how we can lift this up to a generic layer anyways.
[2] https://lore.kernel.org/linux-ext4/9666679c-c9f7-435c-8b67-c67c2f0c19ab@huawei.com/
Thanks,
Ojaswin
>
> Comments or Curses, all are welcome.
>
> --
> Pankaj
>
> [1] https://lwn.net/Articles/1009298/
> [2] https://docs.kernel.org/6.17/filesystems/ext4/atomic_writes.html
> [3] https://lore.kernel.org/linux-fsdevel/20240422143923.3927601-1-john.g.garry@oracle.com/
> [4] https://lore.kernel.org/all/cover.1762945505.git.ojaswin@linux.ibm.com
> [5] https://www.postgresql.org/docs/16/runtime-config-wal.html#GUC-FULL-PAGE-WRITES
> [6]
> https://lore.kernel.org/linux-fsdevel/ZiZ8XGZz46D3PRKr@casper.infradead.org/
> [7]
> https://lore.kernel.org/linux-fsdevel/aRSuH82gM-8BzPCU@casper.infradead.org/
> [8]
> https://lore.kernel.org/linux-fsdevel/aRmHRk7FGD4nCT0s@dread.disaster.area/
>
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Buffered atomic writes
2026-02-13 10:20 [LSF/MM/BPF TOPIC] Buffered atomic writes Pankaj Raghav
2026-02-13 13:32 ` Ojaswin Mujoo
@ 2026-02-15 9:01 ` Amir Goldstein
2026-02-17 5:51 ` Christoph Hellwig
2026-02-20 10:08 ` Pankaj Raghav (Samsung)
3 siblings, 0 replies; 38+ messages in thread
From: Amir Goldstein @ 2026-02-15 9:01 UTC (permalink / raw)
To: Pankaj Raghav, Andres Freund
Cc: linux-xfs, linux-mm, linux-fsdevel, lsf-pc, djwong, john.g.garry,
willy, hch, ritesh.list, jack, ojaswin, Luis Chamberlain,
dchinner, Javier Gonzalez, gost.dev, tytso, p.raghav, vi.shah
On Fri, Feb 13, 2026 at 11:20:36AM +0100, Pankaj Raghav wrote:
> Hi all,
>
> Atomic (untorn) writes for Direct I/O have successfully landed in kernel
> for ext4 and XFS[1][2]. However, extending this support to Buffered I/O
> remains a contentious topic, with previous discussions often stalling due to
> concerns about complexity versus utility.
>
> I would like to propose a session to discuss the concrete use cases for
> buffered atomic writes and if possible, talk about the outstanding
> architectural blockers blocking the current RFCs[3][4].
>
> ## Use Case:
>
> A recurring objection to buffered atomics is the lack of a convincing use
> case, with the argument that databases should simply migrate to direct I/O.
> We have been working with PostgreSQL developer Andres Freund, who has
> highlighted a specific architectural requirement where buffered I/O remains
> preferable in certain scenarios.
>
> While Postgres recently started to support direct I/O, optimal performance
> requires a large, statically configured user-space buffer pool. This becomes
> problematic when running many Postgres instances on the same hardware, a
> common deployment scenario. Statically partitioning RAM for direct I/O
> caches across many instances is inefficient compared to allowing the kernel
> page cache to dynamically balance memory pressure between instances.
>
> The other use case is using postgres as part of a larger workload on one
> instance. Using up enough memory for postgres' buffer pool to make DIO use
> viable is often not realistic, because some deployments require a lot of
> memory to cache database IO, while others need a lot of memory for
> non-database caching.
>
> Enabling atomic writes for this buffered workload would allow Postgres to
> disable full-page writes [5]. For direct I/O, this has shown to reduce
> transaction variability; for buffered I/O, we expect similar gains,
> alongside decreased WAL bandwidth and storage costs for WAL archival. As a
> side note, for most workloads full page writes occupy a significant portion
> of WAL volume.
>
> Andres has agreed to attend LSFMM this year to discuss these requirements.
>
Andres,
If you wish to attend LSFMM, please request an invite via the Google
form:
https://forms.gle/hUgiEksr8CA1migCA
Thanks,
Amir.
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Buffered atomic writes
2026-02-13 13:32 ` Ojaswin Mujoo
@ 2026-02-16 9:52 ` Pankaj Raghav
2026-02-16 15:45 ` Andres Freund
2026-02-17 17:20 ` Ojaswin Mujoo
2026-02-16 11:38 ` Jan Kara
1 sibling, 2 replies; 38+ messages in thread
From: Pankaj Raghav @ 2026-02-16 9:52 UTC (permalink / raw)
To: Ojaswin Mujoo
Cc: linux-xfs, linux-mm, linux-fsdevel, lsf-pc, Andres Freund,
djwong, john.g.garry, willy, hch, ritesh.list, jack,
Luis Chamberlain, dchinner, Javier Gonzalez, gost.dev, tytso,
p.raghav, vi.shah
On 2/13/26 14:32, Ojaswin Mujoo wrote:
> On Fri, Feb 13, 2026 at 11:20:36AM +0100, Pankaj Raghav wrote:
>> Hi all,
>>
>> Atomic (untorn) writes for Direct I/O have successfully landed in kernel
>> for ext4 and XFS[1][2]. However, extending this support to Buffered I/O
>> remains a contentious topic, with previous discussions often stalling due to
>> concerns about complexity versus utility.
>>
>> I would like to propose a session to discuss the concrete use cases for
>> buffered atomic writes and if possible, talk about the outstanding
>> architectural blockers blocking the current RFCs[3][4].
>
> Hi Pankaj,
>
> Thanks for the proposal and glad to hear there is a wider interest in
> this topic. We have also been actively working on this and I in middle
> of testing and ironing out bugs in my RFC v2 for buffered atomic
> writes, which is largely based on Dave's suggestions to maintain atomic
> write mappings in FS layer (aka XFS COW fork). Infact I was going to
> propose a discussion on this myself :)
>
Perfect.
>>
>> ## Use Case:
>>
>> A recurring objection to buffered atomics is the lack of a convincing use
>> case, with the argument that databases should simply migrate to direct I/O.
>> We have been working with PostgreSQL developer Andres Freund, who has
>> highlighted a specific architectural requirement where buffered I/O remains
>> preferable in certain scenarios.
>
> Looks like you have some nice insights to cover from postgres side which
> filesystem community has been asking for. As I've also been working on
> the kernel implementation side of it, do you think we could do a joint
> session on this topic?
>
As one of the main pushback for this feature has been a valid usecase, the main
outcome I would like to get out of this session is a community consensus on the use case
for this feature.
It looks like you already made quite a bit of progress with the CoW impl, so it
would be great to if it can be a joint session.
>> We currently have RFCs posted by John Garry and Ojaswin Mujoo, and there
>> was a previous LSFMM proposal about untorn buffered writes from Ted Tso.
>> Based on the conversation/blockers we had before, the discussion at LSFMM
>> should focus on the following blocking issues:
>>
>> - Handling Short Writes under Memory Pressure[6]: A buffered atomic
>> write might span page boundaries. If memory pressure causes a page
>> fault or reclaim mid-copy, the write could be torn inside the page
>> cache before it even reaches the filesystem.
>> - The current RFC uses a "pinning" approach: pinning user pages and
>> creating a BVEC to ensure the full copy can proceed atomically.
>> This adds complexity to the write path.
>> - Discussion: Is this acceptable? Should we consider alternatives,
>> such as requiring userspace to mlock the I/O buffers before
>> issuing the write to guarantee atomic copy in the page cache?
>
> Right, I chose this approach because we only get to know about the short
> copy after it has actually happened in copy_folio_from_iter_atomic()
> and it seemed simpler to just not let the short copy happen. This is
> inspired from how dio pins the pages for DMA, just that we do it
> for a shorter time.
>
> It does add slight complexity to the path but I'm not sure if it's complex
> enough to justify adding a hard requirement of having pages mlock'd.
>
As databases like postgres have a buffer cache that they manage in userspace,
which is eventually used to do IO, I am wondering if they already do a mlock
or some other way to guarantee the buffer cache does not get reclaimed. That is
why I was thinking if we could make it a requirement. Of course, that also requires
checking if the range is mlocked in the iomap_write_iter path.
>>
>> - Page Cache Model vs. Filesystem CoW: The current RFC introduces a
>> PG_atomic page flag to track dirty pages requiring atomic writeback.
>> This faced pushback due to page flags being a scarce resource[7].
>> Furthermore, it was argued that atomic model does not fit the buffered
>> I/O model because data sitting in the page cache is vulnerable to
>> modification before writeback occurs, and writeback does not preserve
>> application ordering[8].
>> - Dave Chinner has proposed leveraging the filesystem's CoW path
>> where we always allocate new blocks for the atomic write (forced
>> CoW). If the hardware supports it (e.g., NVMe atomic limits), the
>> filesystem can optimize the writeback to use REQ_ATOMIC in place,
>> avoiding the CoW overhead while maintaining the architectural
>> separation.
>
> Right, this is what I'm doing in the new RFC where we maintain the
> mappings for atomic write in COW fork. This way we are able to utilize a
> lot of existing infrastructure, however it does add some complexity to
> ->iomap_begin() and ->writeback_range() callbacks of the FS. I believe
> it is a tradeoff since the general consesus was mostly to avoid adding
> too much complexity to iomap layer.
>
> Another thing that came up is to consider using write through semantics
> for buffered atomic writes, where we are able to transition page to
> writeback state immediately after the write and avoid any other users to
> modify the data till writeback completes. This might affect performance
> since we won't be able to batch similar atomic IOs but maybe
> applications like postgres would not mind this too much. If we go with
> this approach, we will be able to avoid worrying too much about other
> users changing atomic data underneath us.
>
Hmm, IIUC, postgres will write their dirty buffer cache by combining multiple DB
pages based on `io_combine_limit` (typically 128kb). So immediately writing them
might be ok as long as we don't remove those pages from the page cache like we do in
RWF_UNCACHED.
> An argument against this however is that it is user's responsibility to
> not do non atomic IO over an atomic range and this shall be considered a
> userspace usage error. This is similar to how there are ways users can
> tear a dio if they perform overlapping writes. [1].
>
> That being said, I think these points are worth discussing and it would
> be helpful to have people from postgres around while discussing these
> semantics with the FS community members.
>
> As for ordering of writes, I'm not sure if that is something that
> we should guarantee via the RWF_ATOMIC api. Ensuring ordering has mostly
> been the task of userspace via fsync() and friends.
>
Agreed.
>
> [1] https://lore.kernel.org/fstests/0af205d9-6093-4931-abe9-f236acae8d44@oracle.com/
>
>> - Discussion: While the CoW approach fits XFS and other CoW
>> filesystems well, it presents challenges for filesystems like ext4
>> which lack CoW capabilities for data. Should this be a filesystem
>> specific feature?
>
> I believe your question is if we should have a hard dependency on COW
> mappings for atomic writes. Currently, COW in atomic write context in
> XFS, is used for these 2 things:
>
> 1. COW fork holds atomic write ranges.
>
> This is not strictly a COW feature, just that we are repurposing the COW
> fork to hold our atomic ranges. Basically a way for writeback path to
> know that atomic write was done here.
>
> COW fork is one way to do this but I believe every FS has a version of
> in memory extent trees where such ephemeral atomic write mappings can be
> held. The extent status cache is ext4's version of this, and can be used
> to manage the atomic write ranges.
>
> There is an alternate suggestion that came up from discussions with Ted
> and Darrick that we can instead use a generic side-car structure which
> holds atomic write ranges. FSes can populate these during atomic writes
> and query these in their writeback paths.
>
> This means for any FS operation (think truncate, falloc, mwrite, write
> ...) we would need to keep this structure in sync, which can become pretty
> complex pretty fast. I'm yet to implement this so not sure how it would
> look in practice though.
>
> 2. COW feature as a whole enables software based atomic writes.
>
> This is something that ext4 won't be able to support (right now), just
> like how we don't support software writes for dio.
>
> I believe Baokun and Yi and working on a feature that can eventually
> enable COW writes in ext4 [2]. Till we have something like that, we
> would have to rely on hardware support.
>
> Regardless, I don't think the ability to support or not support
> software atomic writes largely depends on the filesystem so I'm not
> sure how we can lift this up to a generic layer anyways.
>
> [2] https://lore.kernel.org/linux-ext4/9666679c-c9f7-435c-8b67-c67c2f0c19ab@huawei.com/
>
Thanks for the explanation. I am also planning to take a shot at the CoW approach. I would
be more than happy to review and test if you send a RFC in the meantime.
--
Pankaj
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Buffered atomic writes
2026-02-13 13:32 ` Ojaswin Mujoo
2026-02-16 9:52 ` Pankaj Raghav
@ 2026-02-16 11:38 ` Jan Kara
2026-02-16 13:18 ` Pankaj Raghav
` (2 more replies)
1 sibling, 3 replies; 38+ messages in thread
From: Jan Kara @ 2026-02-16 11:38 UTC (permalink / raw)
To: Ojaswin Mujoo
Cc: Pankaj Raghav, linux-xfs, linux-mm, linux-fsdevel, lsf-pc,
Andres Freund, djwong, john.g.garry, willy, hch, ritesh.list,
jack, Luis Chamberlain, dchinner, Javier Gonzalez, gost.dev,
tytso, p.raghav, vi.shah
Hi!
On Fri 13-02-26 19:02:39, Ojaswin Mujoo wrote:
> Another thing that came up is to consider using write through semantics
> for buffered atomic writes, where we are able to transition page to
> writeback state immediately after the write and avoid any other users to
> modify the data till writeback completes. This might affect performance
> since we won't be able to batch similar atomic IOs but maybe
> applications like postgres would not mind this too much. If we go with
> this approach, we will be able to avoid worrying too much about other
> users changing atomic data underneath us.
>
> An argument against this however is that it is user's responsibility to
> not do non atomic IO over an atomic range and this shall be considered a
> userspace usage error. This is similar to how there are ways users can
> tear a dio if they perform overlapping writes. [1].
Yes, I was wondering whether the write-through semantics would make sense
as well. Intuitively it should make things simpler because you could
practially reuse the atomic DIO write path. Only that you'd first copy
data into the page cache and issue dio write from those folios. No need for
special tracking of which folios actually belong together in atomic write,
no need for cluttering standard folio writeback path, in case atomic write
cannot happen (e.g. because you cannot allocate appropriately aligned
blocks) you get the error back rightaway, ...
Of course this all depends on whether such semantics would be actually
useful for users such as PostgreSQL.
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Buffered atomic writes
2026-02-16 11:38 ` Jan Kara
@ 2026-02-16 13:18 ` Pankaj Raghav
2026-02-17 18:36 ` Ojaswin Mujoo
2026-02-16 15:57 ` Andres Freund
2026-02-17 18:39 ` Ojaswin Mujoo
2 siblings, 1 reply; 38+ messages in thread
From: Pankaj Raghav @ 2026-02-16 13:18 UTC (permalink / raw)
To: Jan Kara, Ojaswin Mujoo
Cc: linux-xfs, linux-mm, linux-fsdevel, lsf-pc, Andres Freund,
djwong, john.g.garry, willy, hch, ritesh.list, Luis Chamberlain,
dchinner, Javier Gonzalez, gost.dev, tytso, p.raghav, vi.shah
On 2/16/2026 12:38 PM, Jan Kara wrote:
> Hi!
>
> On Fri 13-02-26 19:02:39, Ojaswin Mujoo wrote:
>> Another thing that came up is to consider using write through semantics
>> for buffered atomic writes, where we are able to transition page to
>> writeback state immediately after the write and avoid any other users to
>> modify the data till writeback completes. This might affect performance
>> since we won't be able to batch similar atomic IOs but maybe
>> applications like postgres would not mind this too much. If we go with
>> this approach, we will be able to avoid worrying too much about other
>> users changing atomic data underneath us.
>>
>> An argument against this however is that it is user's responsibility to
>> not do non atomic IO over an atomic range and this shall be considered a
>> userspace usage error. This is similar to how there are ways users can
>> tear a dio if they perform overlapping writes. [1].
>
> Yes, I was wondering whether the write-through semantics would make sense
> as well. Intuitively it should make things simpler because you could
> practially reuse the atomic DIO write path. Only that you'd first copy
> data into the page cache and issue dio write from those folios. No need for
> special tracking of which folios actually belong together in atomic write,
> no need for cluttering standard folio writeback path, in case atomic write
> cannot happen (e.g. because you cannot allocate appropriately aligned
> blocks) you get the error back rightaway, ...
>
> Of course this all depends on whether such semantics would be actually
> useful for users such as PostgreSQL.
One issue might be the performance, especially if the atomic max unit is in the
smaller end such as 16k or 32k (which is fairly common). But it will avoid the
overlapping writes issue and can easily leverage the direct IO path.
But one thing that postgres really cares about is the integrity of a database
block. So if there is an IO that is a multiple of an atomic write unit (one
atomic unit encapsulates the whole DB page), it is not a problem if tearing
happens on the atomic boundaries. This fits very well with what NVMe calls
Multiple Atomicity Mode (MAM) [1].
We don't have any semantics for MaM at the moment but that could increase the
performance as we can do larger IOs but still get the atomic guarantees certain
applications care about.
[1]
https://nvmexpress.org/wp-content/uploads/NVM-Express-NVM-Command-Set-Specification-Revision-1.1-2024.08.05-Ratified.pdf
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Buffered atomic writes
2026-02-16 9:52 ` Pankaj Raghav
@ 2026-02-16 15:45 ` Andres Freund
2026-02-17 12:06 ` Jan Kara
2026-02-17 18:33 ` Ojaswin Mujoo
2026-02-17 17:20 ` Ojaswin Mujoo
1 sibling, 2 replies; 38+ messages in thread
From: Andres Freund @ 2026-02-16 15:45 UTC (permalink / raw)
To: Pankaj Raghav
Cc: Ojaswin Mujoo, linux-xfs, linux-mm, linux-fsdevel, lsf-pc,
djwong, john.g.garry, willy, hch, ritesh.list, jack,
Luis Chamberlain, dchinner, Javier Gonzalez, gost.dev, tytso,
p.raghav, vi.shah
Hi,
On 2026-02-16 10:52:35 +0100, Pankaj Raghav wrote:
> On 2/13/26 14:32, Ojaswin Mujoo wrote:
> > On Fri, Feb 13, 2026 at 11:20:36AM +0100, Pankaj Raghav wrote:
> >> We currently have RFCs posted by John Garry and Ojaswin Mujoo, and there
> >> was a previous LSFMM proposal about untorn buffered writes from Ted Tso.
> >> Based on the conversation/blockers we had before, the discussion at LSFMM
> >> should focus on the following blocking issues:
> >>
> >> - Handling Short Writes under Memory Pressure[6]: A buffered atomic
> >> write might span page boundaries. If memory pressure causes a page
> >> fault or reclaim mid-copy, the write could be torn inside the page
> >> cache before it even reaches the filesystem.
> >> - The current RFC uses a "pinning" approach: pinning user pages and
> >> creating a BVEC to ensure the full copy can proceed atomically.
> >> This adds complexity to the write path.
> >> - Discussion: Is this acceptable? Should we consider alternatives,
> >> such as requiring userspace to mlock the I/O buffers before
> >> issuing the write to guarantee atomic copy in the page cache?
> >
> > Right, I chose this approach because we only get to know about the short
> > copy after it has actually happened in copy_folio_from_iter_atomic()
> > and it seemed simpler to just not let the short copy happen. This is
> > inspired from how dio pins the pages for DMA, just that we do it
> > for a shorter time.
> >
> > It does add slight complexity to the path but I'm not sure if it's complex
> > enough to justify adding a hard requirement of having pages mlock'd.
> >
>
> As databases like postgres have a buffer cache that they manage in userspace,
> which is eventually used to do IO, I am wondering if they already do a mlock
> or some other way to guarantee the buffer cache does not get reclaimed. That is
> why I was thinking if we could make it a requirement. Of course, that also requires
> checking if the range is mlocked in the iomap_write_iter path.
We don't generally mlock our buffer pool - but we strongly recommend to use
explicit huge pages (due to TLB pressure, faster fork() and less memory wasted
on page tables), which afaict has basically the same effect. However, that
doesn't make the page cache pages locked...
> >> - Page Cache Model vs. Filesystem CoW: The current RFC introduces a
> >> PG_atomic page flag to track dirty pages requiring atomic writeback.
> >> This faced pushback due to page flags being a scarce resource[7].
> >> Furthermore, it was argued that atomic model does not fit the buffered
> >> I/O model because data sitting in the page cache is vulnerable to
> >> modification before writeback occurs, and writeback does not preserve
> >> application ordering[8].
> >> - Dave Chinner has proposed leveraging the filesystem's CoW path
> >> where we always allocate new blocks for the atomic write (forced
> >> CoW). If the hardware supports it (e.g., NVMe atomic limits), the
> >> filesystem can optimize the writeback to use REQ_ATOMIC in place,
> >> avoiding the CoW overhead while maintaining the architectural
> >> separation.
> >
> > Right, this is what I'm doing in the new RFC where we maintain the
> > mappings for atomic write in COW fork. This way we are able to utilize a
> > lot of existing infrastructure, however it does add some complexity to
> > ->iomap_begin() and ->writeback_range() callbacks of the FS. I believe
> > it is a tradeoff since the general consesus was mostly to avoid adding
> > too much complexity to iomap layer.
> >
> > Another thing that came up is to consider using write through semantics
> > for buffered atomic writes, where we are able to transition page to
> > writeback state immediately after the write and avoid any other users to
> > modify the data till writeback completes. This might affect performance
> > since we won't be able to batch similar atomic IOs but maybe
> > applications like postgres would not mind this too much. If we go with
> > this approach, we will be able to avoid worrying too much about other
> > users changing atomic data underneath us.
> >
>
> Hmm, IIUC, postgres will write their dirty buffer cache by combining
> multiple DB pages based on `io_combine_limit` (typically 128kb).
We will try to do that, but it's obviously far from always possible, in some
workloads [parts of ]the data in the buffer pool rarely will be dirtied in
consecutive blocks.
FWIW, postgres already tries to force some just-written pages into
writeback. For sources of writes that can be plentiful and are done in the
background, we default to issuing sync_file_range(SYNC_FILE_RANGE_WRITE),
after 256kB-512kB of writes, as otherwise foreground latency can be
significantly impacted by the kernel deciding to suddenly write back (due to
dirty_writeback_centisecs, dirty_background_bytes, ...) and because otherwise
the fsyncs at the end of a checkpoint can be unpredictably slow. For
foreground writes we do not default to that, as there are users that won't
(because they don't know, because they overcommit hardware, ...) size
postgres' buffer pool to be big enough and thus will often re-dirty pages that
have already recently been written out to the operating systems. But for many
workloads it's recommened that users turn on
sync_file_range(SYNC_FILE_RANGE_WRITE) for foreground writes as well (*).
So for many workloads it'd be fine to just always start writeback for atomic
writes immediately. It's possible, but I am not at all sure, that for most of
the other workloads, the gains from atomic writes will outstrip the cost of
more frequently writing data back.
(*) As it turns out, it often seems to improves write throughput as well, if
writeback is triggered by memory pressure instead of SYNC_FILE_RANGE_WRITE,
linux seems to often trigger a lot more small random IO.
> So immediately writing them might be ok as long as we don't remove those
> pages from the page cache like we do in RWF_UNCACHED.
Yes, it might. I actually often have wished for something like a
RWF_WRITEBACK flag...
> > An argument against this however is that it is user's responsibility to
> > not do non atomic IO over an atomic range and this shall be considered a
> > userspace usage error. This is similar to how there are ways users can
> > tear a dio if they perform overlapping writes. [1].
Hm, the scope of the prohibition here is not clear to me. Would it just
be forbidden to do:
P1: start pwritev(fd, [blocks 1-10], RWF_ATOMIC)
P2: pwrite(fd, [any block in 1-10]), non-atomically
P1: complete pwritev(fd, ...)
or is it also forbidden to do:
P1: pwritev(fd, [blocks 1-10], RWF_ATOMIC) start & completes
Kernel: starts writeback but doesn't complete it
P1: pwrite(fd, [any block in 1-10]), non-atomically
Kernel: completes writeback
The former is not at all an issue for postgres' use case, the pages in our
buffer pool that are undergoing IO are locked, preventing additional IO (be it
reads or writes) to those blocks.
The latter would be a problem, since userspace wouldn't even know that here is
still "atomic writeback" going on, afaict the only way we could avoid it would
be to issue an f[data]sync(), which likely would be prohibitively expensive.
> > That being said, I think these points are worth discussing and it would
> > be helpful to have people from postgres around while discussing these
> > semantics with the FS community members.
> >
> > As for ordering of writes, I'm not sure if that is something that
> > we should guarantee via the RWF_ATOMIC api. Ensuring ordering has mostly
> > been the task of userspace via fsync() and friends.
> >
>
> Agreed.
From postgres' side that's fine. In the cases we care about ordering we use
fsync() already.
> > [1] https://lore.kernel.org/fstests/0af205d9-6093-4931-abe9-f236acae8d44@oracle.com/
> >
> >> - Discussion: While the CoW approach fits XFS and other CoW
> >> filesystems well, it presents challenges for filesystems like ext4
> >> which lack CoW capabilities for data. Should this be a filesystem
> >> specific feature?
> >
> > I believe your question is if we should have a hard dependency on COW
> > mappings for atomic writes. Currently, COW in atomic write context in
> > XFS, is used for these 2 things:
> >
> > 1. COW fork holds atomic write ranges.
> >
> > This is not strictly a COW feature, just that we are repurposing the COW
> > fork to hold our atomic ranges. Basically a way for writeback path to
> > know that atomic write was done here.
Does that mean buffered atomic writes would cause fragmentation? Some common
database workloads, e.g. anything running on cheaper cloud storage, are pretty
sensitive to that due to the increase in use of the metered IOPS.
Greetings,
Andres Freund
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Buffered atomic writes
2026-02-16 11:38 ` Jan Kara
2026-02-16 13:18 ` Pankaj Raghav
@ 2026-02-16 15:57 ` Andres Freund
2026-02-17 18:39 ` Ojaswin Mujoo
2 siblings, 0 replies; 38+ messages in thread
From: Andres Freund @ 2026-02-16 15:57 UTC (permalink / raw)
To: Jan Kara
Cc: Ojaswin Mujoo, Pankaj Raghav, linux-xfs, linux-mm, linux-fsdevel,
lsf-pc, djwong, john.g.garry, willy, hch, ritesh.list,
Luis Chamberlain, dchinner, Javier Gonzalez, gost.dev, tytso,
p.raghav, vi.shah
Hi,
On 2026-02-16 12:38:59 +0100, Jan Kara wrote:
> On Fri 13-02-26 19:02:39, Ojaswin Mujoo wrote:
> > Another thing that came up is to consider using write through semantics
> > for buffered atomic writes, where we are able to transition page to
> > writeback state immediately after the write and avoid any other users to
> > modify the data till writeback completes. This might affect performance
> > since we won't be able to batch similar atomic IOs but maybe
> > applications like postgres would not mind this too much. If we go with
> > this approach, we will be able to avoid worrying too much about other
> > users changing atomic data underneath us.
> >
> > An argument against this however is that it is user's responsibility to
> > not do non atomic IO over an atomic range and this shall be considered a
> > userspace usage error. This is similar to how there are ways users can
> > tear a dio if they perform overlapping writes. [1].
>
> Yes, I was wondering whether the write-through semantics would make sense
> as well.
As outlined in
https://lore.kernel.org/all/zzvybbfy6bcxnkt4cfzruhdyy6jsvnuvtjkebdeqwkm6nfpgij@dlps7ucza22s/
that is something that would be useful for postgres even orthogonally to
atomic writes.
If this were the path to go with, I'd suggest adding an RWF_WRITETHROUGH and
requiring it to be set when using RWF_ATOMIC on an buffered write. That way,
if the kernel were to eventually support buffered atomic writes without
immediate writeback, the semantics to userspace wouldn't suddenly change.
> Intuitively it should make things simpler because you could
> practially reuse the atomic DIO write path. Only that you'd first copy
> data into the page cache and issue dio write from those folios. No need for
> special tracking of which folios actually belong together in atomic write,
> no need for cluttering standard folio writeback path, in case atomic write
> cannot happen (e.g. because you cannot allocate appropriately aligned
> blocks) you get the error back rightaway, ...
>
> Of course this all depends on whether such semantics would be actually
> useful for users such as PostgreSQL.
I think it would be useful for many workloads.
As noted in the linked message, there are some workloads where I am not sure
how the gains/costs would balance out (with a small PG buffer pool in a write
heavy workload, we'd loose the ability to have the kernel avoid redundant
writes). It's possible that we could develop some heuristics to fall back to
doing our own torn-page avoidance in such cases, although it's not immediately
obvious to me what that heuristic would be. It's also not that common a
workload, it's *much* more common to have a read heavy workload that has to
overflow in the kernel page cache, due to not being able to dedicate
sufficient memory to postgres.
Greetings,
Andres Freund
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Buffered atomic writes
2026-02-13 10:20 [LSF/MM/BPF TOPIC] Buffered atomic writes Pankaj Raghav
2026-02-13 13:32 ` Ojaswin Mujoo
2026-02-15 9:01 ` Amir Goldstein
@ 2026-02-17 5:51 ` Christoph Hellwig
2026-02-17 9:23 ` [Lsf-pc] " Amir Goldstein
2026-02-20 10:08 ` Pankaj Raghav (Samsung)
3 siblings, 1 reply; 38+ messages in thread
From: Christoph Hellwig @ 2026-02-17 5:51 UTC (permalink / raw)
To: Pankaj Raghav
Cc: linux-xfs, linux-mm, linux-fsdevel, lsf-pc, Andres Freund,
djwong, john.g.garry, willy, hch, ritesh.list, jack, ojaswin,
Luis Chamberlain, dchinner, Javier Gonzalez, gost.dev, tytso,
p.raghav, vi.shah
I think a better session would be how we can help postgres to move
off buffered I/O instead of adding more special cases for them.
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Buffered atomic writes
2026-02-17 5:51 ` Christoph Hellwig
@ 2026-02-17 9:23 ` Amir Goldstein
2026-02-17 15:47 ` Andres Freund
2026-02-18 6:51 ` Christoph Hellwig
0 siblings, 2 replies; 38+ messages in thread
From: Amir Goldstein @ 2026-02-17 9:23 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Pankaj Raghav, linux-xfs, linux-mm, linux-fsdevel, lsf-pc,
Andres Freund, djwong, john.g.garry, willy, ritesh.list, jack,
ojaswin, Luis Chamberlain, dchinner, Javier Gonzalez, gost.dev,
tytso, p.raghav, vi.shah
On Tue, Feb 17, 2026 at 8:00 AM Christoph Hellwig <hch@lst.de> wrote:
>
> I think a better session would be how we can help postgres to move
> off buffered I/O instead of adding more special cases for them.
Respectfully, I disagree that DIO is the only possible solution.
Direct I/O is a legit solution for databases and so is buffered I/O
each with their own caveats.
Specifically, when two subsystems (kernel vfs and db) each require a huge
amount of cache memory for best performance, setting them up to play nicely
together to utilize system memory in an optimal way is a huge pain.
Thanks,
Amir.
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Buffered atomic writes
2026-02-16 15:45 ` Andres Freund
@ 2026-02-17 12:06 ` Jan Kara
2026-02-17 12:42 ` Pankaj Raghav
2026-02-17 16:13 ` Andres Freund
2026-02-17 18:33 ` Ojaswin Mujoo
1 sibling, 2 replies; 38+ messages in thread
From: Jan Kara @ 2026-02-17 12:06 UTC (permalink / raw)
To: Andres Freund
Cc: Pankaj Raghav, Ojaswin Mujoo, linux-xfs, linux-mm, linux-fsdevel,
lsf-pc, djwong, john.g.garry, willy, hch, ritesh.list, jack,
Luis Chamberlain, dchinner, Javier Gonzalez, gost.dev, tytso,
p.raghav, vi.shah
On Mon 16-02-26 10:45:40, Andres Freund wrote:
> > Hmm, IIUC, postgres will write their dirty buffer cache by combining
> > multiple DB pages based on `io_combine_limit` (typically 128kb).
>
> We will try to do that, but it's obviously far from always possible, in some
> workloads [parts of ]the data in the buffer pool rarely will be dirtied in
> consecutive blocks.
>
> FWIW, postgres already tries to force some just-written pages into
> writeback. For sources of writes that can be plentiful and are done in the
> background, we default to issuing sync_file_range(SYNC_FILE_RANGE_WRITE),
> after 256kB-512kB of writes, as otherwise foreground latency can be
> significantly impacted by the kernel deciding to suddenly write back (due to
> dirty_writeback_centisecs, dirty_background_bytes, ...) and because otherwise
> the fsyncs at the end of a checkpoint can be unpredictably slow. For
> foreground writes we do not default to that, as there are users that won't
> (because they don't know, because they overcommit hardware, ...) size
> postgres' buffer pool to be big enough and thus will often re-dirty pages that
> have already recently been written out to the operating systems. But for many
> workloads it's recommened that users turn on
> sync_file_range(SYNC_FILE_RANGE_WRITE) for foreground writes as well (*).
>
> So for many workloads it'd be fine to just always start writeback for atomic
> writes immediately. It's possible, but I am not at all sure, that for most of
> the other workloads, the gains from atomic writes will outstrip the cost of
> more frequently writing data back.
OK, good. Then I think it's worth a try.
> (*) As it turns out, it often seems to improves write throughput as well, if
> writeback is triggered by memory pressure instead of SYNC_FILE_RANGE_WRITE,
> linux seems to often trigger a lot more small random IO.
>
> > So immediately writing them might be ok as long as we don't remove those
> > pages from the page cache like we do in RWF_UNCACHED.
>
> Yes, it might. I actually often have wished for something like a
> RWF_WRITEBACK flag...
I'd call it RWF_WRITETHROUGH but otherwise it makes sense.
> > > An argument against this however is that it is user's responsibility to
> > > not do non atomic IO over an atomic range and this shall be considered a
> > > userspace usage error. This is similar to how there are ways users can
> > > tear a dio if they perform overlapping writes. [1].
>
> Hm, the scope of the prohibition here is not clear to me. Would it just
> be forbidden to do:
>
> P1: start pwritev(fd, [blocks 1-10], RWF_ATOMIC)
> P2: pwrite(fd, [any block in 1-10]), non-atomically
> P1: complete pwritev(fd, ...)
>
> or is it also forbidden to do:
>
> P1: pwritev(fd, [blocks 1-10], RWF_ATOMIC) start & completes
> Kernel: starts writeback but doesn't complete it
> P1: pwrite(fd, [any block in 1-10]), non-atomically
> Kernel: completes writeback
>
> The former is not at all an issue for postgres' use case, the pages in
> our buffer pool that are undergoing IO are locked, preventing additional
> IO (be it reads or writes) to those blocks.
>
> The latter would be a problem, since userspace wouldn't even know that
> here is still "atomic writeback" going on, afaict the only way we could
> avoid it would be to issue an f[data]sync(), which likely would be
> prohibitively expensive.
It somewhat depends on what outcome you expect in terms of crash safety :)
Unless we are careful, the RWF_ATOMIC write in your latter example can end
up writing some bits of the data from the second write because the second
write may be copying data to the pages as we issue DMA from them to the
device. I expect this isn't really acceptable because if you crash before
the second write fully makes it to the disk, you will have inconsistent
data. So what we can offer is to enable "stable pages" feature for the
filesystem (support for buffered atomic writes would be conditioned by
that) - that will block the second write until the IO is done so torn
writes cannot happen. If quick overwrites are rare, this should be a fine
option. If they are frequent, we'd need to come up with some bounce
buffering but things get ugly quickly there.
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Buffered atomic writes
2026-02-17 12:06 ` Jan Kara
@ 2026-02-17 12:42 ` Pankaj Raghav
2026-02-17 16:21 ` Andres Freund
2026-02-17 16:13 ` Andres Freund
1 sibling, 1 reply; 38+ messages in thread
From: Pankaj Raghav @ 2026-02-17 12:42 UTC (permalink / raw)
To: Jan Kara, Andres Freund
Cc: Ojaswin Mujoo, linux-xfs, linux-mm, linux-fsdevel, lsf-pc,
djwong, john.g.garry, willy, hch, ritesh.list, Luis Chamberlain,
dchinner, Javier Gonzalez, gost.dev, tytso, p.raghav, vi.shah
On 2/17/2026 1:06 PM, Jan Kara wrote:
> On Mon 16-02-26 10:45:40, Andres Freund wrote:
>>> Hmm, IIUC, postgres will write their dirty buffer cache by combining
>>> multiple DB pages based on `io_combine_limit` (typically 128kb).
>>
>> We will try to do that, but it's obviously far from always possible, in some
>> workloads [parts of ]the data in the buffer pool rarely will be dirtied in
>> consecutive blocks.
>>
>> FWIW, postgres already tries to force some just-written pages into
>> writeback. For sources of writes that can be plentiful and are done in the
>> background, we default to issuing sync_file_range(SYNC_FILE_RANGE_WRITE),
>> after 256kB-512kB of writes, as otherwise foreground latency can be
>> significantly impacted by the kernel deciding to suddenly write back (due to
>> dirty_writeback_centisecs, dirty_background_bytes, ...) and because otherwise
>> the fsyncs at the end of a checkpoint can be unpredictably slow. For
>> foreground writes we do not default to that, as there are users that won't
>> (because they don't know, because they overcommit hardware, ...) size
>> postgres' buffer pool to be big enough and thus will often re-dirty pages that
>> have already recently been written out to the operating systems. But for many
>> workloads it's recommened that users turn on
>> sync_file_range(SYNC_FILE_RANGE_WRITE) for foreground writes as well (*).
>>
>> So for many workloads it'd be fine to just always start writeback for atomic
>> writes immediately. It's possible, but I am not at all sure, that for most of
>> the other workloads, the gains from atomic writes will outstrip the cost of
>> more frequently writing data back.
>
> OK, good. Then I think it's worth a try.
>
>> (*) As it turns out, it often seems to improves write throughput as well, if
>> writeback is triggered by memory pressure instead of SYNC_FILE_RANGE_WRITE,
>> linux seems to often trigger a lot more small random IO.
>>
>>> So immediately writing them might be ok as long as we don't remove those
>>> pages from the page cache like we do in RWF_UNCACHED.
>>
>> Yes, it might. I actually often have wished for something like a
>> RWF_WRITEBACK flag...
>
> I'd call it RWF_WRITETHROUGH but otherwise it makes sense.
>
One naive question: semantically what will be the difference between
RWF_DSYNC and RWF_WRITETHROUGH? So RWF_DSYNC will be the sync version and
RWF_WRITETHOUGH will be an async version where we kick off writeback immediately
in the background and return?
--
Pankaj
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Buffered atomic writes
2026-02-17 9:23 ` [Lsf-pc] " Amir Goldstein
@ 2026-02-17 15:47 ` Andres Freund
2026-02-17 22:45 ` Dave Chinner
2026-02-18 6:53 ` Christoph Hellwig
2026-02-18 6:51 ` Christoph Hellwig
1 sibling, 2 replies; 38+ messages in thread
From: Andres Freund @ 2026-02-17 15:47 UTC (permalink / raw)
To: Amir Goldstein
Cc: Christoph Hellwig, Pankaj Raghav, linux-xfs, linux-mm,
linux-fsdevel, lsf-pc, djwong, john.g.garry, willy, ritesh.list,
jack, ojaswin, Luis Chamberlain, dchinner, Javier Gonzalez,
gost.dev, tytso, p.raghav, vi.shah
Hi,
On 2026-02-17 10:23:36 +0100, Amir Goldstein wrote:
> On Tue, Feb 17, 2026 at 8:00 AM Christoph Hellwig <hch@lst.de> wrote:
> >
> > I think a better session would be how we can help postgres to move
> > off buffered I/O instead of adding more special cases for them.
FWIW, we are adding support for DIO (it's been added, but performance isn't
competitive for most workloads in the released versions yet, work to address
those issues is in progress).
But it's only really be viable for larger setups, not for e.g.:
- smaller, unattended setups
- uses of postgres as part of a larger application on one server with hard to
predict memory usage of different components
- intentionally overcommitted shared hosting type scenarios
Even once a well configured postgres using DIO beats postgres not using DIO,
I'll bet that well over 50% of users won't be able to use DIO.
There are some kernel issues that make it harder than necessary to use DIO,
btw:
Most prominently: With DIO concurrently extending multiple files leads to
quite terrible fragmentation, at least with XFS. Forcing us to
over-aggressively use fallocate(), truncating later if it turns out we need
less space. The fallocate in turn triggers slowness in the write paths, as
writing to uninitialized extents is a metadata operation. It'd be great if
the allocation behaviour with concurrent file extension could be improved and
if we could have a fallocate mode that forces extents to be initialized.
A secondary issue is that with the buffer pool sizes necessary for DIO use on
bigger systems, creating the anonymous memory mapping becomes painfully slow
if we use MAP_POPULATE - which we kinda need to do, as otherwise performance
is very inconsistent initially (often iomap -> gup -> handle_mm_fault ->
folio_zero_user uses the majority of the CPU). We've been experimenting with
not using MAP_POPULATE and using multiple threads to populate the mapping in
parallel, but that feels not like something that userspace ought to have to
do. It's easier to work around for us that the uninitialized extent
conversion issue, but it still is something we IMO shouldn't have to do.
> Respectfully, I disagree that DIO is the only possible solution.
> Direct I/O is a legit solution for databases and so is buffered I/O
> each with their own caveats.
> Specifically, when two subsystems (kernel vfs and db) each require a huge
> amount of cache memory for best performance, setting them up to play nicely
> together to utilize system memory in an optimal way is a huge pain.
Yep.
Greetings,
Andres Freund
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Buffered atomic writes
2026-02-17 12:06 ` Jan Kara
2026-02-17 12:42 ` Pankaj Raghav
@ 2026-02-17 16:13 ` Andres Freund
2026-02-17 18:27 ` Ojaswin Mujoo
2026-02-18 17:37 ` Jan Kara
1 sibling, 2 replies; 38+ messages in thread
From: Andres Freund @ 2026-02-17 16:13 UTC (permalink / raw)
To: Jan Kara
Cc: Pankaj Raghav, Ojaswin Mujoo, linux-xfs, linux-mm, linux-fsdevel,
lsf-pc, djwong, john.g.garry, willy, hch, ritesh.list,
Luis Chamberlain, dchinner, Javier Gonzalez, gost.dev, tytso,
p.raghav, vi.shah
Hi,
On 2026-02-17 13:06:04 +0100, Jan Kara wrote:
> On Mon 16-02-26 10:45:40, Andres Freund wrote:
> > (*) As it turns out, it often seems to improves write throughput as well, if
> > writeback is triggered by memory pressure instead of SYNC_FILE_RANGE_WRITE,
> > linux seems to often trigger a lot more small random IO.
> >
> > > So immediately writing them might be ok as long as we don't remove those
> > > pages from the page cache like we do in RWF_UNCACHED.
> >
> > Yes, it might. I actually often have wished for something like a
> > RWF_WRITEBACK flag...
>
> I'd call it RWF_WRITETHROUGH but otherwise it makes sense.
Heh, that makes sense. I think that's what I actually was thinking of.
> > > > An argument against this however is that it is user's responsibility to
> > > > not do non atomic IO over an atomic range and this shall be considered a
> > > > userspace usage error. This is similar to how there are ways users can
> > > > tear a dio if they perform overlapping writes. [1].
> >
> > Hm, the scope of the prohibition here is not clear to me. Would it just
> > be forbidden to do:
> >
> > P1: start pwritev(fd, [blocks 1-10], RWF_ATOMIC)
> > P2: pwrite(fd, [any block in 1-10]), non-atomically
> > P1: complete pwritev(fd, ...)
> >
> > or is it also forbidden to do:
> >
> > P1: pwritev(fd, [blocks 1-10], RWF_ATOMIC) start & completes
> > Kernel: starts writeback but doesn't complete it
> > P1: pwrite(fd, [any block in 1-10]), non-atomically
> > Kernel: completes writeback
> >
> > The former is not at all an issue for postgres' use case, the pages in
> > our buffer pool that are undergoing IO are locked, preventing additional
> > IO (be it reads or writes) to those blocks.
> >
> > The latter would be a problem, since userspace wouldn't even know that
> > here is still "atomic writeback" going on, afaict the only way we could
> > avoid it would be to issue an f[data]sync(), which likely would be
> > prohibitively expensive.
>
> It somewhat depends on what outcome you expect in terms of crash safety :)
> Unless we are careful, the RWF_ATOMIC write in your latter example can end
> up writing some bits of the data from the second write because the second
> write may be copying data to the pages as we issue DMA from them to the
> device.
Hm. It's somewhat painful to not know when we can write in what mode again -
with DIO that's not an issue. I guess we could use
sync_file_range(SYNC_FILE_RANGE_WAIT_BEFORE) if we really needed to know?
Although the semantics of the SFR flags aren't particularly clear, so maybe
not?
> I expect this isn't really acceptable because if you crash before
> the second write fully makes it to the disk, you will have inconsistent
> data.
The scenarios that I can think that would lead us to doing something like
this, are when we are overwriting data without regard for the prior contents,
e.g:
An already partially filled page is filled with more rows, we write that page
out, then all the rows are deleted, and we re-fill the page with new content
from scratch. Write it out again. With our existing logic we treat the second
write differently, because the entire contents of the page will be in the
journal, as there is no prior content that we care about.
A second scenario in which we might not use RWF_ATOMIC, if we carry today's
logic forward, is if a newly created relation is bulk loaded in the same
transaction that created the relation. If a crash were to happen while that
bulk load is ongoing, we don't care about the contents of the file(s), as it
will never be visible to anyone after crash recovery. In this case we won't
have prio RWF_ATOMIC writes - but we could have the opposite, i.e. an
RWF_ATOMIC write while there already is non-RWF_ATOMIC dirty data in the page
cache. Would that be an issue?
It's possible we should just always use RWF_ATOMIC, even in the cases where
it's not needed from our side, to avoid potential performance penalties and
"undefined behaviour". I guess that will really depend on the performance
penalty that RWF_ATOMIC will carry and whether multiple-atomicity-mode will
eventually be supported (as doing small writes during bulk loading is quite
expensive).
Greetings,
Andres Freund
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Buffered atomic writes
2026-02-17 12:42 ` Pankaj Raghav
@ 2026-02-17 16:21 ` Andres Freund
2026-02-18 1:04 ` Dave Chinner
0 siblings, 1 reply; 38+ messages in thread
From: Andres Freund @ 2026-02-17 16:21 UTC (permalink / raw)
To: Pankaj Raghav
Cc: Jan Kara, Ojaswin Mujoo, linux-xfs, linux-mm, linux-fsdevel,
lsf-pc, djwong, john.g.garry, willy, hch, ritesh.list,
Luis Chamberlain, dchinner, Javier Gonzalez, gost.dev, tytso,
p.raghav, vi.shah
Hi,
On 2026-02-17 13:42:35 +0100, Pankaj Raghav wrote:
> On 2/17/2026 1:06 PM, Jan Kara wrote:
> > On Mon 16-02-26 10:45:40, Andres Freund wrote:
> > > (*) As it turns out, it often seems to improves write throughput as well, if
> > > writeback is triggered by memory pressure instead of SYNC_FILE_RANGE_WRITE,
> > > linux seems to often trigger a lot more small random IO.
> > >
> > > > So immediately writing them might be ok as long as we don't remove those
> > > > pages from the page cache like we do in RWF_UNCACHED.
> > >
> > > Yes, it might. I actually often have wished for something like a
> > > RWF_WRITEBACK flag...
> >
> > I'd call it RWF_WRITETHROUGH but otherwise it makes sense.
> >
>
> One naive question: semantically what will be the difference between
> RWF_DSYNC and RWF_WRITETHROUGH? So RWF_DSYNC will be the sync version and
> RWF_WRITETHOUGH will be an async version where we kick off writeback
> immediately in the background and return?
Besides sync vs async:
If the device has a volatile write cache, RWF_DSYNC will trigger flushes for
the entire write cache or do FUA writes for just the RWF_DSYNC write. Which
wouldn't be needed for RWF_WRITETHROUGH, right?
I don't know if there will be devices that have a volatile write cache with
atomicity support for > 4kB, so maybe that's a distinction that's irrelevant
in practice for Postgres. But for 4kB writes, the difference in throughput
and individual IO latency you get from many SSDs between using FUA writes /
cache flushes and not doing so are enormous.
Greetings,
Andres Freund
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Buffered atomic writes
2026-02-16 9:52 ` Pankaj Raghav
2026-02-16 15:45 ` Andres Freund
@ 2026-02-17 17:20 ` Ojaswin Mujoo
2026-02-18 17:42 ` [Lsf-pc] " Jan Kara
1 sibling, 1 reply; 38+ messages in thread
From: Ojaswin Mujoo @ 2026-02-17 17:20 UTC (permalink / raw)
To: Pankaj Raghav
Cc: linux-xfs, linux-mm, linux-fsdevel, lsf-pc, Andres Freund,
djwong, john.g.garry, willy, hch, ritesh.list, jack,
Luis Chamberlain, dchinner, Javier Gonzalez, gost.dev, tytso,
p.raghav, vi.shah
On Mon, Feb 16, 2026 at 10:52:35AM +0100, Pankaj Raghav wrote:
> On 2/13/26 14:32, Ojaswin Mujoo wrote:
> > On Fri, Feb 13, 2026 at 11:20:36AM +0100, Pankaj Raghav wrote:
> >> Hi all,
> >>
> >> Atomic (untorn) writes for Direct I/O have successfully landed in kernel
> >> for ext4 and XFS[1][2]. However, extending this support to Buffered I/O
> >> remains a contentious topic, with previous discussions often stalling due to
> >> concerns about complexity versus utility.
> >>
> >> I would like to propose a session to discuss the concrete use cases for
> >> buffered atomic writes and if possible, talk about the outstanding
> >> architectural blockers blocking the current RFCs[3][4].
> >
> > Hi Pankaj,
> >
> > Thanks for the proposal and glad to hear there is a wider interest in
> > this topic. We have also been actively working on this and I in middle
> > of testing and ironing out bugs in my RFC v2 for buffered atomic
> > writes, which is largely based on Dave's suggestions to maintain atomic
> > write mappings in FS layer (aka XFS COW fork). Infact I was going to
> > propose a discussion on this myself :)
> >
>
> Perfect.
>
> >>
> >> ## Use Case:
> >>
> >> A recurring objection to buffered atomics is the lack of a convincing use
> >> case, with the argument that databases should simply migrate to direct I/O.
> >> We have been working with PostgreSQL developer Andres Freund, who has
> >> highlighted a specific architectural requirement where buffered I/O remains
> >> preferable in certain scenarios.
> >
> > Looks like you have some nice insights to cover from postgres side which
> > filesystem community has been asking for. As I've also been working on
> > the kernel implementation side of it, do you think we could do a joint
> > session on this topic?
> >
> As one of the main pushback for this feature has been a valid usecase, the main
> outcome I would like to get out of this session is a community consensus on the use case
> for this feature.
>
> It looks like you already made quite a bit of progress with the CoW impl, so it
> would be great to if it can be a joint session.
Awesome!
>
>
> >> We currently have RFCs posted by John Garry and Ojaswin Mujoo, and there
> >> was a previous LSFMM proposal about untorn buffered writes from Ted Tso.
> >> Based on the conversation/blockers we had before, the discussion at LSFMM
> >> should focus on the following blocking issues:
> >>
> >> - Handling Short Writes under Memory Pressure[6]: A buffered atomic
> >> write might span page boundaries. If memory pressure causes a page
> >> fault or reclaim mid-copy, the write could be torn inside the page
> >> cache before it even reaches the filesystem.
> >> - The current RFC uses a "pinning" approach: pinning user pages and
> >> creating a BVEC to ensure the full copy can proceed atomically.
> >> This adds complexity to the write path.
> >> - Discussion: Is this acceptable? Should we consider alternatives,
> >> such as requiring userspace to mlock the I/O buffers before
> >> issuing the write to guarantee atomic copy in the page cache?
> >
> > Right, I chose this approach because we only get to know about the short
> > copy after it has actually happened in copy_folio_from_iter_atomic()
> > and it seemed simpler to just not let the short copy happen. This is
> > inspired from how dio pins the pages for DMA, just that we do it
> > for a shorter time.
> >
> > It does add slight complexity to the path but I'm not sure if it's complex
> > enough to justify adding a hard requirement of having pages mlock'd.
> >
>
> As databases like postgres have a buffer cache that they manage in userspace,
> which is eventually used to do IO, I am wondering if they already do a mlock
> or some other way to guarantee the buffer cache does not get reclaimed. That is
> why I was thinking if we could make it a requirement. Of course, that also requires
> checking if the range is mlocked in the iomap_write_iter path.
Hmm got it,I still feel it might be an overkill for something we
already have a mechanism for and can achieve easily, but I'm open to
discussion on this :)
>
> >>
> >> - Page Cache Model vs. Filesystem CoW: The current RFC introduces a
> >> PG_atomic page flag to track dirty pages requiring atomic writeback.
> >> This faced pushback due to page flags being a scarce resource[7].
> >> Furthermore, it was argued that atomic model does not fit the buffered
> >> I/O model because data sitting in the page cache is vulnerable to
> >> modification before writeback occurs, and writeback does not preserve
> >> application ordering[8].
> >> - Dave Chinner has proposed leveraging the filesystem's CoW path
> >> where we always allocate new blocks for the atomic write (forced
> >> CoW). If the hardware supports it (e.g., NVMe atomic limits), the
> >> filesystem can optimize the writeback to use REQ_ATOMIC in place,
> >> avoiding the CoW overhead while maintaining the architectural
> >> separation.
> >
> > Right, this is what I'm doing in the new RFC where we maintain the
> > mappings for atomic write in COW fork. This way we are able to utilize a
> > lot of existing infrastructure, however it does add some complexity to
> > ->iomap_begin() and ->writeback_range() callbacks of the FS. I believe
> > it is a tradeoff since the general consesus was mostly to avoid adding
> > too much complexity to iomap layer.
> >
> > Another thing that came up is to consider using write through semantics
> > for buffered atomic writes, where we are able to transition page to
> > writeback state immediately after the write and avoid any other users to
> > modify the data till writeback completes. This might affect performance
> > since we won't be able to batch similar atomic IOs but maybe
> > applications like postgres would not mind this too much. If we go with
> > this approach, we will be able to avoid worrying too much about other
> > users changing atomic data underneath us.
> >
>
> Hmm, IIUC, postgres will write their dirty buffer cache by combining multiple DB
> pages based on `io_combine_limit` (typically 128kb). So immediately writing them
> might be ok as long as we don't remove those pages from the page cache like we do in
> RWF_UNCACHED.
Yep, and Ive not looked at the code path much but I think if we really
care about the user not changing the data b/w write and writeback then
we will probably need to start the writeback while holding the folio
lock, which is currently not done in RWF_UNCACHED.
>
>
> > An argument against this however is that it is user's responsibility to
> > not do non atomic IO over an atomic range and this shall be considered a
> > userspace usage error. This is similar to how there are ways users can
> > tear a dio if they perform overlapping writes. [1].
> >
> > That being said, I think these points are worth discussing and it would
> > be helpful to have people from postgres around while discussing these
> > semantics with the FS community members.
> >
> > As for ordering of writes, I'm not sure if that is something that
> > we should guarantee via the RWF_ATOMIC api. Ensuring ordering has mostly
> > been the task of userspace via fsync() and friends.
> >
>
> Agreed.
>
> >
> > [1] https://lore.kernel.org/fstests/0af205d9-6093-4931-abe9-f236acae8d44@oracle.com/
> >
> >> - Discussion: While the CoW approach fits XFS and other CoW
> >> filesystems well, it presents challenges for filesystems like ext4
> >> which lack CoW capabilities for data. Should this be a filesystem
> >> specific feature?
> >
> > I believe your question is if we should have a hard dependency on COW
> > mappings for atomic writes. Currently, COW in atomic write context in
> > XFS, is used for these 2 things:
> >
> > 1. COW fork holds atomic write ranges.
> >
> > This is not strictly a COW feature, just that we are repurposing the COW
> > fork to hold our atomic ranges. Basically a way for writeback path to
> > know that atomic write was done here.
> >
> > COW fork is one way to do this but I believe every FS has a version of
> > in memory extent trees where such ephemeral atomic write mappings can be
> > held. The extent status cache is ext4's version of this, and can be used
> > to manage the atomic write ranges.
> >
> > There is an alternate suggestion that came up from discussions with Ted
> > and Darrick that we can instead use a generic side-car structure which
> > holds atomic write ranges. FSes can populate these during atomic writes
> > and query these in their writeback paths.
> >
> > This means for any FS operation (think truncate, falloc, mwrite, write
> > ...) we would need to keep this structure in sync, which can become pretty
> > complex pretty fast. I'm yet to implement this so not sure how it would
> > look in practice though.
> >
> > 2. COW feature as a whole enables software based atomic writes.
> >
> > This is something that ext4 won't be able to support (right now), just
> > like how we don't support software writes for dio.
> >
> > I believe Baokun and Yi and working on a feature that can eventually
> > enable COW writes in ext4 [2]. Till we have something like that, we
> > would have to rely on hardware support.
> >
> > Regardless, I don't think the ability to support or not support
> > software atomic writes largely depends on the filesystem so I'm not
> > sure how we can lift this up to a generic layer anyways.
> >
> > [2] https://lore.kernel.org/linux-ext4/9666679c-c9f7-435c-8b67-c67c2f0c19ab@huawei.com/
> >
>
> Thanks for the explanation. I am also planning to take a shot at the CoW approach. I would
> be more than happy to review and test if you send a RFC in the meantime.
Thanks Pankaj, I'm testing the current RFC internally. I think I'll have
something in coming weeks and we can go over the design and how it looks
etc.
Regards,
ojaswin
>
> --
> Pankaj
>
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Buffered atomic writes
2026-02-17 16:13 ` Andres Freund
@ 2026-02-17 18:27 ` Ojaswin Mujoo
2026-02-17 18:42 ` Andres Freund
2026-02-18 17:37 ` Jan Kara
1 sibling, 1 reply; 38+ messages in thread
From: Ojaswin Mujoo @ 2026-02-17 18:27 UTC (permalink / raw)
To: Andres Freund
Cc: Jan Kara, Pankaj Raghav, linux-xfs, linux-mm, linux-fsdevel,
lsf-pc, djwong, john.g.garry, willy, hch, ritesh.list,
Luis Chamberlain, dchinner, Javier Gonzalez, gost.dev, tytso,
p.raghav, vi.shah
On Tue, Feb 17, 2026 at 11:13:07AM -0500, Andres Freund wrote:
> Hi,
>
> On 2026-02-17 13:06:04 +0100, Jan Kara wrote:
> > On Mon 16-02-26 10:45:40, Andres Freund wrote:
> > > (*) As it turns out, it often seems to improves write throughput as well, if
> > > writeback is triggered by memory pressure instead of SYNC_FILE_RANGE_WRITE,
> > > linux seems to often trigger a lot more small random IO.
> > >
> > > > So immediately writing them might be ok as long as we don't remove those
> > > > pages from the page cache like we do in RWF_UNCACHED.
> > >
> > > Yes, it might. I actually often have wished for something like a
> > > RWF_WRITEBACK flag...
> >
> > I'd call it RWF_WRITETHROUGH but otherwise it makes sense.
>
> Heh, that makes sense. I think that's what I actually was thinking of.
>
>
> > > > > An argument against this however is that it is user's responsibility to
> > > > > not do non atomic IO over an atomic range and this shall be considered a
> > > > > userspace usage error. This is similar to how there are ways users can
> > > > > tear a dio if they perform overlapping writes. [1].
> > >
> > > Hm, the scope of the prohibition here is not clear to me. Would it just
> > > be forbidden to do:
> > >
> > > P1: start pwritev(fd, [blocks 1-10], RWF_ATOMIC)
> > > P2: pwrite(fd, [any block in 1-10]), non-atomically
> > > P1: complete pwritev(fd, ...)
> > >
> > > or is it also forbidden to do:
> > >
> > > P1: pwritev(fd, [blocks 1-10], RWF_ATOMIC) start & completes
> > > Kernel: starts writeback but doesn't complete it
> > > P1: pwrite(fd, [any block in 1-10]), non-atomically
> > > Kernel: completes writeback
> > >
> > > The former is not at all an issue for postgres' use case, the pages in
> > > our buffer pool that are undergoing IO are locked, preventing additional
> > > IO (be it reads or writes) to those blocks.
> > >
> > > The latter would be a problem, since userspace wouldn't even know that
> > > here is still "atomic writeback" going on, afaict the only way we could
> > > avoid it would be to issue an f[data]sync(), which likely would be
> > > prohibitively expensive.
> >
> > It somewhat depends on what outcome you expect in terms of crash safety :)
> > Unless we are careful, the RWF_ATOMIC write in your latter example can end
> > up writing some bits of the data from the second write because the second
> > write may be copying data to the pages as we issue DMA from them to the
> > device.
>
> Hm. It's somewhat painful to not know when we can write in what mode again -
> with DIO that's not an issue. I guess we could use
> sync_file_range(SYNC_FILE_RANGE_WAIT_BEFORE) if we really needed to know?
> Although the semantics of the SFR flags aren't particularly clear, so maybe
> not?
>
>
> > I expect this isn't really acceptable because if you crash before
> > the second write fully makes it to the disk, you will have inconsistent
> > data.
>
> The scenarios that I can think that would lead us to doing something like
> this, are when we are overwriting data without regard for the prior contents,
> e.g:
>
> An already partially filled page is filled with more rows, we write that page
> out, then all the rows are deleted, and we re-fill the page with new content
> from scratch. Write it out again. With our existing logic we treat the second
> write differently, because the entire contents of the page will be in the
> journal, as there is no prior content that we care about.
Hi Andres,
From my mental model and very high level understanding of Postgres' WAL
model [1] I am under the impression that for moving from full page
writes to RWF_ATOMIC, we would need to ensure that the **disk** write IO
of any data buffer should go in an untorn fashion.
Now, coming to your example, IIUC here we can actually tolerate to do
the 2nd write above non atomically because it is already a sort of full
page write in the journal.
So lets say if we do something like:
0. Buffer has some initial value on disk
1. Write new rows into buffer
2. Write the buffer as RWF_ATOMIC
3. Overwrite the complete buffer which will journal all the contents
4. Write the buffer as non RWF_ATOMIC
5. Crash
I think it is still possible to satisfy my assumption of **disk** IO
being untorn. Example, here we can have an RWF_ATOMIC implementation
where the data on disk after crash could either be in initial state 0.
or be the new value after 4. This is not strictly the old or new
semantic but still ensures the data is consistent.
My naive understanding says that as long as disk has consistent/untorn
data, like above, we can recover via the journal. In this case the
kernel implementation should be able to tolerate mixing of atomic and
non atomic writes, but again I might be wrong here.
However, if the above guarantees are not enough and actually care about
true old or new semantic, we would need something like RWF_WRITETHROUGH
to ensure we get truely old or new.
[1] https://www.interdb.jp/pg/pgsql09/01.html
>
> A second scenario in which we might not use RWF_ATOMIC, if we carry today's
> logic forward, is if a newly created relation is bulk loaded in the same
> transaction that created the relation. If a crash were to happen while that
> bulk load is ongoing, we don't care about the contents of the file(s), as it
> will never be visible to anyone after crash recovery. In this case we won't
> have prio RWF_ATOMIC writes - but we could have the opposite, i.e. an
> RWF_ATOMIC write while there already is non-RWF_ATOMIC dirty data in the page
> cache. Would that be an issue?
I think this is same discussion as above.
>
>
> It's possible we should just always use RWF_ATOMIC, even in the cases where
> it's not needed from our side, to avoid potential performance penalties and
> "undefined behaviour". I guess that will really depend on the performance
> penalty that RWF_ATOMIC will carry and whether multiple-atomicity-mode will
> eventually be supported (as doing small writes during bulk loading is quite
> expensive).
>
>
> Greetings,
>
> Andres Freund
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Buffered atomic writes
2026-02-16 15:45 ` Andres Freund
2026-02-17 12:06 ` Jan Kara
@ 2026-02-17 18:33 ` Ojaswin Mujoo
1 sibling, 0 replies; 38+ messages in thread
From: Ojaswin Mujoo @ 2026-02-17 18:33 UTC (permalink / raw)
To: Andres Freund
Cc: Pankaj Raghav, linux-xfs, linux-mm, linux-fsdevel, lsf-pc,
djwong, john.g.garry, willy, hch, ritesh.list, jack,
Luis Chamberlain, dchinner, Javier Gonzalez, gost.dev, tytso,
p.raghav, vi.shah
On Mon, Feb 16, 2026 at 10:45:40AM -0500, Andres Freund wrote:
> Hi,
>
> On 2026-02-16 10:52:35 +0100, Pankaj Raghav wrote:
> > On 2/13/26 14:32, Ojaswin Mujoo wrote:
> > > On Fri, Feb 13, 2026 at 11:20:36AM +0100, Pankaj Raghav wrote:
> > >> We currently have RFCs posted by John Garry and Ojaswin Mujoo, and there
> > >> was a previous LSFMM proposal about untorn buffered writes from Ted Tso.
> > >> Based on the conversation/blockers we had before, the discussion at LSFMM
> > >> should focus on the following blocking issues:
> > >>
> > >> - Handling Short Writes under Memory Pressure[6]: A buffered atomic
> > >> write might span page boundaries. If memory pressure causes a page
> > >> fault or reclaim mid-copy, the write could be torn inside the page
> > >> cache before it even reaches the filesystem.
> > >> - The current RFC uses a "pinning" approach: pinning user pages and
> > >> creating a BVEC to ensure the full copy can proceed atomically.
> > >> This adds complexity to the write path.
> > >> - Discussion: Is this acceptable? Should we consider alternatives,
> > >> such as requiring userspace to mlock the I/O buffers before
> > >> issuing the write to guarantee atomic copy in the page cache?
> > >
> > > Right, I chose this approach because we only get to know about the short
> > > copy after it has actually happened in copy_folio_from_iter_atomic()
> > > and it seemed simpler to just not let the short copy happen. This is
> > > inspired from how dio pins the pages for DMA, just that we do it
> > > for a shorter time.
> > >
> > > It does add slight complexity to the path but I'm not sure if it's complex
> > > enough to justify adding a hard requirement of having pages mlock'd.
> > >
> >
> > As databases like postgres have a buffer cache that they manage in userspace,
> > which is eventually used to do IO, I am wondering if they already do a mlock
> > or some other way to guarantee the buffer cache does not get reclaimed. That is
> > why I was thinking if we could make it a requirement. Of course, that also requires
> > checking if the range is mlocked in the iomap_write_iter path.
>
> We don't generally mlock our buffer pool - but we strongly recommend to use
> explicit huge pages (due to TLB pressure, faster fork() and less memory wasted
> on page tables), which afaict has basically the same effect. However, that
> doesn't make the page cache pages locked...
>
>
> > >> - Page Cache Model vs. Filesystem CoW: The current RFC introduces a
> > >> PG_atomic page flag to track dirty pages requiring atomic writeback.
> > >> This faced pushback due to page flags being a scarce resource[7].
> > >> Furthermore, it was argued that atomic model does not fit the buffered
> > >> I/O model because data sitting in the page cache is vulnerable to
> > >> modification before writeback occurs, and writeback does not preserve
> > >> application ordering[8].
> > >> - Dave Chinner has proposed leveraging the filesystem's CoW path
> > >> where we always allocate new blocks for the atomic write (forced
> > >> CoW). If the hardware supports it (e.g., NVMe atomic limits), the
> > >> filesystem can optimize the writeback to use REQ_ATOMIC in place,
> > >> avoiding the CoW overhead while maintaining the architectural
> > >> separation.
> > >
> > > Right, this is what I'm doing in the new RFC where we maintain the
> > > mappings for atomic write in COW fork. This way we are able to utilize a
> > > lot of existing infrastructure, however it does add some complexity to
> > > ->iomap_begin() and ->writeback_range() callbacks of the FS. I believe
> > > it is a tradeoff since the general consesus was mostly to avoid adding
> > > too much complexity to iomap layer.
> > >
> > > Another thing that came up is to consider using write through semantics
> > > for buffered atomic writes, where we are able to transition page to
> > > writeback state immediately after the write and avoid any other users to
> > > modify the data till writeback completes. This might affect performance
> > > since we won't be able to batch similar atomic IOs but maybe
> > > applications like postgres would not mind this too much. If we go with
> > > this approach, we will be able to avoid worrying too much about other
> > > users changing atomic data underneath us.
> > >
> >
> > Hmm, IIUC, postgres will write their dirty buffer cache by combining
> > multiple DB pages based on `io_combine_limit` (typically 128kb).
>
> We will try to do that, but it's obviously far from always possible, in some
> workloads [parts of ]the data in the buffer pool rarely will be dirtied in
> consecutive blocks.
>
> FWIW, postgres already tries to force some just-written pages into
> writeback. For sources of writes that can be plentiful and are done in the
> background, we default to issuing sync_file_range(SYNC_FILE_RANGE_WRITE),
> after 256kB-512kB of writes, as otherwise foreground latency can be
> significantly impacted by the kernel deciding to suddenly write back (due to
> dirty_writeback_centisecs, dirty_background_bytes, ...) and because otherwise
> the fsyncs at the end of a checkpoint can be unpredictably slow. For
> foreground writes we do not default to that, as there are users that won't
> (because they don't know, because they overcommit hardware, ...) size
> postgres' buffer pool to be big enough and thus will often re-dirty pages that
> have already recently been written out to the operating systems. But for many
> workloads it's recommened that users turn on
> sync_file_range(SYNC_FILE_RANGE_WRITE) for foreground writes as well (*).
>
> So for many workloads it'd be fine to just always start writeback for atomic
> writes immediately. It's possible, but I am not at all sure, that for most of
> the other workloads, the gains from atomic writes will outstrip the cost of
> more frequently writing data back.
>
>
> (*) As it turns out, it often seems to improves write throughput as well, if
> writeback is triggered by memory pressure instead of SYNC_FILE_RANGE_WRITE,
> linux seems to often trigger a lot more small random IO.
>
>
> > So immediately writing them might be ok as long as we don't remove those
> > pages from the page cache like we do in RWF_UNCACHED.
>
> Yes, it might. I actually often have wished for something like a
> RWF_WRITEBACK flag...
>
>
> > > An argument against this however is that it is user's responsibility to
> > > not do non atomic IO over an atomic range and this shall be considered a
> > > userspace usage error. This is similar to how there are ways users can
> > > tear a dio if they perform overlapping writes. [1].
>
> Hm, the scope of the prohibition here is not clear to me. Would it just
> be forbidden to do:
>
> P1: start pwritev(fd, [blocks 1-10], RWF_ATOMIC)
> P2: pwrite(fd, [any block in 1-10]), non-atomically
> P1: complete pwritev(fd, ...)
>
> or is it also forbidden to do:
>
> P1: pwritev(fd, [blocks 1-10], RWF_ATOMIC) start & completes
> Kernel: starts writeback but doesn't complete it
> P1: pwrite(fd, [any block in 1-10]), non-atomically
> Kernel: completes writeback
>
> The former is not at all an issue for postgres' use case, the pages in our
> buffer pool that are undergoing IO are locked, preventing additional IO (be it
> reads or writes) to those blocks.
>
> The latter would be a problem, since userspace wouldn't even know that here is
> still "atomic writeback" going on, afaict the only way we could avoid it would
> be to issue an f[data]sync(), which likely would be prohibitively expensive.
>
>
>
> > > That being said, I think these points are worth discussing and it would
> > > be helpful to have people from postgres around while discussing these
> > > semantics with the FS community members.
> > >
> > > As for ordering of writes, I'm not sure if that is something that
> > > we should guarantee via the RWF_ATOMIC api. Ensuring ordering has mostly
> > > been the task of userspace via fsync() and friends.
> > >
> >
> > Agreed.
>
> From postgres' side that's fine. In the cases we care about ordering we use
> fsync() already.
>
>
> > > [1] https://lore.kernel.org/fstests/0af205d9-6093-4931-abe9-f236acae8d44@oracle.com/
> > >
> > >> - Discussion: While the CoW approach fits XFS and other CoW
> > >> filesystems well, it presents challenges for filesystems like ext4
> > >> which lack CoW capabilities for data. Should this be a filesystem
> > >> specific feature?
> > >
> > > I believe your question is if we should have a hard dependency on COW
> > > mappings for atomic writes. Currently, COW in atomic write context in
> > > XFS, is used for these 2 things:
> > >
> > > 1. COW fork holds atomic write ranges.
> > >
> > > This is not strictly a COW feature, just that we are repurposing the COW
> > > fork to hold our atomic ranges. Basically a way for writeback path to
> > > know that atomic write was done here.
>
> Does that mean buffered atomic writes would cause fragmentation? Some common
> database workloads, e.g. anything running on cheaper cloud storage, are pretty
> sensitive to that due to the increase in use of the metered IOPS.
>
Hi Andres,
So we have tricks like allocating more blocks than needed which helps
with fragmentation even when using COW fork. I think we are able to tune
how aggresively we want preallocate more blocks. Further, if we have say
fallocated a range in file which satisfies our requirements, then we can
also upgrade to HW (non cow) atomic writes and use the falloc'd extents
which will also help with fragmentations
My point being, I don't think COW usage will strictly mean more
fragmentation however we will eventually need to run benchamrks and see.
Hopefully once I have the implementation, we can work on these things.
Regards,
ojaswin
> Greetings,
>
> Andres Freund
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Buffered atomic writes
2026-02-16 13:18 ` Pankaj Raghav
@ 2026-02-17 18:36 ` Ojaswin Mujoo
0 siblings, 0 replies; 38+ messages in thread
From: Ojaswin Mujoo @ 2026-02-17 18:36 UTC (permalink / raw)
To: Pankaj Raghav
Cc: Jan Kara, linux-xfs, linux-mm, linux-fsdevel, lsf-pc,
Andres Freund, djwong, john.g.garry, willy, hch, ritesh.list,
Luis Chamberlain, dchinner, Javier Gonzalez, gost.dev, tytso,
p.raghav, vi.shah
On Mon, Feb 16, 2026 at 02:18:10PM +0100, Pankaj Raghav wrote:
>
>
> On 2/16/2026 12:38 PM, Jan Kara wrote:
> > Hi!
> >
> > On Fri 13-02-26 19:02:39, Ojaswin Mujoo wrote:
> > > Another thing that came up is to consider using write through semantics
> > > for buffered atomic writes, where we are able to transition page to
> > > writeback state immediately after the write and avoid any other users to
> > > modify the data till writeback completes. This might affect performance
> > > since we won't be able to batch similar atomic IOs but maybe
> > > applications like postgres would not mind this too much. If we go with
> > > this approach, we will be able to avoid worrying too much about other
> > > users changing atomic data underneath us.
> > >
> > > An argument against this however is that it is user's responsibility to
> > > not do non atomic IO over an atomic range and this shall be considered a
> > > userspace usage error. This is similar to how there are ways users can
> > > tear a dio if they perform overlapping writes. [1].
> >
> > Yes, I was wondering whether the write-through semantics would make sense
> > as well. Intuitively it should make things simpler because you could
> > practially reuse the atomic DIO write path. Only that you'd first copy
> > data into the page cache and issue dio write from those folios. No need for
> > special tracking of which folios actually belong together in atomic write,
> > no need for cluttering standard folio writeback path, in case atomic write
> > cannot happen (e.g. because you cannot allocate appropriately aligned
> > blocks) you get the error back rightaway, ...
> >
> > Of course this all depends on whether such semantics would be actually
> > useful for users such as PostgreSQL.
>
> One issue might be the performance, especially if the atomic max unit is in
> the smaller end such as 16k or 32k (which is fairly common). But it will
> avoid the overlapping writes issue and can easily leverage the direct IO
> path.
>
> But one thing that postgres really cares about is the integrity of a
> database block. So if there is an IO that is a multiple of an atomic write
> unit (one atomic unit encapsulates the whole DB page), it is not a problem
> if tearing happens on the atomic boundaries. This fits very well with what
> NVMe calls Multiple Atomicity Mode (MAM) [1].
>
> We don't have any semantics for MaM at the moment but that could increase
> the performance as we can do larger IOs but still get the atomic guarantees
> certain applications care about.
Interesting, I think very very early dio implementations did use
something of this sort where (awu_max = 4k) an atomic write of 16k would
result in 4 x 4k atomic writes.
I don't remember why it was shot down though :D
Regards,
ojaswin
>
>
> [1] https://nvmexpress.org/wp-content/uploads/NVM-Express-NVM-Command-Set-Specification-Revision-1.1-2024.08.05-Ratified.pdf
>
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Buffered atomic writes
2026-02-16 11:38 ` Jan Kara
2026-02-16 13:18 ` Pankaj Raghav
2026-02-16 15:57 ` Andres Freund
@ 2026-02-17 18:39 ` Ojaswin Mujoo
2026-02-18 0:26 ` Dave Chinner
2 siblings, 1 reply; 38+ messages in thread
From: Ojaswin Mujoo @ 2026-02-17 18:39 UTC (permalink / raw)
To: Jan Kara
Cc: Pankaj Raghav, linux-xfs, linux-mm, linux-fsdevel, lsf-pc,
Andres Freund, djwong, john.g.garry, willy, hch, ritesh.list,
Luis Chamberlain, dchinner, Javier Gonzalez, gost.dev, tytso,
p.raghav, vi.shah
On Mon, Feb 16, 2026 at 12:38:59PM +0100, Jan Kara wrote:
> Hi!
>
> On Fri 13-02-26 19:02:39, Ojaswin Mujoo wrote:
> > Another thing that came up is to consider using write through semantics
> > for buffered atomic writes, where we are able to transition page to
> > writeback state immediately after the write and avoid any other users to
> > modify the data till writeback completes. This might affect performance
> > since we won't be able to batch similar atomic IOs but maybe
> > applications like postgres would not mind this too much. If we go with
> > this approach, we will be able to avoid worrying too much about other
> > users changing atomic data underneath us.
> >
> > An argument against this however is that it is user's responsibility to
> > not do non atomic IO over an atomic range and this shall be considered a
> > userspace usage error. This is similar to how there are ways users can
> > tear a dio if they perform overlapping writes. [1].
>
> Yes, I was wondering whether the write-through semantics would make sense
> as well. Intuitively it should make things simpler because you could
> practially reuse the atomic DIO write path. Only that you'd first copy
> data into the page cache and issue dio write from those folios. No need for
> special tracking of which folios actually belong together in atomic write,
> no need for cluttering standard folio writeback path, in case atomic write
> cannot happen (e.g. because you cannot allocate appropriately aligned
> blocks) you get the error back rightaway, ...
This is an interesting idea Jan and also saves a lot of tracking of
atomic extents etc.
I'm unsure how much of a performance impact it'd have though but I'll
look into this
Regards,
ojaswin
>
> Of course this all depends on whether such semantics would be actually
> useful for users such as PostgreSQL.
>
> Honza
> --
> Jan Kara <jack@suse.com>
> SUSE Labs, CR
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Buffered atomic writes
2026-02-17 18:27 ` Ojaswin Mujoo
@ 2026-02-17 18:42 ` Andres Freund
0 siblings, 0 replies; 38+ messages in thread
From: Andres Freund @ 2026-02-17 18:42 UTC (permalink / raw)
To: Ojaswin Mujoo
Cc: Jan Kara, Pankaj Raghav, linux-xfs, linux-mm, linux-fsdevel,
lsf-pc, djwong, john.g.garry, willy, hch, ritesh.list,
Luis Chamberlain, dchinner, Javier Gonzalez, gost.dev, tytso,
p.raghav, vi.shah
Hi,
On 2026-02-17 23:57:50 +0530, Ojaswin Mujoo wrote:
> From my mental model and very high level understanding of Postgres' WAL
> model [1] I am under the impression that for moving from full page
> writes to RWF_ATOMIC, we would need to ensure that the **disk** write IO
> of any data buffer should go in an untorn fashion.
Right.
> Now, coming to your example, IIUC here we can actually tolerate to do
> the 2nd write above non atomically because it is already a sort of full
> page write in the journal.
>
> So lets say if we do something like:
>
> 0. Buffer has some initial value on disk
> 1. Write new rows into buffer
> 2. Write the buffer as RWF_ATOMIC
> 3. Overwrite the complete buffer which will journal all the contents
> 4. Write the buffer as non RWF_ATOMIC
> 5. Crash
>
> I think it is still possible to satisfy my assumption of **disk** IO
> being untorn. Example, here we can have an RWF_ATOMIC implementation
> where the data on disk after crash could either be in initial state 0.
> or be the new value after 4. This is not strictly the old or new
> semantic but still ensures the data is consistent.
The way I understand Jan is that, unless we are careful with the write in 4),
the write for 0) could still be in progress, with the copy from userspace to
the pagecache from 4 happening in the middle of the DMA for the write from 0),
leading to a torn page on-disk, even though the disk actually behaved
correctly.
> My naive understanding says that as long as disk has consistent/untorn
> data, like above, we can recover via the journal.
Yes, if that were true, we could recover. But if my understanding of Jan's
concern is right, that'd not necessarily be guaranteed.
Greetings,
Andres Freund
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Buffered atomic writes
2026-02-17 15:47 ` Andres Freund
@ 2026-02-17 22:45 ` Dave Chinner
2026-02-18 4:10 ` Andres Freund
2026-02-18 6:53 ` Christoph Hellwig
1 sibling, 1 reply; 38+ messages in thread
From: Dave Chinner @ 2026-02-17 22:45 UTC (permalink / raw)
To: Andres Freund
Cc: Amir Goldstein, Christoph Hellwig, Pankaj Raghav, linux-xfs,
linux-mm, linux-fsdevel, lsf-pc, djwong, john.g.garry, willy,
ritesh.list, jack, ojaswin, Luis Chamberlain, dchinner,
Javier Gonzalez, gost.dev, tytso, p.raghav, vi.shah
On Tue, Feb 17, 2026 at 10:47:07AM -0500, Andres Freund wrote:
> Hi,
>
> On 2026-02-17 10:23:36 +0100, Amir Goldstein wrote:
> > On Tue, Feb 17, 2026 at 8:00 AM Christoph Hellwig <hch@lst.de> wrote:
> > >
> > > I think a better session would be how we can help postgres to move
> > > off buffered I/O instead of adding more special cases for them.
>
> FWIW, we are adding support for DIO (it's been added, but performance isn't
> competitive for most workloads in the released versions yet, work to address
> those issues is in progress).
>
> But it's only really be viable for larger setups, not for e.g.:
> - smaller, unattended setups
> - uses of postgres as part of a larger application on one server with hard to
> predict memory usage of different components
> - intentionally overcommitted shared hosting type scenarios
>
> Even once a well configured postgres using DIO beats postgres not using DIO,
> I'll bet that well over 50% of users won't be able to use DIO.
>
>
> There are some kernel issues that make it harder than necessary to use DIO,
> btw:
>
> Most prominently: With DIO concurrently extending multiple files leads to
> quite terrible fragmentation, at least with XFS. Forcing us to
> over-aggressively use fallocate(), truncating later if it turns out we need
> less space.
<ahem>
seriously, fallocate() is considered harmful for exactly these sorts
of reasons. XFS has vastly better mechanisms built into it that
mitigate worst case fragmentation without needing to change
applications or increase runtime overhead.
So, lets go way back - 32 years ago to 1994:
commit 32766d4d387bc6779e0c432fb56a0cc4e6b96398
Author: Doug Doucette <doucette@engr.sgi.com>
Date: Thu Mar 3 22:17:15 1994 +0000
Add fcntl implementation (F_FSGETXATTR, F_FSSETXATTR, and F_DIOINFO).
Fix xfs_setattr new xfs fields' implementation to split out error checking
to the front of the routine, like the other attributes. Don't set new
fields in xfs_getattr unless one of the fields is requested.
.....
+ case F_FSSETXATTR: {
+ struct fsxattr fa;
+ vattr_t va;
+
+ if (copyin(arg, &fa, sizeof(fa))) {
+ error = EFAULT;
+ break;
+ }
+ va.va_xflags = fa.fsx_xflags;
+ va.va_extsize = fa.fsx_extsize;
^^^^^^^^^^^^^^^
+ error = xfs_setattr(vp, &va, AT_XFLAGS|AT_EXTSIZE, credp);
+ break;
+ }
This was the commit that added user controlled extent size hints to
XFS. These already existed in EFS, so applications using this
functionality go back to the even earlier in the 1990s.
So, let's set the extent size hint on a file to 1MB. Now whenever a
data extent allocation on that file is attempted, the extent size
that is allocated will be rounded up to the nearest 1MB. i.e. XFS
will try to allocate unwritten extents in aligned multiples of the
extent size hint regardless of the actual IO size being performed.
Hence if you are doing concurrent extending 8kB writes, instead of
allocating 8kB at a time, the extent size hint will force a 1MB
unwritten extent to be allocated out beyond EOF. The subsequent
extending 8kB writes to that file now hit that unwritten extent, and
only need to convert it to written. The same will happen for all
other concurrent extending writes - they will allocate in 1MB
chunks, not 8KB.
The result will be that the files will interleave 1MB sized extents
across files instead of 8kB sized extents. i.e. we've just reduced
the worst case fragmentation behaviour by a factor of 128. We've
also reduced allocation overhead by a factor of 128, so the use of
extent size hints results in the filesystem behaving in a far more
efficient way and hence this results in higher performance.
IOWs, the extent size hint effectively sets a minimum extent size
that the filesystem will create for a given file, thereby mitigating
the worst case fragmentation that can occur. However, the use of
fallocate() in the application explicitly prevents the filesystem
from doing this smart, transparent IO path thing to mitigate
fragmentation.
One of the most important properties of extent size hints is that
they can be dynamically tuned *without changing the application.*
The extent size hint is a property of the inode, and it can be set
by the admin through various XFS tools (e.g. mkfs.xfs for a
filesystem wide default, xfs_io to set it on a directory so all new
files/dirs created in that directory inherit the value, set it on
individual files, etc). It can be changed even whilst the file is in
active use by the application.
Hence the extent size hint it can be changed at any time, and you
can apply it immediately to existing installations as an active
mitigation. Doing this won't fix existing fragmentation (that's what
xfs_fsr is for), but it will instantly mitigate/prevent new
fragmentation from occurring. It's much more difficult to do this
with applications that use fallocate()...
Indeed, the case for using fallocate() instead of extent size hints
gets worse the more you look at how extent size hints work.
Extent size hints don't impact IO concurrency at all. Extent size
hints are only applied during extent allocation, so the optimisation
is applied naturally as part of the existing concurrent IO path.
Hence using extent size hints won't block/stall/prevent concurrent
async IO in any way.
fallocate(), OTOH, causes a full IO pipeline stall (blocks submission
of both reads and writes, then waits for all IO in flight to drain)
on that file for the duration of the syscall. You can't do any sort
of IO (async or otherwise) and run fallocate() at the same time, so
fallocate() really sucks from the POV of a high performance IO app.
fallocate() also marks the files as having persistent preallocation,
which means that when you close the file the filesystem does not
remove excessive extents allocated beyond EOF. Hence the reported
problems with excessive space usage and needing to truncate files
manually (which also cause a complete IO stall on that file) are
brought on specifically because fallocate() is being used by the
application to manage worst case fragmentation.
This problem does not exist with extent size hints - unused blocks
beyond EOF will be trimmed on last close or when the inode is cycled
out of cache, just like we do for excess speculative prealloc beyond
EOF for buffered writes (the buffered IO fragmentation mitigation
mechanism for interleaving concurrent extending writes).
The administrator can easily optimise extent size hints to match the
optimal characteristics of the underlying storage (e.g. set them to
be RAID stripe aligned), etc. Fallocate() requires the application
to provide tunables to modify it's behaviour for optimal storage
layout, and depending on how the application uses fallocate(), this
level of flexibility may not even be possible.
And let's not forget that an fallocate() based mitigation that helps
one filesystem type can actively hurt another type (e.g. ext4) by
introducing an application level extent allocation boundary vector
where there was none before.
Hence, IMO, micromanaging filesystem extent allocation with
fallocate() is -almost always- the wrong thing for applications to
be doing. There is no one "right way" to use fallocate() - what is
optimal for one filesystem will be pessimal for another, and it is
impossible to code optimal behaviour in the application for all
filesystem types the app might run on.
> The fallocate in turn triggers slowness in the write paths, as
> writing to uninitialized extents is a metadata operation.
That is not the problem you think it is. XFS is using unwritten
extents for all buffered IO writes that use delayed allocation, too,
and I don't see you complaining about that....
Yes, the overhead of unwritten extent conversion is more visible
with direct IO, but that's only because DIO has much lower overhead
and much, much higher performance ceiling than buffered IO. That
doesn't mean unwritten extents are a performance limiting factor...
> It'd be great if
> the allocation behaviour with concurrent file extension could be improved and
> if we could have a fallocate mode that forces extents to be initialized.
<sigh>
You mean like FALLOC_FL_WRITE_ZEROES?
That won't fix your fragmentation problem, and it has all the same
pipeline stall problems as allocating unwritten extents in
fallocate().
Only much worse now, because the IO pipeline is stalled for the
entire time it takes to write the zeroes to persistent storage. i.e.
long tail file access latencies will increase massively if you do
this regularly to extend files.
-Dave.
--
Dave Chinner
dgc@kernel.org
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Buffered atomic writes
2026-02-17 18:39 ` Ojaswin Mujoo
@ 2026-02-18 0:26 ` Dave Chinner
2026-02-18 6:49 ` Christoph Hellwig
2026-02-18 12:54 ` Ojaswin Mujoo
0 siblings, 2 replies; 38+ messages in thread
From: Dave Chinner @ 2026-02-18 0:26 UTC (permalink / raw)
To: Ojaswin Mujoo
Cc: Jan Kara, Pankaj Raghav, linux-xfs, linux-mm, linux-fsdevel,
lsf-pc, Andres Freund, djwong, john.g.garry, willy, hch,
ritesh.list, Luis Chamberlain, dchinner, Javier Gonzalez,
gost.dev, tytso, p.raghav, vi.shah
On Wed, Feb 18, 2026 at 12:09:46AM +0530, Ojaswin Mujoo wrote:
> On Mon, Feb 16, 2026 at 12:38:59PM +0100, Jan Kara wrote:
> > Hi!
> >
> > On Fri 13-02-26 19:02:39, Ojaswin Mujoo wrote:
> > > Another thing that came up is to consider using write through semantics
> > > for buffered atomic writes, where we are able to transition page to
> > > writeback state immediately after the write and avoid any other users to
> > > modify the data till writeback completes. This might affect performance
> > > since we won't be able to batch similar atomic IOs but maybe
> > > applications like postgres would not mind this too much. If we go with
> > > this approach, we will be able to avoid worrying too much about other
> > > users changing atomic data underneath us.
> > >
> > > An argument against this however is that it is user's responsibility to
> > > not do non atomic IO over an atomic range and this shall be considered a
> > > userspace usage error. This is similar to how there are ways users can
> > > tear a dio if they perform overlapping writes. [1].
> >
> > Yes, I was wondering whether the write-through semantics would make sense
> > as well. Intuitively it should make things simpler because you could
> > practially reuse the atomic DIO write path. Only that you'd first copy
> > data into the page cache and issue dio write from those folios. No need for
> > special tracking of which folios actually belong together in atomic write,
> > no need for cluttering standard folio writeback path, in case atomic write
> > cannot happen (e.g. because you cannot allocate appropriately aligned
> > blocks) you get the error back rightaway, ...
>
> This is an interesting idea Jan and also saves a lot of tracking of
> atomic extents etc.
ISTR mentioning that we should be doing exactly this (grab page
cache pages, fill them and submit them through the DIO path) for
O_DSYNC buffered writethrough IO a long time again. The context was
optimising buffered O_DSYNC to use the FUA optimisations in the
iomap DIO write path.
I suggested it again when discussing how RWF_DONTCACHE should be
implemented, because the async DIO write completion path invalidates
the page cache over the IO range. i.e. it would avoid the need to
use folio flags to track pages that needed invalidation at IO
completion...
I have a vague recollection of mentioning this early in the buffered
RWF_ATOMIC discussions, too, though that may have just been the
voices in my head.
Regardless, we are here again with proposals for RWF_ATOMIC and
RWF_WRITETHROUGH and a suggestion that maybe we should vector
buffered writethrough via the DIO path.....
Perhaps it's time to do this?
FWIW, the other thing that write-through via the DIO path enables is
true async O_DSYNC buffered IO. Right now O_DSYNC buffered writes
block waiting on IO completion through generic_sync_write() ->
vfs_fsync_range(), even when issued through AIO paths. Vectoring it
through the DIO path avoids the blocking fsync path in IO submission
as it runs in the async DIO completion path if it is needed....
-Dave.
--
Dave Chinner
dgc@kernel.org
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Buffered atomic writes
2026-02-17 16:21 ` Andres Freund
@ 2026-02-18 1:04 ` Dave Chinner
2026-02-18 6:47 ` Christoph Hellwig
0 siblings, 1 reply; 38+ messages in thread
From: Dave Chinner @ 2026-02-18 1:04 UTC (permalink / raw)
To: Andres Freund
Cc: Pankaj Raghav, Jan Kara, Ojaswin Mujoo, linux-xfs, linux-mm,
linux-fsdevel, lsf-pc, djwong, john.g.garry, willy, hch,
ritesh.list, Luis Chamberlain, dchinner, Javier Gonzalez,
gost.dev, tytso, p.raghav, vi.shah
On Tue, Feb 17, 2026 at 11:21:20AM -0500, Andres Freund wrote:
> Hi,
>
> On 2026-02-17 13:42:35 +0100, Pankaj Raghav wrote:
> > On 2/17/2026 1:06 PM, Jan Kara wrote:
> > > On Mon 16-02-26 10:45:40, Andres Freund wrote:
> > > > (*) As it turns out, it often seems to improves write throughput as well, if
> > > > writeback is triggered by memory pressure instead of SYNC_FILE_RANGE_WRITE,
> > > > linux seems to often trigger a lot more small random IO.
> > > >
> > > > > So immediately writing them might be ok as long as we don't remove those
> > > > > pages from the page cache like we do in RWF_UNCACHED.
> > > >
> > > > Yes, it might. I actually often have wished for something like a
> > > > RWF_WRITEBACK flag...
> > >
> > > I'd call it RWF_WRITETHROUGH but otherwise it makes sense.
> > >
> >
> > One naive question: semantically what will be the difference between
> > RWF_DSYNC and RWF_WRITETHROUGH?
None, except that RWF_DSYNC provides data integrity guarantees.
> > So RWF_DSYNC will be the sync version and
> > RWF_WRITETHOUGH will be an async version where we kick off writeback
> > immediately in the background and return?
No.
Write-through implies synchronous IO. i.e. that IO errors are
reported immediately to the caller, not reported on the next
operation on the file.
O_DSYNC integrity writes are, by definition, write-through
(synchronous) because they have to report physical IO completion
status to the caller. This is kinda how "synchronous" got associated
with data integrity in the first place.
DIO writes are also write-through - there is nowhere to store an IO
error for later reporting, so they must be executed synchronously to
be able to report IO errors to the caller.
Hence write-through generally implies synchronous IO, but it does
not imply any data integrity guarantees are provided for the IO.
If you want async RWF_WRITETHROUGH semantics, then the IO needs to
be issued through an async IO submission interface (i.e. AIO or
io_uring). In that case, the error status will be reported through
the AIO completion, just like for DIO writes.
IOWs, RWF_WRITETHROUGH should result in buffered writes displaying
identical IO semantics to DIO writes. In doing this, we then we only
need one IO path implementation per filesystem for all writethrough
IO (buffered or direct) and the only thing that differs is the folios
we attach to the bios.
> Besides sync vs async:
>
> If the device has a volatile write cache, RWF_DSYNC will trigger flushes for
> the entire write cache or do FUA writes for just the RWF_DSYNC write.
Yes, that is exactly how the iomap DIO write path optimises
RWF_DSYNC writes. It's much harder to do this for buffered IO using
the generic buffered writeback paths and buffered writes never use
FUA writes.
i.e., using the iomap DIO path for RWF_WRITETHROUGH | RWF_DSYNC
would bring these significant performance optimisations to buffered
writes as well...
> Which
> wouldn't be needed for RWF_WRITETHROUGH, right?
Correct, there shouldn't be any data integrity guarantees associated
with plain RWF_WRITETHROUGH.
-Dave.
--
Dave Chinner
dgc@kernel.org
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Buffered atomic writes
2026-02-17 22:45 ` Dave Chinner
@ 2026-02-18 4:10 ` Andres Freund
0 siblings, 0 replies; 38+ messages in thread
From: Andres Freund @ 2026-02-18 4:10 UTC (permalink / raw)
To: Dave Chinner
Cc: Amir Goldstein, Christoph Hellwig, Pankaj Raghav, linux-xfs,
linux-mm, linux-fsdevel, lsf-pc, djwong, john.g.garry, willy,
ritesh.list, jack, ojaswin, Luis Chamberlain, dchinner,
Javier Gonzalez, gost.dev, tytso, p.raghav, vi.shah
Hi,
On 2026-02-18 09:45:46 +1100, Dave Chinner wrote:
> On Tue, Feb 17, 2026 at 10:47:07AM -0500, Andres Freund wrote:
> > There are some kernel issues that make it harder than necessary to use DIO,
> > btw:
> >
> > Most prominently: With DIO concurrently extending multiple files leads to
> > quite terrible fragmentation, at least with XFS. Forcing us to
> > over-aggressively use fallocate(), truncating later if it turns out we need
> > less space.
>
> <ahem>
>
> seriously, fallocate() is considered harmful for exactly these sorts
> of reasons. XFS has vastly better mechanisms built into it that
> mitigate worst case fragmentation without needing to change
> applications or increase runtime overhead.
There's probably a misunderstanding here: We don't do fallocate to avoid
fragmentation.
We want to guarantee that there's space for data that is in our buffer pool,
as otherwise it's very easy to get into a pickle:
If there is dirty data in the buffer pool that can't be written out due to
ENOSPC, the subsequent checkpoint can't complete. So the system may be stuck
because you're not be able to create more space for WAL / journaling, you
can't free up old WAL due to the checkpoint not being able to complete, and if
you react to that with a crash-recovery cycle you're likely to be unable to
complete crash recovery because you'll just hit ENOSPC again.
And yes, CoW filesystems make that less reliable, it turns out to still save
people often enough that I doubt we can get rid of it.
To ensure there's space for the write out of our buffer pool we have two
choices:
1) write out zeroes
2) use fallocate
Writing out zeroes that we will just overwrite later is obviously not a
particularly good use of IO bandwidth, particularly on metered cloud
"storage". But using fallocate() has fragmentation and unwritten-extent
issues. Our compromise is that we use fallocate iff we enlarge the relation
by a decent number of pages at once and write zeroes otherwise.
Is that perfect? Hell no. But it's also not obvious what a better answer is
with today's interfaces.
If there were a "guarantee that N additional blocks are reserved, but not
concretely allocated" interface, we'd gladly use it.
> So, let's set the extent size hint on a file to 1MB. Now whenever a
> data extent allocation on that file is attempted, the extent size
> that is allocated will be rounded up to the nearest 1MB. i.e. XFS
> will try to allocate unwritten extents in aligned multiples of the
> extent size hint regardless of the actual IO size being performed.
>
> Hence if you are doing concurrent extending 8kB writes, instead of
> allocating 8kB at a time, the extent size hint will force a 1MB
> unwritten extent to be allocated out beyond EOF. The subsequent
> extending 8kB writes to that file now hit that unwritten extent, and
> only need to convert it to written. The same will happen for all
> other concurrent extending writes - they will allocate in 1MB
> chunks, not 8KB.
We could probably benefit from that.
> One of the most important properties of extent size hints is that
> they can be dynamically tuned *without changing the application.*
> The extent size hint is a property of the inode, and it can be set
> by the admin through various XFS tools (e.g. mkfs.xfs for a
> filesystem wide default, xfs_io to set it on a directory so all new
> files/dirs created in that directory inherit the value, set it on
> individual files, etc). It can be changed even whilst the file is in
> active use by the application.
IME our users run enough postgres instances, across a lot of differing
workloads, that manual tuning like that will rarely if ever happen :(. I miss
well educated DBAs :(. A large portion of users doesn't even have direct
access to the server, only via the postgres protocol...
If we were to use these hints, it'd have to happen automatically from within
postgres. But that does seem viable, but certainly is also not exactly
filesystem independent...
> > The fallocate in turn triggers slowness in the write paths, as
> > writing to uninitialized extents is a metadata operation.
>
> That is not the problem you think it is. XFS is using unwritten
> extents for all buffered IO writes that use delayed allocation, too,
> and I don't see you complaining about that....
It's a problem for buffered IO as well, just a bit harder to hit on many
drives, because buffered O_DSYNC writes don't use FUA.
If you need any durable writes into a file with unwritten extents, things get
painful very fast.
See a few paragraphs below for the most crucial case where we need to make
sure writes are durable.
testdir=/srv/fio && for buffered in 0 1; do for overwrite in 0 1; do echo buffered: $buffered overwrite: $overwrite; rm -f $testdir/pg-extend* && fio --directory=$testdir --ioengine=psync --buffered=$buffered --bs=4kB --fallocate=none --overwrite=0 --rw=write --size=64MB --sync=dsync --name pg-extend --overwrite=$overwrite |grep IOPS;done;done
buffered: 0 overwrite: 0
write: IOPS=1427, BW=5709KiB/s (5846kB/s)(64.0MiB/11479msec); 0 zone resets
buffered: 0 overwrite: 1
write: IOPS=4025, BW=15.7MiB/s (16.5MB/s)(64.0MiB/4070msec); 0 zone resets
buffered: 1 overwrite: 0
write: IOPS=1638, BW=6554KiB/s (6712kB/s)(64.0MiB/9999msec); 0 zone resets
buffered: 1 overwrite: 1
write: IOPS=3663, BW=14.3MiB/s (15.0MB/s)(64.0MiB/4472msec); 0 zone resets
That's a > 2x throughput difference. And the results would be similar with
--fdatasync=1.
If you add AIO to the mix, the difference gets way bigger, particularly on
drives with FUA support and DIO:
testdir=/srv/fio && for buffered in 0 1; do for overwrite in 0 1; do echo buffered: $buffered overwrite: $overwrite; rm -f $testdir/pg-extend* && fio --directory=$testdir --ioengine=io_uring --buffered=$buffered --bs=4kB --fallocate=none --overwrite=0 --rw=write --size=64MB --sync=dsync --name pg-extend --overwrite=$overwrite --iodepth 32 |grep IOPS;done;done
buffered: 0 overwrite: 0
write: IOPS=6143, BW=24.0MiB/s (25.2MB/s)(64.0MiB/2667msec); 0 zone resets
buffered: 0 overwrite: 1
write: IOPS=76.6k, BW=299MiB/s (314MB/s)(64.0MiB/214msec); 0 zone resets
buffered: 1 overwrite: 0
write: IOPS=1835, BW=7341KiB/s (7517kB/s)(64.0MiB/8928msec); 0 zone resets
buffered: 1 overwrite: 1
write: IOPS=4096, BW=16.0MiB/s (16.8MB/s)(64.0MiB/4000msec); 0 zone resets
It's less bad, but still quite a noticeable difference, on drives without
volatile caches. And it's often worse on networked storage, whether it has a
volatile cache or not.
> > It'd be great if
> > the allocation behaviour with concurrent file extension could be improved and
> > if we could have a fallocate mode that forces extents to be initialized.
>
> <sigh>
>
> You mean like FALLOC_FL_WRITE_ZEROES?
I hadn't seen that it was merged, that's great! It doesn't yet seem to be
documented in the fallocate(2) man page, which I had checked...
Hm, also doesn't seem to work on xfs yet :(, EOPNOTSUPP.
> That won't fix your fragmentation problem, and it has all the same pipeline
> stall problems as allocating unwritten extents in fallocate().
The primary case where FALLOC_FL_WRITE_ZEROES would be useful is for WAL file
creation, which are always of the same fixed size (therefore no fragmentation
risk).
To avoid having metadata operation during our commit path, we today default to
forcing them to be allocated by overwriting them with zeros and fsyncing
them. To avoid having to do that all the time, we reuse them once they're not
needed anymore.
Not ensuring that the extents are already written, would have a very large
perf penalty (as in ~2-3x for OLTP workloads, on XFS). That's true for both
when using DIO and when not.
To avoid having to do that over and over, we recycle WAL files.
Unfortunately this means that when all those WAL files are not yet
preallocated (or when we release them during low activity), the performance is
rather noticeably worsened by the additional IO for pre-zeroing the WAL files.
In theory FALLOC_FL_WRITE_ZEROES should be faster than issuing writes for the
whole range.
> Only much worse now, because the IO pipeline is stalled for the
> entire time it takes to write the zeroes to persistent storage. i.e.
> long tail file access latencies will increase massively if you do
> this regularly to extend files.
In the WAL path we fsync at the point we could use FALLOC_FL_WRITE_ZEROES, as
otherwise the WAL segment might not exist after a crash, which would be
... bad.
Greetings,
Andres Freund
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Buffered atomic writes
2026-02-18 1:04 ` Dave Chinner
@ 2026-02-18 6:47 ` Christoph Hellwig
2026-02-18 23:42 ` Dave Chinner
0 siblings, 1 reply; 38+ messages in thread
From: Christoph Hellwig @ 2026-02-18 6:47 UTC (permalink / raw)
To: Dave Chinner
Cc: Andres Freund, Pankaj Raghav, Jan Kara, Ojaswin Mujoo, linux-xfs,
linux-mm, linux-fsdevel, lsf-pc, djwong, john.g.garry, willy,
hch, ritesh.list, Luis Chamberlain, dchinner, Javier Gonzalez,
gost.dev, tytso, p.raghav, vi.shah
On Wed, Feb 18, 2026 at 12:04:43PM +1100, Dave Chinner wrote:
> > > > I'd call it RWF_WRITETHROUGH but otherwise it makes sense.
> > > >
> > >
> > > One naive question: semantically what will be the difference between
> > > RWF_DSYNC and RWF_WRITETHROUGH?
>
> None, except that RWF_DSYNC provides data integrity guarantees.
Which boils down to RWF_DSYNC still writing out the inode and flushing
the cache.
> > Which
> > wouldn't be needed for RWF_WRITETHROUGH, right?
>
> Correct, there shouldn't be any data integrity guarantees associated
> with plain RWF_WRITETHROUGH.
Which makes me curious if the plain RWF_WRITETHROUGH would be all
that useful.
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Buffered atomic writes
2026-02-18 0:26 ` Dave Chinner
@ 2026-02-18 6:49 ` Christoph Hellwig
2026-02-18 12:54 ` Ojaswin Mujoo
1 sibling, 0 replies; 38+ messages in thread
From: Christoph Hellwig @ 2026-02-18 6:49 UTC (permalink / raw)
To: Dave Chinner
Cc: Ojaswin Mujoo, Jan Kara, Pankaj Raghav, linux-xfs, linux-mm,
linux-fsdevel, lsf-pc, Andres Freund, djwong, john.g.garry,
willy, hch, ritesh.list, Luis Chamberlain, dchinner,
Javier Gonzalez, gost.dev, tytso, p.raghav, vi.shah
On Wed, Feb 18, 2026 at 11:26:06AM +1100, Dave Chinner wrote:
> ISTR mentioning that we should be doing exactly this (grab page
> cache pages, fill them and submit them through the DIO path) for
> O_DSYNC buffered writethrough IO a long time again.
Yes, multiple times. And I did a few more times since then.
> Regardless, we are here again with proposals for RWF_ATOMIC and
> RWF_WRITETHROUGH and a suggestion that maybe we should vector
> buffered writethrough via the DIO path.....
>
> Perhaps it's time to do this?
Yes.
> FWIW, the other thing that write-through via the DIO path enables is
> true async O_DSYNC buffered IO. Right now O_DSYNC buffered writes
> block waiting on IO completion through generic_sync_write() ->
> vfs_fsync_range(), even when issued through AIO paths. Vectoring it
> through the DIO path avoids the blocking fsync path in IO submission
> as it runs in the async DIO completion path if it is needed....
It's only true if we can do the page cache updates non-blocking, but
in many cases that should indeed be possible.
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Buffered atomic writes
2026-02-17 9:23 ` [Lsf-pc] " Amir Goldstein
2026-02-17 15:47 ` Andres Freund
@ 2026-02-18 6:51 ` Christoph Hellwig
1 sibling, 0 replies; 38+ messages in thread
From: Christoph Hellwig @ 2026-02-18 6:51 UTC (permalink / raw)
To: Amir Goldstein
Cc: Christoph Hellwig, Pankaj Raghav, linux-xfs, linux-mm,
linux-fsdevel, lsf-pc, Andres Freund, djwong, john.g.garry,
willy, ritesh.list, jack, ojaswin, Luis Chamberlain, dchinner,
Javier Gonzalez, gost.dev, tytso, p.raghav, vi.shah
On Tue, Feb 17, 2026 at 10:23:36AM +0100, Amir Goldstein wrote:
> On Tue, Feb 17, 2026 at 8:00 AM Christoph Hellwig <hch@lst.de> wrote:
> >
> > I think a better session would be how we can help postgres to move
> > off buffered I/O instead of adding more special cases for them.
>
> Respectfully, I disagree that DIO is the only possible solution.
> Direct I/O is a legit solution for databases and so is buffered I/O
> each with their own caveats.
Maybe. Classic buffered I/O is not a legit solution for doing atomic
I/Os, and if Postgres is desperate to use that, something like direct
I/O (including the proposed write though semantics) are the only sensible
choice.
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Buffered atomic writes
2026-02-17 15:47 ` Andres Freund
2026-02-17 22:45 ` Dave Chinner
@ 2026-02-18 6:53 ` Christoph Hellwig
1 sibling, 0 replies; 38+ messages in thread
From: Christoph Hellwig @ 2026-02-18 6:53 UTC (permalink / raw)
To: Andres Freund
Cc: Amir Goldstein, Christoph Hellwig, Pankaj Raghav, linux-xfs,
linux-mm, linux-fsdevel, lsf-pc, djwong, john.g.garry, willy,
ritesh.list, jack, ojaswin, Luis Chamberlain, dchinner,
Javier Gonzalez, gost.dev, tytso, p.raghav, vi.shah
On Tue, Feb 17, 2026 at 10:47:07AM -0500, Andres Freund wrote:
> Most prominently: With DIO concurrently extending multiple files leads to
> quite terrible fragmentation, at least with XFS. Forcing us to
> over-aggressively use fallocate(), truncating later if it turns out we need
> less space. The fallocate in turn triggers slowness in the write paths, as
> writing to uninitialized extents is a metadata operation. It'd be great if
> the allocation behaviour with concurrent file extension could be improved and
> if we could have a fallocate mode that forces extents to be initialized.
As Dave already mentioned, if you do concurrent allocations (extension
or hole filling), setting an extent size hint is probably a good idea.
We could try to look into heuristics, but chances are that they would
degrade other use caes. Details would be useful as a report on the
XFS list.
>
> A secondary issue is that with the buffer pool sizes necessary for DIO use on
> bigger systems, creating the anonymous memory mapping becomes painfully slow
> if we use MAP_POPULATE - which we kinda need to do, as otherwise performance
> is very inconsistent initially (often iomap -> gup -> handle_mm_fault ->
> folio_zero_user uses the majority of the CPU). We've been experimenting with
> not using MAP_POPULATE and using multiple threads to populate the mapping in
> parallel, but that feels not like something that userspace ought to have to
> do. It's easier to work around for us that the uninitialized extent
> conversion issue, but it still is something we IMO shouldn't have to do.
Please report this to linux-mm.
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Buffered atomic writes
2026-02-18 0:26 ` Dave Chinner
2026-02-18 6:49 ` Christoph Hellwig
@ 2026-02-18 12:54 ` Ojaswin Mujoo
1 sibling, 0 replies; 38+ messages in thread
From: Ojaswin Mujoo @ 2026-02-18 12:54 UTC (permalink / raw)
To: Dave Chinner
Cc: Jan Kara, Pankaj Raghav, linux-xfs, linux-mm, linux-fsdevel,
lsf-pc, Andres Freund, djwong, john.g.garry, willy, hch,
ritesh.list, Luis Chamberlain, dchinner, Javier Gonzalez,
gost.dev, tytso, p.raghav, vi.shah
On Wed, Feb 18, 2026 at 11:26:06AM +1100, Dave Chinner wrote:
> On Wed, Feb 18, 2026 at 12:09:46AM +0530, Ojaswin Mujoo wrote:
> > On Mon, Feb 16, 2026 at 12:38:59PM +0100, Jan Kara wrote:
> > > Hi!
> > >
> > > On Fri 13-02-26 19:02:39, Ojaswin Mujoo wrote:
> > > > Another thing that came up is to consider using write through semantics
> > > > for buffered atomic writes, where we are able to transition page to
> > > > writeback state immediately after the write and avoid any other users to
> > > > modify the data till writeback completes. This might affect performance
> > > > since we won't be able to batch similar atomic IOs but maybe
> > > > applications like postgres would not mind this too much. If we go with
> > > > this approach, we will be able to avoid worrying too much about other
> > > > users changing atomic data underneath us.
> > > >
> > > > An argument against this however is that it is user's responsibility to
> > > > not do non atomic IO over an atomic range and this shall be considered a
> > > > userspace usage error. This is similar to how there are ways users can
> > > > tear a dio if they perform overlapping writes. [1].
> > >
> > > Yes, I was wondering whether the write-through semantics would make sense
> > > as well. Intuitively it should make things simpler because you could
> > > practially reuse the atomic DIO write path. Only that you'd first copy
> > > data into the page cache and issue dio write from those folios. No need for
> > > special tracking of which folios actually belong together in atomic write,
> > > no need for cluttering standard folio writeback path, in case atomic write
> > > cannot happen (e.g. because you cannot allocate appropriately aligned
> > > blocks) you get the error back rightaway, ...
> >
> > This is an interesting idea Jan and also saves a lot of tracking of
> > atomic extents etc.
>
> ISTR mentioning that we should be doing exactly this (grab page
> cache pages, fill them and submit them through the DIO path) for
> O_DSYNC buffered writethrough IO a long time again. The context was
> optimising buffered O_DSYNC to use the FUA optimisations in the
> iomap DIO write path.
>
> I suggested it again when discussing how RWF_DONTCACHE should be
> implemented, because the async DIO write completion path invalidates
> the page cache over the IO range. i.e. it would avoid the need to
> use folio flags to track pages that needed invalidation at IO
> completion...
>
> I have a vague recollection of mentioning this early in the buffered
> RWF_ATOMIC discussions, too, though that may have just been the
> voices in my head.
Hi Dave,
Yes we did discuss this [1] :)
We also discussed the alternative of using the COW fork path for atomic
writes [2]. Since at that point I was not completely sure if the
writethrough would become too restrictive of an approach, I was working
on a COW fork implementation.
However, from the discussion here as well as Andres' comments, it seems
like write through might not be too bad for postgres.
>
> Regardless, we are here again with proposals for RWF_ATOMIC and
> RWF_WRITETHROUGH and a suggestion that maybe we should vector
> buffered writethrough via the DIO path.....
>
> Perhaps it's time to do this?
I agree that it makes more sense to do writethrough if we want to have
the strict old-or-new semantics (as opposed to just untorn IO
semantics). I'll work on a POC for this approach of doing atomic writes,
I'll mostly try to base it off your suggestions in [1].
FWIW, I do have a somewhat working (although untested and possible
broken in some places) POC for performing atomic writes via XFS COW fork
based on suggestions from Dave [2]. Even though we want to explore the
writethrough approach, I'd just share it here incase anyone is
interested in how the design is looking like:
https://github.com/OjaswinM/linux/commits/iomap-buffered-atomic-rfc2.3/
(If anyone prefers for me to send this as a patchset on mailing list, let
me know)
Regards,
ojaswin
[1] https://lore.kernel.org/linux-fsdevel/aRmHRk7FGD4nCT0s@dread.disaster.area/
[2] https://lore.kernel.org/linux-fsdevel/aRuKz4F3xATf8IUp@dread.disaster.area/
>
> FWIW, the other thing that write-through via the DIO path enables is
> true async O_DSYNC buffered IO. Right now O_DSYNC buffered writes
> block waiting on IO completion through generic_sync_write() ->
> vfs_fsync_range(), even when issued through AIO paths. Vectoring it
> through the DIO path avoids the blocking fsync path in IO submission
> as it runs in the async DIO completion path if it is needed....
>
> -Dave.
> --
> Dave Chinner
> dgc@kernel.org
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Buffered atomic writes
2026-02-17 16:13 ` Andres Freund
2026-02-17 18:27 ` Ojaswin Mujoo
@ 2026-02-18 17:37 ` Jan Kara
2026-02-18 21:04 ` Andres Freund
2026-02-19 0:32 ` Dave Chinner
1 sibling, 2 replies; 38+ messages in thread
From: Jan Kara @ 2026-02-18 17:37 UTC (permalink / raw)
To: Andres Freund
Cc: Jan Kara, Pankaj Raghav, Ojaswin Mujoo, linux-xfs, linux-mm,
linux-fsdevel, lsf-pc, djwong, john.g.garry, willy, hch,
ritesh.list, Luis Chamberlain, dchinner, Javier Gonzalez,
gost.dev, tytso, p.raghav, vi.shah
On Tue 17-02-26 11:13:07, Andres Freund wrote:
> > > P1: pwritev(fd, [blocks 1-10], RWF_ATOMIC) start & completes
> > > Kernel: starts writeback but doesn't complete it
> > > P1: pwrite(fd, [any block in 1-10]), non-atomically
> > > Kernel: completes writeback
> > >
> > > The former is not at all an issue for postgres' use case, the pages in
> > > our buffer pool that are undergoing IO are locked, preventing additional
> > > IO (be it reads or writes) to those blocks.
> > >
> > > The latter would be a problem, since userspace wouldn't even know that
> > > here is still "atomic writeback" going on, afaict the only way we could
> > > avoid it would be to issue an f[data]sync(), which likely would be
> > > prohibitively expensive.
> >
> > It somewhat depends on what outcome you expect in terms of crash safety :)
> > Unless we are careful, the RWF_ATOMIC write in your latter example can end
> > up writing some bits of the data from the second write because the second
> > write may be copying data to the pages as we issue DMA from them to the
> > device.
>
> Hm. It's somewhat painful to not know when we can write in what mode again -
> with DIO that's not an issue. I guess we could use
> sync_file_range(SYNC_FILE_RANGE_WAIT_BEFORE) if we really needed to know?
> Although the semantics of the SFR flags aren't particularly clear, so maybe
> not?
If you used RWF_WRITETHROUGH for your writes (so you are sure IO has
already started) then sync_file_range(SYNC_FILE_RANGE_WAIT_BEFORE) would
indeed be a safe way of waiting for that IO to complete (or just wait for
the write(2) syscall itself to complete if we make RWF_WRITETHROUGH wait
for IO completion as Dave suggests - but I guess writes may happen from
multiple threads so that may be not very convenient and sync_file_range(2)
might be actually easier).
> > I expect this isn't really acceptable because if you crash before
> > the second write fully makes it to the disk, you will have inconsistent
> > data.
>
> The scenarios that I can think that would lead us to doing something like
> this, are when we are overwriting data without regard for the prior contents,
> e.g:
>
> An already partially filled page is filled with more rows, we write that page
> out, then all the rows are deleted, and we re-fill the page with new content
> from scratch. Write it out again. With our existing logic we treat the second
> write differently, because the entire contents of the page will be in the
> journal, as there is no prior content that we care about.
>
> A second scenario in which we might not use RWF_ATOMIC, if we carry today's
> logic forward, is if a newly created relation is bulk loaded in the same
> transaction that created the relation. If a crash were to happen while that
> bulk load is ongoing, we don't care about the contents of the file(s), as it
> will never be visible to anyone after crash recovery. In this case we won't
> have prio RWF_ATOMIC writes - but we could have the opposite, i.e. an
> RWF_ATOMIC write while there already is non-RWF_ATOMIC dirty data in the page
> cache. Would that be an issue?
No, this should be fine. But as I'm thinking about it what seems the most
natural is that RWF_WRITETHROUGH writes will wait on any pages under
writeback in the target range before proceeding with the write. That will
give user proper serialization with other RWF_WRITETHROUGH writes to the
overlapping range as well as writeback from previous normal writes. So the
only case that needs handling - either by userspace or kernel forcing
stable writes - would be RWF_WRITETHROUGH write followed by a normal write.
> It's possible we should just always use RWF_ATOMIC, even in the cases where
> it's not needed from our side, to avoid potential performance penalties and
> "undefined behaviour". I guess that will really depend on the performance
> penalty that RWF_ATOMIC will carry and whether multiple-atomicity-mode will
> eventually be supported (as doing small writes during bulk loading is quite
> expensive).
Sure, that's a possibility as well. I guess it requires some
experimentation and benchmarking to pick a proper tradeoff.
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Buffered atomic writes
2026-02-17 17:20 ` Ojaswin Mujoo
@ 2026-02-18 17:42 ` Jan Kara
2026-02-18 20:22 ` Ojaswin Mujoo
0 siblings, 1 reply; 38+ messages in thread
From: Jan Kara @ 2026-02-18 17:42 UTC (permalink / raw)
To: Ojaswin Mujoo
Cc: Pankaj Raghav, linux-xfs, linux-mm, linux-fsdevel, lsf-pc,
Andres Freund, djwong, john.g.garry, willy, hch, ritesh.list,
jack, Luis Chamberlain, dchinner, Javier Gonzalez, gost.dev,
tytso, p.raghav, vi.shah
On Tue 17-02-26 22:50:17, Ojaswin Mujoo wrote:
> On Mon, Feb 16, 2026 at 10:52:35AM +0100, Pankaj Raghav wrote:
> > Hmm, IIUC, postgres will write their dirty buffer cache by combining multiple DB
> > pages based on `io_combine_limit` (typically 128kb). So immediately writing them
> > might be ok as long as we don't remove those pages from the page cache like we do in
> > RWF_UNCACHED.
>
> Yep, and Ive not looked at the code path much but I think if we really
> care about the user not changing the data b/w write and writeback then
> we will probably need to start the writeback while holding the folio
> lock, which is currently not done in RWF_UNCACHED.
That isn't enough. submit_bio() returning isn't enough to guaranteed DMA
to the device has happened. And until it happens, modifying the pagecache
page means modifying the data the disk will get. The best is probably to
transition pages to writeback state and deal with it as with any other
requirement for stable pages.
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Buffered atomic writes
2026-02-18 17:42 ` [Lsf-pc] " Jan Kara
@ 2026-02-18 20:22 ` Ojaswin Mujoo
0 siblings, 0 replies; 38+ messages in thread
From: Ojaswin Mujoo @ 2026-02-18 20:22 UTC (permalink / raw)
To: Jan Kara
Cc: Pankaj Raghav, linux-xfs, linux-mm, linux-fsdevel, lsf-pc,
Andres Freund, djwong, john.g.garry, willy, hch, ritesh.list,
Luis Chamberlain, dchinner, Javier Gonzalez, gost.dev, tytso,
p.raghav, vi.shah
On Wed, Feb 18, 2026 at 06:42:05PM +0100, Jan Kara wrote:
> On Tue 17-02-26 22:50:17, Ojaswin Mujoo wrote:
> > On Mon, Feb 16, 2026 at 10:52:35AM +0100, Pankaj Raghav wrote:
> > > Hmm, IIUC, postgres will write their dirty buffer cache by combining multiple DB
> > > pages based on `io_combine_limit` (typically 128kb). So immediately writing them
> > > might be ok as long as we don't remove those pages from the page cache like we do in
> > > RWF_UNCACHED.
> >
> > Yep, and Ive not looked at the code path much but I think if we really
> > care about the user not changing the data b/w write and writeback then
> > we will probably need to start the writeback while holding the folio
> > lock, which is currently not done in RWF_UNCACHED.
>
> That isn't enough. submit_bio() returning isn't enough to guaranteed DMA
> to the device has happened. And until it happens, modifying the pagecache
> page means modifying the data the disk will get. The best is probably to
> transition pages to writeback state and deal with it as with any other
> requirement for stable pages.
Yes true, looking at the code, it does seem like we would also need to
depend on the stable page mechanism to ensure nobody changes the buffers
till the IO has actually finished.
I think the right way to go would be to first start with an
implementation of RWF_WRITETHOUGH and then utilize that and stable pages
to enable RWF_ATOMIC for buffered IO.
Regards,
ojaswin
>
> Honza
> --
> Jan Kara <jack@suse.com>
> SUSE Labs, CR
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Buffered atomic writes
2026-02-18 17:37 ` Jan Kara
@ 2026-02-18 21:04 ` Andres Freund
2026-02-19 0:32 ` Dave Chinner
1 sibling, 0 replies; 38+ messages in thread
From: Andres Freund @ 2026-02-18 21:04 UTC (permalink / raw)
To: Jan Kara
Cc: Pankaj Raghav, Ojaswin Mujoo, linux-xfs, linux-mm, linux-fsdevel,
lsf-pc, djwong, john.g.garry, willy, hch, ritesh.list,
Luis Chamberlain, dchinner, Javier Gonzalez, gost.dev, tytso,
p.raghav, vi.shah
Hi,
On 2026-02-18 18:37:45 +0100, Jan Kara wrote:
> On Tue 17-02-26 11:13:07, Andres Freund wrote:
> > Hm. It's somewhat painful to not know when we can write in what mode again -
> > with DIO that's not an issue. I guess we could use
> > sync_file_range(SYNC_FILE_RANGE_WAIT_BEFORE) if we really needed to know?
> > Although the semantics of the SFR flags aren't particularly clear, so maybe
> > not?
>
> If you used RWF_WRITETHROUGH for your writes (so you are sure IO has
> already started) then sync_file_range(SYNC_FILE_RANGE_WAIT_BEFORE) would
> indeed be a safe way of waiting for that IO to complete (or just wait for
> the write(2) syscall itself to complete if we make RWF_WRITETHROUGH wait
> for IO completion as Dave suggests - but I guess writes may happen from
> multiple threads so that may be not very convenient and sync_file_range(2)
> might be actually easier).
For us a synchronously blocking RWF_WRITETHROUGH would actually be easier, I
think.
The issue with writes from multiple threads actually goes the other way for us
- without knowing when the IO actually completes, our buffer pool's state
cannot reflect whether there is ongoing IO for a buffer or not. So we would
always have to do sync_file_range(SYNC_FILE_RANGE_WAIT_BEFORE) before doing
further IO.
Not knowing how many writes are actually outstanding also makes it harder for
us to avoid overwhelming the storage (triggering e.g. poor commit latency).
Greetings,
Andres Freund
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Buffered atomic writes
2026-02-18 6:47 ` Christoph Hellwig
@ 2026-02-18 23:42 ` Dave Chinner
0 siblings, 0 replies; 38+ messages in thread
From: Dave Chinner @ 2026-02-18 23:42 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Andres Freund, Pankaj Raghav, Jan Kara, Ojaswin Mujoo, linux-xfs,
linux-mm, linux-fsdevel, lsf-pc, djwong, john.g.garry, willy,
ritesh.list, Luis Chamberlain, dchinner, Javier Gonzalez,
gost.dev, tytso, p.raghav, vi.shah
On Wed, Feb 18, 2026 at 07:47:39AM +0100, Christoph Hellwig wrote:
> On Wed, Feb 18, 2026 at 12:04:43PM +1100, Dave Chinner wrote:
> > > > > I'd call it RWF_WRITETHROUGH but otherwise it makes sense.
> > > > >
> > > >
> > > > One naive question: semantically what will be the difference between
> > > > RWF_DSYNC and RWF_WRITETHROUGH?
> >
> > None, except that RWF_DSYNC provides data integrity guarantees.
>
> Which boils down to RWF_DSYNC still writing out the inode and flushing
> the cache.
>
> > > Which
> > > wouldn't be needed for RWF_WRITETHROUGH, right?
> >
> > Correct, there shouldn't be any data integrity guarantees associated
> > with plain RWF_WRITETHROUGH.
>
> Which makes me curious if the plain RWF_WRITETHROUGH would be all
> that useful.
For modern SSDs, I think the answer is yes.
e.g. when you are doing lots of small writes to many files from many
threads, it bottlenecks on single threaded writeback. All of the IO
is submitted by background writeback which runs out of CPU fairly
quickly. We end up dirty throttling and topping out at ~100k random
4kB buffered writes IOPS regardless of how much submitter
concurrency we have.
If we switch that to RWF_WRITETHROUGH, we now have N submitting
threads that can all work in parallel, we get pretty much zero dirty
folio backlog (so no dirty throttling and more consistent IO
latency) and throughput can scales much higher because we have IO
submitter concurrency to spread the CPU load around.
I did a fsmark test of a write-though hack a couple of years back,
creating and writing 4kB data files concurrently in a directory per
thread. With vanilla writeback, it topped out at about 80k 4kB file
creates/s from 4 threads and only wnet slower the more I increased
the userspace create concurrency.
Using writethrough submission, it topped out at about 400k 4kB file
creates/s from 32 threads and was largely limited in the fsmark
tasks by the CPU overhead for file creation, user data copying and
data extent space allocation.
I also did a multi-file, multi-process random 4kB write test with
fio, using files much larger than memory and long runtimes. Once the
normal background write path started dirty throttling, it ran at
about 100k 4kB write IOPS, again limited by the single threaded writeback
flusher using all it's CPU time for allocating blocks during
writeback.
Using writethrough, I saw about 900k IOPS being sustained right from
the start, largely limited by a combination of CPU usage and IO
latency in the fio task context. In comparison, the same workload
with DIO ran to the storage capability of 1.6M IOPS because it had
significantly lower CPU usage and IO latency.
I also did some kernel compile tests with writethrough for all
buffered write IO. On fast storage there was neglible
difference in performance between vanilla buffered writes and
submitter driver blocking write-through. This result made me
question the need for caching on modern SSDs at all :)
-Dave.
--
Dave Chinner
dgc@kernel.org
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Buffered atomic writes
2026-02-18 17:37 ` Jan Kara
2026-02-18 21:04 ` Andres Freund
@ 2026-02-19 0:32 ` Dave Chinner
1 sibling, 0 replies; 38+ messages in thread
From: Dave Chinner @ 2026-02-19 0:32 UTC (permalink / raw)
To: Jan Kara
Cc: Andres Freund, Pankaj Raghav, Ojaswin Mujoo, linux-xfs, linux-mm,
linux-fsdevel, lsf-pc, djwong, john.g.garry, willy, hch,
ritesh.list, Luis Chamberlain, dchinner, Javier Gonzalez,
gost.dev, tytso, p.raghav, vi.shah
On Wed, Feb 18, 2026 at 06:37:45PM +0100, Jan Kara wrote:
> On Tue 17-02-26 11:13:07, Andres Freund wrote:
> > > > P1: pwritev(fd, [blocks 1-10], RWF_ATOMIC) start & completes
> > > > Kernel: starts writeback but doesn't complete it
> > > > P1: pwrite(fd, [any block in 1-10]), non-atomically
> > > > Kernel: completes writeback
> > > >
> > > > The former is not at all an issue for postgres' use case, the pages in
> > > > our buffer pool that are undergoing IO are locked, preventing additional
> > > > IO (be it reads or writes) to those blocks.
> > > >
> > > > The latter would be a problem, since userspace wouldn't even know that
> > > > here is still "atomic writeback" going on, afaict the only way we could
> > > > avoid it would be to issue an f[data]sync(), which likely would be
> > > > prohibitively expensive.
> > >
> > > It somewhat depends on what outcome you expect in terms of crash safety :)
> > > Unless we are careful, the RWF_ATOMIC write in your latter example can end
> > > up writing some bits of the data from the second write because the second
> > > write may be copying data to the pages as we issue DMA from them to the
> > > device.
> >
> > Hm. It's somewhat painful to not know when we can write in what mode again -
> > with DIO that's not an issue. I guess we could use
> > sync_file_range(SYNC_FILE_RANGE_WAIT_BEFORE) if we really needed to know?
> > Although the semantics of the SFR flags aren't particularly clear, so maybe
> > not?
>
> If you used RWF_WRITETHROUGH for your writes (so you are sure IO has
> already started) then sync_file_range(SYNC_FILE_RANGE_WAIT_BEFORE) would
> indeed be a safe way of waiting for that IO to complete (or just wait for
> the write(2) syscall itself to complete if we make RWF_WRITETHROUGH wait
> for IO completion as Dave suggests - but I guess writes may happen from
> multiple threads so that may be not very convenient and sync_file_range(2)
> might be actually easier).
I would much prefer we don't have to rely on crappy interfaces like
sync_file_range() to handle RWF_WRITETHROUGH IO completion
processing. All it does is add complexity to error
handling/propagation to both the kernel code and the userspace code.
It takes something that is easy to get right (i.e. synchronous
completion) and replaces it with something that is easy to get
wrong. That's not good API design.
As for handling multiple writes to the same range, stable pages do
that for us. RWF_WRITETHROUGH will need to set folios in the
writeback state before submission and clear it after completion so
that stable pages work correctly. Hence we may as well use that
functionality to serialise overlapping RWF_WRITETHROUGH IOs and
against concurrent background and data integrity driven writeback
We should be trying hard to keep this simple and consistent with
existing write-through IO models that people already know how to use
(i.e. DIO).
> > > I expect this isn't really acceptable because if you crash before
> > > the second write fully makes it to the disk, you will have inconsistent
> > > data.
> >
> > The scenarios that I can think that would lead us to doing something like
> > this, are when we are overwriting data without regard for the prior contents,
> > e.g:
> >
> > An already partially filled page is filled with more rows, we write that page
> > out, then all the rows are deleted, and we re-fill the page with new content
> > from scratch. Write it out again. With our existing logic we treat the second
> > write differently, because the entire contents of the page will be in the
> > journal, as there is no prior content that we care about.
> >
> > A second scenario in which we might not use RWF_ATOMIC, if we carry today's
> > logic forward, is if a newly created relation is bulk loaded in the same
> > transaction that created the relation. If a crash were to happen while that
> > bulk load is ongoing, we don't care about the contents of the file(s), as it
> > will never be visible to anyone after crash recovery. In this case we won't
> > have prio RWF_ATOMIC writes - but we could have the opposite, i.e. an
> > RWF_ATOMIC write while there already is non-RWF_ATOMIC dirty data in the page
> > cache. Would that be an issue?
>
> No, this should be fine. But as I'm thinking about it what seems the most
> natural is that RWF_WRITETHROUGH writes will wait on any pages under
> writeback in the target range before proceeding with the write.
I think that is required behaviour, even though it is natural. IMO,
concurrent overlapping physical IOs from the page cache via
RWF_WRITETHROUGH is a data corruption vector just waiting for
someone to trip over it...
i.e. we need to keep in mind that one of the guarantees that the
page cache provides is that it will never overlap multiple
concurrent physical IOs to the same physical range. Overlapping IOs
are handled and serialised at the folio level, they should never end
up with overlapping physical IO being issued.
> That will
> give user proper serialization with other RWF_WRITETHROUGH writes to the
> overlapping range as well as writeback from previous normal writes. So the
> only case that needs handling - either by userspace or kernel forcing
> stable writes - would be RWF_WRITETHROUGH write followed by a normal write.
*nod*. I think forcing stable writes for RWF_WRITETHROUGH is the
right way to go. We are going to need stable write semantic for
RWF_ATOMIC support, and we probably should have them for RWF_DSYNC
as well because the data integrity guarantees cover the data in that
specific user IO, not any other previous, concurrent or future user
IO.
-Dave.
--
Dave Chinner
dgc@kernel.org
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Buffered atomic writes
2026-02-13 10:20 [LSF/MM/BPF TOPIC] Buffered atomic writes Pankaj Raghav
` (2 preceding siblings ...)
2026-02-17 5:51 ` Christoph Hellwig
@ 2026-02-20 10:08 ` Pankaj Raghav (Samsung)
2026-02-20 15:10 ` Christoph Hellwig
3 siblings, 1 reply; 38+ messages in thread
From: Pankaj Raghav (Samsung) @ 2026-02-20 10:08 UTC (permalink / raw)
To: linux-xfs, linux-mm, linux-fsdevel, lsf-pc
Cc: Andres Freund, djwong, john.g.garry, willy, hch, ritesh.list,
jack, ojaswin, Luis Chamberlain, dchinner, Javier Gonzalez,
gost.dev, tytso, p.raghav, vi.shah
On Fri, Feb 13, 2026 at 11:20:36AM +0100, Pankaj Raghav wrote:
> Hi all,
>
> Atomic (untorn) writes for Direct I/O have successfully landed in kernel
> for ext4 and XFS[1][2]. However, extending this support to Buffered I/O
> remains a contentious topic, with previous discussions often stalling due to
> concerns about complexity versus utility.
>
Hi,
Thanks a lot everyone for the input on this topic. I would like to
summarize some of the important points discussed here so that it could
be used as a reference for the talk and RFCs going forward:
- There is a general consensus to add atomic support to buffered IO
path.
- First step is to add support to RWF_WRITETHROUGH as initially proposed
by Dave Chinner.
Semantics of RWF_WRITETHROUGH (based on my understanding):
* Immediate Writeback Initiation: When RWF_WRITETHROUGH is used with a
buffered write, the kernel will immediately initiate the writeback of
the data to storage. We use page cache to serialize overlapping
writes.
Folio Lock -> Copy data to the page cache -> Initiate and complete writeback -> Unlock folio
* Synchronous I/O Behavior: The I/O operation will behave synchronously
from the application's perspective. This means the system call
will block until the write operation has been submitted to the device.
Any I/O errors will be reported directly to the caller. (Similar to Direct I/O)
* No Inherent Data Integrity Guarantees: Unlike RWF_DSYNC,
RWF_WRITETHROUGH itself does not inherently guarantee that the data
has reached non-volatile storage.
- Once we have writethrough infrastructure is in place, we layer in
atomic support to buffered IO path. But they will require more
guarantees such as no short copies, using stable pages during
writeback, etc.
Feel free to add/correct the above points.
--
Pankaj
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Buffered atomic writes
2026-02-20 10:08 ` Pankaj Raghav (Samsung)
@ 2026-02-20 15:10 ` Christoph Hellwig
0 siblings, 0 replies; 38+ messages in thread
From: Christoph Hellwig @ 2026-02-20 15:10 UTC (permalink / raw)
To: Pankaj Raghav (Samsung)
Cc: linux-xfs, linux-mm, linux-fsdevel, lsf-pc, Andres Freund,
djwong, john.g.garry, willy, hch, ritesh.list, jack, ojaswin,
Luis Chamberlain, dchinner, Javier Gonzalez, gost.dev, tytso,
p.raghav, vi.shah
On Fri, Feb 20, 2026 at 10:08:26AM +0000, Pankaj Raghav (Samsung) wrote:
> On Fri, Feb 13, 2026 at 11:20:36AM +0100, Pankaj Raghav wrote:
> > Hi all,
> >
> > Atomic (untorn) writes for Direct I/O have successfully landed in kernel
> > for ext4 and XFS[1][2]. However, extending this support to Buffered I/O
> > remains a contentious topic, with previous discussions often stalling due to
> > concerns about complexity versus utility.
> >
>
> Hi,
>
> Thanks a lot everyone for the input on this topic. I would like to
> summarize some of the important points discussed here so that it could
> be used as a reference for the talk and RFCs going forward:
>
> - There is a general consensus to add atomic support to buffered IO
> path.
I don't think that's quite true.
^ permalink raw reply [flat|nested] 38+ messages in thread
end of thread, other threads:[~2026-02-20 15:10 UTC | newest]
Thread overview: 38+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2026-02-13 10:20 [LSF/MM/BPF TOPIC] Buffered atomic writes Pankaj Raghav
2026-02-13 13:32 ` Ojaswin Mujoo
2026-02-16 9:52 ` Pankaj Raghav
2026-02-16 15:45 ` Andres Freund
2026-02-17 12:06 ` Jan Kara
2026-02-17 12:42 ` Pankaj Raghav
2026-02-17 16:21 ` Andres Freund
2026-02-18 1:04 ` Dave Chinner
2026-02-18 6:47 ` Christoph Hellwig
2026-02-18 23:42 ` Dave Chinner
2026-02-17 16:13 ` Andres Freund
2026-02-17 18:27 ` Ojaswin Mujoo
2026-02-17 18:42 ` Andres Freund
2026-02-18 17:37 ` Jan Kara
2026-02-18 21:04 ` Andres Freund
2026-02-19 0:32 ` Dave Chinner
2026-02-17 18:33 ` Ojaswin Mujoo
2026-02-17 17:20 ` Ojaswin Mujoo
2026-02-18 17:42 ` [Lsf-pc] " Jan Kara
2026-02-18 20:22 ` Ojaswin Mujoo
2026-02-16 11:38 ` Jan Kara
2026-02-16 13:18 ` Pankaj Raghav
2026-02-17 18:36 ` Ojaswin Mujoo
2026-02-16 15:57 ` Andres Freund
2026-02-17 18:39 ` Ojaswin Mujoo
2026-02-18 0:26 ` Dave Chinner
2026-02-18 6:49 ` Christoph Hellwig
2026-02-18 12:54 ` Ojaswin Mujoo
2026-02-15 9:01 ` Amir Goldstein
2026-02-17 5:51 ` Christoph Hellwig
2026-02-17 9:23 ` [Lsf-pc] " Amir Goldstein
2026-02-17 15:47 ` Andres Freund
2026-02-17 22:45 ` Dave Chinner
2026-02-18 4:10 ` Andres Freund
2026-02-18 6:53 ` Christoph Hellwig
2026-02-18 6:51 ` Christoph Hellwig
2026-02-20 10:08 ` Pankaj Raghav (Samsung)
2026-02-20 15:10 ` Christoph Hellwig
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox