* [LSF/MM/BPF TOPIC] Buffered atomic writes
@ 2026-02-13 10:20 Pankaj Raghav
2026-02-13 13:32 ` Ojaswin Mujoo
` (3 more replies)
0 siblings, 4 replies; 38+ messages in thread
From: Pankaj Raghav @ 2026-02-13 10:20 UTC (permalink / raw)
To: linux-xfs, linux-mm, linux-fsdevel, lsf-pc
Cc: Andres Freund, djwong, john.g.garry, willy, hch, ritesh.list,
jack, ojaswin, Luis Chamberlain, dchinner, Javier Gonzalez,
gost.dev, tytso, p.raghav, vi.shah
Hi all,
Atomic (untorn) writes for Direct I/O have successfully landed in kernel
for ext4 and XFS[1][2]. However, extending this support to Buffered I/O
remains a contentious topic, with previous discussions often stalling
due to concerns about complexity versus utility.
I would like to propose a session to discuss the concrete use cases for
buffered atomic writes and if possible, talk about the outstanding
architectural blockers blocking the current RFCs[3][4].
## Use Case:
A recurring objection to buffered atomics is the lack of a convincing
use case, with the argument that databases should simply migrate to
direct I/O. We have been working with PostgreSQL developer Andres
Freund, who has highlighted a specific architectural requirement where
buffered I/O remains preferable in certain scenarios.
While Postgres recently started to support direct I/O, optimal
performance requires a large, statically configured user-space buffer
pool. This becomes problematic when running many Postgres instances on
the same hardware, a common deployment scenario. Statically partitioning
RAM for direct I/O caches across many instances is inefficient compared
to allowing the kernel page cache to dynamically balance memory pressure
between instances.
The other use case is using postgres as part of a larger workload on one
instance. Using up enough memory for postgres' buffer pool to make DIO
use viable is often not realistic, because some deployments require a
lot of memory to cache database IO, while others need a lot of memory
for non-database caching.
Enabling atomic writes for this buffered workload would allow Postgres
to disable full-page writes [5]. For direct I/O, this has shown to
reduce transaction variability; for buffered I/O, we expect similar
gains, alongside decreased WAL bandwidth and storage costs for WAL
archival. As a side note, for most workloads full page writes occupy a
significant portion of WAL volume.
Andres has agreed to attend LSFMM this year to discuss these requirements.
## Discussion:
We currently have RFCs posted by John Garry and Ojaswin Mujoo, and there
was a previous LSFMM proposal about untorn buffered writes from Ted Tso.
Based on the conversation/blockers we had before, the discussion at
LSFMM should focus on the following blocking issues:
- Handling Short Writes under Memory Pressure[6]: A buffered atomic
write might span page boundaries. If memory pressure causes a page
fault or reclaim mid-copy, the write could be torn inside the page
cache before it even reaches the filesystem.
- The current RFC uses a "pinning" approach: pinning user pages and
creating a BVEC to ensure the full copy can proceed atomically.
This adds complexity to the write path.
- Discussion: Is this acceptable? Should we consider alternatives,
such as requiring userspace to mlock the I/O buffers before
issuing the write to guarantee atomic copy in the page cache?
- Page Cache Model vs. Filesystem CoW: The current RFC introduces a
PG_atomic page flag to track dirty pages requiring atomic writeback.
This faced pushback due to page flags being a scarce resource[7].
Furthermore, it was argued that atomic model does not fit the buffered
I/O model because data sitting in the page cache is vulnerable to
modification before writeback occurs, and writeback does not preserve
application ordering[8].
- Dave Chinner has proposed leveraging the filesystem's CoW path
where we always allocate new blocks for the atomic write (forced
CoW). If the hardware supports it (e.g., NVMe atomic limits), the
filesystem can optimize the writeback to use REQ_ATOMIC in place,
avoiding the CoW overhead while maintaining the architectural
separation.
- Discussion: While the CoW approach fits XFS and other CoW
filesystems well, it presents challenges for filesystems like ext4
which lack CoW capabilities for data. Should this be a filesystem
specific feature?
Comments or Curses, all are welcome.
--
Pankaj
[1] https://lwn.net/Articles/1009298/
[2] https://docs.kernel.org/6.17/filesystems/ext4/atomic_writes.html
[3]
https://lore.kernel.org/linux-fsdevel/20240422143923.3927601-1-john.g.garry@oracle.com/
[4] https://lore.kernel.org/all/cover.1762945505.git.ojaswin@linux.ibm.com
[5]
https://www.postgresql.org/docs/16/runtime-config-wal.html#GUC-FULL-PAGE-WRITES
[6]
https://lore.kernel.org/linux-fsdevel/ZiZ8XGZz46D3PRKr@casper.infradead.org/
[7]
https://lore.kernel.org/linux-fsdevel/aRSuH82gM-8BzPCU@casper.infradead.org/
[8]
https://lore.kernel.org/linux-fsdevel/aRmHRk7FGD4nCT0s@dread.disaster.area/
^ permalink raw reply [flat|nested] 38+ messages in thread* Re: [LSF/MM/BPF TOPIC] Buffered atomic writes 2026-02-13 10:20 [LSF/MM/BPF TOPIC] Buffered atomic writes Pankaj Raghav @ 2026-02-13 13:32 ` Ojaswin Mujoo 2026-02-16 9:52 ` Pankaj Raghav 2026-02-16 11:38 ` Jan Kara 2026-02-15 9:01 ` Amir Goldstein ` (2 subsequent siblings) 3 siblings, 2 replies; 38+ messages in thread From: Ojaswin Mujoo @ 2026-02-13 13:32 UTC (permalink / raw) To: Pankaj Raghav Cc: linux-xfs, linux-mm, linux-fsdevel, lsf-pc, Andres Freund, djwong, john.g.garry, willy, hch, ritesh.list, jack, Luis Chamberlain, dchinner, Javier Gonzalez, gost.dev, tytso, p.raghav, vi.shah On Fri, Feb 13, 2026 at 11:20:36AM +0100, Pankaj Raghav wrote: > Hi all, > > Atomic (untorn) writes for Direct I/O have successfully landed in kernel > for ext4 and XFS[1][2]. However, extending this support to Buffered I/O > remains a contentious topic, with previous discussions often stalling due to > concerns about complexity versus utility. > > I would like to propose a session to discuss the concrete use cases for > buffered atomic writes and if possible, talk about the outstanding > architectural blockers blocking the current RFCs[3][4]. Hi Pankaj, Thanks for the proposal and glad to hear there is a wider interest in this topic. We have also been actively working on this and I in middle of testing and ironing out bugs in my RFC v2 for buffered atomic writes, which is largely based on Dave's suggestions to maintain atomic write mappings in FS layer (aka XFS COW fork). Infact I was going to propose a discussion on this myself :) > > ## Use Case: > > A recurring objection to buffered atomics is the lack of a convincing use > case, with the argument that databases should simply migrate to direct I/O. > We have been working with PostgreSQL developer Andres Freund, who has > highlighted a specific architectural requirement where buffered I/O remains > preferable in certain scenarios. Looks like you have some nice insights to cover from postgres side which filesystem community has been asking for. As I've also been working on the kernel implementation side of it, do you think we could do a joint session on this topic? > > While Postgres recently started to support direct I/O, optimal performance > requires a large, statically configured user-space buffer pool. This becomes > problematic when running many Postgres instances on the same hardware, a > common deployment scenario. Statically partitioning RAM for direct I/O > caches across many instances is inefficient compared to allowing the kernel > page cache to dynamically balance memory pressure between instances. > > The other use case is using postgres as part of a larger workload on one > instance. Using up enough memory for postgres' buffer pool to make DIO use > viable is often not realistic, because some deployments require a lot of > memory to cache database IO, while others need a lot of memory for > non-database caching. > > Enabling atomic writes for this buffered workload would allow Postgres to > disable full-page writes [5]. For direct I/O, this has shown to reduce > transaction variability; for buffered I/O, we expect similar gains, > alongside decreased WAL bandwidth and storage costs for WAL archival. As a > side note, for most workloads full page writes occupy a significant portion > of WAL volume. > > Andres has agreed to attend LSFMM this year to discuss these requirements. Glad to hear people from postgres would also be joining! > > ## Discussion: > > We currently have RFCs posted by John Garry and Ojaswin Mujoo, and there > was a previous LSFMM proposal about untorn buffered writes from Ted Tso. > Based on the conversation/blockers we had before, the discussion at LSFMM > should focus on the following blocking issues: > > - Handling Short Writes under Memory Pressure[6]: A buffered atomic > write might span page boundaries. If memory pressure causes a page > fault or reclaim mid-copy, the write could be torn inside the page > cache before it even reaches the filesystem. > - The current RFC uses a "pinning" approach: pinning user pages and > creating a BVEC to ensure the full copy can proceed atomically. > This adds complexity to the write path. > - Discussion: Is this acceptable? Should we consider alternatives, > such as requiring userspace to mlock the I/O buffers before > issuing the write to guarantee atomic copy in the page cache? Right, I chose this approach because we only get to know about the short copy after it has actually happened in copy_folio_from_iter_atomic() and it seemed simpler to just not let the short copy happen. This is inspired from how dio pins the pages for DMA, just that we do it for a shorter time. It does add slight complexity to the path but I'm not sure if it's complex enough to justify adding a hard requirement of having pages mlock'd. > > - Page Cache Model vs. Filesystem CoW: The current RFC introduces a > PG_atomic page flag to track dirty pages requiring atomic writeback. > This faced pushback due to page flags being a scarce resource[7]. > Furthermore, it was argued that atomic model does not fit the buffered > I/O model because data sitting in the page cache is vulnerable to > modification before writeback occurs, and writeback does not preserve > application ordering[8]. > - Dave Chinner has proposed leveraging the filesystem's CoW path > where we always allocate new blocks for the atomic write (forced > CoW). If the hardware supports it (e.g., NVMe atomic limits), the > filesystem can optimize the writeback to use REQ_ATOMIC in place, > avoiding the CoW overhead while maintaining the architectural > separation. Right, this is what I'm doing in the new RFC where we maintain the mappings for atomic write in COW fork. This way we are able to utilize a lot of existing infrastructure, however it does add some complexity to ->iomap_begin() and ->writeback_range() callbacks of the FS. I believe it is a tradeoff since the general consesus was mostly to avoid adding too much complexity to iomap layer. Another thing that came up is to consider using write through semantics for buffered atomic writes, where we are able to transition page to writeback state immediately after the write and avoid any other users to modify the data till writeback completes. This might affect performance since we won't be able to batch similar atomic IOs but maybe applications like postgres would not mind this too much. If we go with this approach, we will be able to avoid worrying too much about other users changing atomic data underneath us. An argument against this however is that it is user's responsibility to not do non atomic IO over an atomic range and this shall be considered a userspace usage error. This is similar to how there are ways users can tear a dio if they perform overlapping writes. [1]. That being said, I think these points are worth discussing and it would be helpful to have people from postgres around while discussing these semantics with the FS community members. As for ordering of writes, I'm not sure if that is something that we should guarantee via the RWF_ATOMIC api. Ensuring ordering has mostly been the task of userspace via fsync() and friends. [1] https://lore.kernel.org/fstests/0af205d9-6093-4931-abe9-f236acae8d44@oracle.com/ > - Discussion: While the CoW approach fits XFS and other CoW > filesystems well, it presents challenges for filesystems like ext4 > which lack CoW capabilities for data. Should this be a filesystem > specific feature? I believe your question is if we should have a hard dependency on COW mappings for atomic writes. Currently, COW in atomic write context in XFS, is used for these 2 things: 1. COW fork holds atomic write ranges. This is not strictly a COW feature, just that we are repurposing the COW fork to hold our atomic ranges. Basically a way for writeback path to know that atomic write was done here. COW fork is one way to do this but I believe every FS has a version of in memory extent trees where such ephemeral atomic write mappings can be held. The extent status cache is ext4's version of this, and can be used to manage the atomic write ranges. There is an alternate suggestion that came up from discussions with Ted and Darrick that we can instead use a generic side-car structure which holds atomic write ranges. FSes can populate these during atomic writes and query these in their writeback paths. This means for any FS operation (think truncate, falloc, mwrite, write ...) we would need to keep this structure in sync, which can become pretty complex pretty fast. I'm yet to implement this so not sure how it would look in practice though. 2. COW feature as a whole enables software based atomic writes. This is something that ext4 won't be able to support (right now), just like how we don't support software writes for dio. I believe Baokun and Yi and working on a feature that can eventually enable COW writes in ext4 [2]. Till we have something like that, we would have to rely on hardware support. Regardless, I don't think the ability to support or not support software atomic writes largely depends on the filesystem so I'm not sure how we can lift this up to a generic layer anyways. [2] https://lore.kernel.org/linux-ext4/9666679c-c9f7-435c-8b67-c67c2f0c19ab@huawei.com/ Thanks, Ojaswin > > Comments or Curses, all are welcome. > > -- > Pankaj > > [1] https://lwn.net/Articles/1009298/ > [2] https://docs.kernel.org/6.17/filesystems/ext4/atomic_writes.html > [3] https://lore.kernel.org/linux-fsdevel/20240422143923.3927601-1-john.g.garry@oracle.com/ > [4] https://lore.kernel.org/all/cover.1762945505.git.ojaswin@linux.ibm.com > [5] https://www.postgresql.org/docs/16/runtime-config-wal.html#GUC-FULL-PAGE-WRITES > [6] > https://lore.kernel.org/linux-fsdevel/ZiZ8XGZz46D3PRKr@casper.infradead.org/ > [7] > https://lore.kernel.org/linux-fsdevel/aRSuH82gM-8BzPCU@casper.infradead.org/ > [8] > https://lore.kernel.org/linux-fsdevel/aRmHRk7FGD4nCT0s@dread.disaster.area/ > ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Buffered atomic writes 2026-02-13 13:32 ` Ojaswin Mujoo @ 2026-02-16 9:52 ` Pankaj Raghav 2026-02-16 15:45 ` Andres Freund 2026-02-17 17:20 ` Ojaswin Mujoo 2026-02-16 11:38 ` Jan Kara 1 sibling, 2 replies; 38+ messages in thread From: Pankaj Raghav @ 2026-02-16 9:52 UTC (permalink / raw) To: Ojaswin Mujoo Cc: linux-xfs, linux-mm, linux-fsdevel, lsf-pc, Andres Freund, djwong, john.g.garry, willy, hch, ritesh.list, jack, Luis Chamberlain, dchinner, Javier Gonzalez, gost.dev, tytso, p.raghav, vi.shah On 2/13/26 14:32, Ojaswin Mujoo wrote: > On Fri, Feb 13, 2026 at 11:20:36AM +0100, Pankaj Raghav wrote: >> Hi all, >> >> Atomic (untorn) writes for Direct I/O have successfully landed in kernel >> for ext4 and XFS[1][2]. However, extending this support to Buffered I/O >> remains a contentious topic, with previous discussions often stalling due to >> concerns about complexity versus utility. >> >> I would like to propose a session to discuss the concrete use cases for >> buffered atomic writes and if possible, talk about the outstanding >> architectural blockers blocking the current RFCs[3][4]. > > Hi Pankaj, > > Thanks for the proposal and glad to hear there is a wider interest in > this topic. We have also been actively working on this and I in middle > of testing and ironing out bugs in my RFC v2 for buffered atomic > writes, which is largely based on Dave's suggestions to maintain atomic > write mappings in FS layer (aka XFS COW fork). Infact I was going to > propose a discussion on this myself :) > Perfect. >> >> ## Use Case: >> >> A recurring objection to buffered atomics is the lack of a convincing use >> case, with the argument that databases should simply migrate to direct I/O. >> We have been working with PostgreSQL developer Andres Freund, who has >> highlighted a specific architectural requirement where buffered I/O remains >> preferable in certain scenarios. > > Looks like you have some nice insights to cover from postgres side which > filesystem community has been asking for. As I've also been working on > the kernel implementation side of it, do you think we could do a joint > session on this topic? > As one of the main pushback for this feature has been a valid usecase, the main outcome I would like to get out of this session is a community consensus on the use case for this feature. It looks like you already made quite a bit of progress with the CoW impl, so it would be great to if it can be a joint session. >> We currently have RFCs posted by John Garry and Ojaswin Mujoo, and there >> was a previous LSFMM proposal about untorn buffered writes from Ted Tso. >> Based on the conversation/blockers we had before, the discussion at LSFMM >> should focus on the following blocking issues: >> >> - Handling Short Writes under Memory Pressure[6]: A buffered atomic >> write might span page boundaries. If memory pressure causes a page >> fault or reclaim mid-copy, the write could be torn inside the page >> cache before it even reaches the filesystem. >> - The current RFC uses a "pinning" approach: pinning user pages and >> creating a BVEC to ensure the full copy can proceed atomically. >> This adds complexity to the write path. >> - Discussion: Is this acceptable? Should we consider alternatives, >> such as requiring userspace to mlock the I/O buffers before >> issuing the write to guarantee atomic copy in the page cache? > > Right, I chose this approach because we only get to know about the short > copy after it has actually happened in copy_folio_from_iter_atomic() > and it seemed simpler to just not let the short copy happen. This is > inspired from how dio pins the pages for DMA, just that we do it > for a shorter time. > > It does add slight complexity to the path but I'm not sure if it's complex > enough to justify adding a hard requirement of having pages mlock'd. > As databases like postgres have a buffer cache that they manage in userspace, which is eventually used to do IO, I am wondering if they already do a mlock or some other way to guarantee the buffer cache does not get reclaimed. That is why I was thinking if we could make it a requirement. Of course, that also requires checking if the range is mlocked in the iomap_write_iter path. >> >> - Page Cache Model vs. Filesystem CoW: The current RFC introduces a >> PG_atomic page flag to track dirty pages requiring atomic writeback. >> This faced pushback due to page flags being a scarce resource[7]. >> Furthermore, it was argued that atomic model does not fit the buffered >> I/O model because data sitting in the page cache is vulnerable to >> modification before writeback occurs, and writeback does not preserve >> application ordering[8]. >> - Dave Chinner has proposed leveraging the filesystem's CoW path >> where we always allocate new blocks for the atomic write (forced >> CoW). If the hardware supports it (e.g., NVMe atomic limits), the >> filesystem can optimize the writeback to use REQ_ATOMIC in place, >> avoiding the CoW overhead while maintaining the architectural >> separation. > > Right, this is what I'm doing in the new RFC where we maintain the > mappings for atomic write in COW fork. This way we are able to utilize a > lot of existing infrastructure, however it does add some complexity to > ->iomap_begin() and ->writeback_range() callbacks of the FS. I believe > it is a tradeoff since the general consesus was mostly to avoid adding > too much complexity to iomap layer. > > Another thing that came up is to consider using write through semantics > for buffered atomic writes, where we are able to transition page to > writeback state immediately after the write and avoid any other users to > modify the data till writeback completes. This might affect performance > since we won't be able to batch similar atomic IOs but maybe > applications like postgres would not mind this too much. If we go with > this approach, we will be able to avoid worrying too much about other > users changing atomic data underneath us. > Hmm, IIUC, postgres will write their dirty buffer cache by combining multiple DB pages based on `io_combine_limit` (typically 128kb). So immediately writing them might be ok as long as we don't remove those pages from the page cache like we do in RWF_UNCACHED. > An argument against this however is that it is user's responsibility to > not do non atomic IO over an atomic range and this shall be considered a > userspace usage error. This is similar to how there are ways users can > tear a dio if they perform overlapping writes. [1]. > > That being said, I think these points are worth discussing and it would > be helpful to have people from postgres around while discussing these > semantics with the FS community members. > > As for ordering of writes, I'm not sure if that is something that > we should guarantee via the RWF_ATOMIC api. Ensuring ordering has mostly > been the task of userspace via fsync() and friends. > Agreed. > > [1] https://lore.kernel.org/fstests/0af205d9-6093-4931-abe9-f236acae8d44@oracle.com/ > >> - Discussion: While the CoW approach fits XFS and other CoW >> filesystems well, it presents challenges for filesystems like ext4 >> which lack CoW capabilities for data. Should this be a filesystem >> specific feature? > > I believe your question is if we should have a hard dependency on COW > mappings for atomic writes. Currently, COW in atomic write context in > XFS, is used for these 2 things: > > 1. COW fork holds atomic write ranges. > > This is not strictly a COW feature, just that we are repurposing the COW > fork to hold our atomic ranges. Basically a way for writeback path to > know that atomic write was done here. > > COW fork is one way to do this but I believe every FS has a version of > in memory extent trees where such ephemeral atomic write mappings can be > held. The extent status cache is ext4's version of this, and can be used > to manage the atomic write ranges. > > There is an alternate suggestion that came up from discussions with Ted > and Darrick that we can instead use a generic side-car structure which > holds atomic write ranges. FSes can populate these during atomic writes > and query these in their writeback paths. > > This means for any FS operation (think truncate, falloc, mwrite, write > ...) we would need to keep this structure in sync, which can become pretty > complex pretty fast. I'm yet to implement this so not sure how it would > look in practice though. > > 2. COW feature as a whole enables software based atomic writes. > > This is something that ext4 won't be able to support (right now), just > like how we don't support software writes for dio. > > I believe Baokun and Yi and working on a feature that can eventually > enable COW writes in ext4 [2]. Till we have something like that, we > would have to rely on hardware support. > > Regardless, I don't think the ability to support or not support > software atomic writes largely depends on the filesystem so I'm not > sure how we can lift this up to a generic layer anyways. > > [2] https://lore.kernel.org/linux-ext4/9666679c-c9f7-435c-8b67-c67c2f0c19ab@huawei.com/ > Thanks for the explanation. I am also planning to take a shot at the CoW approach. I would be more than happy to review and test if you send a RFC in the meantime. -- Pankaj ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Buffered atomic writes 2026-02-16 9:52 ` Pankaj Raghav @ 2026-02-16 15:45 ` Andres Freund 2026-02-17 12:06 ` Jan Kara 2026-02-17 18:33 ` Ojaswin Mujoo 2026-02-17 17:20 ` Ojaswin Mujoo 1 sibling, 2 replies; 38+ messages in thread From: Andres Freund @ 2026-02-16 15:45 UTC (permalink / raw) To: Pankaj Raghav Cc: Ojaswin Mujoo, linux-xfs, linux-mm, linux-fsdevel, lsf-pc, djwong, john.g.garry, willy, hch, ritesh.list, jack, Luis Chamberlain, dchinner, Javier Gonzalez, gost.dev, tytso, p.raghav, vi.shah Hi, On 2026-02-16 10:52:35 +0100, Pankaj Raghav wrote: > On 2/13/26 14:32, Ojaswin Mujoo wrote: > > On Fri, Feb 13, 2026 at 11:20:36AM +0100, Pankaj Raghav wrote: > >> We currently have RFCs posted by John Garry and Ojaswin Mujoo, and there > >> was a previous LSFMM proposal about untorn buffered writes from Ted Tso. > >> Based on the conversation/blockers we had before, the discussion at LSFMM > >> should focus on the following blocking issues: > >> > >> - Handling Short Writes under Memory Pressure[6]: A buffered atomic > >> write might span page boundaries. If memory pressure causes a page > >> fault or reclaim mid-copy, the write could be torn inside the page > >> cache before it even reaches the filesystem. > >> - The current RFC uses a "pinning" approach: pinning user pages and > >> creating a BVEC to ensure the full copy can proceed atomically. > >> This adds complexity to the write path. > >> - Discussion: Is this acceptable? Should we consider alternatives, > >> such as requiring userspace to mlock the I/O buffers before > >> issuing the write to guarantee atomic copy in the page cache? > > > > Right, I chose this approach because we only get to know about the short > > copy after it has actually happened in copy_folio_from_iter_atomic() > > and it seemed simpler to just not let the short copy happen. This is > > inspired from how dio pins the pages for DMA, just that we do it > > for a shorter time. > > > > It does add slight complexity to the path but I'm not sure if it's complex > > enough to justify adding a hard requirement of having pages mlock'd. > > > > As databases like postgres have a buffer cache that they manage in userspace, > which is eventually used to do IO, I am wondering if they already do a mlock > or some other way to guarantee the buffer cache does not get reclaimed. That is > why I was thinking if we could make it a requirement. Of course, that also requires > checking if the range is mlocked in the iomap_write_iter path. We don't generally mlock our buffer pool - but we strongly recommend to use explicit huge pages (due to TLB pressure, faster fork() and less memory wasted on page tables), which afaict has basically the same effect. However, that doesn't make the page cache pages locked... > >> - Page Cache Model vs. Filesystem CoW: The current RFC introduces a > >> PG_atomic page flag to track dirty pages requiring atomic writeback. > >> This faced pushback due to page flags being a scarce resource[7]. > >> Furthermore, it was argued that atomic model does not fit the buffered > >> I/O model because data sitting in the page cache is vulnerable to > >> modification before writeback occurs, and writeback does not preserve > >> application ordering[8]. > >> - Dave Chinner has proposed leveraging the filesystem's CoW path > >> where we always allocate new blocks for the atomic write (forced > >> CoW). If the hardware supports it (e.g., NVMe atomic limits), the > >> filesystem can optimize the writeback to use REQ_ATOMIC in place, > >> avoiding the CoW overhead while maintaining the architectural > >> separation. > > > > Right, this is what I'm doing in the new RFC where we maintain the > > mappings for atomic write in COW fork. This way we are able to utilize a > > lot of existing infrastructure, however it does add some complexity to > > ->iomap_begin() and ->writeback_range() callbacks of the FS. I believe > > it is a tradeoff since the general consesus was mostly to avoid adding > > too much complexity to iomap layer. > > > > Another thing that came up is to consider using write through semantics > > for buffered atomic writes, where we are able to transition page to > > writeback state immediately after the write and avoid any other users to > > modify the data till writeback completes. This might affect performance > > since we won't be able to batch similar atomic IOs but maybe > > applications like postgres would not mind this too much. If we go with > > this approach, we will be able to avoid worrying too much about other > > users changing atomic data underneath us. > > > > Hmm, IIUC, postgres will write their dirty buffer cache by combining > multiple DB pages based on `io_combine_limit` (typically 128kb). We will try to do that, but it's obviously far from always possible, in some workloads [parts of ]the data in the buffer pool rarely will be dirtied in consecutive blocks. FWIW, postgres already tries to force some just-written pages into writeback. For sources of writes that can be plentiful and are done in the background, we default to issuing sync_file_range(SYNC_FILE_RANGE_WRITE), after 256kB-512kB of writes, as otherwise foreground latency can be significantly impacted by the kernel deciding to suddenly write back (due to dirty_writeback_centisecs, dirty_background_bytes, ...) and because otherwise the fsyncs at the end of a checkpoint can be unpredictably slow. For foreground writes we do not default to that, as there are users that won't (because they don't know, because they overcommit hardware, ...) size postgres' buffer pool to be big enough and thus will often re-dirty pages that have already recently been written out to the operating systems. But for many workloads it's recommened that users turn on sync_file_range(SYNC_FILE_RANGE_WRITE) for foreground writes as well (*). So for many workloads it'd be fine to just always start writeback for atomic writes immediately. It's possible, but I am not at all sure, that for most of the other workloads, the gains from atomic writes will outstrip the cost of more frequently writing data back. (*) As it turns out, it often seems to improves write throughput as well, if writeback is triggered by memory pressure instead of SYNC_FILE_RANGE_WRITE, linux seems to often trigger a lot more small random IO. > So immediately writing them might be ok as long as we don't remove those > pages from the page cache like we do in RWF_UNCACHED. Yes, it might. I actually often have wished for something like a RWF_WRITEBACK flag... > > An argument against this however is that it is user's responsibility to > > not do non atomic IO over an atomic range and this shall be considered a > > userspace usage error. This is similar to how there are ways users can > > tear a dio if they perform overlapping writes. [1]. Hm, the scope of the prohibition here is not clear to me. Would it just be forbidden to do: P1: start pwritev(fd, [blocks 1-10], RWF_ATOMIC) P2: pwrite(fd, [any block in 1-10]), non-atomically P1: complete pwritev(fd, ...) or is it also forbidden to do: P1: pwritev(fd, [blocks 1-10], RWF_ATOMIC) start & completes Kernel: starts writeback but doesn't complete it P1: pwrite(fd, [any block in 1-10]), non-atomically Kernel: completes writeback The former is not at all an issue for postgres' use case, the pages in our buffer pool that are undergoing IO are locked, preventing additional IO (be it reads or writes) to those blocks. The latter would be a problem, since userspace wouldn't even know that here is still "atomic writeback" going on, afaict the only way we could avoid it would be to issue an f[data]sync(), which likely would be prohibitively expensive. > > That being said, I think these points are worth discussing and it would > > be helpful to have people from postgres around while discussing these > > semantics with the FS community members. > > > > As for ordering of writes, I'm not sure if that is something that > > we should guarantee via the RWF_ATOMIC api. Ensuring ordering has mostly > > been the task of userspace via fsync() and friends. > > > > Agreed. From postgres' side that's fine. In the cases we care about ordering we use fsync() already. > > [1] https://lore.kernel.org/fstests/0af205d9-6093-4931-abe9-f236acae8d44@oracle.com/ > > > >> - Discussion: While the CoW approach fits XFS and other CoW > >> filesystems well, it presents challenges for filesystems like ext4 > >> which lack CoW capabilities for data. Should this be a filesystem > >> specific feature? > > > > I believe your question is if we should have a hard dependency on COW > > mappings for atomic writes. Currently, COW in atomic write context in > > XFS, is used for these 2 things: > > > > 1. COW fork holds atomic write ranges. > > > > This is not strictly a COW feature, just that we are repurposing the COW > > fork to hold our atomic ranges. Basically a way for writeback path to > > know that atomic write was done here. Does that mean buffered atomic writes would cause fragmentation? Some common database workloads, e.g. anything running on cheaper cloud storage, are pretty sensitive to that due to the increase in use of the metered IOPS. Greetings, Andres Freund ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Buffered atomic writes 2026-02-16 15:45 ` Andres Freund @ 2026-02-17 12:06 ` Jan Kara 2026-02-17 12:42 ` Pankaj Raghav 2026-02-17 16:13 ` Andres Freund 2026-02-17 18:33 ` Ojaswin Mujoo 1 sibling, 2 replies; 38+ messages in thread From: Jan Kara @ 2026-02-17 12:06 UTC (permalink / raw) To: Andres Freund Cc: Pankaj Raghav, Ojaswin Mujoo, linux-xfs, linux-mm, linux-fsdevel, lsf-pc, djwong, john.g.garry, willy, hch, ritesh.list, jack, Luis Chamberlain, dchinner, Javier Gonzalez, gost.dev, tytso, p.raghav, vi.shah On Mon 16-02-26 10:45:40, Andres Freund wrote: > > Hmm, IIUC, postgres will write their dirty buffer cache by combining > > multiple DB pages based on `io_combine_limit` (typically 128kb). > > We will try to do that, but it's obviously far from always possible, in some > workloads [parts of ]the data in the buffer pool rarely will be dirtied in > consecutive blocks. > > FWIW, postgres already tries to force some just-written pages into > writeback. For sources of writes that can be plentiful and are done in the > background, we default to issuing sync_file_range(SYNC_FILE_RANGE_WRITE), > after 256kB-512kB of writes, as otherwise foreground latency can be > significantly impacted by the kernel deciding to suddenly write back (due to > dirty_writeback_centisecs, dirty_background_bytes, ...) and because otherwise > the fsyncs at the end of a checkpoint can be unpredictably slow. For > foreground writes we do not default to that, as there are users that won't > (because they don't know, because they overcommit hardware, ...) size > postgres' buffer pool to be big enough and thus will often re-dirty pages that > have already recently been written out to the operating systems. But for many > workloads it's recommened that users turn on > sync_file_range(SYNC_FILE_RANGE_WRITE) for foreground writes as well (*). > > So for many workloads it'd be fine to just always start writeback for atomic > writes immediately. It's possible, but I am not at all sure, that for most of > the other workloads, the gains from atomic writes will outstrip the cost of > more frequently writing data back. OK, good. Then I think it's worth a try. > (*) As it turns out, it often seems to improves write throughput as well, if > writeback is triggered by memory pressure instead of SYNC_FILE_RANGE_WRITE, > linux seems to often trigger a lot more small random IO. > > > So immediately writing them might be ok as long as we don't remove those > > pages from the page cache like we do in RWF_UNCACHED. > > Yes, it might. I actually often have wished for something like a > RWF_WRITEBACK flag... I'd call it RWF_WRITETHROUGH but otherwise it makes sense. > > > An argument against this however is that it is user's responsibility to > > > not do non atomic IO over an atomic range and this shall be considered a > > > userspace usage error. This is similar to how there are ways users can > > > tear a dio if they perform overlapping writes. [1]. > > Hm, the scope of the prohibition here is not clear to me. Would it just > be forbidden to do: > > P1: start pwritev(fd, [blocks 1-10], RWF_ATOMIC) > P2: pwrite(fd, [any block in 1-10]), non-atomically > P1: complete pwritev(fd, ...) > > or is it also forbidden to do: > > P1: pwritev(fd, [blocks 1-10], RWF_ATOMIC) start & completes > Kernel: starts writeback but doesn't complete it > P1: pwrite(fd, [any block in 1-10]), non-atomically > Kernel: completes writeback > > The former is not at all an issue for postgres' use case, the pages in > our buffer pool that are undergoing IO are locked, preventing additional > IO (be it reads or writes) to those blocks. > > The latter would be a problem, since userspace wouldn't even know that > here is still "atomic writeback" going on, afaict the only way we could > avoid it would be to issue an f[data]sync(), which likely would be > prohibitively expensive. It somewhat depends on what outcome you expect in terms of crash safety :) Unless we are careful, the RWF_ATOMIC write in your latter example can end up writing some bits of the data from the second write because the second write may be copying data to the pages as we issue DMA from them to the device. I expect this isn't really acceptable because if you crash before the second write fully makes it to the disk, you will have inconsistent data. So what we can offer is to enable "stable pages" feature for the filesystem (support for buffered atomic writes would be conditioned by that) - that will block the second write until the IO is done so torn writes cannot happen. If quick overwrites are rare, this should be a fine option. If they are frequent, we'd need to come up with some bounce buffering but things get ugly quickly there. Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Buffered atomic writes 2026-02-17 12:06 ` Jan Kara @ 2026-02-17 12:42 ` Pankaj Raghav 2026-02-17 16:21 ` Andres Freund 2026-02-17 16:13 ` Andres Freund 1 sibling, 1 reply; 38+ messages in thread From: Pankaj Raghav @ 2026-02-17 12:42 UTC (permalink / raw) To: Jan Kara, Andres Freund Cc: Ojaswin Mujoo, linux-xfs, linux-mm, linux-fsdevel, lsf-pc, djwong, john.g.garry, willy, hch, ritesh.list, Luis Chamberlain, dchinner, Javier Gonzalez, gost.dev, tytso, p.raghav, vi.shah On 2/17/2026 1:06 PM, Jan Kara wrote: > On Mon 16-02-26 10:45:40, Andres Freund wrote: >>> Hmm, IIUC, postgres will write their dirty buffer cache by combining >>> multiple DB pages based on `io_combine_limit` (typically 128kb). >> >> We will try to do that, but it's obviously far from always possible, in some >> workloads [parts of ]the data in the buffer pool rarely will be dirtied in >> consecutive blocks. >> >> FWIW, postgres already tries to force some just-written pages into >> writeback. For sources of writes that can be plentiful and are done in the >> background, we default to issuing sync_file_range(SYNC_FILE_RANGE_WRITE), >> after 256kB-512kB of writes, as otherwise foreground latency can be >> significantly impacted by the kernel deciding to suddenly write back (due to >> dirty_writeback_centisecs, dirty_background_bytes, ...) and because otherwise >> the fsyncs at the end of a checkpoint can be unpredictably slow. For >> foreground writes we do not default to that, as there are users that won't >> (because they don't know, because they overcommit hardware, ...) size >> postgres' buffer pool to be big enough and thus will often re-dirty pages that >> have already recently been written out to the operating systems. But for many >> workloads it's recommened that users turn on >> sync_file_range(SYNC_FILE_RANGE_WRITE) for foreground writes as well (*). >> >> So for many workloads it'd be fine to just always start writeback for atomic >> writes immediately. It's possible, but I am not at all sure, that for most of >> the other workloads, the gains from atomic writes will outstrip the cost of >> more frequently writing data back. > > OK, good. Then I think it's worth a try. > >> (*) As it turns out, it often seems to improves write throughput as well, if >> writeback is triggered by memory pressure instead of SYNC_FILE_RANGE_WRITE, >> linux seems to often trigger a lot more small random IO. >> >>> So immediately writing them might be ok as long as we don't remove those >>> pages from the page cache like we do in RWF_UNCACHED. >> >> Yes, it might. I actually often have wished for something like a >> RWF_WRITEBACK flag... > > I'd call it RWF_WRITETHROUGH but otherwise it makes sense. > One naive question: semantically what will be the difference between RWF_DSYNC and RWF_WRITETHROUGH? So RWF_DSYNC will be the sync version and RWF_WRITETHOUGH will be an async version where we kick off writeback immediately in the background and return? -- Pankaj ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Buffered atomic writes 2026-02-17 12:42 ` Pankaj Raghav @ 2026-02-17 16:21 ` Andres Freund 2026-02-18 1:04 ` Dave Chinner 0 siblings, 1 reply; 38+ messages in thread From: Andres Freund @ 2026-02-17 16:21 UTC (permalink / raw) To: Pankaj Raghav Cc: Jan Kara, Ojaswin Mujoo, linux-xfs, linux-mm, linux-fsdevel, lsf-pc, djwong, john.g.garry, willy, hch, ritesh.list, Luis Chamberlain, dchinner, Javier Gonzalez, gost.dev, tytso, p.raghav, vi.shah Hi, On 2026-02-17 13:42:35 +0100, Pankaj Raghav wrote: > On 2/17/2026 1:06 PM, Jan Kara wrote: > > On Mon 16-02-26 10:45:40, Andres Freund wrote: > > > (*) As it turns out, it often seems to improves write throughput as well, if > > > writeback is triggered by memory pressure instead of SYNC_FILE_RANGE_WRITE, > > > linux seems to often trigger a lot more small random IO. > > > > > > > So immediately writing them might be ok as long as we don't remove those > > > > pages from the page cache like we do in RWF_UNCACHED. > > > > > > Yes, it might. I actually often have wished for something like a > > > RWF_WRITEBACK flag... > > > > I'd call it RWF_WRITETHROUGH but otherwise it makes sense. > > > > One naive question: semantically what will be the difference between > RWF_DSYNC and RWF_WRITETHROUGH? So RWF_DSYNC will be the sync version and > RWF_WRITETHOUGH will be an async version where we kick off writeback > immediately in the background and return? Besides sync vs async: If the device has a volatile write cache, RWF_DSYNC will trigger flushes for the entire write cache or do FUA writes for just the RWF_DSYNC write. Which wouldn't be needed for RWF_WRITETHROUGH, right? I don't know if there will be devices that have a volatile write cache with atomicity support for > 4kB, so maybe that's a distinction that's irrelevant in practice for Postgres. But for 4kB writes, the difference in throughput and individual IO latency you get from many SSDs between using FUA writes / cache flushes and not doing so are enormous. Greetings, Andres Freund ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Buffered atomic writes 2026-02-17 16:21 ` Andres Freund @ 2026-02-18 1:04 ` Dave Chinner 2026-02-18 6:47 ` Christoph Hellwig 0 siblings, 1 reply; 38+ messages in thread From: Dave Chinner @ 2026-02-18 1:04 UTC (permalink / raw) To: Andres Freund Cc: Pankaj Raghav, Jan Kara, Ojaswin Mujoo, linux-xfs, linux-mm, linux-fsdevel, lsf-pc, djwong, john.g.garry, willy, hch, ritesh.list, Luis Chamberlain, dchinner, Javier Gonzalez, gost.dev, tytso, p.raghav, vi.shah On Tue, Feb 17, 2026 at 11:21:20AM -0500, Andres Freund wrote: > Hi, > > On 2026-02-17 13:42:35 +0100, Pankaj Raghav wrote: > > On 2/17/2026 1:06 PM, Jan Kara wrote: > > > On Mon 16-02-26 10:45:40, Andres Freund wrote: > > > > (*) As it turns out, it often seems to improves write throughput as well, if > > > > writeback is triggered by memory pressure instead of SYNC_FILE_RANGE_WRITE, > > > > linux seems to often trigger a lot more small random IO. > > > > > > > > > So immediately writing them might be ok as long as we don't remove those > > > > > pages from the page cache like we do in RWF_UNCACHED. > > > > > > > > Yes, it might. I actually often have wished for something like a > > > > RWF_WRITEBACK flag... > > > > > > I'd call it RWF_WRITETHROUGH but otherwise it makes sense. > > > > > > > One naive question: semantically what will be the difference between > > RWF_DSYNC and RWF_WRITETHROUGH? None, except that RWF_DSYNC provides data integrity guarantees. > > So RWF_DSYNC will be the sync version and > > RWF_WRITETHOUGH will be an async version where we kick off writeback > > immediately in the background and return? No. Write-through implies synchronous IO. i.e. that IO errors are reported immediately to the caller, not reported on the next operation on the file. O_DSYNC integrity writes are, by definition, write-through (synchronous) because they have to report physical IO completion status to the caller. This is kinda how "synchronous" got associated with data integrity in the first place. DIO writes are also write-through - there is nowhere to store an IO error for later reporting, so they must be executed synchronously to be able to report IO errors to the caller. Hence write-through generally implies synchronous IO, but it does not imply any data integrity guarantees are provided for the IO. If you want async RWF_WRITETHROUGH semantics, then the IO needs to be issued through an async IO submission interface (i.e. AIO or io_uring). In that case, the error status will be reported through the AIO completion, just like for DIO writes. IOWs, RWF_WRITETHROUGH should result in buffered writes displaying identical IO semantics to DIO writes. In doing this, we then we only need one IO path implementation per filesystem for all writethrough IO (buffered or direct) and the only thing that differs is the folios we attach to the bios. > Besides sync vs async: > > If the device has a volatile write cache, RWF_DSYNC will trigger flushes for > the entire write cache or do FUA writes for just the RWF_DSYNC write. Yes, that is exactly how the iomap DIO write path optimises RWF_DSYNC writes. It's much harder to do this for buffered IO using the generic buffered writeback paths and buffered writes never use FUA writes. i.e., using the iomap DIO path for RWF_WRITETHROUGH | RWF_DSYNC would bring these significant performance optimisations to buffered writes as well... > Which > wouldn't be needed for RWF_WRITETHROUGH, right? Correct, there shouldn't be any data integrity guarantees associated with plain RWF_WRITETHROUGH. -Dave. -- Dave Chinner dgc@kernel.org ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Buffered atomic writes 2026-02-18 1:04 ` Dave Chinner @ 2026-02-18 6:47 ` Christoph Hellwig 2026-02-18 23:42 ` Dave Chinner 0 siblings, 1 reply; 38+ messages in thread From: Christoph Hellwig @ 2026-02-18 6:47 UTC (permalink / raw) To: Dave Chinner Cc: Andres Freund, Pankaj Raghav, Jan Kara, Ojaswin Mujoo, linux-xfs, linux-mm, linux-fsdevel, lsf-pc, djwong, john.g.garry, willy, hch, ritesh.list, Luis Chamberlain, dchinner, Javier Gonzalez, gost.dev, tytso, p.raghav, vi.shah On Wed, Feb 18, 2026 at 12:04:43PM +1100, Dave Chinner wrote: > > > > I'd call it RWF_WRITETHROUGH but otherwise it makes sense. > > > > > > > > > > One naive question: semantically what will be the difference between > > > RWF_DSYNC and RWF_WRITETHROUGH? > > None, except that RWF_DSYNC provides data integrity guarantees. Which boils down to RWF_DSYNC still writing out the inode and flushing the cache. > > Which > > wouldn't be needed for RWF_WRITETHROUGH, right? > > Correct, there shouldn't be any data integrity guarantees associated > with plain RWF_WRITETHROUGH. Which makes me curious if the plain RWF_WRITETHROUGH would be all that useful. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Buffered atomic writes 2026-02-18 6:47 ` Christoph Hellwig @ 2026-02-18 23:42 ` Dave Chinner 0 siblings, 0 replies; 38+ messages in thread From: Dave Chinner @ 2026-02-18 23:42 UTC (permalink / raw) To: Christoph Hellwig Cc: Andres Freund, Pankaj Raghav, Jan Kara, Ojaswin Mujoo, linux-xfs, linux-mm, linux-fsdevel, lsf-pc, djwong, john.g.garry, willy, ritesh.list, Luis Chamberlain, dchinner, Javier Gonzalez, gost.dev, tytso, p.raghav, vi.shah On Wed, Feb 18, 2026 at 07:47:39AM +0100, Christoph Hellwig wrote: > On Wed, Feb 18, 2026 at 12:04:43PM +1100, Dave Chinner wrote: > > > > > I'd call it RWF_WRITETHROUGH but otherwise it makes sense. > > > > > > > > > > > > > One naive question: semantically what will be the difference between > > > > RWF_DSYNC and RWF_WRITETHROUGH? > > > > None, except that RWF_DSYNC provides data integrity guarantees. > > Which boils down to RWF_DSYNC still writing out the inode and flushing > the cache. > > > > Which > > > wouldn't be needed for RWF_WRITETHROUGH, right? > > > > Correct, there shouldn't be any data integrity guarantees associated > > with plain RWF_WRITETHROUGH. > > Which makes me curious if the plain RWF_WRITETHROUGH would be all > that useful. For modern SSDs, I think the answer is yes. e.g. when you are doing lots of small writes to many files from many threads, it bottlenecks on single threaded writeback. All of the IO is submitted by background writeback which runs out of CPU fairly quickly. We end up dirty throttling and topping out at ~100k random 4kB buffered writes IOPS regardless of how much submitter concurrency we have. If we switch that to RWF_WRITETHROUGH, we now have N submitting threads that can all work in parallel, we get pretty much zero dirty folio backlog (so no dirty throttling and more consistent IO latency) and throughput can scales much higher because we have IO submitter concurrency to spread the CPU load around. I did a fsmark test of a write-though hack a couple of years back, creating and writing 4kB data files concurrently in a directory per thread. With vanilla writeback, it topped out at about 80k 4kB file creates/s from 4 threads and only wnet slower the more I increased the userspace create concurrency. Using writethrough submission, it topped out at about 400k 4kB file creates/s from 32 threads and was largely limited in the fsmark tasks by the CPU overhead for file creation, user data copying and data extent space allocation. I also did a multi-file, multi-process random 4kB write test with fio, using files much larger than memory and long runtimes. Once the normal background write path started dirty throttling, it ran at about 100k 4kB write IOPS, again limited by the single threaded writeback flusher using all it's CPU time for allocating blocks during writeback. Using writethrough, I saw about 900k IOPS being sustained right from the start, largely limited by a combination of CPU usage and IO latency in the fio task context. In comparison, the same workload with DIO ran to the storage capability of 1.6M IOPS because it had significantly lower CPU usage and IO latency. I also did some kernel compile tests with writethrough for all buffered write IO. On fast storage there was neglible difference in performance between vanilla buffered writes and submitter driver blocking write-through. This result made me question the need for caching on modern SSDs at all :) -Dave. -- Dave Chinner dgc@kernel.org ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Buffered atomic writes 2026-02-17 12:06 ` Jan Kara 2026-02-17 12:42 ` Pankaj Raghav @ 2026-02-17 16:13 ` Andres Freund 2026-02-17 18:27 ` Ojaswin Mujoo 2026-02-18 17:37 ` Jan Kara 1 sibling, 2 replies; 38+ messages in thread From: Andres Freund @ 2026-02-17 16:13 UTC (permalink / raw) To: Jan Kara Cc: Pankaj Raghav, Ojaswin Mujoo, linux-xfs, linux-mm, linux-fsdevel, lsf-pc, djwong, john.g.garry, willy, hch, ritesh.list, Luis Chamberlain, dchinner, Javier Gonzalez, gost.dev, tytso, p.raghav, vi.shah Hi, On 2026-02-17 13:06:04 +0100, Jan Kara wrote: > On Mon 16-02-26 10:45:40, Andres Freund wrote: > > (*) As it turns out, it often seems to improves write throughput as well, if > > writeback is triggered by memory pressure instead of SYNC_FILE_RANGE_WRITE, > > linux seems to often trigger a lot more small random IO. > > > > > So immediately writing them might be ok as long as we don't remove those > > > pages from the page cache like we do in RWF_UNCACHED. > > > > Yes, it might. I actually often have wished for something like a > > RWF_WRITEBACK flag... > > I'd call it RWF_WRITETHROUGH but otherwise it makes sense. Heh, that makes sense. I think that's what I actually was thinking of. > > > > An argument against this however is that it is user's responsibility to > > > > not do non atomic IO over an atomic range and this shall be considered a > > > > userspace usage error. This is similar to how there are ways users can > > > > tear a dio if they perform overlapping writes. [1]. > > > > Hm, the scope of the prohibition here is not clear to me. Would it just > > be forbidden to do: > > > > P1: start pwritev(fd, [blocks 1-10], RWF_ATOMIC) > > P2: pwrite(fd, [any block in 1-10]), non-atomically > > P1: complete pwritev(fd, ...) > > > > or is it also forbidden to do: > > > > P1: pwritev(fd, [blocks 1-10], RWF_ATOMIC) start & completes > > Kernel: starts writeback but doesn't complete it > > P1: pwrite(fd, [any block in 1-10]), non-atomically > > Kernel: completes writeback > > > > The former is not at all an issue for postgres' use case, the pages in > > our buffer pool that are undergoing IO are locked, preventing additional > > IO (be it reads or writes) to those blocks. > > > > The latter would be a problem, since userspace wouldn't even know that > > here is still "atomic writeback" going on, afaict the only way we could > > avoid it would be to issue an f[data]sync(), which likely would be > > prohibitively expensive. > > It somewhat depends on what outcome you expect in terms of crash safety :) > Unless we are careful, the RWF_ATOMIC write in your latter example can end > up writing some bits of the data from the second write because the second > write may be copying data to the pages as we issue DMA from them to the > device. Hm. It's somewhat painful to not know when we can write in what mode again - with DIO that's not an issue. I guess we could use sync_file_range(SYNC_FILE_RANGE_WAIT_BEFORE) if we really needed to know? Although the semantics of the SFR flags aren't particularly clear, so maybe not? > I expect this isn't really acceptable because if you crash before > the second write fully makes it to the disk, you will have inconsistent > data. The scenarios that I can think that would lead us to doing something like this, are when we are overwriting data without regard for the prior contents, e.g: An already partially filled page is filled with more rows, we write that page out, then all the rows are deleted, and we re-fill the page with new content from scratch. Write it out again. With our existing logic we treat the second write differently, because the entire contents of the page will be in the journal, as there is no prior content that we care about. A second scenario in which we might not use RWF_ATOMIC, if we carry today's logic forward, is if a newly created relation is bulk loaded in the same transaction that created the relation. If a crash were to happen while that bulk load is ongoing, we don't care about the contents of the file(s), as it will never be visible to anyone after crash recovery. In this case we won't have prio RWF_ATOMIC writes - but we could have the opposite, i.e. an RWF_ATOMIC write while there already is non-RWF_ATOMIC dirty data in the page cache. Would that be an issue? It's possible we should just always use RWF_ATOMIC, even in the cases where it's not needed from our side, to avoid potential performance penalties and "undefined behaviour". I guess that will really depend on the performance penalty that RWF_ATOMIC will carry and whether multiple-atomicity-mode will eventually be supported (as doing small writes during bulk loading is quite expensive). Greetings, Andres Freund ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Buffered atomic writes 2026-02-17 16:13 ` Andres Freund @ 2026-02-17 18:27 ` Ojaswin Mujoo 2026-02-17 18:42 ` Andres Freund 2026-02-18 17:37 ` Jan Kara 1 sibling, 1 reply; 38+ messages in thread From: Ojaswin Mujoo @ 2026-02-17 18:27 UTC (permalink / raw) To: Andres Freund Cc: Jan Kara, Pankaj Raghav, linux-xfs, linux-mm, linux-fsdevel, lsf-pc, djwong, john.g.garry, willy, hch, ritesh.list, Luis Chamberlain, dchinner, Javier Gonzalez, gost.dev, tytso, p.raghav, vi.shah On Tue, Feb 17, 2026 at 11:13:07AM -0500, Andres Freund wrote: > Hi, > > On 2026-02-17 13:06:04 +0100, Jan Kara wrote: > > On Mon 16-02-26 10:45:40, Andres Freund wrote: > > > (*) As it turns out, it often seems to improves write throughput as well, if > > > writeback is triggered by memory pressure instead of SYNC_FILE_RANGE_WRITE, > > > linux seems to often trigger a lot more small random IO. > > > > > > > So immediately writing them might be ok as long as we don't remove those > > > > pages from the page cache like we do in RWF_UNCACHED. > > > > > > Yes, it might. I actually often have wished for something like a > > > RWF_WRITEBACK flag... > > > > I'd call it RWF_WRITETHROUGH but otherwise it makes sense. > > Heh, that makes sense. I think that's what I actually was thinking of. > > > > > > > An argument against this however is that it is user's responsibility to > > > > > not do non atomic IO over an atomic range and this shall be considered a > > > > > userspace usage error. This is similar to how there are ways users can > > > > > tear a dio if they perform overlapping writes. [1]. > > > > > > Hm, the scope of the prohibition here is not clear to me. Would it just > > > be forbidden to do: > > > > > > P1: start pwritev(fd, [blocks 1-10], RWF_ATOMIC) > > > P2: pwrite(fd, [any block in 1-10]), non-atomically > > > P1: complete pwritev(fd, ...) > > > > > > or is it also forbidden to do: > > > > > > P1: pwritev(fd, [blocks 1-10], RWF_ATOMIC) start & completes > > > Kernel: starts writeback but doesn't complete it > > > P1: pwrite(fd, [any block in 1-10]), non-atomically > > > Kernel: completes writeback > > > > > > The former is not at all an issue for postgres' use case, the pages in > > > our buffer pool that are undergoing IO are locked, preventing additional > > > IO (be it reads or writes) to those blocks. > > > > > > The latter would be a problem, since userspace wouldn't even know that > > > here is still "atomic writeback" going on, afaict the only way we could > > > avoid it would be to issue an f[data]sync(), which likely would be > > > prohibitively expensive. > > > > It somewhat depends on what outcome you expect in terms of crash safety :) > > Unless we are careful, the RWF_ATOMIC write in your latter example can end > > up writing some bits of the data from the second write because the second > > write may be copying data to the pages as we issue DMA from them to the > > device. > > Hm. It's somewhat painful to not know when we can write in what mode again - > with DIO that's not an issue. I guess we could use > sync_file_range(SYNC_FILE_RANGE_WAIT_BEFORE) if we really needed to know? > Although the semantics of the SFR flags aren't particularly clear, so maybe > not? > > > > I expect this isn't really acceptable because if you crash before > > the second write fully makes it to the disk, you will have inconsistent > > data. > > The scenarios that I can think that would lead us to doing something like > this, are when we are overwriting data without regard for the prior contents, > e.g: > > An already partially filled page is filled with more rows, we write that page > out, then all the rows are deleted, and we re-fill the page with new content > from scratch. Write it out again. With our existing logic we treat the second > write differently, because the entire contents of the page will be in the > journal, as there is no prior content that we care about. Hi Andres, From my mental model and very high level understanding of Postgres' WAL model [1] I am under the impression that for moving from full page writes to RWF_ATOMIC, we would need to ensure that the **disk** write IO of any data buffer should go in an untorn fashion. Now, coming to your example, IIUC here we can actually tolerate to do the 2nd write above non atomically because it is already a sort of full page write in the journal. So lets say if we do something like: 0. Buffer has some initial value on disk 1. Write new rows into buffer 2. Write the buffer as RWF_ATOMIC 3. Overwrite the complete buffer which will journal all the contents 4. Write the buffer as non RWF_ATOMIC 5. Crash I think it is still possible to satisfy my assumption of **disk** IO being untorn. Example, here we can have an RWF_ATOMIC implementation where the data on disk after crash could either be in initial state 0. or be the new value after 4. This is not strictly the old or new semantic but still ensures the data is consistent. My naive understanding says that as long as disk has consistent/untorn data, like above, we can recover via the journal. In this case the kernel implementation should be able to tolerate mixing of atomic and non atomic writes, but again I might be wrong here. However, if the above guarantees are not enough and actually care about true old or new semantic, we would need something like RWF_WRITETHROUGH to ensure we get truely old or new. [1] https://www.interdb.jp/pg/pgsql09/01.html > > A second scenario in which we might not use RWF_ATOMIC, if we carry today's > logic forward, is if a newly created relation is bulk loaded in the same > transaction that created the relation. If a crash were to happen while that > bulk load is ongoing, we don't care about the contents of the file(s), as it > will never be visible to anyone after crash recovery. In this case we won't > have prio RWF_ATOMIC writes - but we could have the opposite, i.e. an > RWF_ATOMIC write while there already is non-RWF_ATOMIC dirty data in the page > cache. Would that be an issue? I think this is same discussion as above. > > > It's possible we should just always use RWF_ATOMIC, even in the cases where > it's not needed from our side, to avoid potential performance penalties and > "undefined behaviour". I guess that will really depend on the performance > penalty that RWF_ATOMIC will carry and whether multiple-atomicity-mode will > eventually be supported (as doing small writes during bulk loading is quite > expensive). > > > Greetings, > > Andres Freund ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Buffered atomic writes 2026-02-17 18:27 ` Ojaswin Mujoo @ 2026-02-17 18:42 ` Andres Freund 0 siblings, 0 replies; 38+ messages in thread From: Andres Freund @ 2026-02-17 18:42 UTC (permalink / raw) To: Ojaswin Mujoo Cc: Jan Kara, Pankaj Raghav, linux-xfs, linux-mm, linux-fsdevel, lsf-pc, djwong, john.g.garry, willy, hch, ritesh.list, Luis Chamberlain, dchinner, Javier Gonzalez, gost.dev, tytso, p.raghav, vi.shah Hi, On 2026-02-17 23:57:50 +0530, Ojaswin Mujoo wrote: > From my mental model and very high level understanding of Postgres' WAL > model [1] I am under the impression that for moving from full page > writes to RWF_ATOMIC, we would need to ensure that the **disk** write IO > of any data buffer should go in an untorn fashion. Right. > Now, coming to your example, IIUC here we can actually tolerate to do > the 2nd write above non atomically because it is already a sort of full > page write in the journal. > > So lets say if we do something like: > > 0. Buffer has some initial value on disk > 1. Write new rows into buffer > 2. Write the buffer as RWF_ATOMIC > 3. Overwrite the complete buffer which will journal all the contents > 4. Write the buffer as non RWF_ATOMIC > 5. Crash > > I think it is still possible to satisfy my assumption of **disk** IO > being untorn. Example, here we can have an RWF_ATOMIC implementation > where the data on disk after crash could either be in initial state 0. > or be the new value after 4. This is not strictly the old or new > semantic but still ensures the data is consistent. The way I understand Jan is that, unless we are careful with the write in 4), the write for 0) could still be in progress, with the copy from userspace to the pagecache from 4 happening in the middle of the DMA for the write from 0), leading to a torn page on-disk, even though the disk actually behaved correctly. > My naive understanding says that as long as disk has consistent/untorn > data, like above, we can recover via the journal. Yes, if that were true, we could recover. But if my understanding of Jan's concern is right, that'd not necessarily be guaranteed. Greetings, Andres Freund ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Buffered atomic writes 2026-02-17 16:13 ` Andres Freund 2026-02-17 18:27 ` Ojaswin Mujoo @ 2026-02-18 17:37 ` Jan Kara 2026-02-18 21:04 ` Andres Freund 2026-02-19 0:32 ` Dave Chinner 1 sibling, 2 replies; 38+ messages in thread From: Jan Kara @ 2026-02-18 17:37 UTC (permalink / raw) To: Andres Freund Cc: Jan Kara, Pankaj Raghav, Ojaswin Mujoo, linux-xfs, linux-mm, linux-fsdevel, lsf-pc, djwong, john.g.garry, willy, hch, ritesh.list, Luis Chamberlain, dchinner, Javier Gonzalez, gost.dev, tytso, p.raghav, vi.shah On Tue 17-02-26 11:13:07, Andres Freund wrote: > > > P1: pwritev(fd, [blocks 1-10], RWF_ATOMIC) start & completes > > > Kernel: starts writeback but doesn't complete it > > > P1: pwrite(fd, [any block in 1-10]), non-atomically > > > Kernel: completes writeback > > > > > > The former is not at all an issue for postgres' use case, the pages in > > > our buffer pool that are undergoing IO are locked, preventing additional > > > IO (be it reads or writes) to those blocks. > > > > > > The latter would be a problem, since userspace wouldn't even know that > > > here is still "atomic writeback" going on, afaict the only way we could > > > avoid it would be to issue an f[data]sync(), which likely would be > > > prohibitively expensive. > > > > It somewhat depends on what outcome you expect in terms of crash safety :) > > Unless we are careful, the RWF_ATOMIC write in your latter example can end > > up writing some bits of the data from the second write because the second > > write may be copying data to the pages as we issue DMA from them to the > > device. > > Hm. It's somewhat painful to not know when we can write in what mode again - > with DIO that's not an issue. I guess we could use > sync_file_range(SYNC_FILE_RANGE_WAIT_BEFORE) if we really needed to know? > Although the semantics of the SFR flags aren't particularly clear, so maybe > not? If you used RWF_WRITETHROUGH for your writes (so you are sure IO has already started) then sync_file_range(SYNC_FILE_RANGE_WAIT_BEFORE) would indeed be a safe way of waiting for that IO to complete (or just wait for the write(2) syscall itself to complete if we make RWF_WRITETHROUGH wait for IO completion as Dave suggests - but I guess writes may happen from multiple threads so that may be not very convenient and sync_file_range(2) might be actually easier). > > I expect this isn't really acceptable because if you crash before > > the second write fully makes it to the disk, you will have inconsistent > > data. > > The scenarios that I can think that would lead us to doing something like > this, are when we are overwriting data without regard for the prior contents, > e.g: > > An already partially filled page is filled with more rows, we write that page > out, then all the rows are deleted, and we re-fill the page with new content > from scratch. Write it out again. With our existing logic we treat the second > write differently, because the entire contents of the page will be in the > journal, as there is no prior content that we care about. > > A second scenario in which we might not use RWF_ATOMIC, if we carry today's > logic forward, is if a newly created relation is bulk loaded in the same > transaction that created the relation. If a crash were to happen while that > bulk load is ongoing, we don't care about the contents of the file(s), as it > will never be visible to anyone after crash recovery. In this case we won't > have prio RWF_ATOMIC writes - but we could have the opposite, i.e. an > RWF_ATOMIC write while there already is non-RWF_ATOMIC dirty data in the page > cache. Would that be an issue? No, this should be fine. But as I'm thinking about it what seems the most natural is that RWF_WRITETHROUGH writes will wait on any pages under writeback in the target range before proceeding with the write. That will give user proper serialization with other RWF_WRITETHROUGH writes to the overlapping range as well as writeback from previous normal writes. So the only case that needs handling - either by userspace or kernel forcing stable writes - would be RWF_WRITETHROUGH write followed by a normal write. > It's possible we should just always use RWF_ATOMIC, even in the cases where > it's not needed from our side, to avoid potential performance penalties and > "undefined behaviour". I guess that will really depend on the performance > penalty that RWF_ATOMIC will carry and whether multiple-atomicity-mode will > eventually be supported (as doing small writes during bulk loading is quite > expensive). Sure, that's a possibility as well. I guess it requires some experimentation and benchmarking to pick a proper tradeoff. Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Buffered atomic writes 2026-02-18 17:37 ` Jan Kara @ 2026-02-18 21:04 ` Andres Freund 2026-02-19 0:32 ` Dave Chinner 1 sibling, 0 replies; 38+ messages in thread From: Andres Freund @ 2026-02-18 21:04 UTC (permalink / raw) To: Jan Kara Cc: Pankaj Raghav, Ojaswin Mujoo, linux-xfs, linux-mm, linux-fsdevel, lsf-pc, djwong, john.g.garry, willy, hch, ritesh.list, Luis Chamberlain, dchinner, Javier Gonzalez, gost.dev, tytso, p.raghav, vi.shah Hi, On 2026-02-18 18:37:45 +0100, Jan Kara wrote: > On Tue 17-02-26 11:13:07, Andres Freund wrote: > > Hm. It's somewhat painful to not know when we can write in what mode again - > > with DIO that's not an issue. I guess we could use > > sync_file_range(SYNC_FILE_RANGE_WAIT_BEFORE) if we really needed to know? > > Although the semantics of the SFR flags aren't particularly clear, so maybe > > not? > > If you used RWF_WRITETHROUGH for your writes (so you are sure IO has > already started) then sync_file_range(SYNC_FILE_RANGE_WAIT_BEFORE) would > indeed be a safe way of waiting for that IO to complete (or just wait for > the write(2) syscall itself to complete if we make RWF_WRITETHROUGH wait > for IO completion as Dave suggests - but I guess writes may happen from > multiple threads so that may be not very convenient and sync_file_range(2) > might be actually easier). For us a synchronously blocking RWF_WRITETHROUGH would actually be easier, I think. The issue with writes from multiple threads actually goes the other way for us - without knowing when the IO actually completes, our buffer pool's state cannot reflect whether there is ongoing IO for a buffer or not. So we would always have to do sync_file_range(SYNC_FILE_RANGE_WAIT_BEFORE) before doing further IO. Not knowing how many writes are actually outstanding also makes it harder for us to avoid overwhelming the storage (triggering e.g. poor commit latency). Greetings, Andres Freund ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Buffered atomic writes 2026-02-18 17:37 ` Jan Kara 2026-02-18 21:04 ` Andres Freund @ 2026-02-19 0:32 ` Dave Chinner 1 sibling, 0 replies; 38+ messages in thread From: Dave Chinner @ 2026-02-19 0:32 UTC (permalink / raw) To: Jan Kara Cc: Andres Freund, Pankaj Raghav, Ojaswin Mujoo, linux-xfs, linux-mm, linux-fsdevel, lsf-pc, djwong, john.g.garry, willy, hch, ritesh.list, Luis Chamberlain, dchinner, Javier Gonzalez, gost.dev, tytso, p.raghav, vi.shah On Wed, Feb 18, 2026 at 06:37:45PM +0100, Jan Kara wrote: > On Tue 17-02-26 11:13:07, Andres Freund wrote: > > > > P1: pwritev(fd, [blocks 1-10], RWF_ATOMIC) start & completes > > > > Kernel: starts writeback but doesn't complete it > > > > P1: pwrite(fd, [any block in 1-10]), non-atomically > > > > Kernel: completes writeback > > > > > > > > The former is not at all an issue for postgres' use case, the pages in > > > > our buffer pool that are undergoing IO are locked, preventing additional > > > > IO (be it reads or writes) to those blocks. > > > > > > > > The latter would be a problem, since userspace wouldn't even know that > > > > here is still "atomic writeback" going on, afaict the only way we could > > > > avoid it would be to issue an f[data]sync(), which likely would be > > > > prohibitively expensive. > > > > > > It somewhat depends on what outcome you expect in terms of crash safety :) > > > Unless we are careful, the RWF_ATOMIC write in your latter example can end > > > up writing some bits of the data from the second write because the second > > > write may be copying data to the pages as we issue DMA from them to the > > > device. > > > > Hm. It's somewhat painful to not know when we can write in what mode again - > > with DIO that's not an issue. I guess we could use > > sync_file_range(SYNC_FILE_RANGE_WAIT_BEFORE) if we really needed to know? > > Although the semantics of the SFR flags aren't particularly clear, so maybe > > not? > > If you used RWF_WRITETHROUGH for your writes (so you are sure IO has > already started) then sync_file_range(SYNC_FILE_RANGE_WAIT_BEFORE) would > indeed be a safe way of waiting for that IO to complete (or just wait for > the write(2) syscall itself to complete if we make RWF_WRITETHROUGH wait > for IO completion as Dave suggests - but I guess writes may happen from > multiple threads so that may be not very convenient and sync_file_range(2) > might be actually easier). I would much prefer we don't have to rely on crappy interfaces like sync_file_range() to handle RWF_WRITETHROUGH IO completion processing. All it does is add complexity to error handling/propagation to both the kernel code and the userspace code. It takes something that is easy to get right (i.e. synchronous completion) and replaces it with something that is easy to get wrong. That's not good API design. As for handling multiple writes to the same range, stable pages do that for us. RWF_WRITETHROUGH will need to set folios in the writeback state before submission and clear it after completion so that stable pages work correctly. Hence we may as well use that functionality to serialise overlapping RWF_WRITETHROUGH IOs and against concurrent background and data integrity driven writeback We should be trying hard to keep this simple and consistent with existing write-through IO models that people already know how to use (i.e. DIO). > > > I expect this isn't really acceptable because if you crash before > > > the second write fully makes it to the disk, you will have inconsistent > > > data. > > > > The scenarios that I can think that would lead us to doing something like > > this, are when we are overwriting data without regard for the prior contents, > > e.g: > > > > An already partially filled page is filled with more rows, we write that page > > out, then all the rows are deleted, and we re-fill the page with new content > > from scratch. Write it out again. With our existing logic we treat the second > > write differently, because the entire contents of the page will be in the > > journal, as there is no prior content that we care about. > > > > A second scenario in which we might not use RWF_ATOMIC, if we carry today's > > logic forward, is if a newly created relation is bulk loaded in the same > > transaction that created the relation. If a crash were to happen while that > > bulk load is ongoing, we don't care about the contents of the file(s), as it > > will never be visible to anyone after crash recovery. In this case we won't > > have prio RWF_ATOMIC writes - but we could have the opposite, i.e. an > > RWF_ATOMIC write while there already is non-RWF_ATOMIC dirty data in the page > > cache. Would that be an issue? > > No, this should be fine. But as I'm thinking about it what seems the most > natural is that RWF_WRITETHROUGH writes will wait on any pages under > writeback in the target range before proceeding with the write. I think that is required behaviour, even though it is natural. IMO, concurrent overlapping physical IOs from the page cache via RWF_WRITETHROUGH is a data corruption vector just waiting for someone to trip over it... i.e. we need to keep in mind that one of the guarantees that the page cache provides is that it will never overlap multiple concurrent physical IOs to the same physical range. Overlapping IOs are handled and serialised at the folio level, they should never end up with overlapping physical IO being issued. > That will > give user proper serialization with other RWF_WRITETHROUGH writes to the > overlapping range as well as writeback from previous normal writes. So the > only case that needs handling - either by userspace or kernel forcing > stable writes - would be RWF_WRITETHROUGH write followed by a normal write. *nod*. I think forcing stable writes for RWF_WRITETHROUGH is the right way to go. We are going to need stable write semantic for RWF_ATOMIC support, and we probably should have them for RWF_DSYNC as well because the data integrity guarantees cover the data in that specific user IO, not any other previous, concurrent or future user IO. -Dave. -- Dave Chinner dgc@kernel.org ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Buffered atomic writes 2026-02-16 15:45 ` Andres Freund 2026-02-17 12:06 ` Jan Kara @ 2026-02-17 18:33 ` Ojaswin Mujoo 1 sibling, 0 replies; 38+ messages in thread From: Ojaswin Mujoo @ 2026-02-17 18:33 UTC (permalink / raw) To: Andres Freund Cc: Pankaj Raghav, linux-xfs, linux-mm, linux-fsdevel, lsf-pc, djwong, john.g.garry, willy, hch, ritesh.list, jack, Luis Chamberlain, dchinner, Javier Gonzalez, gost.dev, tytso, p.raghav, vi.shah On Mon, Feb 16, 2026 at 10:45:40AM -0500, Andres Freund wrote: > Hi, > > On 2026-02-16 10:52:35 +0100, Pankaj Raghav wrote: > > On 2/13/26 14:32, Ojaswin Mujoo wrote: > > > On Fri, Feb 13, 2026 at 11:20:36AM +0100, Pankaj Raghav wrote: > > >> We currently have RFCs posted by John Garry and Ojaswin Mujoo, and there > > >> was a previous LSFMM proposal about untorn buffered writes from Ted Tso. > > >> Based on the conversation/blockers we had before, the discussion at LSFMM > > >> should focus on the following blocking issues: > > >> > > >> - Handling Short Writes under Memory Pressure[6]: A buffered atomic > > >> write might span page boundaries. If memory pressure causes a page > > >> fault or reclaim mid-copy, the write could be torn inside the page > > >> cache before it even reaches the filesystem. > > >> - The current RFC uses a "pinning" approach: pinning user pages and > > >> creating a BVEC to ensure the full copy can proceed atomically. > > >> This adds complexity to the write path. > > >> - Discussion: Is this acceptable? Should we consider alternatives, > > >> such as requiring userspace to mlock the I/O buffers before > > >> issuing the write to guarantee atomic copy in the page cache? > > > > > > Right, I chose this approach because we only get to know about the short > > > copy after it has actually happened in copy_folio_from_iter_atomic() > > > and it seemed simpler to just not let the short copy happen. This is > > > inspired from how dio pins the pages for DMA, just that we do it > > > for a shorter time. > > > > > > It does add slight complexity to the path but I'm not sure if it's complex > > > enough to justify adding a hard requirement of having pages mlock'd. > > > > > > > As databases like postgres have a buffer cache that they manage in userspace, > > which is eventually used to do IO, I am wondering if they already do a mlock > > or some other way to guarantee the buffer cache does not get reclaimed. That is > > why I was thinking if we could make it a requirement. Of course, that also requires > > checking if the range is mlocked in the iomap_write_iter path. > > We don't generally mlock our buffer pool - but we strongly recommend to use > explicit huge pages (due to TLB pressure, faster fork() and less memory wasted > on page tables), which afaict has basically the same effect. However, that > doesn't make the page cache pages locked... > > > > >> - Page Cache Model vs. Filesystem CoW: The current RFC introduces a > > >> PG_atomic page flag to track dirty pages requiring atomic writeback. > > >> This faced pushback due to page flags being a scarce resource[7]. > > >> Furthermore, it was argued that atomic model does not fit the buffered > > >> I/O model because data sitting in the page cache is vulnerable to > > >> modification before writeback occurs, and writeback does not preserve > > >> application ordering[8]. > > >> - Dave Chinner has proposed leveraging the filesystem's CoW path > > >> where we always allocate new blocks for the atomic write (forced > > >> CoW). If the hardware supports it (e.g., NVMe atomic limits), the > > >> filesystem can optimize the writeback to use REQ_ATOMIC in place, > > >> avoiding the CoW overhead while maintaining the architectural > > >> separation. > > > > > > Right, this is what I'm doing in the new RFC where we maintain the > > > mappings for atomic write in COW fork. This way we are able to utilize a > > > lot of existing infrastructure, however it does add some complexity to > > > ->iomap_begin() and ->writeback_range() callbacks of the FS. I believe > > > it is a tradeoff since the general consesus was mostly to avoid adding > > > too much complexity to iomap layer. > > > > > > Another thing that came up is to consider using write through semantics > > > for buffered atomic writes, where we are able to transition page to > > > writeback state immediately after the write and avoid any other users to > > > modify the data till writeback completes. This might affect performance > > > since we won't be able to batch similar atomic IOs but maybe > > > applications like postgres would not mind this too much. If we go with > > > this approach, we will be able to avoid worrying too much about other > > > users changing atomic data underneath us. > > > > > > > Hmm, IIUC, postgres will write their dirty buffer cache by combining > > multiple DB pages based on `io_combine_limit` (typically 128kb). > > We will try to do that, but it's obviously far from always possible, in some > workloads [parts of ]the data in the buffer pool rarely will be dirtied in > consecutive blocks. > > FWIW, postgres already tries to force some just-written pages into > writeback. For sources of writes that can be plentiful and are done in the > background, we default to issuing sync_file_range(SYNC_FILE_RANGE_WRITE), > after 256kB-512kB of writes, as otherwise foreground latency can be > significantly impacted by the kernel deciding to suddenly write back (due to > dirty_writeback_centisecs, dirty_background_bytes, ...) and because otherwise > the fsyncs at the end of a checkpoint can be unpredictably slow. For > foreground writes we do not default to that, as there are users that won't > (because they don't know, because they overcommit hardware, ...) size > postgres' buffer pool to be big enough and thus will often re-dirty pages that > have already recently been written out to the operating systems. But for many > workloads it's recommened that users turn on > sync_file_range(SYNC_FILE_RANGE_WRITE) for foreground writes as well (*). > > So for many workloads it'd be fine to just always start writeback for atomic > writes immediately. It's possible, but I am not at all sure, that for most of > the other workloads, the gains from atomic writes will outstrip the cost of > more frequently writing data back. > > > (*) As it turns out, it often seems to improves write throughput as well, if > writeback is triggered by memory pressure instead of SYNC_FILE_RANGE_WRITE, > linux seems to often trigger a lot more small random IO. > > > > So immediately writing them might be ok as long as we don't remove those > > pages from the page cache like we do in RWF_UNCACHED. > > Yes, it might. I actually often have wished for something like a > RWF_WRITEBACK flag... > > > > > An argument against this however is that it is user's responsibility to > > > not do non atomic IO over an atomic range and this shall be considered a > > > userspace usage error. This is similar to how there are ways users can > > > tear a dio if they perform overlapping writes. [1]. > > Hm, the scope of the prohibition here is not clear to me. Would it just > be forbidden to do: > > P1: start pwritev(fd, [blocks 1-10], RWF_ATOMIC) > P2: pwrite(fd, [any block in 1-10]), non-atomically > P1: complete pwritev(fd, ...) > > or is it also forbidden to do: > > P1: pwritev(fd, [blocks 1-10], RWF_ATOMIC) start & completes > Kernel: starts writeback but doesn't complete it > P1: pwrite(fd, [any block in 1-10]), non-atomically > Kernel: completes writeback > > The former is not at all an issue for postgres' use case, the pages in our > buffer pool that are undergoing IO are locked, preventing additional IO (be it > reads or writes) to those blocks. > > The latter would be a problem, since userspace wouldn't even know that here is > still "atomic writeback" going on, afaict the only way we could avoid it would > be to issue an f[data]sync(), which likely would be prohibitively expensive. > > > > > > That being said, I think these points are worth discussing and it would > > > be helpful to have people from postgres around while discussing these > > > semantics with the FS community members. > > > > > > As for ordering of writes, I'm not sure if that is something that > > > we should guarantee via the RWF_ATOMIC api. Ensuring ordering has mostly > > > been the task of userspace via fsync() and friends. > > > > > > > Agreed. > > From postgres' side that's fine. In the cases we care about ordering we use > fsync() already. > > > > > [1] https://lore.kernel.org/fstests/0af205d9-6093-4931-abe9-f236acae8d44@oracle.com/ > > > > > >> - Discussion: While the CoW approach fits XFS and other CoW > > >> filesystems well, it presents challenges for filesystems like ext4 > > >> which lack CoW capabilities for data. Should this be a filesystem > > >> specific feature? > > > > > > I believe your question is if we should have a hard dependency on COW > > > mappings for atomic writes. Currently, COW in atomic write context in > > > XFS, is used for these 2 things: > > > > > > 1. COW fork holds atomic write ranges. > > > > > > This is not strictly a COW feature, just that we are repurposing the COW > > > fork to hold our atomic ranges. Basically a way for writeback path to > > > know that atomic write was done here. > > Does that mean buffered atomic writes would cause fragmentation? Some common > database workloads, e.g. anything running on cheaper cloud storage, are pretty > sensitive to that due to the increase in use of the metered IOPS. > Hi Andres, So we have tricks like allocating more blocks than needed which helps with fragmentation even when using COW fork. I think we are able to tune how aggresively we want preallocate more blocks. Further, if we have say fallocated a range in file which satisfies our requirements, then we can also upgrade to HW (non cow) atomic writes and use the falloc'd extents which will also help with fragmentations My point being, I don't think COW usage will strictly mean more fragmentation however we will eventually need to run benchamrks and see. Hopefully once I have the implementation, we can work on these things. Regards, ojaswin > Greetings, > > Andres Freund ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Buffered atomic writes 2026-02-16 9:52 ` Pankaj Raghav 2026-02-16 15:45 ` Andres Freund @ 2026-02-17 17:20 ` Ojaswin Mujoo 2026-02-18 17:42 ` [Lsf-pc] " Jan Kara 1 sibling, 1 reply; 38+ messages in thread From: Ojaswin Mujoo @ 2026-02-17 17:20 UTC (permalink / raw) To: Pankaj Raghav Cc: linux-xfs, linux-mm, linux-fsdevel, lsf-pc, Andres Freund, djwong, john.g.garry, willy, hch, ritesh.list, jack, Luis Chamberlain, dchinner, Javier Gonzalez, gost.dev, tytso, p.raghav, vi.shah On Mon, Feb 16, 2026 at 10:52:35AM +0100, Pankaj Raghav wrote: > On 2/13/26 14:32, Ojaswin Mujoo wrote: > > On Fri, Feb 13, 2026 at 11:20:36AM +0100, Pankaj Raghav wrote: > >> Hi all, > >> > >> Atomic (untorn) writes for Direct I/O have successfully landed in kernel > >> for ext4 and XFS[1][2]. However, extending this support to Buffered I/O > >> remains a contentious topic, with previous discussions often stalling due to > >> concerns about complexity versus utility. > >> > >> I would like to propose a session to discuss the concrete use cases for > >> buffered atomic writes and if possible, talk about the outstanding > >> architectural blockers blocking the current RFCs[3][4]. > > > > Hi Pankaj, > > > > Thanks for the proposal and glad to hear there is a wider interest in > > this topic. We have also been actively working on this and I in middle > > of testing and ironing out bugs in my RFC v2 for buffered atomic > > writes, which is largely based on Dave's suggestions to maintain atomic > > write mappings in FS layer (aka XFS COW fork). Infact I was going to > > propose a discussion on this myself :) > > > > Perfect. > > >> > >> ## Use Case: > >> > >> A recurring objection to buffered atomics is the lack of a convincing use > >> case, with the argument that databases should simply migrate to direct I/O. > >> We have been working with PostgreSQL developer Andres Freund, who has > >> highlighted a specific architectural requirement where buffered I/O remains > >> preferable in certain scenarios. > > > > Looks like you have some nice insights to cover from postgres side which > > filesystem community has been asking for. As I've also been working on > > the kernel implementation side of it, do you think we could do a joint > > session on this topic? > > > As one of the main pushback for this feature has been a valid usecase, the main > outcome I would like to get out of this session is a community consensus on the use case > for this feature. > > It looks like you already made quite a bit of progress with the CoW impl, so it > would be great to if it can be a joint session. Awesome! > > > >> We currently have RFCs posted by John Garry and Ojaswin Mujoo, and there > >> was a previous LSFMM proposal about untorn buffered writes from Ted Tso. > >> Based on the conversation/blockers we had before, the discussion at LSFMM > >> should focus on the following blocking issues: > >> > >> - Handling Short Writes under Memory Pressure[6]: A buffered atomic > >> write might span page boundaries. If memory pressure causes a page > >> fault or reclaim mid-copy, the write could be torn inside the page > >> cache before it even reaches the filesystem. > >> - The current RFC uses a "pinning" approach: pinning user pages and > >> creating a BVEC to ensure the full copy can proceed atomically. > >> This adds complexity to the write path. > >> - Discussion: Is this acceptable? Should we consider alternatives, > >> such as requiring userspace to mlock the I/O buffers before > >> issuing the write to guarantee atomic copy in the page cache? > > > > Right, I chose this approach because we only get to know about the short > > copy after it has actually happened in copy_folio_from_iter_atomic() > > and it seemed simpler to just not let the short copy happen. This is > > inspired from how dio pins the pages for DMA, just that we do it > > for a shorter time. > > > > It does add slight complexity to the path but I'm not sure if it's complex > > enough to justify adding a hard requirement of having pages mlock'd. > > > > As databases like postgres have a buffer cache that they manage in userspace, > which is eventually used to do IO, I am wondering if they already do a mlock > or some other way to guarantee the buffer cache does not get reclaimed. That is > why I was thinking if we could make it a requirement. Of course, that also requires > checking if the range is mlocked in the iomap_write_iter path. Hmm got it,I still feel it might be an overkill for something we already have a mechanism for and can achieve easily, but I'm open to discussion on this :) > > >> > >> - Page Cache Model vs. Filesystem CoW: The current RFC introduces a > >> PG_atomic page flag to track dirty pages requiring atomic writeback. > >> This faced pushback due to page flags being a scarce resource[7]. > >> Furthermore, it was argued that atomic model does not fit the buffered > >> I/O model because data sitting in the page cache is vulnerable to > >> modification before writeback occurs, and writeback does not preserve > >> application ordering[8]. > >> - Dave Chinner has proposed leveraging the filesystem's CoW path > >> where we always allocate new blocks for the atomic write (forced > >> CoW). If the hardware supports it (e.g., NVMe atomic limits), the > >> filesystem can optimize the writeback to use REQ_ATOMIC in place, > >> avoiding the CoW overhead while maintaining the architectural > >> separation. > > > > Right, this is what I'm doing in the new RFC where we maintain the > > mappings for atomic write in COW fork. This way we are able to utilize a > > lot of existing infrastructure, however it does add some complexity to > > ->iomap_begin() and ->writeback_range() callbacks of the FS. I believe > > it is a tradeoff since the general consesus was mostly to avoid adding > > too much complexity to iomap layer. > > > > Another thing that came up is to consider using write through semantics > > for buffered atomic writes, where we are able to transition page to > > writeback state immediately after the write and avoid any other users to > > modify the data till writeback completes. This might affect performance > > since we won't be able to batch similar atomic IOs but maybe > > applications like postgres would not mind this too much. If we go with > > this approach, we will be able to avoid worrying too much about other > > users changing atomic data underneath us. > > > > Hmm, IIUC, postgres will write their dirty buffer cache by combining multiple DB > pages based on `io_combine_limit` (typically 128kb). So immediately writing them > might be ok as long as we don't remove those pages from the page cache like we do in > RWF_UNCACHED. Yep, and Ive not looked at the code path much but I think if we really care about the user not changing the data b/w write and writeback then we will probably need to start the writeback while holding the folio lock, which is currently not done in RWF_UNCACHED. > > > > An argument against this however is that it is user's responsibility to > > not do non atomic IO over an atomic range and this shall be considered a > > userspace usage error. This is similar to how there are ways users can > > tear a dio if they perform overlapping writes. [1]. > > > > That being said, I think these points are worth discussing and it would > > be helpful to have people from postgres around while discussing these > > semantics with the FS community members. > > > > As for ordering of writes, I'm not sure if that is something that > > we should guarantee via the RWF_ATOMIC api. Ensuring ordering has mostly > > been the task of userspace via fsync() and friends. > > > > Agreed. > > > > > [1] https://lore.kernel.org/fstests/0af205d9-6093-4931-abe9-f236acae8d44@oracle.com/ > > > >> - Discussion: While the CoW approach fits XFS and other CoW > >> filesystems well, it presents challenges for filesystems like ext4 > >> which lack CoW capabilities for data. Should this be a filesystem > >> specific feature? > > > > I believe your question is if we should have a hard dependency on COW > > mappings for atomic writes. Currently, COW in atomic write context in > > XFS, is used for these 2 things: > > > > 1. COW fork holds atomic write ranges. > > > > This is not strictly a COW feature, just that we are repurposing the COW > > fork to hold our atomic ranges. Basically a way for writeback path to > > know that atomic write was done here. > > > > COW fork is one way to do this but I believe every FS has a version of > > in memory extent trees where such ephemeral atomic write mappings can be > > held. The extent status cache is ext4's version of this, and can be used > > to manage the atomic write ranges. > > > > There is an alternate suggestion that came up from discussions with Ted > > and Darrick that we can instead use a generic side-car structure which > > holds atomic write ranges. FSes can populate these during atomic writes > > and query these in their writeback paths. > > > > This means for any FS operation (think truncate, falloc, mwrite, write > > ...) we would need to keep this structure in sync, which can become pretty > > complex pretty fast. I'm yet to implement this so not sure how it would > > look in practice though. > > > > 2. COW feature as a whole enables software based atomic writes. > > > > This is something that ext4 won't be able to support (right now), just > > like how we don't support software writes for dio. > > > > I believe Baokun and Yi and working on a feature that can eventually > > enable COW writes in ext4 [2]. Till we have something like that, we > > would have to rely on hardware support. > > > > Regardless, I don't think the ability to support or not support > > software atomic writes largely depends on the filesystem so I'm not > > sure how we can lift this up to a generic layer anyways. > > > > [2] https://lore.kernel.org/linux-ext4/9666679c-c9f7-435c-8b67-c67c2f0c19ab@huawei.com/ > > > > Thanks for the explanation. I am also planning to take a shot at the CoW approach. I would > be more than happy to review and test if you send a RFC in the meantime. Thanks Pankaj, I'm testing the current RFC internally. I think I'll have something in coming weeks and we can go over the design and how it looks etc. Regards, ojaswin > > -- > Pankaj > ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Buffered atomic writes 2026-02-17 17:20 ` Ojaswin Mujoo @ 2026-02-18 17:42 ` Jan Kara 2026-02-18 20:22 ` Ojaswin Mujoo 0 siblings, 1 reply; 38+ messages in thread From: Jan Kara @ 2026-02-18 17:42 UTC (permalink / raw) To: Ojaswin Mujoo Cc: Pankaj Raghav, linux-xfs, linux-mm, linux-fsdevel, lsf-pc, Andres Freund, djwong, john.g.garry, willy, hch, ritesh.list, jack, Luis Chamberlain, dchinner, Javier Gonzalez, gost.dev, tytso, p.raghav, vi.shah On Tue 17-02-26 22:50:17, Ojaswin Mujoo wrote: > On Mon, Feb 16, 2026 at 10:52:35AM +0100, Pankaj Raghav wrote: > > Hmm, IIUC, postgres will write their dirty buffer cache by combining multiple DB > > pages based on `io_combine_limit` (typically 128kb). So immediately writing them > > might be ok as long as we don't remove those pages from the page cache like we do in > > RWF_UNCACHED. > > Yep, and Ive not looked at the code path much but I think if we really > care about the user not changing the data b/w write and writeback then > we will probably need to start the writeback while holding the folio > lock, which is currently not done in RWF_UNCACHED. That isn't enough. submit_bio() returning isn't enough to guaranteed DMA to the device has happened. And until it happens, modifying the pagecache page means modifying the data the disk will get. The best is probably to transition pages to writeback state and deal with it as with any other requirement for stable pages. Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Buffered atomic writes 2026-02-18 17:42 ` [Lsf-pc] " Jan Kara @ 2026-02-18 20:22 ` Ojaswin Mujoo 0 siblings, 0 replies; 38+ messages in thread From: Ojaswin Mujoo @ 2026-02-18 20:22 UTC (permalink / raw) To: Jan Kara Cc: Pankaj Raghav, linux-xfs, linux-mm, linux-fsdevel, lsf-pc, Andres Freund, djwong, john.g.garry, willy, hch, ritesh.list, Luis Chamberlain, dchinner, Javier Gonzalez, gost.dev, tytso, p.raghav, vi.shah On Wed, Feb 18, 2026 at 06:42:05PM +0100, Jan Kara wrote: > On Tue 17-02-26 22:50:17, Ojaswin Mujoo wrote: > > On Mon, Feb 16, 2026 at 10:52:35AM +0100, Pankaj Raghav wrote: > > > Hmm, IIUC, postgres will write their dirty buffer cache by combining multiple DB > > > pages based on `io_combine_limit` (typically 128kb). So immediately writing them > > > might be ok as long as we don't remove those pages from the page cache like we do in > > > RWF_UNCACHED. > > > > Yep, and Ive not looked at the code path much but I think if we really > > care about the user not changing the data b/w write and writeback then > > we will probably need to start the writeback while holding the folio > > lock, which is currently not done in RWF_UNCACHED. > > That isn't enough. submit_bio() returning isn't enough to guaranteed DMA > to the device has happened. And until it happens, modifying the pagecache > page means modifying the data the disk will get. The best is probably to > transition pages to writeback state and deal with it as with any other > requirement for stable pages. Yes true, looking at the code, it does seem like we would also need to depend on the stable page mechanism to ensure nobody changes the buffers till the IO has actually finished. I think the right way to go would be to first start with an implementation of RWF_WRITETHOUGH and then utilize that and stable pages to enable RWF_ATOMIC for buffered IO. Regards, ojaswin > > Honza > -- > Jan Kara <jack@suse.com> > SUSE Labs, CR ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Buffered atomic writes 2026-02-13 13:32 ` Ojaswin Mujoo 2026-02-16 9:52 ` Pankaj Raghav @ 2026-02-16 11:38 ` Jan Kara 2026-02-16 13:18 ` Pankaj Raghav ` (2 more replies) 1 sibling, 3 replies; 38+ messages in thread From: Jan Kara @ 2026-02-16 11:38 UTC (permalink / raw) To: Ojaswin Mujoo Cc: Pankaj Raghav, linux-xfs, linux-mm, linux-fsdevel, lsf-pc, Andres Freund, djwong, john.g.garry, willy, hch, ritesh.list, jack, Luis Chamberlain, dchinner, Javier Gonzalez, gost.dev, tytso, p.raghav, vi.shah Hi! On Fri 13-02-26 19:02:39, Ojaswin Mujoo wrote: > Another thing that came up is to consider using write through semantics > for buffered atomic writes, where we are able to transition page to > writeback state immediately after the write and avoid any other users to > modify the data till writeback completes. This might affect performance > since we won't be able to batch similar atomic IOs but maybe > applications like postgres would not mind this too much. If we go with > this approach, we will be able to avoid worrying too much about other > users changing atomic data underneath us. > > An argument against this however is that it is user's responsibility to > not do non atomic IO over an atomic range and this shall be considered a > userspace usage error. This is similar to how there are ways users can > tear a dio if they perform overlapping writes. [1]. Yes, I was wondering whether the write-through semantics would make sense as well. Intuitively it should make things simpler because you could practially reuse the atomic DIO write path. Only that you'd first copy data into the page cache and issue dio write from those folios. No need for special tracking of which folios actually belong together in atomic write, no need for cluttering standard folio writeback path, in case atomic write cannot happen (e.g. because you cannot allocate appropriately aligned blocks) you get the error back rightaway, ... Of course this all depends on whether such semantics would be actually useful for users such as PostgreSQL. Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Buffered atomic writes 2026-02-16 11:38 ` Jan Kara @ 2026-02-16 13:18 ` Pankaj Raghav 2026-02-17 18:36 ` Ojaswin Mujoo 2026-02-16 15:57 ` Andres Freund 2026-02-17 18:39 ` Ojaswin Mujoo 2 siblings, 1 reply; 38+ messages in thread From: Pankaj Raghav @ 2026-02-16 13:18 UTC (permalink / raw) To: Jan Kara, Ojaswin Mujoo Cc: linux-xfs, linux-mm, linux-fsdevel, lsf-pc, Andres Freund, djwong, john.g.garry, willy, hch, ritesh.list, Luis Chamberlain, dchinner, Javier Gonzalez, gost.dev, tytso, p.raghav, vi.shah On 2/16/2026 12:38 PM, Jan Kara wrote: > Hi! > > On Fri 13-02-26 19:02:39, Ojaswin Mujoo wrote: >> Another thing that came up is to consider using write through semantics >> for buffered atomic writes, where we are able to transition page to >> writeback state immediately after the write and avoid any other users to >> modify the data till writeback completes. This might affect performance >> since we won't be able to batch similar atomic IOs but maybe >> applications like postgres would not mind this too much. If we go with >> this approach, we will be able to avoid worrying too much about other >> users changing atomic data underneath us. >> >> An argument against this however is that it is user's responsibility to >> not do non atomic IO over an atomic range and this shall be considered a >> userspace usage error. This is similar to how there are ways users can >> tear a dio if they perform overlapping writes. [1]. > > Yes, I was wondering whether the write-through semantics would make sense > as well. Intuitively it should make things simpler because you could > practially reuse the atomic DIO write path. Only that you'd first copy > data into the page cache and issue dio write from those folios. No need for > special tracking of which folios actually belong together in atomic write, > no need for cluttering standard folio writeback path, in case atomic write > cannot happen (e.g. because you cannot allocate appropriately aligned > blocks) you get the error back rightaway, ... > > Of course this all depends on whether such semantics would be actually > useful for users such as PostgreSQL. One issue might be the performance, especially if the atomic max unit is in the smaller end such as 16k or 32k (which is fairly common). But it will avoid the overlapping writes issue and can easily leverage the direct IO path. But one thing that postgres really cares about is the integrity of a database block. So if there is an IO that is a multiple of an atomic write unit (one atomic unit encapsulates the whole DB page), it is not a problem if tearing happens on the atomic boundaries. This fits very well with what NVMe calls Multiple Atomicity Mode (MAM) [1]. We don't have any semantics for MaM at the moment but that could increase the performance as we can do larger IOs but still get the atomic guarantees certain applications care about. [1] https://nvmexpress.org/wp-content/uploads/NVM-Express-NVM-Command-Set-Specification-Revision-1.1-2024.08.05-Ratified.pdf ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Buffered atomic writes 2026-02-16 13:18 ` Pankaj Raghav @ 2026-02-17 18:36 ` Ojaswin Mujoo 0 siblings, 0 replies; 38+ messages in thread From: Ojaswin Mujoo @ 2026-02-17 18:36 UTC (permalink / raw) To: Pankaj Raghav Cc: Jan Kara, linux-xfs, linux-mm, linux-fsdevel, lsf-pc, Andres Freund, djwong, john.g.garry, willy, hch, ritesh.list, Luis Chamberlain, dchinner, Javier Gonzalez, gost.dev, tytso, p.raghav, vi.shah On Mon, Feb 16, 2026 at 02:18:10PM +0100, Pankaj Raghav wrote: > > > On 2/16/2026 12:38 PM, Jan Kara wrote: > > Hi! > > > > On Fri 13-02-26 19:02:39, Ojaswin Mujoo wrote: > > > Another thing that came up is to consider using write through semantics > > > for buffered atomic writes, where we are able to transition page to > > > writeback state immediately after the write and avoid any other users to > > > modify the data till writeback completes. This might affect performance > > > since we won't be able to batch similar atomic IOs but maybe > > > applications like postgres would not mind this too much. If we go with > > > this approach, we will be able to avoid worrying too much about other > > > users changing atomic data underneath us. > > > > > > An argument against this however is that it is user's responsibility to > > > not do non atomic IO over an atomic range and this shall be considered a > > > userspace usage error. This is similar to how there are ways users can > > > tear a dio if they perform overlapping writes. [1]. > > > > Yes, I was wondering whether the write-through semantics would make sense > > as well. Intuitively it should make things simpler because you could > > practially reuse the atomic DIO write path. Only that you'd first copy > > data into the page cache and issue dio write from those folios. No need for > > special tracking of which folios actually belong together in atomic write, > > no need for cluttering standard folio writeback path, in case atomic write > > cannot happen (e.g. because you cannot allocate appropriately aligned > > blocks) you get the error back rightaway, ... > > > > Of course this all depends on whether such semantics would be actually > > useful for users such as PostgreSQL. > > One issue might be the performance, especially if the atomic max unit is in > the smaller end such as 16k or 32k (which is fairly common). But it will > avoid the overlapping writes issue and can easily leverage the direct IO > path. > > But one thing that postgres really cares about is the integrity of a > database block. So if there is an IO that is a multiple of an atomic write > unit (one atomic unit encapsulates the whole DB page), it is not a problem > if tearing happens on the atomic boundaries. This fits very well with what > NVMe calls Multiple Atomicity Mode (MAM) [1]. > > We don't have any semantics for MaM at the moment but that could increase > the performance as we can do larger IOs but still get the atomic guarantees > certain applications care about. Interesting, I think very very early dio implementations did use something of this sort where (awu_max = 4k) an atomic write of 16k would result in 4 x 4k atomic writes. I don't remember why it was shot down though :D Regards, ojaswin > > > [1] https://nvmexpress.org/wp-content/uploads/NVM-Express-NVM-Command-Set-Specification-Revision-1.1-2024.08.05-Ratified.pdf > ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Buffered atomic writes 2026-02-16 11:38 ` Jan Kara 2026-02-16 13:18 ` Pankaj Raghav @ 2026-02-16 15:57 ` Andres Freund 2026-02-17 18:39 ` Ojaswin Mujoo 2 siblings, 0 replies; 38+ messages in thread From: Andres Freund @ 2026-02-16 15:57 UTC (permalink / raw) To: Jan Kara Cc: Ojaswin Mujoo, Pankaj Raghav, linux-xfs, linux-mm, linux-fsdevel, lsf-pc, djwong, john.g.garry, willy, hch, ritesh.list, Luis Chamberlain, dchinner, Javier Gonzalez, gost.dev, tytso, p.raghav, vi.shah Hi, On 2026-02-16 12:38:59 +0100, Jan Kara wrote: > On Fri 13-02-26 19:02:39, Ojaswin Mujoo wrote: > > Another thing that came up is to consider using write through semantics > > for buffered atomic writes, where we are able to transition page to > > writeback state immediately after the write and avoid any other users to > > modify the data till writeback completes. This might affect performance > > since we won't be able to batch similar atomic IOs but maybe > > applications like postgres would not mind this too much. If we go with > > this approach, we will be able to avoid worrying too much about other > > users changing atomic data underneath us. > > > > An argument against this however is that it is user's responsibility to > > not do non atomic IO over an atomic range and this shall be considered a > > userspace usage error. This is similar to how there are ways users can > > tear a dio if they perform overlapping writes. [1]. > > Yes, I was wondering whether the write-through semantics would make sense > as well. As outlined in https://lore.kernel.org/all/zzvybbfy6bcxnkt4cfzruhdyy6jsvnuvtjkebdeqwkm6nfpgij@dlps7ucza22s/ that is something that would be useful for postgres even orthogonally to atomic writes. If this were the path to go with, I'd suggest adding an RWF_WRITETHROUGH and requiring it to be set when using RWF_ATOMIC on an buffered write. That way, if the kernel were to eventually support buffered atomic writes without immediate writeback, the semantics to userspace wouldn't suddenly change. > Intuitively it should make things simpler because you could > practially reuse the atomic DIO write path. Only that you'd first copy > data into the page cache and issue dio write from those folios. No need for > special tracking of which folios actually belong together in atomic write, > no need for cluttering standard folio writeback path, in case atomic write > cannot happen (e.g. because you cannot allocate appropriately aligned > blocks) you get the error back rightaway, ... > > Of course this all depends on whether such semantics would be actually > useful for users such as PostgreSQL. I think it would be useful for many workloads. As noted in the linked message, there are some workloads where I am not sure how the gains/costs would balance out (with a small PG buffer pool in a write heavy workload, we'd loose the ability to have the kernel avoid redundant writes). It's possible that we could develop some heuristics to fall back to doing our own torn-page avoidance in such cases, although it's not immediately obvious to me what that heuristic would be. It's also not that common a workload, it's *much* more common to have a read heavy workload that has to overflow in the kernel page cache, due to not being able to dedicate sufficient memory to postgres. Greetings, Andres Freund ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Buffered atomic writes 2026-02-16 11:38 ` Jan Kara 2026-02-16 13:18 ` Pankaj Raghav 2026-02-16 15:57 ` Andres Freund @ 2026-02-17 18:39 ` Ojaswin Mujoo 2026-02-18 0:26 ` Dave Chinner 2 siblings, 1 reply; 38+ messages in thread From: Ojaswin Mujoo @ 2026-02-17 18:39 UTC (permalink / raw) To: Jan Kara Cc: Pankaj Raghav, linux-xfs, linux-mm, linux-fsdevel, lsf-pc, Andres Freund, djwong, john.g.garry, willy, hch, ritesh.list, Luis Chamberlain, dchinner, Javier Gonzalez, gost.dev, tytso, p.raghav, vi.shah On Mon, Feb 16, 2026 at 12:38:59PM +0100, Jan Kara wrote: > Hi! > > On Fri 13-02-26 19:02:39, Ojaswin Mujoo wrote: > > Another thing that came up is to consider using write through semantics > > for buffered atomic writes, where we are able to transition page to > > writeback state immediately after the write and avoid any other users to > > modify the data till writeback completes. This might affect performance > > since we won't be able to batch similar atomic IOs but maybe > > applications like postgres would not mind this too much. If we go with > > this approach, we will be able to avoid worrying too much about other > > users changing atomic data underneath us. > > > > An argument against this however is that it is user's responsibility to > > not do non atomic IO over an atomic range and this shall be considered a > > userspace usage error. This is similar to how there are ways users can > > tear a dio if they perform overlapping writes. [1]. > > Yes, I was wondering whether the write-through semantics would make sense > as well. Intuitively it should make things simpler because you could > practially reuse the atomic DIO write path. Only that you'd first copy > data into the page cache and issue dio write from those folios. No need for > special tracking of which folios actually belong together in atomic write, > no need for cluttering standard folio writeback path, in case atomic write > cannot happen (e.g. because you cannot allocate appropriately aligned > blocks) you get the error back rightaway, ... This is an interesting idea Jan and also saves a lot of tracking of atomic extents etc. I'm unsure how much of a performance impact it'd have though but I'll look into this Regards, ojaswin > > Of course this all depends on whether such semantics would be actually > useful for users such as PostgreSQL. > > Honza > -- > Jan Kara <jack@suse.com> > SUSE Labs, CR ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Buffered atomic writes 2026-02-17 18:39 ` Ojaswin Mujoo @ 2026-02-18 0:26 ` Dave Chinner 2026-02-18 6:49 ` Christoph Hellwig 2026-02-18 12:54 ` Ojaswin Mujoo 0 siblings, 2 replies; 38+ messages in thread From: Dave Chinner @ 2026-02-18 0:26 UTC (permalink / raw) To: Ojaswin Mujoo Cc: Jan Kara, Pankaj Raghav, linux-xfs, linux-mm, linux-fsdevel, lsf-pc, Andres Freund, djwong, john.g.garry, willy, hch, ritesh.list, Luis Chamberlain, dchinner, Javier Gonzalez, gost.dev, tytso, p.raghav, vi.shah On Wed, Feb 18, 2026 at 12:09:46AM +0530, Ojaswin Mujoo wrote: > On Mon, Feb 16, 2026 at 12:38:59PM +0100, Jan Kara wrote: > > Hi! > > > > On Fri 13-02-26 19:02:39, Ojaswin Mujoo wrote: > > > Another thing that came up is to consider using write through semantics > > > for buffered atomic writes, where we are able to transition page to > > > writeback state immediately after the write and avoid any other users to > > > modify the data till writeback completes. This might affect performance > > > since we won't be able to batch similar atomic IOs but maybe > > > applications like postgres would not mind this too much. If we go with > > > this approach, we will be able to avoid worrying too much about other > > > users changing atomic data underneath us. > > > > > > An argument against this however is that it is user's responsibility to > > > not do non atomic IO over an atomic range and this shall be considered a > > > userspace usage error. This is similar to how there are ways users can > > > tear a dio if they perform overlapping writes. [1]. > > > > Yes, I was wondering whether the write-through semantics would make sense > > as well. Intuitively it should make things simpler because you could > > practially reuse the atomic DIO write path. Only that you'd first copy > > data into the page cache and issue dio write from those folios. No need for > > special tracking of which folios actually belong together in atomic write, > > no need for cluttering standard folio writeback path, in case atomic write > > cannot happen (e.g. because you cannot allocate appropriately aligned > > blocks) you get the error back rightaway, ... > > This is an interesting idea Jan and also saves a lot of tracking of > atomic extents etc. ISTR mentioning that we should be doing exactly this (grab page cache pages, fill them and submit them through the DIO path) for O_DSYNC buffered writethrough IO a long time again. The context was optimising buffered O_DSYNC to use the FUA optimisations in the iomap DIO write path. I suggested it again when discussing how RWF_DONTCACHE should be implemented, because the async DIO write completion path invalidates the page cache over the IO range. i.e. it would avoid the need to use folio flags to track pages that needed invalidation at IO completion... I have a vague recollection of mentioning this early in the buffered RWF_ATOMIC discussions, too, though that may have just been the voices in my head. Regardless, we are here again with proposals for RWF_ATOMIC and RWF_WRITETHROUGH and a suggestion that maybe we should vector buffered writethrough via the DIO path..... Perhaps it's time to do this? FWIW, the other thing that write-through via the DIO path enables is true async O_DSYNC buffered IO. Right now O_DSYNC buffered writes block waiting on IO completion through generic_sync_write() -> vfs_fsync_range(), even when issued through AIO paths. Vectoring it through the DIO path avoids the blocking fsync path in IO submission as it runs in the async DIO completion path if it is needed.... -Dave. -- Dave Chinner dgc@kernel.org ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Buffered atomic writes 2026-02-18 0:26 ` Dave Chinner @ 2026-02-18 6:49 ` Christoph Hellwig 2026-02-18 12:54 ` Ojaswin Mujoo 1 sibling, 0 replies; 38+ messages in thread From: Christoph Hellwig @ 2026-02-18 6:49 UTC (permalink / raw) To: Dave Chinner Cc: Ojaswin Mujoo, Jan Kara, Pankaj Raghav, linux-xfs, linux-mm, linux-fsdevel, lsf-pc, Andres Freund, djwong, john.g.garry, willy, hch, ritesh.list, Luis Chamberlain, dchinner, Javier Gonzalez, gost.dev, tytso, p.raghav, vi.shah On Wed, Feb 18, 2026 at 11:26:06AM +1100, Dave Chinner wrote: > ISTR mentioning that we should be doing exactly this (grab page > cache pages, fill them and submit them through the DIO path) for > O_DSYNC buffered writethrough IO a long time again. Yes, multiple times. And I did a few more times since then. > Regardless, we are here again with proposals for RWF_ATOMIC and > RWF_WRITETHROUGH and a suggestion that maybe we should vector > buffered writethrough via the DIO path..... > > Perhaps it's time to do this? Yes. > FWIW, the other thing that write-through via the DIO path enables is > true async O_DSYNC buffered IO. Right now O_DSYNC buffered writes > block waiting on IO completion through generic_sync_write() -> > vfs_fsync_range(), even when issued through AIO paths. Vectoring it > through the DIO path avoids the blocking fsync path in IO submission > as it runs in the async DIO completion path if it is needed.... It's only true if we can do the page cache updates non-blocking, but in many cases that should indeed be possible. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Buffered atomic writes 2026-02-18 0:26 ` Dave Chinner 2026-02-18 6:49 ` Christoph Hellwig @ 2026-02-18 12:54 ` Ojaswin Mujoo 1 sibling, 0 replies; 38+ messages in thread From: Ojaswin Mujoo @ 2026-02-18 12:54 UTC (permalink / raw) To: Dave Chinner Cc: Jan Kara, Pankaj Raghav, linux-xfs, linux-mm, linux-fsdevel, lsf-pc, Andres Freund, djwong, john.g.garry, willy, hch, ritesh.list, Luis Chamberlain, dchinner, Javier Gonzalez, gost.dev, tytso, p.raghav, vi.shah On Wed, Feb 18, 2026 at 11:26:06AM +1100, Dave Chinner wrote: > On Wed, Feb 18, 2026 at 12:09:46AM +0530, Ojaswin Mujoo wrote: > > On Mon, Feb 16, 2026 at 12:38:59PM +0100, Jan Kara wrote: > > > Hi! > > > > > > On Fri 13-02-26 19:02:39, Ojaswin Mujoo wrote: > > > > Another thing that came up is to consider using write through semantics > > > > for buffered atomic writes, where we are able to transition page to > > > > writeback state immediately after the write and avoid any other users to > > > > modify the data till writeback completes. This might affect performance > > > > since we won't be able to batch similar atomic IOs but maybe > > > > applications like postgres would not mind this too much. If we go with > > > > this approach, we will be able to avoid worrying too much about other > > > > users changing atomic data underneath us. > > > > > > > > An argument against this however is that it is user's responsibility to > > > > not do non atomic IO over an atomic range and this shall be considered a > > > > userspace usage error. This is similar to how there are ways users can > > > > tear a dio if they perform overlapping writes. [1]. > > > > > > Yes, I was wondering whether the write-through semantics would make sense > > > as well. Intuitively it should make things simpler because you could > > > practially reuse the atomic DIO write path. Only that you'd first copy > > > data into the page cache and issue dio write from those folios. No need for > > > special tracking of which folios actually belong together in atomic write, > > > no need for cluttering standard folio writeback path, in case atomic write > > > cannot happen (e.g. because you cannot allocate appropriately aligned > > > blocks) you get the error back rightaway, ... > > > > This is an interesting idea Jan and also saves a lot of tracking of > > atomic extents etc. > > ISTR mentioning that we should be doing exactly this (grab page > cache pages, fill them and submit them through the DIO path) for > O_DSYNC buffered writethrough IO a long time again. The context was > optimising buffered O_DSYNC to use the FUA optimisations in the > iomap DIO write path. > > I suggested it again when discussing how RWF_DONTCACHE should be > implemented, because the async DIO write completion path invalidates > the page cache over the IO range. i.e. it would avoid the need to > use folio flags to track pages that needed invalidation at IO > completion... > > I have a vague recollection of mentioning this early in the buffered > RWF_ATOMIC discussions, too, though that may have just been the > voices in my head. Hi Dave, Yes we did discuss this [1] :) We also discussed the alternative of using the COW fork path for atomic writes [2]. Since at that point I was not completely sure if the writethrough would become too restrictive of an approach, I was working on a COW fork implementation. However, from the discussion here as well as Andres' comments, it seems like write through might not be too bad for postgres. > > Regardless, we are here again with proposals for RWF_ATOMIC and > RWF_WRITETHROUGH and a suggestion that maybe we should vector > buffered writethrough via the DIO path..... > > Perhaps it's time to do this? I agree that it makes more sense to do writethrough if we want to have the strict old-or-new semantics (as opposed to just untorn IO semantics). I'll work on a POC for this approach of doing atomic writes, I'll mostly try to base it off your suggestions in [1]. FWIW, I do have a somewhat working (although untested and possible broken in some places) POC for performing atomic writes via XFS COW fork based on suggestions from Dave [2]. Even though we want to explore the writethrough approach, I'd just share it here incase anyone is interested in how the design is looking like: https://github.com/OjaswinM/linux/commits/iomap-buffered-atomic-rfc2.3/ (If anyone prefers for me to send this as a patchset on mailing list, let me know) Regards, ojaswin [1] https://lore.kernel.org/linux-fsdevel/aRmHRk7FGD4nCT0s@dread.disaster.area/ [2] https://lore.kernel.org/linux-fsdevel/aRuKz4F3xATf8IUp@dread.disaster.area/ > > FWIW, the other thing that write-through via the DIO path enables is > true async O_DSYNC buffered IO. Right now O_DSYNC buffered writes > block waiting on IO completion through generic_sync_write() -> > vfs_fsync_range(), even when issued through AIO paths. Vectoring it > through the DIO path avoids the blocking fsync path in IO submission > as it runs in the async DIO completion path if it is needed.... > > -Dave. > -- > Dave Chinner > dgc@kernel.org ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Buffered atomic writes 2026-02-13 10:20 [LSF/MM/BPF TOPIC] Buffered atomic writes Pankaj Raghav 2026-02-13 13:32 ` Ojaswin Mujoo @ 2026-02-15 9:01 ` Amir Goldstein 2026-02-17 5:51 ` Christoph Hellwig 2026-02-20 10:08 ` Pankaj Raghav (Samsung) 3 siblings, 0 replies; 38+ messages in thread From: Amir Goldstein @ 2026-02-15 9:01 UTC (permalink / raw) To: Pankaj Raghav, Andres Freund Cc: linux-xfs, linux-mm, linux-fsdevel, lsf-pc, djwong, john.g.garry, willy, hch, ritesh.list, jack, ojaswin, Luis Chamberlain, dchinner, Javier Gonzalez, gost.dev, tytso, p.raghav, vi.shah On Fri, Feb 13, 2026 at 11:20:36AM +0100, Pankaj Raghav wrote: > Hi all, > > Atomic (untorn) writes for Direct I/O have successfully landed in kernel > for ext4 and XFS[1][2]. However, extending this support to Buffered I/O > remains a contentious topic, with previous discussions often stalling due to > concerns about complexity versus utility. > > I would like to propose a session to discuss the concrete use cases for > buffered atomic writes and if possible, talk about the outstanding > architectural blockers blocking the current RFCs[3][4]. > > ## Use Case: > > A recurring objection to buffered atomics is the lack of a convincing use > case, with the argument that databases should simply migrate to direct I/O. > We have been working with PostgreSQL developer Andres Freund, who has > highlighted a specific architectural requirement where buffered I/O remains > preferable in certain scenarios. > > While Postgres recently started to support direct I/O, optimal performance > requires a large, statically configured user-space buffer pool. This becomes > problematic when running many Postgres instances on the same hardware, a > common deployment scenario. Statically partitioning RAM for direct I/O > caches across many instances is inefficient compared to allowing the kernel > page cache to dynamically balance memory pressure between instances. > > The other use case is using postgres as part of a larger workload on one > instance. Using up enough memory for postgres' buffer pool to make DIO use > viable is often not realistic, because some deployments require a lot of > memory to cache database IO, while others need a lot of memory for > non-database caching. > > Enabling atomic writes for this buffered workload would allow Postgres to > disable full-page writes [5]. For direct I/O, this has shown to reduce > transaction variability; for buffered I/O, we expect similar gains, > alongside decreased WAL bandwidth and storage costs for WAL archival. As a > side note, for most workloads full page writes occupy a significant portion > of WAL volume. > > Andres has agreed to attend LSFMM this year to discuss these requirements. > Andres, If you wish to attend LSFMM, please request an invite via the Google form: https://forms.gle/hUgiEksr8CA1migCA Thanks, Amir. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Buffered atomic writes 2026-02-13 10:20 [LSF/MM/BPF TOPIC] Buffered atomic writes Pankaj Raghav 2026-02-13 13:32 ` Ojaswin Mujoo 2026-02-15 9:01 ` Amir Goldstein @ 2026-02-17 5:51 ` Christoph Hellwig 2026-02-17 9:23 ` [Lsf-pc] " Amir Goldstein 2026-02-20 10:08 ` Pankaj Raghav (Samsung) 3 siblings, 1 reply; 38+ messages in thread From: Christoph Hellwig @ 2026-02-17 5:51 UTC (permalink / raw) To: Pankaj Raghav Cc: linux-xfs, linux-mm, linux-fsdevel, lsf-pc, Andres Freund, djwong, john.g.garry, willy, hch, ritesh.list, jack, ojaswin, Luis Chamberlain, dchinner, Javier Gonzalez, gost.dev, tytso, p.raghav, vi.shah I think a better session would be how we can help postgres to move off buffered I/O instead of adding more special cases for them. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Buffered atomic writes 2026-02-17 5:51 ` Christoph Hellwig @ 2026-02-17 9:23 ` Amir Goldstein 2026-02-17 15:47 ` Andres Freund 2026-02-18 6:51 ` Christoph Hellwig 0 siblings, 2 replies; 38+ messages in thread From: Amir Goldstein @ 2026-02-17 9:23 UTC (permalink / raw) To: Christoph Hellwig Cc: Pankaj Raghav, linux-xfs, linux-mm, linux-fsdevel, lsf-pc, Andres Freund, djwong, john.g.garry, willy, ritesh.list, jack, ojaswin, Luis Chamberlain, dchinner, Javier Gonzalez, gost.dev, tytso, p.raghav, vi.shah On Tue, Feb 17, 2026 at 8:00 AM Christoph Hellwig <hch@lst.de> wrote: > > I think a better session would be how we can help postgres to move > off buffered I/O instead of adding more special cases for them. Respectfully, I disagree that DIO is the only possible solution. Direct I/O is a legit solution for databases and so is buffered I/O each with their own caveats. Specifically, when two subsystems (kernel vfs and db) each require a huge amount of cache memory for best performance, setting them up to play nicely together to utilize system memory in an optimal way is a huge pain. Thanks, Amir. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Buffered atomic writes 2026-02-17 9:23 ` [Lsf-pc] " Amir Goldstein @ 2026-02-17 15:47 ` Andres Freund 2026-02-17 22:45 ` Dave Chinner 2026-02-18 6:53 ` Christoph Hellwig 2026-02-18 6:51 ` Christoph Hellwig 1 sibling, 2 replies; 38+ messages in thread From: Andres Freund @ 2026-02-17 15:47 UTC (permalink / raw) To: Amir Goldstein Cc: Christoph Hellwig, Pankaj Raghav, linux-xfs, linux-mm, linux-fsdevel, lsf-pc, djwong, john.g.garry, willy, ritesh.list, jack, ojaswin, Luis Chamberlain, dchinner, Javier Gonzalez, gost.dev, tytso, p.raghav, vi.shah Hi, On 2026-02-17 10:23:36 +0100, Amir Goldstein wrote: > On Tue, Feb 17, 2026 at 8:00 AM Christoph Hellwig <hch@lst.de> wrote: > > > > I think a better session would be how we can help postgres to move > > off buffered I/O instead of adding more special cases for them. FWIW, we are adding support for DIO (it's been added, but performance isn't competitive for most workloads in the released versions yet, work to address those issues is in progress). But it's only really be viable for larger setups, not for e.g.: - smaller, unattended setups - uses of postgres as part of a larger application on one server with hard to predict memory usage of different components - intentionally overcommitted shared hosting type scenarios Even once a well configured postgres using DIO beats postgres not using DIO, I'll bet that well over 50% of users won't be able to use DIO. There are some kernel issues that make it harder than necessary to use DIO, btw: Most prominently: With DIO concurrently extending multiple files leads to quite terrible fragmentation, at least with XFS. Forcing us to over-aggressively use fallocate(), truncating later if it turns out we need less space. The fallocate in turn triggers slowness in the write paths, as writing to uninitialized extents is a metadata operation. It'd be great if the allocation behaviour with concurrent file extension could be improved and if we could have a fallocate mode that forces extents to be initialized. A secondary issue is that with the buffer pool sizes necessary for DIO use on bigger systems, creating the anonymous memory mapping becomes painfully slow if we use MAP_POPULATE - which we kinda need to do, as otherwise performance is very inconsistent initially (often iomap -> gup -> handle_mm_fault -> folio_zero_user uses the majority of the CPU). We've been experimenting with not using MAP_POPULATE and using multiple threads to populate the mapping in parallel, but that feels not like something that userspace ought to have to do. It's easier to work around for us that the uninitialized extent conversion issue, but it still is something we IMO shouldn't have to do. > Respectfully, I disagree that DIO is the only possible solution. > Direct I/O is a legit solution for databases and so is buffered I/O > each with their own caveats. > Specifically, when two subsystems (kernel vfs and db) each require a huge > amount of cache memory for best performance, setting them up to play nicely > together to utilize system memory in an optimal way is a huge pain. Yep. Greetings, Andres Freund ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Buffered atomic writes 2026-02-17 15:47 ` Andres Freund @ 2026-02-17 22:45 ` Dave Chinner 2026-02-18 4:10 ` Andres Freund 2026-02-18 6:53 ` Christoph Hellwig 1 sibling, 1 reply; 38+ messages in thread From: Dave Chinner @ 2026-02-17 22:45 UTC (permalink / raw) To: Andres Freund Cc: Amir Goldstein, Christoph Hellwig, Pankaj Raghav, linux-xfs, linux-mm, linux-fsdevel, lsf-pc, djwong, john.g.garry, willy, ritesh.list, jack, ojaswin, Luis Chamberlain, dchinner, Javier Gonzalez, gost.dev, tytso, p.raghav, vi.shah On Tue, Feb 17, 2026 at 10:47:07AM -0500, Andres Freund wrote: > Hi, > > On 2026-02-17 10:23:36 +0100, Amir Goldstein wrote: > > On Tue, Feb 17, 2026 at 8:00 AM Christoph Hellwig <hch@lst.de> wrote: > > > > > > I think a better session would be how we can help postgres to move > > > off buffered I/O instead of adding more special cases for them. > > FWIW, we are adding support for DIO (it's been added, but performance isn't > competitive for most workloads in the released versions yet, work to address > those issues is in progress). > > But it's only really be viable for larger setups, not for e.g.: > - smaller, unattended setups > - uses of postgres as part of a larger application on one server with hard to > predict memory usage of different components > - intentionally overcommitted shared hosting type scenarios > > Even once a well configured postgres using DIO beats postgres not using DIO, > I'll bet that well over 50% of users won't be able to use DIO. > > > There are some kernel issues that make it harder than necessary to use DIO, > btw: > > Most prominently: With DIO concurrently extending multiple files leads to > quite terrible fragmentation, at least with XFS. Forcing us to > over-aggressively use fallocate(), truncating later if it turns out we need > less space. <ahem> seriously, fallocate() is considered harmful for exactly these sorts of reasons. XFS has vastly better mechanisms built into it that mitigate worst case fragmentation without needing to change applications or increase runtime overhead. So, lets go way back - 32 years ago to 1994: commit 32766d4d387bc6779e0c432fb56a0cc4e6b96398 Author: Doug Doucette <doucette@engr.sgi.com> Date: Thu Mar 3 22:17:15 1994 +0000 Add fcntl implementation (F_FSGETXATTR, F_FSSETXATTR, and F_DIOINFO). Fix xfs_setattr new xfs fields' implementation to split out error checking to the front of the routine, like the other attributes. Don't set new fields in xfs_getattr unless one of the fields is requested. ..... + case F_FSSETXATTR: { + struct fsxattr fa; + vattr_t va; + + if (copyin(arg, &fa, sizeof(fa))) { + error = EFAULT; + break; + } + va.va_xflags = fa.fsx_xflags; + va.va_extsize = fa.fsx_extsize; ^^^^^^^^^^^^^^^ + error = xfs_setattr(vp, &va, AT_XFLAGS|AT_EXTSIZE, credp); + break; + } This was the commit that added user controlled extent size hints to XFS. These already existed in EFS, so applications using this functionality go back to the even earlier in the 1990s. So, let's set the extent size hint on a file to 1MB. Now whenever a data extent allocation on that file is attempted, the extent size that is allocated will be rounded up to the nearest 1MB. i.e. XFS will try to allocate unwritten extents in aligned multiples of the extent size hint regardless of the actual IO size being performed. Hence if you are doing concurrent extending 8kB writes, instead of allocating 8kB at a time, the extent size hint will force a 1MB unwritten extent to be allocated out beyond EOF. The subsequent extending 8kB writes to that file now hit that unwritten extent, and only need to convert it to written. The same will happen for all other concurrent extending writes - they will allocate in 1MB chunks, not 8KB. The result will be that the files will interleave 1MB sized extents across files instead of 8kB sized extents. i.e. we've just reduced the worst case fragmentation behaviour by a factor of 128. We've also reduced allocation overhead by a factor of 128, so the use of extent size hints results in the filesystem behaving in a far more efficient way and hence this results in higher performance. IOWs, the extent size hint effectively sets a minimum extent size that the filesystem will create for a given file, thereby mitigating the worst case fragmentation that can occur. However, the use of fallocate() in the application explicitly prevents the filesystem from doing this smart, transparent IO path thing to mitigate fragmentation. One of the most important properties of extent size hints is that they can be dynamically tuned *without changing the application.* The extent size hint is a property of the inode, and it can be set by the admin through various XFS tools (e.g. mkfs.xfs for a filesystem wide default, xfs_io to set it on a directory so all new files/dirs created in that directory inherit the value, set it on individual files, etc). It can be changed even whilst the file is in active use by the application. Hence the extent size hint it can be changed at any time, and you can apply it immediately to existing installations as an active mitigation. Doing this won't fix existing fragmentation (that's what xfs_fsr is for), but it will instantly mitigate/prevent new fragmentation from occurring. It's much more difficult to do this with applications that use fallocate()... Indeed, the case for using fallocate() instead of extent size hints gets worse the more you look at how extent size hints work. Extent size hints don't impact IO concurrency at all. Extent size hints are only applied during extent allocation, so the optimisation is applied naturally as part of the existing concurrent IO path. Hence using extent size hints won't block/stall/prevent concurrent async IO in any way. fallocate(), OTOH, causes a full IO pipeline stall (blocks submission of both reads and writes, then waits for all IO in flight to drain) on that file for the duration of the syscall. You can't do any sort of IO (async or otherwise) and run fallocate() at the same time, so fallocate() really sucks from the POV of a high performance IO app. fallocate() also marks the files as having persistent preallocation, which means that when you close the file the filesystem does not remove excessive extents allocated beyond EOF. Hence the reported problems with excessive space usage and needing to truncate files manually (which also cause a complete IO stall on that file) are brought on specifically because fallocate() is being used by the application to manage worst case fragmentation. This problem does not exist with extent size hints - unused blocks beyond EOF will be trimmed on last close or when the inode is cycled out of cache, just like we do for excess speculative prealloc beyond EOF for buffered writes (the buffered IO fragmentation mitigation mechanism for interleaving concurrent extending writes). The administrator can easily optimise extent size hints to match the optimal characteristics of the underlying storage (e.g. set them to be RAID stripe aligned), etc. Fallocate() requires the application to provide tunables to modify it's behaviour for optimal storage layout, and depending on how the application uses fallocate(), this level of flexibility may not even be possible. And let's not forget that an fallocate() based mitigation that helps one filesystem type can actively hurt another type (e.g. ext4) by introducing an application level extent allocation boundary vector where there was none before. Hence, IMO, micromanaging filesystem extent allocation with fallocate() is -almost always- the wrong thing for applications to be doing. There is no one "right way" to use fallocate() - what is optimal for one filesystem will be pessimal for another, and it is impossible to code optimal behaviour in the application for all filesystem types the app might run on. > The fallocate in turn triggers slowness in the write paths, as > writing to uninitialized extents is a metadata operation. That is not the problem you think it is. XFS is using unwritten extents for all buffered IO writes that use delayed allocation, too, and I don't see you complaining about that.... Yes, the overhead of unwritten extent conversion is more visible with direct IO, but that's only because DIO has much lower overhead and much, much higher performance ceiling than buffered IO. That doesn't mean unwritten extents are a performance limiting factor... > It'd be great if > the allocation behaviour with concurrent file extension could be improved and > if we could have a fallocate mode that forces extents to be initialized. <sigh> You mean like FALLOC_FL_WRITE_ZEROES? That won't fix your fragmentation problem, and it has all the same pipeline stall problems as allocating unwritten extents in fallocate(). Only much worse now, because the IO pipeline is stalled for the entire time it takes to write the zeroes to persistent storage. i.e. long tail file access latencies will increase massively if you do this regularly to extend files. -Dave. -- Dave Chinner dgc@kernel.org ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Buffered atomic writes 2026-02-17 22:45 ` Dave Chinner @ 2026-02-18 4:10 ` Andres Freund 0 siblings, 0 replies; 38+ messages in thread From: Andres Freund @ 2026-02-18 4:10 UTC (permalink / raw) To: Dave Chinner Cc: Amir Goldstein, Christoph Hellwig, Pankaj Raghav, linux-xfs, linux-mm, linux-fsdevel, lsf-pc, djwong, john.g.garry, willy, ritesh.list, jack, ojaswin, Luis Chamberlain, dchinner, Javier Gonzalez, gost.dev, tytso, p.raghav, vi.shah Hi, On 2026-02-18 09:45:46 +1100, Dave Chinner wrote: > On Tue, Feb 17, 2026 at 10:47:07AM -0500, Andres Freund wrote: > > There are some kernel issues that make it harder than necessary to use DIO, > > btw: > > > > Most prominently: With DIO concurrently extending multiple files leads to > > quite terrible fragmentation, at least with XFS. Forcing us to > > over-aggressively use fallocate(), truncating later if it turns out we need > > less space. > > <ahem> > > seriously, fallocate() is considered harmful for exactly these sorts > of reasons. XFS has vastly better mechanisms built into it that > mitigate worst case fragmentation without needing to change > applications or increase runtime overhead. There's probably a misunderstanding here: We don't do fallocate to avoid fragmentation. We want to guarantee that there's space for data that is in our buffer pool, as otherwise it's very easy to get into a pickle: If there is dirty data in the buffer pool that can't be written out due to ENOSPC, the subsequent checkpoint can't complete. So the system may be stuck because you're not be able to create more space for WAL / journaling, you can't free up old WAL due to the checkpoint not being able to complete, and if you react to that with a crash-recovery cycle you're likely to be unable to complete crash recovery because you'll just hit ENOSPC again. And yes, CoW filesystems make that less reliable, it turns out to still save people often enough that I doubt we can get rid of it. To ensure there's space for the write out of our buffer pool we have two choices: 1) write out zeroes 2) use fallocate Writing out zeroes that we will just overwrite later is obviously not a particularly good use of IO bandwidth, particularly on metered cloud "storage". But using fallocate() has fragmentation and unwritten-extent issues. Our compromise is that we use fallocate iff we enlarge the relation by a decent number of pages at once and write zeroes otherwise. Is that perfect? Hell no. But it's also not obvious what a better answer is with today's interfaces. If there were a "guarantee that N additional blocks are reserved, but not concretely allocated" interface, we'd gladly use it. > So, let's set the extent size hint on a file to 1MB. Now whenever a > data extent allocation on that file is attempted, the extent size > that is allocated will be rounded up to the nearest 1MB. i.e. XFS > will try to allocate unwritten extents in aligned multiples of the > extent size hint regardless of the actual IO size being performed. > > Hence if you are doing concurrent extending 8kB writes, instead of > allocating 8kB at a time, the extent size hint will force a 1MB > unwritten extent to be allocated out beyond EOF. The subsequent > extending 8kB writes to that file now hit that unwritten extent, and > only need to convert it to written. The same will happen for all > other concurrent extending writes - they will allocate in 1MB > chunks, not 8KB. We could probably benefit from that. > One of the most important properties of extent size hints is that > they can be dynamically tuned *without changing the application.* > The extent size hint is a property of the inode, and it can be set > by the admin through various XFS tools (e.g. mkfs.xfs for a > filesystem wide default, xfs_io to set it on a directory so all new > files/dirs created in that directory inherit the value, set it on > individual files, etc). It can be changed even whilst the file is in > active use by the application. IME our users run enough postgres instances, across a lot of differing workloads, that manual tuning like that will rarely if ever happen :(. I miss well educated DBAs :(. A large portion of users doesn't even have direct access to the server, only via the postgres protocol... If we were to use these hints, it'd have to happen automatically from within postgres. But that does seem viable, but certainly is also not exactly filesystem independent... > > The fallocate in turn triggers slowness in the write paths, as > > writing to uninitialized extents is a metadata operation. > > That is not the problem you think it is. XFS is using unwritten > extents for all buffered IO writes that use delayed allocation, too, > and I don't see you complaining about that.... It's a problem for buffered IO as well, just a bit harder to hit on many drives, because buffered O_DSYNC writes don't use FUA. If you need any durable writes into a file with unwritten extents, things get painful very fast. See a few paragraphs below for the most crucial case where we need to make sure writes are durable. testdir=/srv/fio && for buffered in 0 1; do for overwrite in 0 1; do echo buffered: $buffered overwrite: $overwrite; rm -f $testdir/pg-extend* && fio --directory=$testdir --ioengine=psync --buffered=$buffered --bs=4kB --fallocate=none --overwrite=0 --rw=write --size=64MB --sync=dsync --name pg-extend --overwrite=$overwrite |grep IOPS;done;done buffered: 0 overwrite: 0 write: IOPS=1427, BW=5709KiB/s (5846kB/s)(64.0MiB/11479msec); 0 zone resets buffered: 0 overwrite: 1 write: IOPS=4025, BW=15.7MiB/s (16.5MB/s)(64.0MiB/4070msec); 0 zone resets buffered: 1 overwrite: 0 write: IOPS=1638, BW=6554KiB/s (6712kB/s)(64.0MiB/9999msec); 0 zone resets buffered: 1 overwrite: 1 write: IOPS=3663, BW=14.3MiB/s (15.0MB/s)(64.0MiB/4472msec); 0 zone resets That's a > 2x throughput difference. And the results would be similar with --fdatasync=1. If you add AIO to the mix, the difference gets way bigger, particularly on drives with FUA support and DIO: testdir=/srv/fio && for buffered in 0 1; do for overwrite in 0 1; do echo buffered: $buffered overwrite: $overwrite; rm -f $testdir/pg-extend* && fio --directory=$testdir --ioengine=io_uring --buffered=$buffered --bs=4kB --fallocate=none --overwrite=0 --rw=write --size=64MB --sync=dsync --name pg-extend --overwrite=$overwrite --iodepth 32 |grep IOPS;done;done buffered: 0 overwrite: 0 write: IOPS=6143, BW=24.0MiB/s (25.2MB/s)(64.0MiB/2667msec); 0 zone resets buffered: 0 overwrite: 1 write: IOPS=76.6k, BW=299MiB/s (314MB/s)(64.0MiB/214msec); 0 zone resets buffered: 1 overwrite: 0 write: IOPS=1835, BW=7341KiB/s (7517kB/s)(64.0MiB/8928msec); 0 zone resets buffered: 1 overwrite: 1 write: IOPS=4096, BW=16.0MiB/s (16.8MB/s)(64.0MiB/4000msec); 0 zone resets It's less bad, but still quite a noticeable difference, on drives without volatile caches. And it's often worse on networked storage, whether it has a volatile cache or not. > > It'd be great if > > the allocation behaviour with concurrent file extension could be improved and > > if we could have a fallocate mode that forces extents to be initialized. > > <sigh> > > You mean like FALLOC_FL_WRITE_ZEROES? I hadn't seen that it was merged, that's great! It doesn't yet seem to be documented in the fallocate(2) man page, which I had checked... Hm, also doesn't seem to work on xfs yet :(, EOPNOTSUPP. > That won't fix your fragmentation problem, and it has all the same pipeline > stall problems as allocating unwritten extents in fallocate(). The primary case where FALLOC_FL_WRITE_ZEROES would be useful is for WAL file creation, which are always of the same fixed size (therefore no fragmentation risk). To avoid having metadata operation during our commit path, we today default to forcing them to be allocated by overwriting them with zeros and fsyncing them. To avoid having to do that all the time, we reuse them once they're not needed anymore. Not ensuring that the extents are already written, would have a very large perf penalty (as in ~2-3x for OLTP workloads, on XFS). That's true for both when using DIO and when not. To avoid having to do that over and over, we recycle WAL files. Unfortunately this means that when all those WAL files are not yet preallocated (or when we release them during low activity), the performance is rather noticeably worsened by the additional IO for pre-zeroing the WAL files. In theory FALLOC_FL_WRITE_ZEROES should be faster than issuing writes for the whole range. > Only much worse now, because the IO pipeline is stalled for the > entire time it takes to write the zeroes to persistent storage. i.e. > long tail file access latencies will increase massively if you do > this regularly to extend files. In the WAL path we fsync at the point we could use FALLOC_FL_WRITE_ZEROES, as otherwise the WAL segment might not exist after a crash, which would be ... bad. Greetings, Andres Freund ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Buffered atomic writes 2026-02-17 15:47 ` Andres Freund 2026-02-17 22:45 ` Dave Chinner @ 2026-02-18 6:53 ` Christoph Hellwig 1 sibling, 0 replies; 38+ messages in thread From: Christoph Hellwig @ 2026-02-18 6:53 UTC (permalink / raw) To: Andres Freund Cc: Amir Goldstein, Christoph Hellwig, Pankaj Raghav, linux-xfs, linux-mm, linux-fsdevel, lsf-pc, djwong, john.g.garry, willy, ritesh.list, jack, ojaswin, Luis Chamberlain, dchinner, Javier Gonzalez, gost.dev, tytso, p.raghav, vi.shah On Tue, Feb 17, 2026 at 10:47:07AM -0500, Andres Freund wrote: > Most prominently: With DIO concurrently extending multiple files leads to > quite terrible fragmentation, at least with XFS. Forcing us to > over-aggressively use fallocate(), truncating later if it turns out we need > less space. The fallocate in turn triggers slowness in the write paths, as > writing to uninitialized extents is a metadata operation. It'd be great if > the allocation behaviour with concurrent file extension could be improved and > if we could have a fallocate mode that forces extents to be initialized. As Dave already mentioned, if you do concurrent allocations (extension or hole filling), setting an extent size hint is probably a good idea. We could try to look into heuristics, but chances are that they would degrade other use caes. Details would be useful as a report on the XFS list. > > A secondary issue is that with the buffer pool sizes necessary for DIO use on > bigger systems, creating the anonymous memory mapping becomes painfully slow > if we use MAP_POPULATE - which we kinda need to do, as otherwise performance > is very inconsistent initially (often iomap -> gup -> handle_mm_fault -> > folio_zero_user uses the majority of the CPU). We've been experimenting with > not using MAP_POPULATE and using multiple threads to populate the mapping in > parallel, but that feels not like something that userspace ought to have to > do. It's easier to work around for us that the uninitialized extent > conversion issue, but it still is something we IMO shouldn't have to do. Please report this to linux-mm. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Buffered atomic writes 2026-02-17 9:23 ` [Lsf-pc] " Amir Goldstein 2026-02-17 15:47 ` Andres Freund @ 2026-02-18 6:51 ` Christoph Hellwig 1 sibling, 0 replies; 38+ messages in thread From: Christoph Hellwig @ 2026-02-18 6:51 UTC (permalink / raw) To: Amir Goldstein Cc: Christoph Hellwig, Pankaj Raghav, linux-xfs, linux-mm, linux-fsdevel, lsf-pc, Andres Freund, djwong, john.g.garry, willy, ritesh.list, jack, ojaswin, Luis Chamberlain, dchinner, Javier Gonzalez, gost.dev, tytso, p.raghav, vi.shah On Tue, Feb 17, 2026 at 10:23:36AM +0100, Amir Goldstein wrote: > On Tue, Feb 17, 2026 at 8:00 AM Christoph Hellwig <hch@lst.de> wrote: > > > > I think a better session would be how we can help postgres to move > > off buffered I/O instead of adding more special cases for them. > > Respectfully, I disagree that DIO is the only possible solution. > Direct I/O is a legit solution for databases and so is buffered I/O > each with their own caveats. Maybe. Classic buffered I/O is not a legit solution for doing atomic I/Os, and if Postgres is desperate to use that, something like direct I/O (including the proposed write though semantics) are the only sensible choice. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Buffered atomic writes 2026-02-13 10:20 [LSF/MM/BPF TOPIC] Buffered atomic writes Pankaj Raghav ` (2 preceding siblings ...) 2026-02-17 5:51 ` Christoph Hellwig @ 2026-02-20 10:08 ` Pankaj Raghav (Samsung) 2026-02-20 15:10 ` Christoph Hellwig 3 siblings, 1 reply; 38+ messages in thread From: Pankaj Raghav (Samsung) @ 2026-02-20 10:08 UTC (permalink / raw) To: linux-xfs, linux-mm, linux-fsdevel, lsf-pc Cc: Andres Freund, djwong, john.g.garry, willy, hch, ritesh.list, jack, ojaswin, Luis Chamberlain, dchinner, Javier Gonzalez, gost.dev, tytso, p.raghav, vi.shah On Fri, Feb 13, 2026 at 11:20:36AM +0100, Pankaj Raghav wrote: > Hi all, > > Atomic (untorn) writes for Direct I/O have successfully landed in kernel > for ext4 and XFS[1][2]. However, extending this support to Buffered I/O > remains a contentious topic, with previous discussions often stalling due to > concerns about complexity versus utility. > Hi, Thanks a lot everyone for the input on this topic. I would like to summarize some of the important points discussed here so that it could be used as a reference for the talk and RFCs going forward: - There is a general consensus to add atomic support to buffered IO path. - First step is to add support to RWF_WRITETHROUGH as initially proposed by Dave Chinner. Semantics of RWF_WRITETHROUGH (based on my understanding): * Immediate Writeback Initiation: When RWF_WRITETHROUGH is used with a buffered write, the kernel will immediately initiate the writeback of the data to storage. We use page cache to serialize overlapping writes. Folio Lock -> Copy data to the page cache -> Initiate and complete writeback -> Unlock folio * Synchronous I/O Behavior: The I/O operation will behave synchronously from the application's perspective. This means the system call will block until the write operation has been submitted to the device. Any I/O errors will be reported directly to the caller. (Similar to Direct I/O) * No Inherent Data Integrity Guarantees: Unlike RWF_DSYNC, RWF_WRITETHROUGH itself does not inherently guarantee that the data has reached non-volatile storage. - Once we have writethrough infrastructure is in place, we layer in atomic support to buffered IO path. But they will require more guarantees such as no short copies, using stable pages during writeback, etc. Feel free to add/correct the above points. -- Pankaj ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Buffered atomic writes 2026-02-20 10:08 ` Pankaj Raghav (Samsung) @ 2026-02-20 15:10 ` Christoph Hellwig 0 siblings, 0 replies; 38+ messages in thread From: Christoph Hellwig @ 2026-02-20 15:10 UTC (permalink / raw) To: Pankaj Raghav (Samsung) Cc: linux-xfs, linux-mm, linux-fsdevel, lsf-pc, Andres Freund, djwong, john.g.garry, willy, hch, ritesh.list, jack, ojaswin, Luis Chamberlain, dchinner, Javier Gonzalez, gost.dev, tytso, p.raghav, vi.shah On Fri, Feb 20, 2026 at 10:08:26AM +0000, Pankaj Raghav (Samsung) wrote: > On Fri, Feb 13, 2026 at 11:20:36AM +0100, Pankaj Raghav wrote: > > Hi all, > > > > Atomic (untorn) writes for Direct I/O have successfully landed in kernel > > for ext4 and XFS[1][2]. However, extending this support to Buffered I/O > > remains a contentious topic, with previous discussions often stalling due to > > concerns about complexity versus utility. > > > > Hi, > > Thanks a lot everyone for the input on this topic. I would like to > summarize some of the important points discussed here so that it could > be used as a reference for the talk and RFCs going forward: > > - There is a general consensus to add atomic support to buffered IO > path. I don't think that's quite true. ^ permalink raw reply [flat|nested] 38+ messages in thread
end of thread, other threads:[~2026-02-20 15:10 UTC | newest] Thread overview: 38+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2026-02-13 10:20 [LSF/MM/BPF TOPIC] Buffered atomic writes Pankaj Raghav 2026-02-13 13:32 ` Ojaswin Mujoo 2026-02-16 9:52 ` Pankaj Raghav 2026-02-16 15:45 ` Andres Freund 2026-02-17 12:06 ` Jan Kara 2026-02-17 12:42 ` Pankaj Raghav 2026-02-17 16:21 ` Andres Freund 2026-02-18 1:04 ` Dave Chinner 2026-02-18 6:47 ` Christoph Hellwig 2026-02-18 23:42 ` Dave Chinner 2026-02-17 16:13 ` Andres Freund 2026-02-17 18:27 ` Ojaswin Mujoo 2026-02-17 18:42 ` Andres Freund 2026-02-18 17:37 ` Jan Kara 2026-02-18 21:04 ` Andres Freund 2026-02-19 0:32 ` Dave Chinner 2026-02-17 18:33 ` Ojaswin Mujoo 2026-02-17 17:20 ` Ojaswin Mujoo 2026-02-18 17:42 ` [Lsf-pc] " Jan Kara 2026-02-18 20:22 ` Ojaswin Mujoo 2026-02-16 11:38 ` Jan Kara 2026-02-16 13:18 ` Pankaj Raghav 2026-02-17 18:36 ` Ojaswin Mujoo 2026-02-16 15:57 ` Andres Freund 2026-02-17 18:39 ` Ojaswin Mujoo 2026-02-18 0:26 ` Dave Chinner 2026-02-18 6:49 ` Christoph Hellwig 2026-02-18 12:54 ` Ojaswin Mujoo 2026-02-15 9:01 ` Amir Goldstein 2026-02-17 5:51 ` Christoph Hellwig 2026-02-17 9:23 ` [Lsf-pc] " Amir Goldstein 2026-02-17 15:47 ` Andres Freund 2026-02-17 22:45 ` Dave Chinner 2026-02-18 4:10 ` Andres Freund 2026-02-18 6:53 ` Christoph Hellwig 2026-02-18 6:51 ` Christoph Hellwig 2026-02-20 10:08 ` Pankaj Raghav (Samsung) 2026-02-20 15:10 ` Christoph Hellwig
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox