From: Andres Freund <andres@anarazel.de>
To: Ritesh Harjani <ritesh.list@gmail.com>
Cc: Amir Goldstein <amir73il@gmail.com>,
Christoph Hellwig <hch@lst.de>,
Pankaj Raghav <pankaj.raghav@linux.dev>,
linux-xfs@vger.kernel.org, linux-mm@kvack.org,
linux-fsdevel@vger.kernel.org,
lsf-pc@lists.linux-foundation.org, djwong@kernel.org,
john.g.garry@oracle.com, willy@infradead.org, jack@suse.cz,
ojaswin@linux.ibm.com, Luis Chamberlain <mcgrof@kernel.org>,
dchinner@redhat.com, Javier Gonzalez <javier.gonz@samsung.com>,
gost.dev@samsung.com, tytso@mit.edu, p.raghav@samsung.com,
vi.shah@samsung.com
Subject: Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Buffered atomic writes
Date: Sun, 8 Mar 2026 11:33:10 -0400 [thread overview]
Message-ID: <pyyeaqzjdpgkpjfyxkrlzawwwj6elzvidubtq73toufr7wgoec@prv4u3o5ixjy> (raw)
In-Reply-To: <v7f6u19i.ritesh.list@gmail.com>
Hi,
On 2026-03-08 14:49:21 +0530, Ritesh Harjani wrote:
> Andres Freund <andres@anarazel.de> writes:
> > On 2026-02-17 10:23:36 +0100, Amir Goldstein wrote:
> >> On Tue, Feb 17, 2026 at 8:00 AM Christoph Hellwig <hch@lst.de> wrote:
> >> >
> >> > I think a better session would be how we can help postgres to move
> >> > off buffered I/O instead of adding more special cases for them.
> >
> > FWIW, we are adding support for DIO (it's been added, but performance isn't
> > competitive for most workloads in the released versions yet, work to address
> > those issues is in progress).
> >
>
> Is postgres also planning to evaluate the performance gains by using DIO
> atomic writes available in upstream linux kernel? What would be
> interesting to see is the relative %delta with DIO atomic-writes v/s
> DIO non atomic writes.
For some limited workloads that comparison is possible today with minimal work
(albeit with some safety compromises, due to postgres not yet verifying that
the atomic boundaries are correct, but it's good enough for experiments), as
you can just disable the torn-page avoidance with a configuration parameter.
The gains from not needing full page writes (postgres' mechanism to protect
against torn pages) can be rather significant, as full page writes have
substantial overhead due to the higher journalling volume. The worst part of
the cost is that the cost decreases between checkpoints (because we don't need
to repeatedly log a full page images for the same page), just to then increase
again when the next checkpoint starts. It's not uncommon that in the phase
just after the start of a checkpoint, WAL is over 90% of full page writes
(when not having full page write compression enabled), while later the same
workload only has a very small percentage of the overhead. The biggest gain
from atomic writes will be the more even performance (important for real world
users), rather than the absolute increase in throughput.
Normal gains during the full page intensive phase are probably on the order of
20-35% for workload with many small transactions, bigger for workloads with
larger transactions. But if the increase in WAL volume pushes you above the
disk write throughput, the gains can be almost arbitrarily larger. E.g. on a
cloud disk with 100MB/s of write bandwidth, the difference between WAL
throughput of 50MB/s without full page writes and the same workload with full
page images generating ~300MB/s of WAL will obviously mean that you'll get
about < 1/3 of the transaction throughput while also not having any spare IO
capacity for anything other than WAL writes.
The reason I say limited workloads above is that upstream postgres does not
yet do smart enough write combining with DIO for data writes, I'd expect that
to be addressed later this year (but it's community open source, as you
presumably know from experience, that's not always easy to predict /
control). If the workload has a large fraction of data writes, the overhead of
that makes the DIO numbers too unrealistic.
Unfortunately all this means that the gains from atomic writes, be it for
buffered or direct IO, will very very heavily depend on the chosen workload
and by tweaking the workload / hardware you can inflate the gains to an almost
arbitrarily large degree.
This is also about more than throughput / latency, as the volume of WAL also
impacts the cost of retaining the WAL - often that's done for a while to allow
point-in-time-recovery (i.e. recovering an older base backup up to a precise
point in time, to recover from application bugs or operator errors).
Greetings,
Andres Freund
next prev parent reply other threads:[~2026-03-08 15:33 UTC|newest]
Thread overview: 18+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-03-08 9:19 Ritesh Harjani
2026-03-08 15:33 ` Andres Freund [this message]
-- strict thread matches above, loose matches on Subject: below --
2026-02-13 10:20 Pankaj Raghav
2026-02-13 13:32 ` Ojaswin Mujoo
2026-02-16 9:52 ` Pankaj Raghav
2026-02-17 17:20 ` Ojaswin Mujoo
2026-02-18 17:42 ` [Lsf-pc] " Jan Kara
2026-02-18 20:22 ` Ojaswin Mujoo
2026-02-16 11:38 ` Jan Kara
2026-02-16 13:18 ` Pankaj Raghav
2026-02-17 18:36 ` Ojaswin Mujoo
2026-02-16 15:57 ` Andres Freund
2026-02-17 18:39 ` Ojaswin Mujoo
2026-02-18 0:26 ` Dave Chinner
2026-02-18 6:49 ` Christoph Hellwig
2026-02-18 12:54 ` Ojaswin Mujoo
2026-02-17 5:51 ` Christoph Hellwig
2026-02-17 9:23 ` [Lsf-pc] " Amir Goldstein
2026-02-17 15:47 ` Andres Freund
2026-02-17 22:45 ` Dave Chinner
2026-02-18 4:10 ` Andres Freund
2026-02-18 6:53 ` Christoph Hellwig
2026-02-18 6:51 ` Christoph Hellwig
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=pyyeaqzjdpgkpjfyxkrlzawwwj6elzvidubtq73toufr7wgoec@prv4u3o5ixjy \
--to=andres@anarazel.de \
--cc=amir73il@gmail.com \
--cc=dchinner@redhat.com \
--cc=djwong@kernel.org \
--cc=gost.dev@samsung.com \
--cc=hch@lst.de \
--cc=jack@suse.cz \
--cc=javier.gonz@samsung.com \
--cc=john.g.garry@oracle.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=linux-xfs@vger.kernel.org \
--cc=lsf-pc@lists.linux-foundation.org \
--cc=mcgrof@kernel.org \
--cc=ojaswin@linux.ibm.com \
--cc=p.raghav@samsung.com \
--cc=pankaj.raghav@linux.dev \
--cc=ritesh.list@gmail.com \
--cc=tytso@mit.edu \
--cc=vi.shah@samsung.com \
--cc=willy@infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox