From: Andres Freund <andres@anarazel.de>
To: Dave Chinner <dgc@kernel.org>
Cc: Amir Goldstein <amir73il@gmail.com>,
Christoph Hellwig <hch@lst.de>,
Pankaj Raghav <pankaj.raghav@linux.dev>,
linux-xfs@vger.kernel.org, linux-mm@kvack.org,
linux-fsdevel@vger.kernel.org,
lsf-pc@lists.linux-foundation.org, djwong@kernel.org,
john.g.garry@oracle.com, willy@infradead.org,
ritesh.list@gmail.com, jack@suse.cz, ojaswin@linux.ibm.com,
Luis Chamberlain <mcgrof@kernel.org>,
dchinner@redhat.com, Javier Gonzalez <javier.gonz@samsung.com>,
gost.dev@samsung.com, tytso@mit.edu, p.raghav@samsung.com,
vi.shah@samsung.com
Subject: Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Buffered atomic writes
Date: Tue, 17 Feb 2026 23:10:23 -0500 [thread overview]
Message-ID: <is77m5lxg22z2lfhpj3zh7hse5wmft5i2mae72of7iffmtjktu@euxitej5vwxp> (raw)
In-Reply-To: <aZTvmpOL7NC4_kDq@dread>
Hi,
On 2026-02-18 09:45:46 +1100, Dave Chinner wrote:
> On Tue, Feb 17, 2026 at 10:47:07AM -0500, Andres Freund wrote:
> > There are some kernel issues that make it harder than necessary to use DIO,
> > btw:
> >
> > Most prominently: With DIO concurrently extending multiple files leads to
> > quite terrible fragmentation, at least with XFS. Forcing us to
> > over-aggressively use fallocate(), truncating later if it turns out we need
> > less space.
>
> <ahem>
>
> seriously, fallocate() is considered harmful for exactly these sorts
> of reasons. XFS has vastly better mechanisms built into it that
> mitigate worst case fragmentation without needing to change
> applications or increase runtime overhead.
There's probably a misunderstanding here: We don't do fallocate to avoid
fragmentation.
We want to guarantee that there's space for data that is in our buffer pool,
as otherwise it's very easy to get into a pickle:
If there is dirty data in the buffer pool that can't be written out due to
ENOSPC, the subsequent checkpoint can't complete. So the system may be stuck
because you're not be able to create more space for WAL / journaling, you
can't free up old WAL due to the checkpoint not being able to complete, and if
you react to that with a crash-recovery cycle you're likely to be unable to
complete crash recovery because you'll just hit ENOSPC again.
And yes, CoW filesystems make that less reliable, it turns out to still save
people often enough that I doubt we can get rid of it.
To ensure there's space for the write out of our buffer pool we have two
choices:
1) write out zeroes
2) use fallocate
Writing out zeroes that we will just overwrite later is obviously not a
particularly good use of IO bandwidth, particularly on metered cloud
"storage". But using fallocate() has fragmentation and unwritten-extent
issues. Our compromise is that we use fallocate iff we enlarge the relation
by a decent number of pages at once and write zeroes otherwise.
Is that perfect? Hell no. But it's also not obvious what a better answer is
with today's interfaces.
If there were a "guarantee that N additional blocks are reserved, but not
concretely allocated" interface, we'd gladly use it.
> So, let's set the extent size hint on a file to 1MB. Now whenever a
> data extent allocation on that file is attempted, the extent size
> that is allocated will be rounded up to the nearest 1MB. i.e. XFS
> will try to allocate unwritten extents in aligned multiples of the
> extent size hint regardless of the actual IO size being performed.
>
> Hence if you are doing concurrent extending 8kB writes, instead of
> allocating 8kB at a time, the extent size hint will force a 1MB
> unwritten extent to be allocated out beyond EOF. The subsequent
> extending 8kB writes to that file now hit that unwritten extent, and
> only need to convert it to written. The same will happen for all
> other concurrent extending writes - they will allocate in 1MB
> chunks, not 8KB.
We could probably benefit from that.
> One of the most important properties of extent size hints is that
> they can be dynamically tuned *without changing the application.*
> The extent size hint is a property of the inode, and it can be set
> by the admin through various XFS tools (e.g. mkfs.xfs for a
> filesystem wide default, xfs_io to set it on a directory so all new
> files/dirs created in that directory inherit the value, set it on
> individual files, etc). It can be changed even whilst the file is in
> active use by the application.
IME our users run enough postgres instances, across a lot of differing
workloads, that manual tuning like that will rarely if ever happen :(. I miss
well educated DBAs :(. A large portion of users doesn't even have direct
access to the server, only via the postgres protocol...
If we were to use these hints, it'd have to happen automatically from within
postgres. But that does seem viable, but certainly is also not exactly
filesystem independent...
> > The fallocate in turn triggers slowness in the write paths, as
> > writing to uninitialized extents is a metadata operation.
>
> That is not the problem you think it is. XFS is using unwritten
> extents for all buffered IO writes that use delayed allocation, too,
> and I don't see you complaining about that....
It's a problem for buffered IO as well, just a bit harder to hit on many
drives, because buffered O_DSYNC writes don't use FUA.
If you need any durable writes into a file with unwritten extents, things get
painful very fast.
See a few paragraphs below for the most crucial case where we need to make
sure writes are durable.
testdir=/srv/fio && for buffered in 0 1; do for overwrite in 0 1; do echo buffered: $buffered overwrite: $overwrite; rm -f $testdir/pg-extend* && fio --directory=$testdir --ioengine=psync --buffered=$buffered --bs=4kB --fallocate=none --overwrite=0 --rw=write --size=64MB --sync=dsync --name pg-extend --overwrite=$overwrite |grep IOPS;done;done
buffered: 0 overwrite: 0
write: IOPS=1427, BW=5709KiB/s (5846kB/s)(64.0MiB/11479msec); 0 zone resets
buffered: 0 overwrite: 1
write: IOPS=4025, BW=15.7MiB/s (16.5MB/s)(64.0MiB/4070msec); 0 zone resets
buffered: 1 overwrite: 0
write: IOPS=1638, BW=6554KiB/s (6712kB/s)(64.0MiB/9999msec); 0 zone resets
buffered: 1 overwrite: 1
write: IOPS=3663, BW=14.3MiB/s (15.0MB/s)(64.0MiB/4472msec); 0 zone resets
That's a > 2x throughput difference. And the results would be similar with
--fdatasync=1.
If you add AIO to the mix, the difference gets way bigger, particularly on
drives with FUA support and DIO:
testdir=/srv/fio && for buffered in 0 1; do for overwrite in 0 1; do echo buffered: $buffered overwrite: $overwrite; rm -f $testdir/pg-extend* && fio --directory=$testdir --ioengine=io_uring --buffered=$buffered --bs=4kB --fallocate=none --overwrite=0 --rw=write --size=64MB --sync=dsync --name pg-extend --overwrite=$overwrite --iodepth 32 |grep IOPS;done;done
buffered: 0 overwrite: 0
write: IOPS=6143, BW=24.0MiB/s (25.2MB/s)(64.0MiB/2667msec); 0 zone resets
buffered: 0 overwrite: 1
write: IOPS=76.6k, BW=299MiB/s (314MB/s)(64.0MiB/214msec); 0 zone resets
buffered: 1 overwrite: 0
write: IOPS=1835, BW=7341KiB/s (7517kB/s)(64.0MiB/8928msec); 0 zone resets
buffered: 1 overwrite: 1
write: IOPS=4096, BW=16.0MiB/s (16.8MB/s)(64.0MiB/4000msec); 0 zone resets
It's less bad, but still quite a noticeable difference, on drives without
volatile caches. And it's often worse on networked storage, whether it has a
volatile cache or not.
> > It'd be great if
> > the allocation behaviour with concurrent file extension could be improved and
> > if we could have a fallocate mode that forces extents to be initialized.
>
> <sigh>
>
> You mean like FALLOC_FL_WRITE_ZEROES?
I hadn't seen that it was merged, that's great! It doesn't yet seem to be
documented in the fallocate(2) man page, which I had checked...
Hm, also doesn't seem to work on xfs yet :(, EOPNOTSUPP.
> That won't fix your fragmentation problem, and it has all the same pipeline
> stall problems as allocating unwritten extents in fallocate().
The primary case where FALLOC_FL_WRITE_ZEROES would be useful is for WAL file
creation, which are always of the same fixed size (therefore no fragmentation
risk).
To avoid having metadata operation during our commit path, we today default to
forcing them to be allocated by overwriting them with zeros and fsyncing
them. To avoid having to do that all the time, we reuse them once they're not
needed anymore.
Not ensuring that the extents are already written, would have a very large
perf penalty (as in ~2-3x for OLTP workloads, on XFS). That's true for both
when using DIO and when not.
To avoid having to do that over and over, we recycle WAL files.
Unfortunately this means that when all those WAL files are not yet
preallocated (or when we release them during low activity), the performance is
rather noticeably worsened by the additional IO for pre-zeroing the WAL files.
In theory FALLOC_FL_WRITE_ZEROES should be faster than issuing writes for the
whole range.
> Only much worse now, because the IO pipeline is stalled for the
> entire time it takes to write the zeroes to persistent storage. i.e.
> long tail file access latencies will increase massively if you do
> this regularly to extend files.
In the WAL path we fsync at the point we could use FALLOC_FL_WRITE_ZEROES, as
otherwise the WAL segment might not exist after a crash, which would be
... bad.
Greetings,
Andres Freund
next prev parent reply other threads:[~2026-02-18 4:10 UTC|newest]
Thread overview: 38+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-02-13 10:20 Pankaj Raghav
2026-02-13 13:32 ` Ojaswin Mujoo
2026-02-16 9:52 ` Pankaj Raghav
2026-02-16 15:45 ` Andres Freund
2026-02-17 12:06 ` Jan Kara
2026-02-17 12:42 ` Pankaj Raghav
2026-02-17 16:21 ` Andres Freund
2026-02-18 1:04 ` Dave Chinner
2026-02-18 6:47 ` Christoph Hellwig
2026-02-18 23:42 ` Dave Chinner
2026-02-17 16:13 ` Andres Freund
2026-02-17 18:27 ` Ojaswin Mujoo
2026-02-17 18:42 ` Andres Freund
2026-02-18 17:37 ` Jan Kara
2026-02-18 21:04 ` Andres Freund
2026-02-19 0:32 ` Dave Chinner
2026-02-17 18:33 ` Ojaswin Mujoo
2026-02-17 17:20 ` Ojaswin Mujoo
2026-02-18 17:42 ` [Lsf-pc] " Jan Kara
2026-02-18 20:22 ` Ojaswin Mujoo
2026-02-16 11:38 ` Jan Kara
2026-02-16 13:18 ` Pankaj Raghav
2026-02-17 18:36 ` Ojaswin Mujoo
2026-02-16 15:57 ` Andres Freund
2026-02-17 18:39 ` Ojaswin Mujoo
2026-02-18 0:26 ` Dave Chinner
2026-02-18 6:49 ` Christoph Hellwig
2026-02-18 12:54 ` Ojaswin Mujoo
2026-02-15 9:01 ` Amir Goldstein
2026-02-17 5:51 ` Christoph Hellwig
2026-02-17 9:23 ` [Lsf-pc] " Amir Goldstein
2026-02-17 15:47 ` Andres Freund
2026-02-17 22:45 ` Dave Chinner
2026-02-18 4:10 ` Andres Freund [this message]
2026-02-18 6:53 ` Christoph Hellwig
2026-02-18 6:51 ` Christoph Hellwig
2026-02-20 10:08 ` Pankaj Raghav (Samsung)
2026-02-20 15:10 ` Christoph Hellwig
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=is77m5lxg22z2lfhpj3zh7hse5wmft5i2mae72of7iffmtjktu@euxitej5vwxp \
--to=andres@anarazel.de \
--cc=amir73il@gmail.com \
--cc=dchinner@redhat.com \
--cc=dgc@kernel.org \
--cc=djwong@kernel.org \
--cc=gost.dev@samsung.com \
--cc=hch@lst.de \
--cc=jack@suse.cz \
--cc=javier.gonz@samsung.com \
--cc=john.g.garry@oracle.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=linux-xfs@vger.kernel.org \
--cc=lsf-pc@lists.linux-foundation.org \
--cc=mcgrof@kernel.org \
--cc=ojaswin@linux.ibm.com \
--cc=p.raghav@samsung.com \
--cc=pankaj.raghav@linux.dev \
--cc=ritesh.list@gmail.com \
--cc=tytso@mit.edu \
--cc=vi.shah@samsung.com \
--cc=willy@infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox