From: Dave Chinner <dgc@kernel.org>
To: Andres Freund <andres@anarazel.de>
Cc: Amir Goldstein <amir73il@gmail.com>,
Christoph Hellwig <hch@lst.de>,
Pankaj Raghav <pankaj.raghav@linux.dev>,
linux-xfs@vger.kernel.org, linux-mm@kvack.org,
linux-fsdevel@vger.kernel.org, lsf-pc@lists.linux-foundation.org,
djwong@kernel.org, john.g.garry@oracle.com, willy@infradead.org,
ritesh.list@gmail.com, jack@suse.cz, ojaswin@linux.ibm.com,
Luis Chamberlain <mcgrof@kernel.org>,
dchinner@redhat.com, Javier Gonzalez <javier.gonz@samsung.com>,
gost.dev@samsung.com, tytso@mit.edu, p.raghav@samsung.com,
vi.shah@samsung.com
Subject: Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Buffered atomic writes
Date: Wed, 18 Feb 2026 09:45:46 +1100 [thread overview]
Message-ID: <aZTvmpOL7NC4_kDq@dread> (raw)
In-Reply-To: <ndwqem2mzymo6j3zw3mmxk2vh4mnun2fb2s5vrh4nthatlze3u@qjemcazy4agv>
On Tue, Feb 17, 2026 at 10:47:07AM -0500, Andres Freund wrote:
> Hi,
>
> On 2026-02-17 10:23:36 +0100, Amir Goldstein wrote:
> > On Tue, Feb 17, 2026 at 8:00 AM Christoph Hellwig <hch@lst.de> wrote:
> > >
> > > I think a better session would be how we can help postgres to move
> > > off buffered I/O instead of adding more special cases for them.
>
> FWIW, we are adding support for DIO (it's been added, but performance isn't
> competitive for most workloads in the released versions yet, work to address
> those issues is in progress).
>
> But it's only really be viable for larger setups, not for e.g.:
> - smaller, unattended setups
> - uses of postgres as part of a larger application on one server with hard to
> predict memory usage of different components
> - intentionally overcommitted shared hosting type scenarios
>
> Even once a well configured postgres using DIO beats postgres not using DIO,
> I'll bet that well over 50% of users won't be able to use DIO.
>
>
> There are some kernel issues that make it harder than necessary to use DIO,
> btw:
>
> Most prominently: With DIO concurrently extending multiple files leads to
> quite terrible fragmentation, at least with XFS. Forcing us to
> over-aggressively use fallocate(), truncating later if it turns out we need
> less space.
<ahem>
seriously, fallocate() is considered harmful for exactly these sorts
of reasons. XFS has vastly better mechanisms built into it that
mitigate worst case fragmentation without needing to change
applications or increase runtime overhead.
So, lets go way back - 32 years ago to 1994:
commit 32766d4d387bc6779e0c432fb56a0cc4e6b96398
Author: Doug Doucette <doucette@engr.sgi.com>
Date: Thu Mar 3 22:17:15 1994 +0000
Add fcntl implementation (F_FSGETXATTR, F_FSSETXATTR, and F_DIOINFO).
Fix xfs_setattr new xfs fields' implementation to split out error checking
to the front of the routine, like the other attributes. Don't set new
fields in xfs_getattr unless one of the fields is requested.
.....
+ case F_FSSETXATTR: {
+ struct fsxattr fa;
+ vattr_t va;
+
+ if (copyin(arg, &fa, sizeof(fa))) {
+ error = EFAULT;
+ break;
+ }
+ va.va_xflags = fa.fsx_xflags;
+ va.va_extsize = fa.fsx_extsize;
^^^^^^^^^^^^^^^
+ error = xfs_setattr(vp, &va, AT_XFLAGS|AT_EXTSIZE, credp);
+ break;
+ }
This was the commit that added user controlled extent size hints to
XFS. These already existed in EFS, so applications using this
functionality go back to the even earlier in the 1990s.
So, let's set the extent size hint on a file to 1MB. Now whenever a
data extent allocation on that file is attempted, the extent size
that is allocated will be rounded up to the nearest 1MB. i.e. XFS
will try to allocate unwritten extents in aligned multiples of the
extent size hint regardless of the actual IO size being performed.
Hence if you are doing concurrent extending 8kB writes, instead of
allocating 8kB at a time, the extent size hint will force a 1MB
unwritten extent to be allocated out beyond EOF. The subsequent
extending 8kB writes to that file now hit that unwritten extent, and
only need to convert it to written. The same will happen for all
other concurrent extending writes - they will allocate in 1MB
chunks, not 8KB.
The result will be that the files will interleave 1MB sized extents
across files instead of 8kB sized extents. i.e. we've just reduced
the worst case fragmentation behaviour by a factor of 128. We've
also reduced allocation overhead by a factor of 128, so the use of
extent size hints results in the filesystem behaving in a far more
efficient way and hence this results in higher performance.
IOWs, the extent size hint effectively sets a minimum extent size
that the filesystem will create for a given file, thereby mitigating
the worst case fragmentation that can occur. However, the use of
fallocate() in the application explicitly prevents the filesystem
from doing this smart, transparent IO path thing to mitigate
fragmentation.
One of the most important properties of extent size hints is that
they can be dynamically tuned *without changing the application.*
The extent size hint is a property of the inode, and it can be set
by the admin through various XFS tools (e.g. mkfs.xfs for a
filesystem wide default, xfs_io to set it on a directory so all new
files/dirs created in that directory inherit the value, set it on
individual files, etc). It can be changed even whilst the file is in
active use by the application.
Hence the extent size hint it can be changed at any time, and you
can apply it immediately to existing installations as an active
mitigation. Doing this won't fix existing fragmentation (that's what
xfs_fsr is for), but it will instantly mitigate/prevent new
fragmentation from occurring. It's much more difficult to do this
with applications that use fallocate()...
Indeed, the case for using fallocate() instead of extent size hints
gets worse the more you look at how extent size hints work.
Extent size hints don't impact IO concurrency at all. Extent size
hints are only applied during extent allocation, so the optimisation
is applied naturally as part of the existing concurrent IO path.
Hence using extent size hints won't block/stall/prevent concurrent
async IO in any way.
fallocate(), OTOH, causes a full IO pipeline stall (blocks submission
of both reads and writes, then waits for all IO in flight to drain)
on that file for the duration of the syscall. You can't do any sort
of IO (async or otherwise) and run fallocate() at the same time, so
fallocate() really sucks from the POV of a high performance IO app.
fallocate() also marks the files as having persistent preallocation,
which means that when you close the file the filesystem does not
remove excessive extents allocated beyond EOF. Hence the reported
problems with excessive space usage and needing to truncate files
manually (which also cause a complete IO stall on that file) are
brought on specifically because fallocate() is being used by the
application to manage worst case fragmentation.
This problem does not exist with extent size hints - unused blocks
beyond EOF will be trimmed on last close or when the inode is cycled
out of cache, just like we do for excess speculative prealloc beyond
EOF for buffered writes (the buffered IO fragmentation mitigation
mechanism for interleaving concurrent extending writes).
The administrator can easily optimise extent size hints to match the
optimal characteristics of the underlying storage (e.g. set them to
be RAID stripe aligned), etc. Fallocate() requires the application
to provide tunables to modify it's behaviour for optimal storage
layout, and depending on how the application uses fallocate(), this
level of flexibility may not even be possible.
And let's not forget that an fallocate() based mitigation that helps
one filesystem type can actively hurt another type (e.g. ext4) by
introducing an application level extent allocation boundary vector
where there was none before.
Hence, IMO, micromanaging filesystem extent allocation with
fallocate() is -almost always- the wrong thing for applications to
be doing. There is no one "right way" to use fallocate() - what is
optimal for one filesystem will be pessimal for another, and it is
impossible to code optimal behaviour in the application for all
filesystem types the app might run on.
> The fallocate in turn triggers slowness in the write paths, as
> writing to uninitialized extents is a metadata operation.
That is not the problem you think it is. XFS is using unwritten
extents for all buffered IO writes that use delayed allocation, too,
and I don't see you complaining about that....
Yes, the overhead of unwritten extent conversion is more visible
with direct IO, but that's only because DIO has much lower overhead
and much, much higher performance ceiling than buffered IO. That
doesn't mean unwritten extents are a performance limiting factor...
> It'd be great if
> the allocation behaviour with concurrent file extension could be improved and
> if we could have a fallocate mode that forces extents to be initialized.
<sigh>
You mean like FALLOC_FL_WRITE_ZEROES?
That won't fix your fragmentation problem, and it has all the same
pipeline stall problems as allocating unwritten extents in
fallocate().
Only much worse now, because the IO pipeline is stalled for the
entire time it takes to write the zeroes to persistent storage. i.e.
long tail file access latencies will increase massively if you do
this regularly to extend files.
-Dave.
--
Dave Chinner
dgc@kernel.org
next prev parent reply other threads:[~2026-02-17 22:46 UTC|newest]
Thread overview: 38+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-02-13 10:20 Pankaj Raghav
2026-02-13 13:32 ` Ojaswin Mujoo
2026-02-16 9:52 ` Pankaj Raghav
2026-02-16 15:45 ` Andres Freund
2026-02-17 12:06 ` Jan Kara
2026-02-17 12:42 ` Pankaj Raghav
2026-02-17 16:21 ` Andres Freund
2026-02-18 1:04 ` Dave Chinner
2026-02-18 6:47 ` Christoph Hellwig
2026-02-18 23:42 ` Dave Chinner
2026-02-17 16:13 ` Andres Freund
2026-02-17 18:27 ` Ojaswin Mujoo
2026-02-17 18:42 ` Andres Freund
2026-02-18 17:37 ` Jan Kara
2026-02-18 21:04 ` Andres Freund
2026-02-19 0:32 ` Dave Chinner
2026-02-17 18:33 ` Ojaswin Mujoo
2026-02-17 17:20 ` Ojaswin Mujoo
2026-02-18 17:42 ` [Lsf-pc] " Jan Kara
2026-02-18 20:22 ` Ojaswin Mujoo
2026-02-16 11:38 ` Jan Kara
2026-02-16 13:18 ` Pankaj Raghav
2026-02-17 18:36 ` Ojaswin Mujoo
2026-02-16 15:57 ` Andres Freund
2026-02-17 18:39 ` Ojaswin Mujoo
2026-02-18 0:26 ` Dave Chinner
2026-02-18 6:49 ` Christoph Hellwig
2026-02-18 12:54 ` Ojaswin Mujoo
2026-02-15 9:01 ` Amir Goldstein
2026-02-17 5:51 ` Christoph Hellwig
2026-02-17 9:23 ` [Lsf-pc] " Amir Goldstein
2026-02-17 15:47 ` Andres Freund
2026-02-17 22:45 ` Dave Chinner [this message]
2026-02-18 4:10 ` Andres Freund
2026-02-18 6:53 ` Christoph Hellwig
2026-02-18 6:51 ` Christoph Hellwig
2026-02-20 10:08 ` Pankaj Raghav (Samsung)
2026-02-20 15:10 ` Christoph Hellwig
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=aZTvmpOL7NC4_kDq@dread \
--to=dgc@kernel.org \
--cc=amir73il@gmail.com \
--cc=andres@anarazel.de \
--cc=dchinner@redhat.com \
--cc=djwong@kernel.org \
--cc=gost.dev@samsung.com \
--cc=hch@lst.de \
--cc=jack@suse.cz \
--cc=javier.gonz@samsung.com \
--cc=john.g.garry@oracle.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=linux-xfs@vger.kernel.org \
--cc=lsf-pc@lists.linux-foundation.org \
--cc=mcgrof@kernel.org \
--cc=ojaswin@linux.ibm.com \
--cc=p.raghav@samsung.com \
--cc=pankaj.raghav@linux.dev \
--cc=ritesh.list@gmail.com \
--cc=tytso@mit.edu \
--cc=vi.shah@samsung.com \
--cc=willy@infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox