From: Andres Freund <andres@anarazel.de>
To: Amir Goldstein <amir73il@gmail.com>
Cc: Christoph Hellwig <hch@lst.de>,
Pankaj Raghav <pankaj.raghav@linux.dev>,
linux-xfs@vger.kernel.org, linux-mm@kvack.org,
linux-fsdevel@vger.kernel.org,
lsf-pc@lists.linux-foundation.org, djwong@kernel.org,
john.g.garry@oracle.com, willy@infradead.org,
ritesh.list@gmail.com, jack@suse.cz, ojaswin@linux.ibm.com,
Luis Chamberlain <mcgrof@kernel.org>,
dchinner@redhat.com, Javier Gonzalez <javier.gonz@samsung.com>,
gost.dev@samsung.com, tytso@mit.edu, p.raghav@samsung.com,
vi.shah@samsung.com
Subject: Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Buffered atomic writes
Date: Tue, 17 Feb 2026 10:47:07 -0500 [thread overview]
Message-ID: <ndwqem2mzymo6j3zw3mmxk2vh4mnun2fb2s5vrh4nthatlze3u@qjemcazy4agv> (raw)
In-Reply-To: <CAOQ4uxgdWvJPAi6QMWQjWJ2TnjO=JP84WCgQ+ShM3GiikF=bSw@mail.gmail.com>
Hi,
On 2026-02-17 10:23:36 +0100, Amir Goldstein wrote:
> On Tue, Feb 17, 2026 at 8:00 AM Christoph Hellwig <hch@lst.de> wrote:
> >
> > I think a better session would be how we can help postgres to move
> > off buffered I/O instead of adding more special cases for them.
FWIW, we are adding support for DIO (it's been added, but performance isn't
competitive for most workloads in the released versions yet, work to address
those issues is in progress).
But it's only really be viable for larger setups, not for e.g.:
- smaller, unattended setups
- uses of postgres as part of a larger application on one server with hard to
predict memory usage of different components
- intentionally overcommitted shared hosting type scenarios
Even once a well configured postgres using DIO beats postgres not using DIO,
I'll bet that well over 50% of users won't be able to use DIO.
There are some kernel issues that make it harder than necessary to use DIO,
btw:
Most prominently: With DIO concurrently extending multiple files leads to
quite terrible fragmentation, at least with XFS. Forcing us to
over-aggressively use fallocate(), truncating later if it turns out we need
less space. The fallocate in turn triggers slowness in the write paths, as
writing to uninitialized extents is a metadata operation. It'd be great if
the allocation behaviour with concurrent file extension could be improved and
if we could have a fallocate mode that forces extents to be initialized.
A secondary issue is that with the buffer pool sizes necessary for DIO use on
bigger systems, creating the anonymous memory mapping becomes painfully slow
if we use MAP_POPULATE - which we kinda need to do, as otherwise performance
is very inconsistent initially (often iomap -> gup -> handle_mm_fault ->
folio_zero_user uses the majority of the CPU). We've been experimenting with
not using MAP_POPULATE and using multiple threads to populate the mapping in
parallel, but that feels not like something that userspace ought to have to
do. It's easier to work around for us that the uninitialized extent
conversion issue, but it still is something we IMO shouldn't have to do.
> Respectfully, I disagree that DIO is the only possible solution.
> Direct I/O is a legit solution for databases and so is buffered I/O
> each with their own caveats.
> Specifically, when two subsystems (kernel vfs and db) each require a huge
> amount of cache memory for best performance, setting them up to play nicely
> together to utilize system memory in an optimal way is a huge pain.
Yep.
Greetings,
Andres Freund
next prev parent reply other threads:[~2026-02-17 15:47 UTC|newest]
Thread overview: 38+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-02-13 10:20 Pankaj Raghav
2026-02-13 13:32 ` Ojaswin Mujoo
2026-02-16 9:52 ` Pankaj Raghav
2026-02-16 15:45 ` Andres Freund
2026-02-17 12:06 ` Jan Kara
2026-02-17 12:42 ` Pankaj Raghav
2026-02-17 16:21 ` Andres Freund
2026-02-18 1:04 ` Dave Chinner
2026-02-18 6:47 ` Christoph Hellwig
2026-02-18 23:42 ` Dave Chinner
2026-02-17 16:13 ` Andres Freund
2026-02-17 18:27 ` Ojaswin Mujoo
2026-02-17 18:42 ` Andres Freund
2026-02-18 17:37 ` Jan Kara
2026-02-18 21:04 ` Andres Freund
2026-02-19 0:32 ` Dave Chinner
2026-02-17 18:33 ` Ojaswin Mujoo
2026-02-17 17:20 ` Ojaswin Mujoo
2026-02-18 17:42 ` [Lsf-pc] " Jan Kara
2026-02-18 20:22 ` Ojaswin Mujoo
2026-02-16 11:38 ` Jan Kara
2026-02-16 13:18 ` Pankaj Raghav
2026-02-17 18:36 ` Ojaswin Mujoo
2026-02-16 15:57 ` Andres Freund
2026-02-17 18:39 ` Ojaswin Mujoo
2026-02-18 0:26 ` Dave Chinner
2026-02-18 6:49 ` Christoph Hellwig
2026-02-18 12:54 ` Ojaswin Mujoo
2026-02-15 9:01 ` Amir Goldstein
2026-02-17 5:51 ` Christoph Hellwig
2026-02-17 9:23 ` [Lsf-pc] " Amir Goldstein
2026-02-17 15:47 ` Andres Freund [this message]
2026-02-17 22:45 ` Dave Chinner
2026-02-18 4:10 ` Andres Freund
2026-02-18 6:53 ` Christoph Hellwig
2026-02-18 6:51 ` Christoph Hellwig
2026-02-20 10:08 ` Pankaj Raghav (Samsung)
2026-02-20 15:10 ` Christoph Hellwig
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ndwqem2mzymo6j3zw3mmxk2vh4mnun2fb2s5vrh4nthatlze3u@qjemcazy4agv \
--to=andres@anarazel.de \
--cc=amir73il@gmail.com \
--cc=dchinner@redhat.com \
--cc=djwong@kernel.org \
--cc=gost.dev@samsung.com \
--cc=hch@lst.de \
--cc=jack@suse.cz \
--cc=javier.gonz@samsung.com \
--cc=john.g.garry@oracle.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=linux-xfs@vger.kernel.org \
--cc=lsf-pc@lists.linux-foundation.org \
--cc=mcgrof@kernel.org \
--cc=ojaswin@linux.ibm.com \
--cc=p.raghav@samsung.com \
--cc=pankaj.raghav@linux.dev \
--cc=ritesh.list@gmail.com \
--cc=tytso@mit.edu \
--cc=vi.shah@samsung.com \
--cc=willy@infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox