From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 476E0E9A03B for ; Wed, 18 Feb 2026 04:10:31 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 52D316B0088; Tue, 17 Feb 2026 23:10:30 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 4D4186B0089; Tue, 17 Feb 2026 23:10:30 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 37EA16B008A; Tue, 17 Feb 2026 23:10:30 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 20DC86B0088 for ; Tue, 17 Feb 2026 23:10:30 -0500 (EST) Received: from smtpin11.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id A9DFB1A040C for ; Wed, 18 Feb 2026 04:10:29 +0000 (UTC) X-FDA: 84456250578.11.0BF9A5C Received: from fout-b2-smtp.messagingengine.com (fout-b2-smtp.messagingengine.com [202.12.124.145]) by imf28.hostedemail.com (Postfix) with ESMTP id 944CFC0007 for ; Wed, 18 Feb 2026 04:10:27 +0000 (UTC) Authentication-Results: imf28.hostedemail.com; dkim=pass header.d=anarazel.de header.s=fm3 header.b=NyFKBjU9; dkim=pass header.d=messagingengine.com header.s=fm3 header.b=F442jkCw; spf=pass (imf28.hostedemail.com: domain of andres@anarazel.de designates 202.12.124.145 as permitted sender) smtp.mailfrom=andres@anarazel.de; dmarc=pass (policy=none) header.from=anarazel.de ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1771387827; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=MDesmuvOoqm7KYIcrl+m4jnd+l0mTOjnNT7rySenVmw=; b=Z4fMBZytjDxMtnpn0nac3uPXbMCzhALeRy6TQG/V0UA2W+u0AjAn4e/SDLzu642UuFXnw+ 4vtX6cgKxU23LI1+EbCZsSQ1IyXZlkxqIyPX4UcV2V2pxpfBjiLwpXluRRNGD1fSPsk0UG fCC+AWRQyQy+GJZVz/BJRZT3hNx2/pA= ARC-Authentication-Results: i=1; imf28.hostedemail.com; dkim=pass header.d=anarazel.de header.s=fm3 header.b=NyFKBjU9; dkim=pass header.d=messagingengine.com header.s=fm3 header.b=F442jkCw; spf=pass (imf28.hostedemail.com: domain of andres@anarazel.de designates 202.12.124.145 as permitted sender) smtp.mailfrom=andres@anarazel.de; dmarc=pass (policy=none) header.from=anarazel.de ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1771387827; a=rsa-sha256; cv=none; b=R7oUFxBgnMswV8CilBiFgEjJB/tVNDUlnoE2auGcf+Sx4U4QSd+2+YABZcb1HQq1Hnf/to YeLPGuKV1CQfCndiEuaSSuWZe0BAOXMNV3VJT1ouTjltE4uQLYN+pjmWUqKJ5d5PeQ+bcT aVDSEo6opD78pJj15pxXxQtYQl4e4vM= Received: from phl-compute-03.internal (phl-compute-03.internal [10.202.2.43]) by mailfout.stl.internal (Postfix) with ESMTP id D75BC1D00141; Tue, 17 Feb 2026 23:10:25 -0500 (EST) Received: from phl-frontend-03 ([10.202.2.162]) by phl-compute-03.internal (MEProxy); Tue, 17 Feb 2026 23:10:26 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=anarazel.de; h= cc:cc:content-type:content-type:date:date:from:from:in-reply-to :in-reply-to:message-id:mime-version:references:reply-to:subject :subject:to:to; s=fm3; t=1771387825; x=1771474225; bh=MDesmuvOoq m7KYIcrl+m4jnd+l0mTOjnNT7rySenVmw=; b=NyFKBjU94uhXFhnzD+chKOJfL3 yU1YSkNcnof42XOEjsSEpCWAZVkUu+HmranGCIoAcsgVDnUtzaRHjfjQ20naydJ5 vR7edehRvutjHe5B7641tY4oVgmpNOL+02yyVzYgZC4tTNzYWbt048wWlEvhsuoX I01uXwJPPZEe2MsXIhDec/gUFbRmS0Toq8p0v6E2PbFBHdPDQ46V75fDhTSREOJj ZEf1yuDq2tpncRzE8DDWoXPXvIpCa0uPPSiEROObySFZykMsKd/xY3Gtu4NDWQP8 rgqFA0fNzlH3BR5eoeNAR8jvcn9TKRGjWyNqLoxWcbGWvK8c+KW1Djpq99Iw== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d= messagingengine.com; h=cc:cc:content-type:content-type:date:date :feedback-id:feedback-id:from:from:in-reply-to:in-reply-to :message-id:mime-version:references:reply-to:subject:subject:to :to:x-me-proxy:x-me-sender:x-me-sender:x-sasl-enc; s=fm3; t= 1771387825; x=1771474225; bh=MDesmuvOoqm7KYIcrl+m4jnd+l0mTOjnNT7 rySenVmw=; b=F442jkCwr4HV0qmZORjuygDkjt04rSy/Efaz7Q8BD1v1NnaVw2L 7XYqsESZQ54YPbUQuw9OZu4teO/Gnf4MLmUnbJgzEo1C/SRKJc5aEcMbJ+JJpYVY GjMpbJquw3OJfFzN6/VILMeBSD6DVJkSSJKbkxu54W7HbkYrQwOYTBgCaGhYC+Zo eWXQNqN84Q1oKYRoLE4vXKTNkX9xxdY1vZmOPJ8GmnsZXXiB2KLdXBo3YQUVA6OZ W+z0PF4ZY+IdVcZvwkcuF8uLlY8QuUu93CDAbHXKShbGL/DVJdfd7rj6irghdeN7 ELXlPOquGiniH489rn2yt5RVtOKqGT4rBag== X-ME-Sender: X-ME-Received: X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgeefgedrtddtgddvvdduieegucetufdoteggodetrf dotffvucfrrhhofhhilhgvmecuhfgrshhtofgrihhlpdfurfetoffkrfgpnffqhgenuceu rghilhhouhhtmecufedttdenucesvcftvggtihhpihgvnhhtshculddquddttddmnecujf gurhepfffhvfevuffkfhggtggujgesthdtsfdttddtvdenucfhrhhomheptehnughrvghs ucfhrhgvuhhnugcuoegrnhgurhgvshesrghnrghrrgiivghlrdguvgeqnecuggftrfgrth htvghrnhepfeffgfelvdffgedtveelgfdtgefghfdvkefggeetieevjeekteduleevjefh ueegnecuvehluhhsthgvrhfuihiivgeptdenucfrrghrrghmpehmrghilhhfrhhomheprg hnughrvghssegrnhgrrhgriigvlhdruggvpdhnsggprhgtphhtthhopedvuddpmhhouggv pehsmhhtphhouhhtpdhrtghpthhtoheprghmihhrjeefihhlsehgmhgrihhlrdgtohhmpd hrtghpthhtoheprhhithgvshhhrdhlihhsthesghhmrghilhdrtghomhdprhgtphhtthho peifihhllhihsehinhhfrhgruggvrggurdhorhhgpdhrtghpthhtohepughgtgeskhgvrh hnvghlrdhorhhgpdhrtghpthhtohepughjfihonhhgsehkvghrnhgvlhdrohhrghdprhgt phhtthhopehmtghgrhhofheskhgvrhhnvghlrdhorhhgpdhrtghpthhtoheplhhinhhugi dqmhhmsehkvhgrtghkrdhorhhgpdhrtghpthhtohepphgrnhhkrghjrdhrrghghhgrvhes lhhinhhugidruggvvhdprhgtphhtthhopehojhgrshifihhnsehlihhnuhigrdhisghmrd gtohhm X-ME-Proxy: Feedback-ID: id4a34324:Fastmail Received: by mail.messagingengine.com (Postfix) with ESMTPA; Tue, 17 Feb 2026 23:10:24 -0500 (EST) Date: Tue, 17 Feb 2026 23:10:23 -0500 From: Andres Freund To: Dave Chinner Cc: Amir Goldstein , Christoph Hellwig , Pankaj Raghav , linux-xfs@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, lsf-pc@lists.linux-foundation.org, djwong@kernel.org, john.g.garry@oracle.com, willy@infradead.org, ritesh.list@gmail.com, jack@suse.cz, ojaswin@linux.ibm.com, Luis Chamberlain , dchinner@redhat.com, Javier Gonzalez , gost.dev@samsung.com, tytso@mit.edu, p.raghav@samsung.com, vi.shah@samsung.com Subject: Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Buffered atomic writes Message-ID: References: <20260217055103.GA6174@lst.de> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Stat-Signature: 7afi57t54sq9nf1pshg3qpxq1ng49qjq X-Rspam-User: X-Rspamd-Queue-Id: 944CFC0007 X-Rspamd-Server: rspam01 X-HE-Tag: 1771387827-837029 X-HE-Meta: U2FsdGVkX1+7kNBUG2v9ozO741yfm/L0CaqAAJURjUHJ7tJxmhJYsrbgLf2RkYVmN1kVHO4HN1LhkiASRMIRGEcfyXRI31h09PtspxVsjclcGUufYbQQf++7fIU/mDE0uvNmFQmjvj9NW9iG0+8ukieILeYyWkfic/JCXoPl1VAq5VfQ64W8ZPUFvCnTS5si+Qgv07uWdsfZl9l/gzPzZ2WPL5mVmV0AR67Qb+Lntq3bA6gwzSLOwlAyX/BNJYQPb/tf5OIoWzK4SBjQ6QDTHhwhExOJyztkFUPtuGUoRXSoTjJZizgOjRsavlogjBhj7GqdyCnjgH4Nvy2QRCQ0Gp5w2MrqM8AeVrjFMXM2uQl1N+VCxi2XqkXsSR54GeG9dALb+fGgshItbG1KRiF8dGhwRU1YTXNBos+Jh5JGF+/CAi7IApA86DJ/9u7ggDQwW028xt+KreXD/YwBqG5sJ7cqsNdNFyE/ArrWPN2KqTGPCmlBpnMGfu9MwFO3U4bpFgA0+yw7u4n52rgcQNBEMZfw7XsoLNFIy1s8DfhzjBPJrTPBfEMoCG51nXoxMctvXa1RMmqUfSUhgXl0L1VZ4BybETe/GQclN5ZCnZ7siflYFTzRKy0VJe4nRRZs0r+HRnm6F/bZZ3Np2eU4ZB1DRhXZcdIv0Luwr3QplIhADrGKNSLlsKDermA0asCamt4TpjokVyf1UFn+IQa9D+HmLdo6l7TwLwtwqvf/w3kKGwBKQLT6lMz+LKa4R1eciQ1LJ/0Q9+mamj2A/vN1YKts0SsD0afOdfM8BlmwcjqkZX43lRU7fd35KDpcPVwjGK2gNQ/ARG7FyfSAK3mVQ3QpIqfXFQkITQD9S+k+yU7mTKy2aPKB9ya3v6vW9IhJO/B0aEZn3EcTF9JTxiJ6wNYCKmQ3/iMoF5ta2Dkz9G1zfkvGER6GT/dRjKtbtSo76TbYdAQeoZ6CgN9iw8cRaub amFhdys0 CdT7HkUCh5iddxfvcRL7gvVdlKJXqD3+hKOlTM1wnFNJiSs18JPayL8jliMN9xws/7KjRMKicnfGF9G2U/pbEL6qqPEARC4IMPLzhCZq1fhrqL47rFPXvcc8bMO1b1hbHW9YeDC4UmBy5MtQUGyAtcHcHNA8YGX2J1Z1EX+s4lNhlhn8OvRIZN0aKEK6eVoaTc9/xCArEQ+04eno+JcTufIIPxdppCaHN60lFe5BI6vHOtSI6xObB1VQClG53jsqDPetfAmt4jkd1OCzqbO31mr4xmMUe3a2DTblpkWVyr9OgshLltH/4Oq+8cdal4SU953BeBkHnRlTIgFIprdytlBZf60G6p5GrgNMP8hBAc7UTTr5oudYFA4rb5o2BnD79D56wZUJEewBsRtgI4Gz+wxLSWiVuFmK8wZIbPp4ADf5p7Rg= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Hi, On 2026-02-18 09:45:46 +1100, Dave Chinner wrote: > On Tue, Feb 17, 2026 at 10:47:07AM -0500, Andres Freund wrote: > > There are some kernel issues that make it harder than necessary to use DIO, > > btw: > > > > Most prominently: With DIO concurrently extending multiple files leads to > > quite terrible fragmentation, at least with XFS. Forcing us to > > over-aggressively use fallocate(), truncating later if it turns out we need > > less space. > > > > seriously, fallocate() is considered harmful for exactly these sorts > of reasons. XFS has vastly better mechanisms built into it that > mitigate worst case fragmentation without needing to change > applications or increase runtime overhead. There's probably a misunderstanding here: We don't do fallocate to avoid fragmentation. We want to guarantee that there's space for data that is in our buffer pool, as otherwise it's very easy to get into a pickle: If there is dirty data in the buffer pool that can't be written out due to ENOSPC, the subsequent checkpoint can't complete. So the system may be stuck because you're not be able to create more space for WAL / journaling, you can't free up old WAL due to the checkpoint not being able to complete, and if you react to that with a crash-recovery cycle you're likely to be unable to complete crash recovery because you'll just hit ENOSPC again. And yes, CoW filesystems make that less reliable, it turns out to still save people often enough that I doubt we can get rid of it. To ensure there's space for the write out of our buffer pool we have two choices: 1) write out zeroes 2) use fallocate Writing out zeroes that we will just overwrite later is obviously not a particularly good use of IO bandwidth, particularly on metered cloud "storage". But using fallocate() has fragmentation and unwritten-extent issues. Our compromise is that we use fallocate iff we enlarge the relation by a decent number of pages at once and write zeroes otherwise. Is that perfect? Hell no. But it's also not obvious what a better answer is with today's interfaces. If there were a "guarantee that N additional blocks are reserved, but not concretely allocated" interface, we'd gladly use it. > So, let's set the extent size hint on a file to 1MB. Now whenever a > data extent allocation on that file is attempted, the extent size > that is allocated will be rounded up to the nearest 1MB. i.e. XFS > will try to allocate unwritten extents in aligned multiples of the > extent size hint regardless of the actual IO size being performed. > > Hence if you are doing concurrent extending 8kB writes, instead of > allocating 8kB at a time, the extent size hint will force a 1MB > unwritten extent to be allocated out beyond EOF. The subsequent > extending 8kB writes to that file now hit that unwritten extent, and > only need to convert it to written. The same will happen for all > other concurrent extending writes - they will allocate in 1MB > chunks, not 8KB. We could probably benefit from that. > One of the most important properties of extent size hints is that > they can be dynamically tuned *without changing the application.* > The extent size hint is a property of the inode, and it can be set > by the admin through various XFS tools (e.g. mkfs.xfs for a > filesystem wide default, xfs_io to set it on a directory so all new > files/dirs created in that directory inherit the value, set it on > individual files, etc). It can be changed even whilst the file is in > active use by the application. IME our users run enough postgres instances, across a lot of differing workloads, that manual tuning like that will rarely if ever happen :(. I miss well educated DBAs :(. A large portion of users doesn't even have direct access to the server, only via the postgres protocol... If we were to use these hints, it'd have to happen automatically from within postgres. But that does seem viable, but certainly is also not exactly filesystem independent... > > The fallocate in turn triggers slowness in the write paths, as > > writing to uninitialized extents is a metadata operation. > > That is not the problem you think it is. XFS is using unwritten > extents for all buffered IO writes that use delayed allocation, too, > and I don't see you complaining about that.... It's a problem for buffered IO as well, just a bit harder to hit on many drives, because buffered O_DSYNC writes don't use FUA. If you need any durable writes into a file with unwritten extents, things get painful very fast. See a few paragraphs below for the most crucial case where we need to make sure writes are durable. testdir=/srv/fio && for buffered in 0 1; do for overwrite in 0 1; do echo buffered: $buffered overwrite: $overwrite; rm -f $testdir/pg-extend* && fio --directory=$testdir --ioengine=psync --buffered=$buffered --bs=4kB --fallocate=none --overwrite=0 --rw=write --size=64MB --sync=dsync --name pg-extend --overwrite=$overwrite |grep IOPS;done;done buffered: 0 overwrite: 0 write: IOPS=1427, BW=5709KiB/s (5846kB/s)(64.0MiB/11479msec); 0 zone resets buffered: 0 overwrite: 1 write: IOPS=4025, BW=15.7MiB/s (16.5MB/s)(64.0MiB/4070msec); 0 zone resets buffered: 1 overwrite: 0 write: IOPS=1638, BW=6554KiB/s (6712kB/s)(64.0MiB/9999msec); 0 zone resets buffered: 1 overwrite: 1 write: IOPS=3663, BW=14.3MiB/s (15.0MB/s)(64.0MiB/4472msec); 0 zone resets That's a > 2x throughput difference. And the results would be similar with --fdatasync=1. If you add AIO to the mix, the difference gets way bigger, particularly on drives with FUA support and DIO: testdir=/srv/fio && for buffered in 0 1; do for overwrite in 0 1; do echo buffered: $buffered overwrite: $overwrite; rm -f $testdir/pg-extend* && fio --directory=$testdir --ioengine=io_uring --buffered=$buffered --bs=4kB --fallocate=none --overwrite=0 --rw=write --size=64MB --sync=dsync --name pg-extend --overwrite=$overwrite --iodepth 32 |grep IOPS;done;done buffered: 0 overwrite: 0 write: IOPS=6143, BW=24.0MiB/s (25.2MB/s)(64.0MiB/2667msec); 0 zone resets buffered: 0 overwrite: 1 write: IOPS=76.6k, BW=299MiB/s (314MB/s)(64.0MiB/214msec); 0 zone resets buffered: 1 overwrite: 0 write: IOPS=1835, BW=7341KiB/s (7517kB/s)(64.0MiB/8928msec); 0 zone resets buffered: 1 overwrite: 1 write: IOPS=4096, BW=16.0MiB/s (16.8MB/s)(64.0MiB/4000msec); 0 zone resets It's less bad, but still quite a noticeable difference, on drives without volatile caches. And it's often worse on networked storage, whether it has a volatile cache or not. > > It'd be great if > > the allocation behaviour with concurrent file extension could be improved and > > if we could have a fallocate mode that forces extents to be initialized. > > > > You mean like FALLOC_FL_WRITE_ZEROES? I hadn't seen that it was merged, that's great! It doesn't yet seem to be documented in the fallocate(2) man page, which I had checked... Hm, also doesn't seem to work on xfs yet :(, EOPNOTSUPP. > That won't fix your fragmentation problem, and it has all the same pipeline > stall problems as allocating unwritten extents in fallocate(). The primary case where FALLOC_FL_WRITE_ZEROES would be useful is for WAL file creation, which are always of the same fixed size (therefore no fragmentation risk). To avoid having metadata operation during our commit path, we today default to forcing them to be allocated by overwriting them with zeros and fsyncing them. To avoid having to do that all the time, we reuse them once they're not needed anymore. Not ensuring that the extents are already written, would have a very large perf penalty (as in ~2-3x for OLTP workloads, on XFS). That's true for both when using DIO and when not. To avoid having to do that over and over, we recycle WAL files. Unfortunately this means that when all those WAL files are not yet preallocated (or when we release them during low activity), the performance is rather noticeably worsened by the additional IO for pre-zeroing the WAL files. In theory FALLOC_FL_WRITE_ZEROES should be faster than issuing writes for the whole range. > Only much worse now, because the IO pipeline is stalled for the > entire time it takes to write the zeroes to persistent storage. i.e. > long tail file access latencies will increase massively if you do > this regularly to extend files. In the WAL path we fsync at the point we could use FALLOC_FL_WRITE_ZEROES, as otherwise the WAL segment might not exist after a crash, which would be ... bad. Greetings, Andres Freund