From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 15EB2E909D3 for ; Tue, 17 Feb 2026 16:13:14 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 724566B0093; Tue, 17 Feb 2026 11:13:13 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 6D18B6B009B; Tue, 17 Feb 2026 11:13:13 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 5D0B56B009E; Tue, 17 Feb 2026 11:13:13 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 499E86B0093 for ; Tue, 17 Feb 2026 11:13:13 -0500 (EST) Received: from smtpin30.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id DC16C1A0272 for ; Tue, 17 Feb 2026 16:13:12 +0000 (UTC) X-FDA: 84454443024.30.55AD067 Received: from fout-b8-smtp.messagingengine.com (fout-b8-smtp.messagingengine.com [202.12.124.151]) by imf28.hostedemail.com (Postfix) with ESMTP id DFA3BC0008 for ; Tue, 17 Feb 2026 16:13:10 +0000 (UTC) Authentication-Results: imf28.hostedemail.com; dkim=pass header.d=anarazel.de header.s=fm3 header.b=VwKBeGvw; dkim=pass header.d=messagingengine.com header.s=fm3 header.b=gbej3kK+; spf=pass (imf28.hostedemail.com: domain of andres@anarazel.de designates 202.12.124.151 as permitted sender) smtp.mailfrom=andres@anarazel.de; dmarc=pass (policy=none) header.from=anarazel.de ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1771344791; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=m85nxpM60MhQS/QITwd1hHeXdBL8OWiwGVCZ0ecbrpo=; b=sNhM9IH2TiU2scdJxbx9I8KmjXNh8daWKnzylAtn/ECZn+kAlUhYldumVy7/DYsYevTDVm zcS/wxQE9yRhYTUYQlTaBnt96v0Uq/yaXe9K9AQ83D9y1FLYQ6Jx1C49gQNhTrEpHivTme nOT9iM1MHehRZ30Icygp8Aw7hqXQ0RI= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1771344791; a=rsa-sha256; cv=none; b=cGSWfahiAzSQxNbs8ubLN+HM91ffBu1jUXok3VbiN3NsNVX+QlZaYFjOrI59fxeJJw3Dmh nPz4ZBV4mlFyQqkuLrduXgduLcYvSrp4YaQAm+Pc8oWnW03tzIev+O46qmqJCpSh8J5MZu r4BXmFvTyhnyb7x6Ju3rgLI0YxF9S/U= ARC-Authentication-Results: i=1; imf28.hostedemail.com; dkim=pass header.d=anarazel.de header.s=fm3 header.b=VwKBeGvw; dkim=pass header.d=messagingengine.com header.s=fm3 header.b=gbej3kK+; spf=pass (imf28.hostedemail.com: domain of andres@anarazel.de designates 202.12.124.151 as permitted sender) smtp.mailfrom=andres@anarazel.de; dmarc=pass (policy=none) header.from=anarazel.de Received: from phl-compute-02.internal (phl-compute-02.internal [10.202.2.42]) by mailfout.stl.internal (Postfix) with ESMTP id 61F311D003E3; Tue, 17 Feb 2026 11:13:09 -0500 (EST) Received: from phl-frontend-03 ([10.202.2.162]) by phl-compute-02.internal (MEProxy); Tue, 17 Feb 2026 11:13:10 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=anarazel.de; h= cc:cc:content-type:content-type:date:date:from:from:in-reply-to :in-reply-to:message-id:mime-version:references:reply-to:subject :subject:to:to; s=fm3; t=1771344789; x=1771431189; bh=m85nxpM60M hQS/QITwd1hHeXdBL8OWiwGVCZ0ecbrpo=; b=VwKBeGvwQ3zDU0ibAFyO0/1cVw HFZe9mheim8Mv8ZI9ZOUbMM9FieNgJ4RUERoQu10Q7QM3iiYpROGu4Z8SEdmWo/c VuJYFDSn7QOjleDXAxrPwHn0jttbvS+XBclsb3bReFaLtQwKpfx5dmXPmV3DPoxI B3LtKhK7DOr36nyadFu/s4l3kTq0vS+r683i8YqhzDT23uX1UNFOkrVgOVmOhICJ XubtoZGX2Y37vgC+I3d7LungGCYPED3FatswxLNQ8n5pipaJI2StCiGcBty86r2d SRd/G4Hq+wkW0efCojbH72SxJnoaBwZPxou7Hwr66SjCVw6GWnGr6dL3wemg== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d= messagingengine.com; h=cc:cc:content-type:content-type:date:date :feedback-id:feedback-id:from:from:in-reply-to:in-reply-to :message-id:mime-version:references:reply-to:subject:subject:to :to:x-me-proxy:x-me-sender:x-me-sender:x-sasl-enc; s=fm3; t= 1771344789; x=1771431189; bh=m85nxpM60MhQS/QITwd1hHeXdBL8OWiwGVC Z0ecbrpo=; b=gbej3kK+pS+/gY1RSixWXdD8sBg2pz/RNdJcMbwAL3n2wFBQ37o ocQcq2L2sA/ltEUfH3WUXHiQmbcqI4B6W64WqyoVlhfNj0KjdJnPtFIuSTKTSnru 7ITiNRABCOsw1scccvelmIYUakBLvoSgYo/cmGjeRX8SHdimtdPyv7SOhE2GADVI l3VUIh9Y2uBueeFJEwrIkw4Lk8ecM2tnJo+wMmCuMLrHLTnJApUwJEbbvkfl5ue/ ebYKpOqJrAF7LgbZZgSLwHT+q8PszmIDLWu2B5R0nkBBEulpz4cjJggAUtsuiGEs 2NiXF9VdqXL9NAcEdnmouqxFOpRqn+4Q9Tg== X-ME-Sender: X-ME-Received: X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgeefgedrtddtgddvvddtvdduucetufdoteggodetrf dotffvucfrrhhofhhilhgvmecuhfgrshhtofgrihhlpdfurfetoffkrfgpnffqhgenuceu rghilhhouhhtmecufedttdenucesvcftvggtihhpihgvnhhtshculddquddttddmnecujf gurhepfffhvfevuffkfhggtggujgesthdtsfdttddtvdenucfhrhhomheptehnughrvghs ucfhrhgvuhhnugcuoegrnhgurhgvshesrghnrghrrgiivghlrdguvgeqnecuggftrfgrth htvghrnhepfeffgfelvdffgedtveelgfdtgefghfdvkefggeetieevjeekteduleevjefh ueegnecuvehluhhsthgvrhfuihiivgeptdenucfrrghrrghmpehmrghilhhfrhhomheprg hnughrvghssegrnhgrrhgriigvlhdruggvpdhnsggprhgtphhtthhopeduledpmhhouggv pehsmhhtphhouhhtpdhrtghpthhtoheprhhithgvshhhrdhlihhsthesghhmrghilhdrtg homhdprhgtphhtthhopeifihhllhihsehinhhfrhgruggvrggurdhorhhgpdhrtghpthht ohepughjfihonhhgsehkvghrnhgvlhdrohhrghdprhgtphhtthhopehmtghgrhhofheskh gvrhhnvghlrdhorhhgpdhrtghpthhtoheplhhinhhugidqmhhmsehkvhgrtghkrdhorhhg pdhrtghpthhtohepphgrnhhkrghjrdhrrghghhgrvheslhhinhhugidruggvvhdprhgtph htthhopehojhgrshifihhnsehlihhnuhigrdhisghmrdgtohhmpdhrtghpthhtoheplhhs fhdqphgtsehlihhsthhsrdhlihhnuhigqdhfohhunhgurghtihhonhdrohhrghdprhgtph htthhopehhtghhsehlshhtrdguvg X-ME-Proxy: Feedback-ID: id4a34324:Fastmail Received: by mail.messagingengine.com (Postfix) with ESMTPA; Tue, 17 Feb 2026 11:13:08 -0500 (EST) Date: Tue, 17 Feb 2026 11:13:07 -0500 From: Andres Freund To: Jan Kara Cc: Pankaj Raghav , Ojaswin Mujoo , linux-xfs@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, lsf-pc@lists.linux-foundation.org, djwong@kernel.org, john.g.garry@oracle.com, willy@infradead.org, hch@lst.de, ritesh.list@gmail.com, Luis Chamberlain , dchinner@redhat.com, Javier Gonzalez , gost.dev@samsung.com, tytso@mit.edu, p.raghav@samsung.com, vi.shah@samsung.com Subject: Re: [LSF/MM/BPF TOPIC] Buffered atomic writes Message-ID: <2planlrvjqicgpparsdhxipfdoawtzq3tedql72hoff4pdet6t@btxbx6cpoyc6> References: <7cf3f249-453d-423a-91d1-dfb45c474b78@linux.dev> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Rspamd-Server: rspam09 X-Stat-Signature: ge9h5a381kw33wjscuszbjgjgr9st86b X-Rspamd-Queue-Id: DFA3BC0008 X-Rspam-User: X-HE-Tag: 1771344790-336526 X-HE-Meta: U2FsdGVkX1/xNxEIpd8k3wiA01fS8xSpEGn/+YmMQReVligTGR+kbDG5ubzkrqxn7t7q84PQS7ofOG3ZYhgSGWHdartQsJl69fkPHUwrPVmuXqj8OTuDozjkFQ+b4wk5yPa2xXg3LKLq+86AtBFLMOIfoUB43egyA4OsdLDT96T2/yqzt9Uo0Hllw7tJnHygVVtbdGT5EOmP+mf12sOGLqUltcKmloGlvPkXx62Ji+PHzQU8mpjQHAu2evPrPC+61mft5xcarkhjfMJt3HyzBtviXVLK47Ghuqgyb9kaA+zHc3NqCHP0mZEaBfjwaJ+WXOAYXtvXROHF5axoIu9iszb33JA7sLBKTHnuT305fcgiR1pg8JpWc7Cpao+BuYAK704rew28nkei8YK854kt9J9oSLAUHc94D0hklV/4fX3uvsPkR2uQ0PegBgBSxmuRmLzK/1dcYqOvZOk2wLJ/4eM4zFO1IRleoyJshwmrNZjKcIhW0mieJwI/7EPAx7mMyzo3gGN9kQwm/0QMSmDsrl4taHJFTgBiTdZ/lrPb9kdq+yayCo0Ijvg/3G2fBUJl9LjmE2hTh1DfNu7qe7coHP/7Q+Fo2rM/NNoJZ7303wtX/GYshWiZEgGGNoPX2X9ppodpTFdl9wU31eyf10jHmvHvkJA1YbVRN8DAUHBORRf1AWRAJTUkJAAyN588vfmybQYjbiyXSjzG8bWcPMuQPIpGiyMfWKM+a0g3RjLHg1b6GOhxyZsMur4goXgJ86dssZ0nRQ4UcDg0z8fdmlvw9oJ5J3iKQ3u5T6c/sWPN+CxLwRnQEuHhgTRBbo4t2Er1pGP8tk5n0bhwLj3C4V9nauS3r/zvORH/efIw2uvFOmgeQLDS0P137FMpq/cw38/9nM46ppRjt1N6te3PwVPd5+74umNvytdmDtmmYiihrNni5RuHnv84oLDyyUYI/7kCa12/cMWM/5SmoYp+sVI 2oUFh2N3 gE8IrlJzcj2x38tZWPfv8KwJ5QhwFjxfDg/x4d5BPnhaQ7tKwxxtbwSxNRaHDkSVekVtEb7kQfPjtsZqrywYPKYp17Ggn9rZMU2Cunba2pFXxRWbNzkZ3LV0NShau12ip1yQEWq+cTzoFQGIFAGK8qUMc4qROWcMGCEtSRw36V1CXDcKbPJIPiiKl4yXKMPTHJVU/uN/7yXxcWL2wBhFboNCQWGt4MufY8GAup212rQ4TIxN2k+iUII2CFNiancODjYalw3XPDYKDYbsToevs5fwAmkKbYiKQKzwvz3lw7NXct4c+wtfu5dQxp6kGxIqQAOVSieuBUv4X/rzn0yQ1Oa4Q9VbAzb6SBEtCoqV7j1VSI2AepgORzBNRCLcb+R9hwZ41KQyopYKKF3rMKlrKrtLJCXqT9pFI1eHSHjuDRrswHp28LPcQvNKUCg== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Hi, On 2026-02-17 13:06:04 +0100, Jan Kara wrote: > On Mon 16-02-26 10:45:40, Andres Freund wrote: > > (*) As it turns out, it often seems to improves write throughput as well, if > > writeback is triggered by memory pressure instead of SYNC_FILE_RANGE_WRITE, > > linux seems to often trigger a lot more small random IO. > > > > > So immediately writing them might be ok as long as we don't remove those > > > pages from the page cache like we do in RWF_UNCACHED. > > > > Yes, it might. I actually often have wished for something like a > > RWF_WRITEBACK flag... > > I'd call it RWF_WRITETHROUGH but otherwise it makes sense. Heh, that makes sense. I think that's what I actually was thinking of. > > > > An argument against this however is that it is user's responsibility to > > > > not do non atomic IO over an atomic range and this shall be considered a > > > > userspace usage error. This is similar to how there are ways users can > > > > tear a dio if they perform overlapping writes. [1]. > > > > Hm, the scope of the prohibition here is not clear to me. Would it just > > be forbidden to do: > > > > P1: start pwritev(fd, [blocks 1-10], RWF_ATOMIC) > > P2: pwrite(fd, [any block in 1-10]), non-atomically > > P1: complete pwritev(fd, ...) > > > > or is it also forbidden to do: > > > > P1: pwritev(fd, [blocks 1-10], RWF_ATOMIC) start & completes > > Kernel: starts writeback but doesn't complete it > > P1: pwrite(fd, [any block in 1-10]), non-atomically > > Kernel: completes writeback > > > > The former is not at all an issue for postgres' use case, the pages in > > our buffer pool that are undergoing IO are locked, preventing additional > > IO (be it reads or writes) to those blocks. > > > > The latter would be a problem, since userspace wouldn't even know that > > here is still "atomic writeback" going on, afaict the only way we could > > avoid it would be to issue an f[data]sync(), which likely would be > > prohibitively expensive. > > It somewhat depends on what outcome you expect in terms of crash safety :) > Unless we are careful, the RWF_ATOMIC write in your latter example can end > up writing some bits of the data from the second write because the second > write may be copying data to the pages as we issue DMA from them to the > device. Hm. It's somewhat painful to not know when we can write in what mode again - with DIO that's not an issue. I guess we could use sync_file_range(SYNC_FILE_RANGE_WAIT_BEFORE) if we really needed to know? Although the semantics of the SFR flags aren't particularly clear, so maybe not? > I expect this isn't really acceptable because if you crash before > the second write fully makes it to the disk, you will have inconsistent > data. The scenarios that I can think that would lead us to doing something like this, are when we are overwriting data without regard for the prior contents, e.g: An already partially filled page is filled with more rows, we write that page out, then all the rows are deleted, and we re-fill the page with new content from scratch. Write it out again. With our existing logic we treat the second write differently, because the entire contents of the page will be in the journal, as there is no prior content that we care about. A second scenario in which we might not use RWF_ATOMIC, if we carry today's logic forward, is if a newly created relation is bulk loaded in the same transaction that created the relation. If a crash were to happen while that bulk load is ongoing, we don't care about the contents of the file(s), as it will never be visible to anyone after crash recovery. In this case we won't have prio RWF_ATOMIC writes - but we could have the opposite, i.e. an RWF_ATOMIC write while there already is non-RWF_ATOMIC dirty data in the page cache. Would that be an issue? It's possible we should just always use RWF_ATOMIC, even in the cases where it's not needed from our side, to avoid potential performance penalties and "undefined behaviour". I guess that will really depend on the performance penalty that RWF_ATOMIC will carry and whether multiple-atomicity-mode will eventually be supported (as doing small writes during bulk loading is quite expensive). Greetings, Andres Freund