linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Andres Freund <andres@anarazel.de>
To: Jan Kara <jack@suse.cz>
Cc: Pankaj Raghav <pankaj.raghav@linux.dev>,
	 Ojaswin Mujoo <ojaswin@linux.ibm.com>,
	linux-xfs@vger.kernel.org, linux-mm@kvack.org,
	 linux-fsdevel@vger.kernel.org,
	lsf-pc@lists.linux-foundation.org, djwong@kernel.org,
	 john.g.garry@oracle.com, willy@infradead.org, hch@lst.de,
	ritesh.list@gmail.com,  Luis Chamberlain <mcgrof@kernel.org>,
	dchinner@redhat.com, Javier Gonzalez <javier.gonz@samsung.com>,
	 gost.dev@samsung.com, tytso@mit.edu, p.raghav@samsung.com,
	vi.shah@samsung.com
Subject: Re: [LSF/MM/BPF TOPIC] Buffered atomic writes
Date: Tue, 17 Feb 2026 11:13:07 -0500	[thread overview]
Message-ID: <2planlrvjqicgpparsdhxipfdoawtzq3tedql72hoff4pdet6t@btxbx6cpoyc6> (raw)
In-Reply-To: <wkczfczlmstoywbmgfrxzm6ko4frjsu65kvpwquzu7obrjcd3f@6gs5nsfivc6v>

Hi,

On 2026-02-17 13:06:04 +0100, Jan Kara wrote:
> On Mon 16-02-26 10:45:40, Andres Freund wrote:
> > (*) As it turns out, it often seems to improves write throughput as well, if
> > writeback is triggered by memory pressure instead of SYNC_FILE_RANGE_WRITE,
> > linux seems to often trigger a lot more small random IO.
> >
> > > So immediately writing them might be ok as long as we don't remove those
> > > pages from the page cache like we do in RWF_UNCACHED.
> >
> > Yes, it might.  I actually often have wished for something like a
> > RWF_WRITEBACK flag...
>
> I'd call it RWF_WRITETHROUGH but otherwise it makes sense.

Heh, that makes sense. I think that's what I actually was thinking of.


> > > > An argument against this however is that it is user's responsibility to
> > > > not do non atomic IO over an atomic range and this shall be considered a
> > > > userspace usage error. This is similar to how there are ways users can
> > > > tear a dio if they perform overlapping writes. [1].
> >
> > Hm, the scope of the prohibition here is not clear to me. Would it just
> > be forbidden to do:
> >
> > P1: start pwritev(fd, [blocks 1-10], RWF_ATOMIC)
> > P2: pwrite(fd, [any block in 1-10]), non-atomically
> > P1: complete pwritev(fd, ...)
> >
> > or is it also forbidden to do:
> >
> > P1: pwritev(fd, [blocks 1-10], RWF_ATOMIC) start & completes
> > Kernel: starts writeback but doesn't complete it
> > P1: pwrite(fd, [any block in 1-10]), non-atomically
> > Kernel: completes writeback
> >
> > The former is not at all an issue for postgres' use case, the pages in
> > our buffer pool that are undergoing IO are locked, preventing additional
> > IO (be it reads or writes) to those blocks.
> >
> > The latter would be a problem, since userspace wouldn't even know that
> > here is still "atomic writeback" going on, afaict the only way we could
> > avoid it would be to issue an f[data]sync(), which likely would be
> > prohibitively expensive.
>
> It somewhat depends on what outcome you expect in terms of crash safety :)
> Unless we are careful, the RWF_ATOMIC write in your latter example can end
> up writing some bits of the data from the second write because the second
> write may be copying data to the pages as we issue DMA from them to the
> device.

Hm. It's somewhat painful to not know when we can write in what mode again -
with DIO that's not an issue. I guess we could use
sync_file_range(SYNC_FILE_RANGE_WAIT_BEFORE) if we really needed to know?
Although the semantics of the SFR flags aren't particularly clear, so maybe
not?


> I expect this isn't really acceptable because if you crash before
> the second write fully makes it to the disk, you will have inconsistent
> data.

The scenarios that I can think that would lead us to doing something like
this, are when we are overwriting data without regard for the prior contents,
e.g:

An already partially filled page is filled with more rows, we write that page
out, then all the rows are deleted, and we re-fill the page with new content
from scratch. Write it out again.  With our existing logic we treat the second
write differently, because the entire contents of the page will be in the
journal, as there is no prior content that we care about.

A second scenario in which we might not use RWF_ATOMIC, if we carry today's
logic forward, is if a newly created relation is bulk loaded in the same
transaction that created the relation. If a crash were to happen while that
bulk load is ongoing, we don't care about the contents of the file(s), as it
will never be visible to anyone after crash recovery.  In this case we won't
have prio RWF_ATOMIC writes - but we could have the opposite, i.e. an
RWF_ATOMIC write while there already is non-RWF_ATOMIC dirty data in the page
cache. Would that be an issue?


It's possible we should just always use RWF_ATOMIC, even in the cases where
it's not needed from our side, to avoid potential performance penalties and
"undefined behaviour".  I guess that will really depend on the performance
penalty that RWF_ATOMIC will carry and whether multiple-atomicity-mode will
eventually be supported (as doing small writes during bulk loading is quite
expensive).


Greetings,

Andres Freund


  parent reply	other threads:[~2026-02-17 16:13 UTC|newest]

Thread overview: 38+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-02-13 10:20 Pankaj Raghav
2026-02-13 13:32 ` Ojaswin Mujoo
2026-02-16  9:52   ` Pankaj Raghav
2026-02-16 15:45     ` Andres Freund
2026-02-17 12:06       ` Jan Kara
2026-02-17 12:42         ` Pankaj Raghav
2026-02-17 16:21           ` Andres Freund
2026-02-18  1:04             ` Dave Chinner
2026-02-18  6:47               ` Christoph Hellwig
2026-02-18 23:42                 ` Dave Chinner
2026-02-17 16:13         ` Andres Freund [this message]
2026-02-17 18:27           ` Ojaswin Mujoo
2026-02-17 18:42             ` Andres Freund
2026-02-18 17:37           ` Jan Kara
2026-02-18 21:04             ` Andres Freund
2026-02-19  0:32             ` Dave Chinner
2026-02-17 18:33       ` Ojaswin Mujoo
2026-02-17 17:20     ` Ojaswin Mujoo
2026-02-18 17:42       ` [Lsf-pc] " Jan Kara
2026-02-18 20:22         ` Ojaswin Mujoo
2026-02-16 11:38   ` Jan Kara
2026-02-16 13:18     ` Pankaj Raghav
2026-02-17 18:36       ` Ojaswin Mujoo
2026-02-16 15:57     ` Andres Freund
2026-02-17 18:39     ` Ojaswin Mujoo
2026-02-18  0:26       ` Dave Chinner
2026-02-18  6:49         ` Christoph Hellwig
2026-02-18 12:54         ` Ojaswin Mujoo
2026-02-15  9:01 ` Amir Goldstein
2026-02-17  5:51 ` Christoph Hellwig
2026-02-17  9:23   ` [Lsf-pc] " Amir Goldstein
2026-02-17 15:47     ` Andres Freund
2026-02-17 22:45       ` Dave Chinner
2026-02-18  4:10         ` Andres Freund
2026-02-18  6:53       ` Christoph Hellwig
2026-02-18  6:51     ` Christoph Hellwig
2026-02-20 10:08 ` Pankaj Raghav (Samsung)
2026-02-20 15:10   ` Christoph Hellwig

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=2planlrvjqicgpparsdhxipfdoawtzq3tedql72hoff4pdet6t@btxbx6cpoyc6 \
    --to=andres@anarazel.de \
    --cc=dchinner@redhat.com \
    --cc=djwong@kernel.org \
    --cc=gost.dev@samsung.com \
    --cc=hch@lst.de \
    --cc=jack@suse.cz \
    --cc=javier.gonz@samsung.com \
    --cc=john.g.garry@oracle.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-xfs@vger.kernel.org \
    --cc=lsf-pc@lists.linux-foundation.org \
    --cc=mcgrof@kernel.org \
    --cc=ojaswin@linux.ibm.com \
    --cc=p.raghav@samsung.com \
    --cc=pankaj.raghav@linux.dev \
    --cc=ritesh.list@gmail.com \
    --cc=tytso@mit.edu \
    --cc=vi.shah@samsung.com \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox