Re: [LSF/MM/BPF TOPIC] Buffered atomic writes

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Dave Chinner <dgc@kernel.org>
To: Jan Kara <jack@suse.cz>
Cc: Andres Freund <andres@anarazel.de>,
	Pankaj Raghav <pankaj.raghav@linux.dev>,
	Ojaswin Mujoo <ojaswin@linux.ibm.com>,
	linux-xfs@vger.kernel.org, linux-mm@kvack.org,
	linux-fsdevel@vger.kernel.org, lsf-pc@lists.linux-foundation.org,
	djwong@kernel.org, john.g.garry@oracle.com, willy@infradead.org,
	hch@lst.de, ritesh.list@gmail.com,
	Luis Chamberlain <mcgrof@kernel.org>,
	dchinner@redhat.com, Javier Gonzalez <javier.gonz@samsung.com>,
	gost.dev@samsung.com, tytso@mit.edu, p.raghav@samsung.com,
	vi.shah@samsung.com
Subject: Re: [LSF/MM/BPF TOPIC] Buffered atomic writes
Date: Thu, 19 Feb 2026 11:32:45 +1100	[thread overview]
Message-ID: <aZZaLQhC-nFmJBTq@dread> (raw)
In-Reply-To: <umq2nlgxqp4xbrp23zjiajwd6ombed4dfwbajuh35xd4vphyee@26g2y6a4rdnu>

On Wed, Feb 18, 2026 at 06:37:45PM +0100, Jan Kara wrote:
> On Tue 17-02-26 11:13:07, Andres Freund wrote:
> > > > P1: pwritev(fd, [blocks 1-10], RWF_ATOMIC) start & completes
> > > > Kernel: starts writeback but doesn't complete it
> > > > P1: pwrite(fd, [any block in 1-10]), non-atomically
> > > > Kernel: completes writeback
> > > >
> > > > The former is not at all an issue for postgres' use case, the pages in
> > > > our buffer pool that are undergoing IO are locked, preventing additional
> > > > IO (be it reads or writes) to those blocks.
> > > >
> > > > The latter would be a problem, since userspace wouldn't even know that
> > > > here is still "atomic writeback" going on, afaict the only way we could
> > > > avoid it would be to issue an f[data]sync(), which likely would be
> > > > prohibitively expensive.
> > >
> > > It somewhat depends on what outcome you expect in terms of crash safety :)
> > > Unless we are careful, the RWF_ATOMIC write in your latter example can end
> > > up writing some bits of the data from the second write because the second
> > > write may be copying data to the pages as we issue DMA from them to the
> > > device.
> > 
> > Hm. It's somewhat painful to not know when we can write in what mode again -
> > with DIO that's not an issue. I guess we could use
> > sync_file_range(SYNC_FILE_RANGE_WAIT_BEFORE) if we really needed to know?
> > Although the semantics of the SFR flags aren't particularly clear, so maybe
> > not?
> 
> If you used RWF_WRITETHROUGH for your writes (so you are sure IO has
> already started) then sync_file_range(SYNC_FILE_RANGE_WAIT_BEFORE) would
> indeed be a safe way of waiting for that IO to complete (or just wait for
> the write(2) syscall itself to complete if we make RWF_WRITETHROUGH wait
> for IO completion as Dave suggests - but I guess writes may happen from
> multiple threads so that may be not very convenient and sync_file_range(2)
> might be actually easier).

I would much prefer we don't have to rely on crappy interfaces like
sync_file_range() to handle RWF_WRITETHROUGH IO completion
processing. All it does is add complexity to error
handling/propagation to both the kernel code and the userspace code.
It takes something that is easy to get right (i.e. synchronous
completion) and replaces it with something that is easy to get
wrong. That's not good API design.

As for handling multiple writes to the same range, stable pages do
that for us. RWF_WRITETHROUGH will need to set folios in the
writeback state before submission and clear it after completion so
that stable pages work correctly. Hence we may as well use that
functionality to serialise overlapping RWF_WRITETHROUGH IOs and
against concurrent background and data integrity driven writeback

We should be trying hard to keep this simple and consistent with
existing write-through IO models that people already know how to use
(i.e. DIO).

> > > I expect this isn't really acceptable because if you crash before
> > > the second write fully makes it to the disk, you will have inconsistent
> > > data.
> > 
> > The scenarios that I can think that would lead us to doing something like
> > this, are when we are overwriting data without regard for the prior contents,
> > e.g:
> > 
> > An already partially filled page is filled with more rows, we write that page
> > out, then all the rows are deleted, and we re-fill the page with new content
> > from scratch. Write it out again.  With our existing logic we treat the second
> > write differently, because the entire contents of the page will be in the
> > journal, as there is no prior content that we care about.
> > 
> > A second scenario in which we might not use RWF_ATOMIC, if we carry today's
> > logic forward, is if a newly created relation is bulk loaded in the same
> > transaction that created the relation. If a crash were to happen while that
> > bulk load is ongoing, we don't care about the contents of the file(s), as it
> > will never be visible to anyone after crash recovery.  In this case we won't
> > have prio RWF_ATOMIC writes - but we could have the opposite, i.e. an
> > RWF_ATOMIC write while there already is non-RWF_ATOMIC dirty data in the page
> > cache. Would that be an issue?
> 
> No, this should be fine. But as I'm thinking about it what seems the most
> natural is that RWF_WRITETHROUGH writes will wait on any pages under
> writeback in the target range before proceeding with the write.

I think that is required behaviour, even though it is natural. IMO,
concurrent overlapping physical IOs from the page cache via
RWF_WRITETHROUGH is a data corruption vector just waiting for
someone to trip over it...

i.e. we need to keep in mind that one of the guarantees that the
page cache provides is that it will never overlap multiple
concurrent physical IOs to the same physical range. Overlapping IOs
are handled and serialised at the folio level, they should never end
up with overlapping physical IO being issued.

> That will
> give user proper serialization with other RWF_WRITETHROUGH writes to the
> overlapping range as well as writeback from previous normal writes. So the
> only case that needs handling - either by userspace or kernel forcing
> stable writes - would be RWF_WRITETHROUGH write followed by a normal write.

*nod*. I think forcing stable writes for RWF_WRITETHROUGH is the
right way to go. We are going to need stable write semantic for
RWF_ATOMIC support, and we probably should have them for RWF_DSYNC
as well because the data integrity guarantees cover the data in that
specific user IO, not any other previous, concurrent or future user
IO.

-Dave.

-- 
Dave Chinner
dgc@kernel.org

next prev parent reply	other threads:[~2026-02-19  0:33 UTC|newest]

Thread overview: 38+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-02-13 10:20 Pankaj Raghav
2026-02-13 13:32 ` Ojaswin Mujoo
2026-02-16  9:52   ` Pankaj Raghav
2026-02-16 15:45     ` Andres Freund
2026-02-17 12:06       ` Jan Kara
2026-02-17 12:42         ` Pankaj Raghav
2026-02-17 16:21           ` Andres Freund
2026-02-18  1:04             ` Dave Chinner
2026-02-18  6:47               ` Christoph Hellwig
2026-02-18 23:42                 ` Dave Chinner
2026-02-17 16:13         ` Andres Freund
2026-02-17 18:27           ` Ojaswin Mujoo
2026-02-17 18:42             ` Andres Freund
2026-02-18 17:37           ` Jan Kara
2026-02-18 21:04             ` Andres Freund
2026-02-19  0:32             ` Dave Chinner [this message]
2026-02-17 18:33       ` Ojaswin Mujoo
2026-02-17 17:20     ` Ojaswin Mujoo
2026-02-18 17:42       ` [Lsf-pc] " Jan Kara
2026-02-18 20:22         ` Ojaswin Mujoo
2026-02-16 11:38   ` Jan Kara
2026-02-16 13:18     ` Pankaj Raghav
2026-02-17 18:36       ` Ojaswin Mujoo
2026-02-16 15:57     ` Andres Freund
2026-02-17 18:39     ` Ojaswin Mujoo
2026-02-18  0:26       ` Dave Chinner
2026-02-18  6:49         ` Christoph Hellwig
2026-02-18 12:54         ` Ojaswin Mujoo
2026-02-15  9:01 ` Amir Goldstein
2026-02-17  5:51 ` Christoph Hellwig
2026-02-17  9:23   ` [Lsf-pc] " Amir Goldstein
2026-02-17 15:47     ` Andres Freund
2026-02-17 22:45       ` Dave Chinner
2026-02-18  4:10         ` Andres Freund
2026-02-18  6:53       ` Christoph Hellwig
2026-02-18  6:51     ` Christoph Hellwig
2026-02-20 10:08 ` Pankaj Raghav (Samsung)
2026-02-20 15:10   ` Christoph Hellwig

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aZZaLQhC-nFmJBTq@dread \
    --to=dgc@kernel.org \
    --cc=andres@anarazel.de \
    --cc=dchinner@redhat.com \
    --cc=djwong@kernel.org \
    --cc=gost.dev@samsung.com \
    --cc=hch@lst.de \
    --cc=jack@suse.cz \
    --cc=javier.gonz@samsung.com \
    --cc=john.g.garry@oracle.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-xfs@vger.kernel.org \
    --cc=lsf-pc@lists.linux-foundation.org \
    --cc=mcgrof@kernel.org \
    --cc=ojaswin@linux.ibm.com \
    --cc=p.raghav@samsung.com \
    --cc=pankaj.raghav@linux.dev \
    --cc=ritesh.list@gmail.com \
    --cc=tytso@mit.edu \
    --cc=vi.shah@samsung.com \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox