Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Dave Chinner <david@fromorbit.com>
To: Kent Overstreet <kent.overstreet@linux.dev>
Cc: Amir Goldstein <amir73il@gmail.com>,
	Pankaj Raghav <p.raghav@samsung.com>,
	Jens Axboe <axboe@kernel.dk>, Chris Mason <clm@fb.com>,
	Matthew Wilcox <willy@infradead.org>,
	Daniel Gomez <da.gomez@samsung.com>,
	linux-mm <linux-mm@kvack.org>,
	Luis Chamberlain <mcgrof@kernel.org>,
	Johannes Weiner <hannes@cmpxchg.org>,
	linux-fsdevel@vger.kernel.org, lsf-pc@lists.linux-foundation.org,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Christoph Hellwig <hch@lst.de>,
	Josef Bacik <josef@toxicpanda.com>, Jan Kara <jack@suse.cz>
Subject: Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO
Date: Mon, 4 Mar 2024 11:46:52 +1100	[thread overview]
Message-ID: <ZeUZ/DWuCvR1423G@dread.disaster.area> (raw)
In-Reply-To: <j6cvqvq2az45kj5tjepbklm7r3h24rl4mj65ygf3uozaseauuv@hdr7tmidxx5u>

On Wed, Feb 28, 2024 at 07:57:38PM -0500, Kent Overstreet wrote:
> On Thu, Feb 29, 2024 at 11:25:33AM +1100, Dave Chinner wrote:
> > > That's doable - I can try to do that.
> > > What is your take regarding opt-in/opt-out of legacy behavior?
> > 
> > Screw the legacy code, don't even make it an option. No-one should
> > be relying on large buffered writes being atomic anymore, and with
> > high order folios in the page cache most small buffered writes are
> > going to be atomic w.r.t. both reads and writes anyway.
> 
> That's a new take...
> 
> > 
> > > At the time, I have proposed POSIX_FADV_TORN_RW API [1]
> > > to opt-out of the legacy POSIX behavior, but I guess that an xfs mount
> > > option would make more sense for consistent and clear semantics across
> > > the fs - it is easier if all buffered IO to inode behaved the same way.
> > 
> > No mount options, just change the behaviour. Applications already
> > have to avoid concurrent overlapping buffered reads and writes if
> > they care about data integrity and coherency, so making buffered
> > writes concurrent doesn't change anything.
> 
> Honestly - no.
> 
> Userspace would really like to see some sort of definition for this kind
> of behaviour, and if we just change things underneath them without
> telling anyone, _that's a dick move_.

I don't think you understand the full picture here, Kent.

> POSIX_FADV_TORN_RW is a terrible name, though.

The described behaviour for this advice is the standard behaviour
for ext4, btrfs and most linux filesystems other than XFS. It has
been for a -long- time.

The only filesystem that gives anything resembling POSIX atomic
write behaviour is XFS. No other filesystem in Linux actually
provides the POSIX "buffered reads won't see partial data from
buffered writes in progress" behaviour that XFS does via the IOLOCK
behaviour it uses.

So when I say "screw the legacy apps" I'm talking about the ancient
enterprise applications that still behave as if this POSIX behaviour
is reliable on modern linux systems. It simply isn't, and these apps
are *already implicitly broken* on most Linux filesystems and they
need fixing.

> And fadvise() is the wrong API for this because it applies to ranges,
> this should be an open flag or an fcntl.

Not only is it the wrong API, it's also the wrong approach to take.
We have a new API coming through for atomic writes: RWF_ATOMIC.

If an applications needs an actual atomic IO guarantee, they will
soon be able to be explicit in their requirements and they will not
end up in the situation where the filesystem they use might
determine if there is an implicit atomic write behaviour provided.

Indeed, we don't actually say that XFS will always guarantee POSIX
atomic buffered IO semantics - we've just never decided that the
time is right to change the behaviour.

In making such a change to XFS, normal buffered writes will get
mostly the same behaviour as they do now because we now use high
order folios in the page cache and serialisation will be done
against high-order ranges rather than individual pages. And
applications that actually need atomic IO semantics can use
RWF_ATOMIC and in that case we can do explicitly serialised buffered
writes that lock out concurrent buffered reads as we do right now.

IOWs, there is no better time to convert XFS behaviour to match all
the other Linux filesystems than right now. Applications that need
atomic IO guarantees can use RWF_ATOMIC, and everyone else can get
the performance benefits that come from no longer trying to make
buffered IO implicitly "atomic".

-Dave.
-- 
Dave Chinner
david@fromorbit.com

next prev parent reply	other threads:[~2024-03-04  0:47 UTC|newest]

Thread overview: 90+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-02-23 23:59 Luis Chamberlain
2024-02-24  4:12 ` Matthew Wilcox
2024-02-24 17:31   ` Linus Torvalds
2024-02-24 18:13     ` Matthew Wilcox
2024-02-24 18:24       ` Linus Torvalds
2024-02-24 18:20     ` Linus Torvalds
2024-02-24 19:11       ` Linus Torvalds
2024-02-24 21:42         ` Theodore Ts'o
2024-02-24 22:57         ` Chris Mason
2024-02-24 23:40           ` Linus Torvalds
2024-05-10 23:57           ` Luis Chamberlain
2024-02-25  5:18     ` Kent Overstreet
2024-02-25  6:04       ` Kent Overstreet
2024-02-25 13:10       ` Matthew Wilcox
2024-02-25 17:03         ` Linus Torvalds
2024-02-25 21:14           ` Matthew Wilcox
2024-02-25 23:45             ` Linus Torvalds
2024-02-26  1:02               ` Kent Overstreet
2024-02-26  1:32                 ` Linus Torvalds
2024-02-26  1:58                   ` Kent Overstreet
2024-02-26  2:06                     ` Kent Overstreet
2024-02-26  2:34                     ` Linus Torvalds
2024-02-26  2:50                   ` Al Viro
2024-02-26 17:17                     ` Linus Torvalds
2024-02-26 21:07                       ` Matthew Wilcox
2024-02-26 21:17                         ` Kent Overstreet
2024-02-26 21:19                           ` Kent Overstreet
2024-02-26 21:55                             ` Paul E. McKenney
2024-02-26 23:29                               ` Kent Overstreet
2024-02-27  0:05                                 ` Paul E. McKenney
2024-02-27  0:29                                   ` Kent Overstreet
2024-02-27  0:55                                     ` Paul E. McKenney
2024-02-27  1:08                                       ` Kent Overstreet
2024-02-27  5:17                                         ` Paul E. McKenney
2024-02-27  6:21                                           ` Kent Overstreet
2024-02-27 15:32                                             ` Paul E. McKenney
2024-02-27 15:52                                               ` Kent Overstreet
2024-02-27 16:06                                                 ` Paul E. McKenney
2024-02-27 15:54                                               ` Matthew Wilcox
2024-02-27 16:21                                                 ` Paul E. McKenney
2024-02-27 16:34                                                   ` Kent Overstreet
2024-02-27 17:58                                                     ` Paul E. McKenney
2024-02-28 23:55                                                       ` Kent Overstreet
2024-02-29 19:42                                                         ` Paul E. McKenney
2024-02-29 20:51                                                           ` Kent Overstreet
2024-03-05  2:19                                                             ` Paul E. McKenney
2024-02-27  0:43                                 ` Dave Chinner
2024-02-26 22:46                       ` Linus Torvalds
2024-02-26 23:48                         ` Linus Torvalds
2024-02-27  7:21                           ` Kent Overstreet
2024-02-27 15:39                             ` Matthew Wilcox
2024-02-27 15:54                               ` Kent Overstreet
2024-02-27 16:34                             ` Linus Torvalds
2024-02-27 16:47                               ` Kent Overstreet
2024-02-27 17:07                                 ` Linus Torvalds
2024-02-27 17:20                                   ` Kent Overstreet
2024-02-27 18:02                                     ` Linus Torvalds
2024-05-14 11:52                         ` Luis Chamberlain
2024-05-14 16:04                           ` Linus Torvalds
2024-11-15 19:43                           ` Linus Torvalds
2024-11-15 20:42                             ` Matthew Wilcox
2024-11-15 21:52                               ` Linus Torvalds
2024-02-25 21:29           ` Kent Overstreet
2024-02-25 17:32         ` Kent Overstreet
2024-02-24 17:55   ` Luis Chamberlain
2024-02-25  5:24 ` Kent Overstreet
2024-02-26 12:22 ` Dave Chinner
2024-02-27 10:07 ` Kent Overstreet
2024-02-27 14:08   ` Luis Chamberlain
2024-02-27 14:57     ` Kent Overstreet
2024-02-27 22:13   ` Dave Chinner
2024-02-27 22:21     ` Kent Overstreet
2024-02-27 22:42       ` Dave Chinner
2024-02-28  7:48         ` [Lsf-pc] " Amir Goldstein
2024-02-28 14:01           ` Chris Mason
2024-02-29  0:25           ` Dave Chinner
2024-02-29  0:57             ` Kent Overstreet
2024-03-04  0:46               ` Dave Chinner [this message]
2024-02-27 22:46       ` Linus Torvalds
2024-02-27 23:00         ` Linus Torvalds
2024-02-28  2:22         ` Kent Overstreet
2024-02-28  3:00           ` Matthew Wilcox
2024-02-28  4:22             ` Matthew Wilcox
2024-02-28 17:34               ` Kent Overstreet
2024-02-28 18:04                 ` Matthew Wilcox
2024-02-28 18:18         ` Kent Overstreet
2024-02-28 19:09           ` Linus Torvalds
2024-02-28 19:29             ` Kent Overstreet
2024-02-28 20:17               ` Linus Torvalds
2024-02-28 23:21                 ` Kent Overstreet

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZeUZ/DWuCvR1423G@dread.disaster.area \
    --to=david@fromorbit.com \
    --cc=amir73il@gmail.com \
    --cc=axboe@kernel.dk \
    --cc=clm@fb.com \
    --cc=da.gomez@samsung.com \
    --cc=hannes@cmpxchg.org \
    --cc=hch@lst.de \
    --cc=jack@suse.cz \
    --cc=josef@toxicpanda.com \
    --cc=kent.overstreet@linux.dev \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lsf-pc@lists.linux-foundation.org \
    --cc=mcgrof@kernel.org \
    --cc=p.raghav@samsung.com \
    --cc=torvalds@linux-foundation.org \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox