From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id C8CA0C5475B
	for <linux-mm@archiver.kernel.org>; Thu, 29 Feb 2024 00:57:53 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 382C16B009A; Wed, 28 Feb 2024 19:57:53 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 333056B00A8; Wed, 28 Feb 2024 19:57:53 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 1FAD16B00AA; Wed, 28 Feb 2024 19:57:53 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15])
	by kanga.kvack.org (Postfix) with ESMTP id 0CD836B009A
	for <linux-mm@kvack.org>; Wed, 28 Feb 2024 19:57:53 -0500 (EST)
Received: from smtpin29.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay04.hostedemail.com (Postfix) with ESMTP id D8E731A0E86
	for <linux-mm@kvack.org>; Thu, 29 Feb 2024 00:57:52 +0000 (UTC)
X-FDA: 81843029184.29.486D1D0
Received: from out-181.mta0.migadu.com (out-181.mta0.migadu.com [91.218.175.181])
	by imf19.hostedemail.com (Postfix) with ESMTP id 001E21A0013
	for <linux-mm@kvack.org>; Thu, 29 Feb 2024 00:57:50 +0000 (UTC)
Authentication-Results: imf19.hostedemail.com;
	dkim=pass header.d=linux.dev header.s=key1 header.b=CoJg5ftQ;
	dmarc=pass (policy=none) header.from=linux.dev;
	spf=pass (imf19.hostedemail.com: domain of kent.overstreet@linux.dev designates 91.218.175.181 as permitted sender) smtp.mailfrom=kent.overstreet@linux.dev
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1709168271;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=uv9ar9mptQNlelTUKQCvcTXiPbKJ6d2oz0Pm8eqFib8=;
	b=iRktI/ZQuWJj0VYxN9CD/+RNFIC3R/2yIgu6IHOo0Fw7d6TUWOJP+e6aEp9nwcrH/0b3gT
	lofz69DaOH5LBCQnnG6g7LvSs/3SAG+7LEiiy0wLAejPjLxx4rrW6imqH8btMI9DSnYyD4
	1XdE3VDC1uoNI4CrsHKPIUuRdUytFKs=
ARC-Authentication-Results: i=1;
	imf19.hostedemail.com;
	dkim=pass header.d=linux.dev header.s=key1 header.b=CoJg5ftQ;
	dmarc=pass (policy=none) header.from=linux.dev;
	spf=pass (imf19.hostedemail.com: domain of kent.overstreet@linux.dev designates 91.218.175.181 as permitted sender) smtp.mailfrom=kent.overstreet@linux.dev
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1709168271; a=rsa-sha256;
	cv=none;
	b=5NBLSrVJDW7ki0MvhXXav/+vinStPdpGqusTdAWPyaxNDPTtvazQjNCz4j5+8dMyGMM/S/
	RiQ21tld0fnxvzRNVOJ9Pc0Mr/upBFp3na/NlayE2q8csmN/FsRqoiUpKJdRjCViOfo9ia
	5hvyQCg+inI6D+fOUnz5BpdAHUNaEow=
Date: Wed, 28 Feb 2024 19:57:38 -0500
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1;
	t=1709168268;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=uv9ar9mptQNlelTUKQCvcTXiPbKJ6d2oz0Pm8eqFib8=;
	b=CoJg5ftQ2RT3INjKRvVwkfK50lqwXKBx8DL7czNU7vZmC3/2qOgG4uUCm3aMs+udtbrXv3
	ACpl4ucd7ojjtLp7xme7EQm6Arh7k7N+Ka1rwpVUaaQzhZ5t6PgqLgZQdDg2MuzcIYc0iE
	jFV2YbiUzRXWdnkBxiGOc6gd2zaNlvc=
X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers.
From: Kent Overstreet <kent.overstreet@linux.dev>
To: Dave Chinner <david@fromorbit.com>
Cc: Amir Goldstein <amir73il@gmail.com>, 
	Pankaj Raghav <p.raghav@samsung.com>, Jens Axboe <axboe@kernel.dk>, Chris Mason <clm@fb.com>, 
	Matthew Wilcox <willy@infradead.org>, Daniel Gomez <da.gomez@samsung.com>, 
	linux-mm <linux-mm@kvack.org>, Luis Chamberlain <mcgrof@kernel.org>, 
	Johannes Weiner <hannes@cmpxchg.org>, linux-fsdevel@vger.kernel.org, lsf-pc@lists.linux-foundation.org, 
	Linus Torvalds <torvalds@linux-foundation.org>, Christoph Hellwig <hch@lst.de>, 
	Josef Bacik <josef@toxicpanda.com>, Jan Kara <jack@suse.cz>
Subject: Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Measuring limits and enhancing
 buffered IO
Message-ID: <j6cvqvq2az45kj5tjepbklm7r3h24rl4mj65ygf3uozaseauuv@hdr7tmidxx5u>
References: <Zdkxfspq3urnrM6I@bombadil.infradead.org>
 <xhymmlbragegxvgykhaddrkkhc7qn7soapca22ogbjlegjri35@ffqmquunkvxw>
 <Zd5ecZbF5NACZpGs@dread.disaster.area>
 <d2zbdldh5l6flfwzcwo6pnhjpoihfiaafl7lqeqmxdbpgoj77y@fjpx3tcc4oev>
 <Zd5lORiOCUsARPWq@dread.disaster.area>
 <CAOQ4uxi=fdjXq7q0_+0mDovmBd6Afb=xteFBSnE-rUmQMJYgRQ@mail.gmail.com>
 <Zd/O/S3rdvZ8OxZJ@dread.disaster.area>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
In-Reply-To: <Zd/O/S3rdvZ8OxZJ@dread.disaster.area>
X-Migadu-Flow: FLOW_OUT
X-Rspam-User: 
X-Rspamd-Server: rspam12
X-Rspamd-Queue-Id: 001E21A0013
X-Stat-Signature: oifyuwpr1193ygdxuxtmdrgif5z3zf46
X-HE-Tag: 1709168270-858407
X-HE-Meta: U2FsdGVkX19LXasASszX+/5fr/a2WNlr4T4Dyyy/iZ2EHlka18SoF2SltLpnKJ3wWyrZTl0rDYHwZUTAogA366QfCW7tpEMI2JFkhd44PUJZjUv95ezhASrIp9AEO96zJ/RPncVRd/kkqxPtitt2iFxftfpHBPi4+0xRDt468d5I6heemZ7c+iPCa45B1/wIoU5RT8O3yFAxH2QvBnoEVXdutQOmsopCOgZ51K2pDOcggOPxoyHCOLpBO+YnLY9WkVp9DFUpAS3L+9jiTv+dy0mg1LSZzyhaoZNKk9V+7Lw1HfAqzB917manEHvOCZ3r5hxTK+8AdyY2StYsAUo7PBUBRUjxaDVe/r+n7KcvwmZuBVxJojLTCluitPG6PSmUFVAYzKEdi5pka/KtRA6X8AYv5TV6JMmBSJmQQLFNVw+x24iTHWCvYC6JqmdCwaH7hvBXZ3OdFb6KDCl502AeuscKp9B/T3UDtHvYWsgneY1VLgJ7HFeVkhvBK1z/+B357o3vSyO2koHPFPvR5wFYnWhLGkiunL7yEOnYS+Y8PqihXVdhSeJwTMNP8rndbi33iArD/0zM0ZmWWYIlzXM3+95halmEnJjdcrj4MPWHsj7HgYS+lT/BYEvIOp3xkP76ztAMp9Xz1KuVXvfJ8bAnrHOt5KAsVM3+zPDdyt/YBnVUiyzllVM757TckyhAt05J038B/Mrg2rvtALtlTE9FozoRuSoxlgsUNpm8vhXomfa8dTOZkg9MAydFxi/H8pOCCC8iUawiaBm9LQFLIEtZg3SCiSo02+qHHyLHYpByrX8=
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On Thu, Feb 29, 2024 at 11:25:33AM +1100, Dave Chinner wrote:
> On Wed, Feb 28, 2024 at 09:48:46AM +0200, Amir Goldstein wrote:
> > On Wed, Feb 28, 2024 at 12:42 AM Dave Chinner via Lsf-pc
> > <lsf-pc@lists.linux-foundation.org> wrote:
> > >
> > > On Tue, Feb 27, 2024 at 05:21:20PM -0500, Kent Overstreet wrote:
> > > > On Wed, Feb 28, 2024 at 09:13:05AM +1100, Dave Chinner wrote:
> > > > > On Tue, Feb 27, 2024 at 05:07:30AM -0500, Kent Overstreet wrote:
> > > > > > AFAIK every filesystem allows concurrent direct writes, not just xfs,
> > > > > > it's _buffered_ writes that we care about here.
> > > > >
> > > > > We could do concurrent buffered writes in XFS - we would just use
> > > > > the same locking strategy as direct IO and fall back on folio locks
> > > > > for copy-in exclusion like ext4 does.
> > > >
> > > > ext4 code doesn't do that. it takes the inode lock in exclusive mode,
> > > > just like everyone else.
> > >
> > > Uhuh. ext4 does allow concurrent DIO writes. It's just much more
> > > constrained than XFS. See ext4_dio_write_checks().
> > >
> > > > > The real question is how much of userspace will that break, because
> > > > > of implicit assumptions that the kernel has always serialised
> > > > > buffered writes?
> > > >
> > > > What would break?
> > >
> > > Good question. If you don't know the answer, then you've got the
> > > same problem as I have. i.e. we don't know if concurrent
> > > applications that use buffered IO extensively (eg. postgres) assume
> > > data coherency because of the implicit serialisation occurring
> > > during buffered IO writes?
> > >
> > > > > > If we do a short write because of a page fault (despite previously
> > > > > > faulting in the userspace buffer), there is no way to completely prevent
> > > > > > torn writes an atomicity breakage; we could at least try a trylock on
> > > > > > the inode lock, I didn't do that here.
> > > > >
> > > > > As soon as we go for concurrent writes, we give up on any concept of
> > > > > atomicity of buffered writes (esp. w.r.t reads), so this really
> > > > > doesn't matter at all.
> > > >
> > > > We've already given up buffered write vs. read atomicity, have for a
> > > > long time - buffered read path takes no locks.
> > >
> > > We still have explicit buffered read() vs buffered write() atomicity
> > > in XFS via buffered reads taking the inode lock shared (see
> > > xfs_file_buffered_read()) because that's what POSIX says we should
> > > have.
> > >
> > > Essentially, we need to explicitly give POSIX the big finger and
> > > state that there are no atomicity guarantees given for write() calls
> > > of any size, nor are there any guarantees for data coherency for
> > > any overlapping concurrent buffered IO operations.
> > >
> > 
> > I have disabled read vs. write atomicity (out-of-tree) to make xfs behave
> > as the other fs ever since Jan has added the invalidate_lock and I believe
> > that Meta kernel has done that way before.
> > 
> > > Those are things we haven't completely given up yet w.r.t. buffered
> > > IO, and enabling concurrent buffered writes will expose to users.
> > > So we need to have explicit policies for this and document them
> > > clearly in all the places that application developers might look
> > > for behavioural hints.
> > 
> > That's doable - I can try to do that.
> > What is your take regarding opt-in/opt-out of legacy behavior?
> 
> Screw the legacy code, don't even make it an option. No-one should
> be relying on large buffered writes being atomic anymore, and with
> high order folios in the page cache most small buffered writes are
> going to be atomic w.r.t. both reads and writes anyway.

That's a new take...

> 
> > At the time, I have proposed POSIX_FADV_TORN_RW API [1]
> > to opt-out of the legacy POSIX behavior, but I guess that an xfs mount
> > option would make more sense for consistent and clear semantics across
> > the fs - it is easier if all buffered IO to inode behaved the same way.
> 
> No mount options, just change the behaviour. Applications already
> have to avoid concurrent overlapping buffered reads and writes if
> they care about data integrity and coherency, so making buffered
> writes concurrent doesn't change anything.

Honestly - no.

Userspace would really like to see some sort of definition for this kind
of behaviour, and if we just change things underneath them without
telling anyone, _that's a dick move_.

POSIX_FADV_TORN_RW is a terrible name, though.

And fadvise() is the wrong API for this because it applies to ranges,
this should be an open flag or an fcntl.