From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id BAB95C54798
	for <linux-mm@archiver.kernel.org>; Tue, 27 Feb 2024 14:57:40 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 4D46B28001B; Tue, 27 Feb 2024 09:57:40 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 484D2280017; Tue, 27 Feb 2024 09:57:40 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 373C928001B; Tue, 27 Feb 2024 09:57:40 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10])
	by kanga.kvack.org (Postfix) with ESMTP id 2946E280017
	for <linux-mm@kvack.org>; Tue, 27 Feb 2024 09:57:40 -0500 (EST)
Received: from smtpin15.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay04.hostedemail.com (Postfix) with ESMTP id C8A4D1A0C0F
	for <linux-mm@kvack.org>; Tue, 27 Feb 2024 14:57:39 +0000 (UTC)
X-FDA: 81837887838.15.C760299
Received: from out-175.mta0.migadu.com (out-175.mta0.migadu.com [91.218.175.175])
	by imf22.hostedemail.com (Postfix) with ESMTP id 73AD0C000D
	for <linux-mm@kvack.org>; Tue, 27 Feb 2024 14:57:36 +0000 (UTC)
Authentication-Results: imf22.hostedemail.com;
	dkim=pass header.d=linux.dev header.s=key1 header.b=q1J9vV39;
	spf=pass (imf22.hostedemail.com: domain of kent.overstreet@linux.dev designates 91.218.175.175 as permitted sender) smtp.mailfrom=kent.overstreet@linux.dev;
	dmarc=pass (policy=none) header.from=linux.dev
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1709045856;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=u7soQeQMolVADcku4EZWS0C8/aJsrG13W4BWj0N1xKY=;
	b=I/J5TTe58dZMBwUbU3RG2LDUiJpuP87q8+wmAff+ekkE5Kg3B5LH5Syl+xpMENH8saUETr
	8Fy2455QPOZC8SIaFEnlBDqMRWDD+GWl+qoiUv7eD0BSs9DjNTNt5Z7U5dhlL1UpbXwgdH
	ak+2XClJ7YIRduOiocjpVvOJJ0qm5O4=
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1709045857; a=rsa-sha256;
	cv=none;
	b=VLh8Suw5tn2U5Nf/SJ9s2YtsSCwgz2xF34laaJ2+ojp7QNjhpg3WoKi+JsTZWAyhLlTwg+
	xrxT+GvILS0LBC4m2SoPswTvB/t27rsMX0D/IuXEHaKMc44GwM6BtXCUh1IrKOEOkov9Q7
	MJGDZyyOD8K+poB+mKhR8OoEhKnTMqk=
ARC-Authentication-Results: i=1;
	imf22.hostedemail.com;
	dkim=pass header.d=linux.dev header.s=key1 header.b=q1J9vV39;
	spf=pass (imf22.hostedemail.com: domain of kent.overstreet@linux.dev designates 91.218.175.175 as permitted sender) smtp.mailfrom=kent.overstreet@linux.dev;
	dmarc=pass (policy=none) header.from=linux.dev
Date: Tue, 27 Feb 2024 09:57:25 -0500
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1;
	t=1709045853;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:content-type:content-type:
	 in-reply-to:in-reply-to:references:references;
	bh=u7soQeQMolVADcku4EZWS0C8/aJsrG13W4BWj0N1xKY=;
	b=q1J9vV39pPY36rQVXE+3UVDqPULGJhnH+50vQQh1MB0jPtpuXig21WzDSD+byu8URD8p5u
	vC8pAjNZONuXIoiow771GQG0+kFSmcEQmL+K8E5Htacxqb5p6hOoee4wv7fNbAPu96FsPo
	3R6iD6oSMxwzKjpmVaETZ8dq8s9sshI=
X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers.
From: Kent Overstreet <kent.overstreet@linux.dev>
To: Luis Chamberlain <mcgrof@kernel.org>
Cc: lsf-pc@lists.linux-foundation.org, linux-fsdevel@vger.kernel.org, 
	linux-mm <linux-mm@kvack.org>, Daniel Gomez <da.gomez@samsung.com>, 
	Pankaj Raghav <p.raghav@samsung.com>, Jens Axboe <axboe@kernel.dk>, Dave Chinner <david@fromorbit.com>, 
	Christoph Hellwig <hch@lst.de>, Chris Mason <clm@fb.com>, Johannes Weiner <hannes@cmpxchg.org>, 
	Matthew Wilcox <willy@infradead.org>, Linus Torvalds <torvalds@linux-foundation.org>
Subject: Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO
Message-ID: <mdhbz72ifvdzgjj3wfgx6ascxaspgaj6vdckqo4qt3mn3uojy7@aqtnnfywjesu>
References: <Zdkxfspq3urnrM6I@bombadil.infradead.org>
 <xhymmlbragegxvgykhaddrkkhc7qn7soapca22ogbjlegjri35@ffqmquunkvxw>
 <Zd3s-SPx_EnDXJzs@bombadil.infradead.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <Zd3s-SPx_EnDXJzs@bombadil.infradead.org>
X-Migadu-Flow: FLOW_OUT
X-Rspamd-Queue-Id: 73AD0C000D
X-Rspam-User: 
X-Rspamd-Server: rspam11
X-Stat-Signature: qkpsc67pwuih5968ue664wxtw3n8skrc
X-HE-Tag: 1709045856-171461
X-HE-Meta: U2FsdGVkX18ckU1pGUVkzUqO+VXoE/sjWYEnEpYT2URuQpqzc07tDvGuLSmAbIflB0UAKSP6dXIqtaIcbYmux90QnULsZgwVNO0mFQRbNCoFNOigiQCWk1m/yN+I5rR/rFz64uQOC3Y3JdzVOwLJ6um8VNIAduqLgMqKoPx4fEFiPcoBynIcVNO2d/hyTR5qDqgmfKjcO9X2eilunffjTSjO5ctbTQABj8jls/95/hIdmmbIivvA50zYOW2v/DXdCQvUxWpx48HyKx8ztpfDefVp9hOe1hj5zq3rmDsP0JSHywE95MCe6pbVCqmVYuP8vSlcKjDN/rpyYHWR+YiVJdyq6K6ph49AjqzoBTgNKjvOC05SCDxnlOwdu1mb/1wKiWc+H9HFM6DSbm14zopKLRV8CW4uDUYk5nm8NUXKGNqSqJIuKAUBFoUmkp+SxENre+Wx0YgKlCjRWo2ikC6oEudTV8svSVB0+Wz+OTzPWqX23cTpwQupLvTfgBjvxqWvVUBxFcex0hoITdn2S/EdYbfj128iyo5GiEuH8kmy/NcErpsIeKJ5EgMVL0QLwtEc1ejXmbDodg3L8nFgudTEjfN48us6Fi6Vr/PspfE+SU1qgBze2F1KalTC6NfK8XeOWCq3jX/t3lRgnZwwF8qxE/U6Xrobb4A9sZkiiTTonB+YmqG5sAsuOHfo5NYgnI7aNQRUA2YGXDtcFeoq7/OSrpxUb544SOkgpaS/z/GRRpxX5LCDTpG70Hab+hRtbX9KGkogkR09nyU0o05xghDQKjSOlF1AP+wl7+ohEhd1hkc=
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On Tue, Feb 27, 2024 at 06:08:57AM -0800, Luis Chamberlain wrote:
> On Tue, Feb 27, 2024 at 05:07:30AM -0500, Kent Overstreet wrote:
> > On Fri, Feb 23, 2024 at 03:59:58PM -0800, Luis Chamberlain wrote:
> > > Part of the testing we have done with LBS was to do some performance
> > > tests on XFS to ensure things are not regressing. Building linux is a
> > > fine decent test and we did some random cloud instance tests on that and
> > > presented that at Plumbers, but it doesn't really cut it if we want to
> > > push things to the limit though. What are the limits to buffered IO
> > > and how do we test that? Who keeps track of it?
> > > 
> > > The obvious recurring tension is that for really high performance folks
> > > just recommend to use birect IO. But if you are stress testing changes
> > > to a filesystem and want to push buffered IO to its limits it makes
> > > sense to stick to buffered IO, otherwise how else do we test it?
> > > 
> > > It is good to know limits to buffered IO too because some workloads
> > > cannot use direct IO.  For instance PostgreSQL doesn't have direct IO
> > > support and even as late as the end of last year we learned that adding
> > > direct IO to PostgreSQL would be difficult.  Chris Mason has noted also
> > > that direct IO can also force writes during reads (?)... Anyway, testing
> > > the limits of buffered IO limits to ensure you are not creating
> > > regressions when doing some page cache surgery seems like it might be
> > > useful and a sensible thing to do .... The good news is we have not found
> > > regressions with LBS but all the testing seems to beg the question, of what
> > > are the limits of buffered IO anyway, and how does it scale? Do we know, do
> > > we care? Do we keep track of it? How does it compare to direct IO for some
> > > workloads? How big is the delta? How do we best test that? How do we
> > > automate all that? Do we want to automatically test this to avoid regressions?
> > > 
> > > The obvious issues with some workloads for buffered IO is having a
> > > possible penality if you are not really re-using folios added to the
> > > page cache. Jens Axboe reported a while ago issues with workloads with
> > > random reads over a data set 10x the size of RAM and also proposed
> > > RWF_UNCACHED as a way to help [0]. As Chinner put it, this seemed more
> > > like direct IO with kernel pages and a memcpy(), and it requires
> > > further serialization to be implemented that we already do for
> > > direct IO for writes. There at least seems to be agreement that if we're
> > > going to provide an enhancement or alternative that we should strive to not
> > > make the same mistakes we've done with direct IO. The rationale for some
> > > workloads to use buffered IO is it helps reduce some tail latencies, so
> > > that's something to live up to.
> > > 
> > > On that same thread Christoph also mentioned the possibility of a direct
> > > IO variant which can leverage the cache. Is that something we want to
> > > move forward with?
> > > 
> > > Chris Mason also listed a few other desirables if we do:
> > > 
> > > - Allowing concurrent writes (xfs DIO does this now)
> > 
> > AFAIK every filesystem allows concurrent direct writes, not just xfs,
> > it's _buffered_ writes that we care about here.
> 
> The context above was a possible direct IO variant, that's why direct IO
> was mentioned and that XFS at least had support.
> 
> > I just pushed a patch to my CI for buffered writes without taking the
> > inode lock - for bcachefs. It'll be straightforward, but a decent amount
> > of work, to lift this to the VFS, if people are interested in
> > collaborating.
> > 
> > https://evilpiepirate.org/git/bcachefs.git/log/?h=bcachefs-buffered-write-locking
> 
> Neat, this is sort of what I wanted to get a sense for, if this sort of
> topic was worth discussing at LSFMM.
> 
> > The approach is: for non extending, non appending writes, see if we can
> > pin the entire range of the pagecache we're writing to; fall back to
> > taking the inode lock if we can't.
> 
> Perhaps a silly thought... but initial reaction is, would it make sense
> for the page cache to make this easier for us, so we have this be
> easier? It is not clear to me but my first reaction to seeing some of
> these deltas was what if we had something like the space split up, as we
> do with XFS agcounts, and so each group deals with its own ranges. I
> considered this before profiling, and as with Matthew I figured it might
> be lock contenton.  It very likely is not for my test case, and as Linus
> and Dave has clarified we are both penalized and also have a
> singlthreaded writeback.  If we had a group split we'd have locks per
> group and perhaps a writeback a dedicated thread per group.

Wtf are you talking about?