From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 49D0DC54798
	for <linux-mm@archiver.kernel.org>; Tue, 27 Feb 2024 10:07:41 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id D9C218000F; Tue, 27 Feb 2024 05:07:40 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id D7181940008; Tue, 27 Feb 2024 05:07:40 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id C39338000F; Tue, 27 Feb 2024 05:07:40 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14])
	by kanga.kvack.org (Postfix) with ESMTP id B1845940008
	for <linux-mm@kvack.org>; Tue, 27 Feb 2024 05:07:40 -0500 (EST)
Received: from smtpin01.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay05.hostedemail.com (Postfix) with ESMTP id 85CEC40901
	for <linux-mm@kvack.org>; Tue, 27 Feb 2024 10:07:40 +0000 (UTC)
X-FDA: 81837157080.01.6265B2B
Received: from out-179.mta0.migadu.com (out-179.mta0.migadu.com [91.218.175.179])
	by imf04.hostedemail.com (Postfix) with ESMTP id C489540003
	for <linux-mm@kvack.org>; Tue, 27 Feb 2024 10:07:37 +0000 (UTC)
Authentication-Results: imf04.hostedemail.com;
	dkim=pass header.d=linux.dev header.s=key1 header.b=LMtICWEa;
	spf=pass (imf04.hostedemail.com: domain of kent.overstreet@linux.dev designates 91.218.175.179 as permitted sender) smtp.mailfrom=kent.overstreet@linux.dev;
	dmarc=pass (policy=none) header.from=linux.dev
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1709028458;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=ZC5i4ceQpeqp2EF6ln2H0aYPVsQ9lMsdpe13g+8CkTk=;
	b=kPO7b1PBz3W9MEayNG/taHFyrhD1Yc9Zpy9LJ34CTPdw8ZvmMNWJIucA9PGgxhKxuIP5l9
	ve/luXLpEZz1YJLhypYjy45IqF+60hzE8spM70Mbz7UIJsAQqD88FVdYgZAYiMwoUMCEU5
	NqEF1KnUanM0oiM+8wx93I9b7nneLdk=
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1709028458; a=rsa-sha256;
	cv=none;
	b=xWpgtgD05AwFzDl1OdEUXdhKTUsCUwjaJ5ZpL58M/fY7Mtk4drJ2HlPIVUQe2fuwzhusl4
	5okk6CHst3QzOhgAuQtrSKUPn+cXjBBnBVlROS4YlV46pvx8QCGLITiR8EX8kFfVUAeXHy
	YTEGL/uWiYu7z0LUOegkllELSQwlcsw=
ARC-Authentication-Results: i=1;
	imf04.hostedemail.com;
	dkim=pass header.d=linux.dev header.s=key1 header.b=LMtICWEa;
	spf=pass (imf04.hostedemail.com: domain of kent.overstreet@linux.dev designates 91.218.175.179 as permitted sender) smtp.mailfrom=kent.overstreet@linux.dev;
	dmarc=pass (policy=none) header.from=linux.dev
Date: Tue, 27 Feb 2024 05:07:30 -0500
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1;
	t=1709028455;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:content-type:content-type:
	 in-reply-to:in-reply-to:references:references;
	bh=ZC5i4ceQpeqp2EF6ln2H0aYPVsQ9lMsdpe13g+8CkTk=;
	b=LMtICWEaJAvln4EHFOHqfvW0WdwRkhPAFA4RAmLEPC0UlQKynmTwOFxR3BXH9HEPHwtIwE
	kLJOiG1WBli06FqIkK/WbcExtHcUvtnbQ7gOkvmzg6qNiN7Sn32d7xcMPYce7qNPG77F/q
	nr4SNYXzHWel1xTrv/UCgq3+oia+WVA=
X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers.
From: Kent Overstreet <kent.overstreet@linux.dev>
To: Luis Chamberlain <mcgrof@kernel.org>
Cc: lsf-pc@lists.linux-foundation.org, linux-fsdevel@vger.kernel.org, 
	linux-mm <linux-mm@kvack.org>, Daniel Gomez <da.gomez@samsung.com>, 
	Pankaj Raghav <p.raghav@samsung.com>, Jens Axboe <axboe@kernel.dk>, Dave Chinner <david@fromorbit.com>, 
	Christoph Hellwig <hch@lst.de>, Chris Mason <clm@fb.com>, Johannes Weiner <hannes@cmpxchg.org>, 
	Matthew Wilcox <willy@infradead.org>, Linus Torvalds <torvalds@linux-foundation.org>
Subject: Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO
Message-ID: <xhymmlbragegxvgykhaddrkkhc7qn7soapca22ogbjlegjri35@ffqmquunkvxw>
References: <Zdkxfspq3urnrM6I@bombadil.infradead.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <Zdkxfspq3urnrM6I@bombadil.infradead.org>
X-Migadu-Flow: FLOW_OUT
X-Rspamd-Queue-Id: C489540003
X-Rspam-User: 
X-Stat-Signature: 497buoikw1z5uwhy1o6o1b5u4mniz1ib
X-Rspamd-Server: rspam03
X-HE-Tag: 1709028457-613383
X-HE-Meta: U2FsdGVkX1+OWEvEU/64S6FINqsDU3wJodcC/1sCiIokW3GCCuZnEuNvJnxEAtnmB4BTXHnN/H3QVNckkWNQTW3v0Y7M8POAAHdne9NiNGaEgR84gjU71qX3yPsxE1JKCFcUNDsP+OIP7GLGK5Wy+Y8MN39hcqfgpbHBwDMf2Y8Na2QP/cFxXsbKzZKUhQIED5xQZyrmTZoVmaxkh4sNtJqfW9HDN6wpJ/5yULRI82XzpnO4T/4DTau9V4GrVe9oChWCHzg50lugRPMqQdjIGwPso7sIzKgXE5PiVOgYKomivXUcbyTSBfzIU7D5V8erGmQ+ZmokCU2uxM3lg27xs0dShEpegsVUOx0/8xWsSfrKUzkF8AARZTcH136USrSNJfnT1qQaUJS1M7xEPzHHlTg6SaxM+N6WUOP69QIsT6nJVBnuFma/zzSZs00NBa5Nwma4geTvf5RbCi3Zz0tzVTI48tkZZvmXIPUBTdhPu7gZanwSnqVCrwoWa3aFd9V+RpssEjj/eW6uaRbiVsA1jBNLArK087Ff0507ElZUglDdNwGYtMD/k77GJSHnrewAp5GWzK75sCLL0rZGa77Hrk39CE/qwm65V0Si8GlUavFhwOBbZyie9zxYTuQtofHnIdn0Zq35g0SRPRML51AwMTgu7frmRmyrIHc2xKa3gUaR/dNTSKZAoI/BjEWKhBE14TH8zXCACfWYyPzL+i1CkXCqN8RyU+FTfEh1Y1fw98Wf/LNh/MRbeC56/+ODkVTbw8FcM+v20gCFSBDK5BB/gx0V8zgJrtl21HvV6Ioztx4=
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On Fri, Feb 23, 2024 at 03:59:58PM -0800, Luis Chamberlain wrote:
> Part of the testing we have done with LBS was to do some performance
> tests on XFS to ensure things are not regressing. Building linux is a
> fine decent test and we did some random cloud instance tests on that and
> presented that at Plumbers, but it doesn't really cut it if we want to
> push things to the limit though. What are the limits to buffered IO
> and how do we test that? Who keeps track of it?
> 
> The obvious recurring tension is that for really high performance folks
> just recommend to use birect IO. But if you are stress testing changes
> to a filesystem and want to push buffered IO to its limits it makes
> sense to stick to buffered IO, otherwise how else do we test it?
> 
> It is good to know limits to buffered IO too because some workloads
> cannot use direct IO.  For instance PostgreSQL doesn't have direct IO
> support and even as late as the end of last year we learned that adding
> direct IO to PostgreSQL would be difficult.  Chris Mason has noted also
> that direct IO can also force writes during reads (?)... Anyway, testing
> the limits of buffered IO limits to ensure you are not creating
> regressions when doing some page cache surgery seems like it might be
> useful and a sensible thing to do .... The good news is we have not found
> regressions with LBS but all the testing seems to beg the question, of what
> are the limits of buffered IO anyway, and how does it scale? Do we know, do
> we care? Do we keep track of it? How does it compare to direct IO for some
> workloads? How big is the delta? How do we best test that? How do we
> automate all that? Do we want to automatically test this to avoid regressions?
> 
> The obvious issues with some workloads for buffered IO is having a
> possible penality if you are not really re-using folios added to the
> page cache. Jens Axboe reported a while ago issues with workloads with
> random reads over a data set 10x the size of RAM and also proposed
> RWF_UNCACHED as a way to help [0]. As Chinner put it, this seemed more
> like direct IO with kernel pages and a memcpy(), and it requires
> further serialization to be implemented that we already do for
> direct IO for writes. There at least seems to be agreement that if we're
> going to provide an enhancement or alternative that we should strive to not
> make the same mistakes we've done with direct IO. The rationale for some
> workloads to use buffered IO is it helps reduce some tail latencies, so
> that's something to live up to.
> 
> On that same thread Christoph also mentioned the possibility of a direct
> IO variant which can leverage the cache. Is that something we want to
> move forward with?
> 
> Chris Mason also listed a few other desirables if we do:
> 
> - Allowing concurrent writes (xfs DIO does this now)

AFAIK every filesystem allows concurrent direct writes, not just xfs,
it's _buffered_ writes that we care about here.

I just pushed a patch to my CI for buffered writes without taking the
inode lock - for bcachefs. It'll be straightforward, but a decent amount
of work, to lift this to the VFS, if people are interested in
collaborating.

https://evilpiepirate.org/git/bcachefs.git/log/?h=bcachefs-buffered-write-locking

The approach is: for non extending, non appending writes, see if we can
pin the entire range of the pagecache we're writing to; fall back to
taking the inode lock if we can't.

If we do a short write because of a page fault (despite previously
faulting in the userspace buffer), there is no way to completely prevent
torn writes an atomicity breakage; we could at least try a trylock on
the inode lock, I didn't do that here.

For lifting this to the VFS, this needs
 - My darray code, which I'll be moving to include/linux/ in the 6.9
   merge window
 - My pagecache add lock - we need this for sychronization with hole
   punching and truncate when we don't have the inode lock.
 - My vectorized buffered write path lifted to filemap.c, which means we
   need some sort of vectorized replacement for .write_begin and
   .write_end