From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 9C22CC47DD9
	for <linux-mm@archiver.kernel.org>; Sun, 25 Feb 2024 05:25:07 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id E340C6B00F0; Sun, 25 Feb 2024 00:25:06 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id DE4366B00F1; Sun, 25 Feb 2024 00:25:06 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id CD34D6B00F2; Sun, 25 Feb 2024 00:25:06 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13])
	by kanga.kvack.org (Postfix) with ESMTP id BB4566B00F0
	for <linux-mm@kvack.org>; Sun, 25 Feb 2024 00:25:06 -0500 (EST)
Received: from smtpin04.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay06.hostedemail.com (Postfix) with ESMTP id 66EDFA0EFE
	for <linux-mm@kvack.org>; Sun, 25 Feb 2024 05:25:06 +0000 (UTC)
X-FDA: 81829187412.04.A8858EC
Received: from out-179.mta1.migadu.com (out-179.mta1.migadu.com [95.215.58.179])
	by imf07.hostedemail.com (Postfix) with ESMTP id B8C1440008
	for <linux-mm@kvack.org>; Sun, 25 Feb 2024 05:25:04 +0000 (UTC)
Authentication-Results: imf07.hostedemail.com;
	dkim=pass header.d=linux.dev header.s=key1 header.b=ejansM5C;
	spf=pass (imf07.hostedemail.com: domain of kent.overstreet@linux.dev designates 95.215.58.179 as permitted sender) smtp.mailfrom=kent.overstreet@linux.dev;
	dmarc=pass (policy=none) header.from=linux.dev
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1708838704;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=QjXC2JkYyp2zA7zOsHpwHQbHa4Lbe1SXOUcO5kMx/pc=;
	b=c0JWgQ1C3fJUseQkXdZ4FfOfqAfvIJvKibySb5y2vjjo3o1smHqJSTIfflsPwYc1gldxJW
	C92HZeLboSbxXnXumHb4EWHE7li8STmdD218Ec4o275o0ItrG8K2wQ1gyvv9WWc8oPYaR3
	bWhPeYteuQPP1XGsHysBNedPhnO1u/8=
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1708838704; a=rsa-sha256;
	cv=none;
	b=mEHNxrvZD9gWr6EoUqU05HxHzmQ3mP5a9QF/MxD/Gvk2M5VALhCvI6NWc6sfR5OMT2G0cs
	Dwnh3YWBkUbWnmBqKFoJDk9NQ7VxOQd/xV+YITDqoGcVSo6Dkwou3k0Etc2pgaFdHKvLBY
	YVk5A98QdNEH3sbgjRDqOmBOnZOporw=
ARC-Authentication-Results: i=1;
	imf07.hostedemail.com;
	dkim=pass header.d=linux.dev header.s=key1 header.b=ejansM5C;
	spf=pass (imf07.hostedemail.com: domain of kent.overstreet@linux.dev designates 95.215.58.179 as permitted sender) smtp.mailfrom=kent.overstreet@linux.dev;
	dmarc=pass (policy=none) header.from=linux.dev
Date: Sun, 25 Feb 2024 00:24:55 -0500
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1;
	t=1708838703;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:content-type:content-type:
	 in-reply-to:in-reply-to:references:references;
	bh=QjXC2JkYyp2zA7zOsHpwHQbHa4Lbe1SXOUcO5kMx/pc=;
	b=ejansM5CJH7RY0h0m/Dt6l70X+H72EK7MR9ktJAeBz4EG3DHFzGR9wrbYerDQQA8BVXto5
	BMhZIzr4GMEj31KUUECR+yPJyU1J5POUifBTI8EnWY5tZoA576sHeCewG0blpaT3EcZT2t
	kCTcDl0mVeEjyxhTgVBGAxqyAsDhm2k=
X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers.
From: Kent Overstreet <kent.overstreet@linux.dev>
To: Luis Chamberlain <mcgrof@kernel.org>
Cc: lsf-pc@lists.linux-foundation.org, linux-fsdevel@vger.kernel.org, 
	linux-mm <linux-mm@kvack.org>, Daniel Gomez <da.gomez@samsung.com>, 
	Pankaj Raghav <p.raghav@samsung.com>, Jens Axboe <axboe@kernel.dk>, Dave Chinner <david@fromorbit.com>, 
	Christoph Hellwig <hch@lst.de>, Chris Mason <clm@fb.com>, Johannes Weiner <hannes@cmpxchg.org>, 
	Matthew Wilcox <willy@infradead.org>, Linus Torvalds <torvalds@linux-foundation.org>
Subject: Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO
Message-ID: <45odvhgymm7fxsgwpewoiiggaokxjcr7ootsamm4rwsdw26j62@5lohhv2kvofo>
References: <Zdkxfspq3urnrM6I@bombadil.infradead.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <Zdkxfspq3urnrM6I@bombadil.infradead.org>
X-Migadu-Flow: FLOW_OUT
X-Stat-Signature: ogmxff3x65irhfyei91kpqm1feyfqu7r
X-Rspamd-Server: rspam10
X-Rspamd-Queue-Id: B8C1440008
X-Rspam-User: 
X-HE-Tag: 1708838704-328083
X-HE-Meta: U2FsdGVkX18HT8WFUcXRi2ET1Sg5BOxM141jU2p5QPD4c0Gk4fR3jJHrqypYf1l4IW7QUVXVpfIo6nxf1HHWr0KBO/vM1IWO466oJSY/SETn4y1D+GoS37CRxQsTplyQy+v+MTxpnZpoATQBIH8cI8/k76tWnh4edhdolC8yRO1RhClhhnWSP4HZ8lX2oEDLisqwzWuILhxdeAIxUVJ4/Bq+RCKD13Qu495Nl1OG5bG71jzUTu+PeLFF0wwLo/dy/TTH+qlEhaUmR7euRoRlQEDEdJgDLGOk/9Gh54tcvZXZn7VB7hqxjZY3mnapwfH+yANSI9wWzCGRnevg3X39Kk1/sXqwvnLaJ29muM2BxqQk54FAp7WSoTPNZMDLGwxPRbm3f6qIJXCPwpnUuGFsmWUB+eebqWgOem2h0HUK7e2JgJY1EGZXK7o/fvhIObLxPU2SKWCrzjBJ1AftVx9Fj/gISsFzi/Wt0E5gfFYBAEz1YmS/wCqxdQsQ2i0spWWuzqdK/VxMxbhiz23AbaD3vDYk5O8bFveFvvCTs1TjnhcqX2RHs/RroZYxSIBainbQZ/6HsD1/jW1ZCABHYlItdIuaz9TMIZ+fMMEuo2xne5/jHCRqVH320nywUwJHnlcUeUiSBqmQY2J6UeXEUCvF2F9HVfIC1Dl3/zNnfn2peAVYqBBmDDr8aAPS0Ccm4yyHlVw3H+3771l6bDYwFdTMzszLcLnOIoflWC0hR3IxrmcK/zYt5LvEYT2agMaZPh9giGqbLYwxeVGcpJ1avEgbpaZ8tELUWUWIMuYMGuAnlBI=
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On Fri, Feb 23, 2024 at 03:59:58PM -0800, Luis Chamberlain wrote:
> Part of the testing we have done with LBS was to do some performance
> tests on XFS to ensure things are not regressing. Building linux is a
> fine decent test and we did some random cloud instance tests on that and
> presented that at Plumbers, but it doesn't really cut it if we want to
> push things to the limit though. What are the limits to buffered IO
> and how do we test that? Who keeps track of it?
> 
> The obvious recurring tension is that for really high performance folks
> just recommend to use birect IO. But if you are stress testing changes
> to a filesystem and want to push buffered IO to its limits it makes
> sense to stick to buffered IO, otherwise how else do we test it?
> 
> It is good to know limits to buffered IO too because some workloads
> cannot use direct IO.  For instance PostgreSQL doesn't have direct IO
> support and even as late as the end of last year we learned that adding
> direct IO to PostgreSQL would be difficult.  Chris Mason has noted also
> that direct IO can also force writes during reads (?)... Anyway, testing
> the limits of buffered IO limits to ensure you are not creating
> regressions when doing some page cache surgery seems like it might be
> useful and a sensible thing to do .... The good news is we have not found
> regressions with LBS but all the testing seems to beg the question, of what
> are the limits of buffered IO anyway, and how does it scale? Do we know, do
> we care? Do we keep track of it? How does it compare to direct IO for some
> workloads? How big is the delta? How do we best test that? How do we
> automate all that? Do we want to automatically test this to avoid regressions?
> 
> The obvious issues with some workloads for buffered IO is having a
> possible penality if you are not really re-using folios added to the
> page cache. Jens Axboe reported a while ago issues with workloads with
> random reads over a data set 10x the size of RAM and also proposed
> RWF_UNCACHED as a way to help [0]. As Chinner put it, this seemed more
> like direct IO with kernel pages and a memcpy(), and it requires
> further serialization to be implemented that we already do for
> direct IO for writes. There at least seems to be agreement that if we're
> going to provide an enhancement or alternative that we should strive to not
> make the same mistakes we've done with direct IO. The rationale for some
> workloads to use buffered IO is it helps reduce some tail latencies, so
> that's something to live up to.
> 
> On that same thread Christoph also mentioned the possibility of a direct
> IO variant which can leverage the cache. Is that something we want to
> move forward with?

The thing to consider here would be an improved O_SYNC. There's a fair
amount of tree walking and thread to thread cacheline bouncing that
would be avoided by just calling .write_folios() and kicking bios off
from .write_iter().

OTOH - the way it's done now is probably the best possible way of
splitting up the work between multiple threads, so I'd expect this
approach to get less throughput than current O_SYNC.

Luis, are you profiling these workloads? I haven't looked at high
throughput profiles of the buffered IO path in years, and that's a good
place to start.