From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 1C0FBC54E49
	for <linux-mm@archiver.kernel.org>; Sun, 25 Feb 2024 21:30:09 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 92A286B011E; Sun, 25 Feb 2024 16:30:08 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 8DA516B011F; Sun, 25 Feb 2024 16:30:08 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 7A1AD6B0120; Sun, 25 Feb 2024 16:30:08 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17])
	by kanga.kvack.org (Postfix) with ESMTP id 6849B6B011E
	for <linux-mm@kvack.org>; Sun, 25 Feb 2024 16:30:08 -0500 (EST)
Received: from smtpin11.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay02.hostedemail.com (Postfix) with ESMTP id 1265F1205F1
	for <linux-mm@kvack.org>; Sun, 25 Feb 2024 21:30:08 +0000 (UTC)
X-FDA: 81831619296.11.05D2E5A
Received: from out-187.mta0.migadu.com (out-187.mta0.migadu.com [91.218.175.187])
	by imf05.hostedemail.com (Postfix) with ESMTP id EF89A100002
	for <linux-mm@kvack.org>; Sun, 25 Feb 2024 21:30:05 +0000 (UTC)
Authentication-Results: imf05.hostedemail.com;
	dkim=pass header.d=linux.dev header.s=key1 header.b=KMjxZ8hU;
	spf=pass (imf05.hostedemail.com: domain of kent.overstreet@linux.dev designates 91.218.175.187 as permitted sender) smtp.mailfrom=kent.overstreet@linux.dev;
	dmarc=pass (policy=none) header.from=linux.dev
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1708896606; a=rsa-sha256;
	cv=none;
	b=6uLvck6Cpd37fF0KBwlwkVP1xIlvt0Styor3OZCJ3j+1Kq357paGrtRgRAxbaI7uV+X7ud
	QcinaEHK7ctZjZd3ko1DGxjPMPCmOglf/+GjDkbICvcmHy325x0W5Gg/WDWOSJ/L4/ln+w
	4XcCb/dw57ORfFQxdRQ5hF+UYLzFj2o=
ARC-Authentication-Results: i=1;
	imf05.hostedemail.com;
	dkim=pass header.d=linux.dev header.s=key1 header.b=KMjxZ8hU;
	spf=pass (imf05.hostedemail.com: domain of kent.overstreet@linux.dev designates 91.218.175.187 as permitted sender) smtp.mailfrom=kent.overstreet@linux.dev;
	dmarc=pass (policy=none) header.from=linux.dev
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1708896606;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=dr+h+YWNXTfxtzgW+IRBSnnfKh3QOXetAtC7/5S634U=;
	b=EpTSokhPfqje7DT6vPOwEcgMs74xNtm3kJ//gaTuW4K/X+uM9fhTRw0dhxYxaQvtrxJtEp
	OnCzhVEvCbeq+JIl9VQurrc5zMDKsUJPChdexYzwLDm6q6QoHwYxxmRCLyjYOrA2AK2rn6
	lJCDL5UYAyXF55E7yMqii/oPXsQT4Kc=
Date: Sun, 25 Feb 2024 16:29:58 -0500
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1;
	t=1708896603;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:content-type:content-type:
	 in-reply-to:in-reply-to:references:references;
	bh=dr+h+YWNXTfxtzgW+IRBSnnfKh3QOXetAtC7/5S634U=;
	b=KMjxZ8hUjNx6kSOE8qtcX3FNj53c8KcaxELrOMeh9GiV196ozFrD0knqBznPkDg41qvEcS
	3DrX+QWWr3FDEaVYXML5F1lY57PXPDDDQv1/f5StYP1hVDoCWDEfh5U7eeGFBt2Cfayv4u
	/pAQsSB25g/cdTcqCloVni+2nINAMbE=
X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers.
From: Kent Overstreet <kent.overstreet@linux.dev>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matthew Wilcox <willy@infradead.org>, 
	Luis Chamberlain <mcgrof@kernel.org>, lsf-pc@lists.linux-foundation.org, linux-fsdevel@vger.kernel.org, 
	linux-mm <linux-mm@kvack.org>, Daniel Gomez <da.gomez@samsung.com>, 
	Pankaj Raghav <p.raghav@samsung.com>, Jens Axboe <axboe@kernel.dk>, Dave Chinner <david@fromorbit.com>, 
	Christoph Hellwig <hch@lst.de>, Chris Mason <clm@fb.com>, Johannes Weiner <hannes@cmpxchg.org>
Subject: Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO
Message-ID: <ufp6jyfxvdeftlr2tqu4ythrdilxrwg6uuev7ghc6zlwjjtp3r@sklx42xdiepw>
References: <Zdkxfspq3urnrM6I@bombadil.infradead.org>
 <Zdlsr88A6AAlJpcc@casper.infradead.org>
 <CAHk-=wjUkYLv23KtF=EyCrQcmf9NGwE8Yo1cuxdaLF8gqx5zWw@mail.gmail.com>
 <o4a6577t2z5xytjwmixqkl33h23vfnjypwbx7jaaldtldpvjf5@dzbzkhrzyobb>
 <Zds8T9O4AYAmdS9d@casper.infradead.org>
 <CAHk-=wgVPHPPjZPoV8E_q59L7i8zFjHo_5hHo_+qECYuy7FF6g@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <CAHk-=wgVPHPPjZPoV8E_q59L7i8zFjHo_5hHo_+qECYuy7FF6g@mail.gmail.com>
X-Migadu-Flow: FLOW_OUT
X-Rspamd-Server: rspam08
X-Rspamd-Queue-Id: EF89A100002
X-Stat-Signature: ux64x6y9bmxo41pprharsxxbn79t5eh8
X-Rspam-User: 
X-HE-Tag: 1708896605-639887
X-HE-Meta: U2FsdGVkX1/dBZOzF8wZWD/1mGNGBD+sP8dJD/CvqLtyq0g1rfUUKHg5cp59+BxtLaFRJmS2FJ6pJh7w1x1NFnTAXtgip8bX1lpMF4RZ4jR9Lnf+4W80cR3ygZzcgGMxRtmsOcj8wsD970zXzXeFW8R+rVcHEp9XeQMyeEySJGVTLa8DNuHE5hZ6z2b9/Nvd+9T4T3AsRM+Tf2mTzUvdvZZc7e7K7E3WItoNJSmjl2RFTWy97M1r2XB47azbFjMpHET+nA6W+37JwzNTYK1YCP820EriB14rCO4BvQn9S9fsum4HaH9i4WuglVbd/JORWqDCz8qnnnPnDUCunCPvELsaj5cwuZ2UHY3iaJMuHrH+3r9HD6TwY5tokqBowT0XP+6kV5roTiDOyVyrAVSRr10R0C2qyMirQECn7x78ymgTxsN7BL+YS+rAcnACus3hSlp/yPY86tdaz6zsk8SP+C1AJDhb++dvZN8zdbiRnBW5IzS5GTSN/hxagw4rDBJyEWD0Vq2K1RI6aIuFXCK5r2TeKDwCmxCDVJHPGOqrIJYtQZ0W/R30VcJ5luEnruWqw1Iurhgk8K2MTKOC+M+Im9rQYMgVcRGkl+DW/xMcHnx+NTTDePYt52KuJWijs8c3AkyxPW+6VWVLA26NSCi7tZbflLzsGXFZhVZ9wySjiANX8gV1PbKgf5W4RmQOfvGadQSSVhKAkUvUazlTQsWbOA1f8ggtTGHGRYjvtJsIFcSdw8O/+zlbSHyijfPCt7cL5vEzElVAl4sy4aA19ht2LgtKg72SJTWe/Ke3OxDA2NCf3coPaqkPxIneA7ltaXb0r+1WPOPMe+z0j2tG8bzm1Zjq6S1+kn3xEUlh+1XI1moyIqSY8kBHQcrgy4CgXOC4or+a/3zXRrzIjgjsGdcLhLPxdQLn3kHRC/zANM8enr/eQrQ6J/Kcn++iRFunHOR0LSYhFvsnmhfhuDC2FE+
 Qx3TiqDj
 bIXp9h+Ip5zpN94fxJU771uSK42B+a9BPqLs27DWRXDF/XKRR2OcHjEalSN+HlvWzfYuJJ4Jb5LH6mD60xZFx4InmkgpkT1pFpN6nkuxytI/UdWdCO9YiqqJ7Pr/u43cvS1P28iSb4lyuMnM=
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On Sun, Feb 25, 2024 at 09:03:32AM -0800, Linus Torvalds wrote:
> On Sun, 25 Feb 2024 at 05:10, Matthew Wilcox <willy@infradead.org> wrote:
> >
> > There's also the small random 64 byte read case that we haven't optimised
> > for yet.  That also bottlenecks on the page refcount atomic op.
> >
> > The proposed solution to that was double-copy; look up the page without
> > bumping its refcount, copy to a buffer, look up the page again to be
> > sure it's still there, copy from the buffer to userspace.
> 
> Please stop the cray-cray.
> 
> Yes, cache dirtying is expensive. But you don't actually have
> cacheline ping-pong, because you don't have lots of different CPU's
> hammering the same page cache page in any normal circumstances. So the
> really expensive stuff just doesn't exist.

Not ping pong, you're just blowing the cachelines you want out of l1
with the big usercopy, hardware caches not being fully associative.

> I think you've been staring at profiles too much. In instruction-level
> profiles, the atomic ops stand out a lot. But that's at least partly
> artificial - they are a serialization point on x86, so things get
> accounted to them. So they tend to be the collection point for
> everything around them in an OoO CPU.

Yes, which leads to a fun game of whack a mole when you eliminate one
atomic op and then everything just ends up piling up behind a different
atomic op - but for the buffered read path, the folio get/put are the
only atomic ops.

> Fior example, the fact that Kent complains about the page cache and
> talks about large folios is completely ludicrous. I've seen the
> benchmarks of real loads. Kent - you're not close to any limits, you
> are often a factor of two to five off other filesystems. We're not
> talking "a few percent", and we're not talking "the atomics are
> hurting".

Yes, there's a bunch of places where bcachefs is still slow; it'll get
there :)

If you've got those benchmarks handyy and they're ones I haven't seen,
I'd love to take look; the one that always jumps out at people is small
O_DIRECT reads, and that hasn't been a priority because O_DIRECT doesn't
matter to most people nearly as much as they think it does.

There's a bunch of stuff still to work through; another that comes to
mind is that we need a free inodes btree to eliminate scanning in inode
create, and that was half a day of work - except it also needs sharding
(i.e. leaf nodes can't span certain boundaries), and for that I need
variable sized btree nodes so we aren't burning stupid amounts of
memory - and that's something we need anyways, number of btrees growing
like it is.

Another fun one that I just discovered while I was hanging out at
Darrick's - journal was stalling on high iodepth workloads; device write
buffer fills up, write latency goes up, suddenly the journal can't write
quickly enough when it's only submitting one write at a time. So there's
a fix for 6.9 queued up that lets the journal keep multiple writes in
flight.

That one was worth mentioning because another fix would've been to add a
way to signal backpressure to /above/ the filesystem, so that we don't
hit such big queuing delays within the filesystem; right now user writes
don't hit backpressure until submit_bio() blocks because the request
queue is full. I've been seeing other performance corner cases where it
looks like such a mechanism would be helpful.

I except I've got a solid year or two ahead of me of mastly just working
through performance bugs - standing up a lot of automated perf testing
adn whatnot. But, one thing at a time...