From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 1C0FBC54E49 for ; Sun, 25 Feb 2024 21:30:09 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 92A286B011E; Sun, 25 Feb 2024 16:30:08 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 8DA516B011F; Sun, 25 Feb 2024 16:30:08 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 7A1AD6B0120; Sun, 25 Feb 2024 16:30:08 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 6849B6B011E for ; Sun, 25 Feb 2024 16:30:08 -0500 (EST) Received: from smtpin11.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id 1265F1205F1 for ; Sun, 25 Feb 2024 21:30:08 +0000 (UTC) X-FDA: 81831619296.11.05D2E5A Received: from out-187.mta0.migadu.com (out-187.mta0.migadu.com [91.218.175.187]) by imf05.hostedemail.com (Postfix) with ESMTP id EF89A100002 for ; Sun, 25 Feb 2024 21:30:05 +0000 (UTC) Authentication-Results: imf05.hostedemail.com; dkim=pass header.d=linux.dev header.s=key1 header.b=KMjxZ8hU; spf=pass (imf05.hostedemail.com: domain of kent.overstreet@linux.dev designates 91.218.175.187 as permitted sender) smtp.mailfrom=kent.overstreet@linux.dev; dmarc=pass (policy=none) header.from=linux.dev ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1708896606; a=rsa-sha256; cv=none; b=6uLvck6Cpd37fF0KBwlwkVP1xIlvt0Styor3OZCJ3j+1Kq357paGrtRgRAxbaI7uV+X7ud QcinaEHK7ctZjZd3ko1DGxjPMPCmOglf/+GjDkbICvcmHy325x0W5Gg/WDWOSJ/L4/ln+w 4XcCb/dw57ORfFQxdRQ5hF+UYLzFj2o= ARC-Authentication-Results: i=1; imf05.hostedemail.com; dkim=pass header.d=linux.dev header.s=key1 header.b=KMjxZ8hU; spf=pass (imf05.hostedemail.com: domain of kent.overstreet@linux.dev designates 91.218.175.187 as permitted sender) smtp.mailfrom=kent.overstreet@linux.dev; dmarc=pass (policy=none) header.from=linux.dev ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1708896606; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=dr+h+YWNXTfxtzgW+IRBSnnfKh3QOXetAtC7/5S634U=; b=EpTSokhPfqje7DT6vPOwEcgMs74xNtm3kJ//gaTuW4K/X+uM9fhTRw0dhxYxaQvtrxJtEp OnCzhVEvCbeq+JIl9VQurrc5zMDKsUJPChdexYzwLDm6q6QoHwYxxmRCLyjYOrA2AK2rn6 lJCDL5UYAyXF55E7yMqii/oPXsQT4Kc= Date: Sun, 25 Feb 2024 16:29:58 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1708896603; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=dr+h+YWNXTfxtzgW+IRBSnnfKh3QOXetAtC7/5S634U=; b=KMjxZ8hUjNx6kSOE8qtcX3FNj53c8KcaxELrOMeh9GiV196ozFrD0knqBznPkDg41qvEcS 3DrX+QWWr3FDEaVYXML5F1lY57PXPDDDQv1/f5StYP1hVDoCWDEfh5U7eeGFBt2Cfayv4u /pAQsSB25g/cdTcqCloVni+2nINAMbE= X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. From: Kent Overstreet To: Linus Torvalds Cc: Matthew Wilcox , Luis Chamberlain , lsf-pc@lists.linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-mm , Daniel Gomez , Pankaj Raghav , Jens Axboe , Dave Chinner , Christoph Hellwig , Chris Mason , Johannes Weiner Subject: Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO Message-ID: References: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Migadu-Flow: FLOW_OUT X-Rspamd-Server: rspam08 X-Rspamd-Queue-Id: EF89A100002 X-Stat-Signature: ux64x6y9bmxo41pprharsxxbn79t5eh8 X-Rspam-User: X-HE-Tag: 1708896605-639887 X-HE-Meta: U2FsdGVkX1/dBZOzF8wZWD/1mGNGBD+sP8dJD/CvqLtyq0g1rfUUKHg5cp59+BxtLaFRJmS2FJ6pJh7w1x1NFnTAXtgip8bX1lpMF4RZ4jR9Lnf+4W80cR3ygZzcgGMxRtmsOcj8wsD970zXzXeFW8R+rVcHEp9XeQMyeEySJGVTLa8DNuHE5hZ6z2b9/Nvd+9T4T3AsRM+Tf2mTzUvdvZZc7e7K7E3WItoNJSmjl2RFTWy97M1r2XB47azbFjMpHET+nA6W+37JwzNTYK1YCP820EriB14rCO4BvQn9S9fsum4HaH9i4WuglVbd/JORWqDCz8qnnnPnDUCunCPvELsaj5cwuZ2UHY3iaJMuHrH+3r9HD6TwY5tokqBowT0XP+6kV5roTiDOyVyrAVSRr10R0C2qyMirQECn7x78ymgTxsN7BL+YS+rAcnACus3hSlp/yPY86tdaz6zsk8SP+C1AJDhb++dvZN8zdbiRnBW5IzS5GTSN/hxagw4rDBJyEWD0Vq2K1RI6aIuFXCK5r2TeKDwCmxCDVJHPGOqrIJYtQZ0W/R30VcJ5luEnruWqw1Iurhgk8K2MTKOC+M+Im9rQYMgVcRGkl+DW/xMcHnx+NTTDePYt52KuJWijs8c3AkyxPW+6VWVLA26NSCi7tZbflLzsGXFZhVZ9wySjiANX8gV1PbKgf5W4RmQOfvGadQSSVhKAkUvUazlTQsWbOA1f8ggtTGHGRYjvtJsIFcSdw8O/+zlbSHyijfPCt7cL5vEzElVAl4sy4aA19ht2LgtKg72SJTWe/Ke3OxDA2NCf3coPaqkPxIneA7ltaXb0r+1WPOPMe+z0j2tG8bzm1Zjq6S1+kn3xEUlh+1XI1moyIqSY8kBHQcrgy4CgXOC4or+a/3zXRrzIjgjsGdcLhLPxdQLn3kHRC/zANM8enr/eQrQ6J/Kcn++iRFunHOR0LSYhFvsnmhfhuDC2FE+ Qx3TiqDj bIXp9h+Ip5zpN94fxJU771uSK42B+a9BPqLs27DWRXDF/XKRR2OcHjEalSN+HlvWzfYuJJ4Jb5LH6mD60xZFx4InmkgpkT1pFpN6nkuxytI/UdWdCO9YiqqJ7Pr/u43cvS1P28iSb4lyuMnM= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Sun, Feb 25, 2024 at 09:03:32AM -0800, Linus Torvalds wrote: > On Sun, 25 Feb 2024 at 05:10, Matthew Wilcox wrote: > > > > There's also the small random 64 byte read case that we haven't optimised > > for yet. That also bottlenecks on the page refcount atomic op. > > > > The proposed solution to that was double-copy; look up the page without > > bumping its refcount, copy to a buffer, look up the page again to be > > sure it's still there, copy from the buffer to userspace. > > Please stop the cray-cray. > > Yes, cache dirtying is expensive. But you don't actually have > cacheline ping-pong, because you don't have lots of different CPU's > hammering the same page cache page in any normal circumstances. So the > really expensive stuff just doesn't exist. Not ping pong, you're just blowing the cachelines you want out of l1 with the big usercopy, hardware caches not being fully associative. > I think you've been staring at profiles too much. In instruction-level > profiles, the atomic ops stand out a lot. But that's at least partly > artificial - they are a serialization point on x86, so things get > accounted to them. So they tend to be the collection point for > everything around them in an OoO CPU. Yes, which leads to a fun game of whack a mole when you eliminate one atomic op and then everything just ends up piling up behind a different atomic op - but for the buffered read path, the folio get/put are the only atomic ops. > Fior example, the fact that Kent complains about the page cache and > talks about large folios is completely ludicrous. I've seen the > benchmarks of real loads. Kent - you're not close to any limits, you > are often a factor of two to five off other filesystems. We're not > talking "a few percent", and we're not talking "the atomics are > hurting". Yes, there's a bunch of places where bcachefs is still slow; it'll get there :) If you've got those benchmarks handyy and they're ones I haven't seen, I'd love to take look; the one that always jumps out at people is small O_DIRECT reads, and that hasn't been a priority because O_DIRECT doesn't matter to most people nearly as much as they think it does. There's a bunch of stuff still to work through; another that comes to mind is that we need a free inodes btree to eliminate scanning in inode create, and that was half a day of work - except it also needs sharding (i.e. leaf nodes can't span certain boundaries), and for that I need variable sized btree nodes so we aren't burning stupid amounts of memory - and that's something we need anyways, number of btrees growing like it is. Another fun one that I just discovered while I was hanging out at Darrick's - journal was stalling on high iodepth workloads; device write buffer fills up, write latency goes up, suddenly the journal can't write quickly enough when it's only submitting one write at a time. So there's a fix for 6.9 queued up that lets the journal keep multiple writes in flight. That one was worth mentioning because another fix would've been to add a way to signal backpressure to /above/ the filesystem, so that we don't hit such big queuing delays within the filesystem; right now user writes don't hit backpressure until submit_bio() blocks because the request queue is full. I've been seeing other performance corner cases where it looks like such a mechanism would be helpful. I except I've got a solid year or two ahead of me of mastly just working through performance bugs - standing up a lot of automated perf testing adn whatnot. But, one thing at a time...