From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 9C22CC47DD9 for ; Sun, 25 Feb 2024 05:25:07 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id E340C6B00F0; Sun, 25 Feb 2024 00:25:06 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id DE4366B00F1; Sun, 25 Feb 2024 00:25:06 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id CD34D6B00F2; Sun, 25 Feb 2024 00:25:06 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id BB4566B00F0 for ; Sun, 25 Feb 2024 00:25:06 -0500 (EST) Received: from smtpin04.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id 66EDFA0EFE for ; Sun, 25 Feb 2024 05:25:06 +0000 (UTC) X-FDA: 81829187412.04.A8858EC Received: from out-179.mta1.migadu.com (out-179.mta1.migadu.com [95.215.58.179]) by imf07.hostedemail.com (Postfix) with ESMTP id B8C1440008 for ; Sun, 25 Feb 2024 05:25:04 +0000 (UTC) Authentication-Results: imf07.hostedemail.com; dkim=pass header.d=linux.dev header.s=key1 header.b=ejansM5C; spf=pass (imf07.hostedemail.com: domain of kent.overstreet@linux.dev designates 95.215.58.179 as permitted sender) smtp.mailfrom=kent.overstreet@linux.dev; dmarc=pass (policy=none) header.from=linux.dev ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1708838704; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=QjXC2JkYyp2zA7zOsHpwHQbHa4Lbe1SXOUcO5kMx/pc=; b=c0JWgQ1C3fJUseQkXdZ4FfOfqAfvIJvKibySb5y2vjjo3o1smHqJSTIfflsPwYc1gldxJW C92HZeLboSbxXnXumHb4EWHE7li8STmdD218Ec4o275o0ItrG8K2wQ1gyvv9WWc8oPYaR3 bWhPeYteuQPP1XGsHysBNedPhnO1u/8= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1708838704; a=rsa-sha256; cv=none; b=mEHNxrvZD9gWr6EoUqU05HxHzmQ3mP5a9QF/MxD/Gvk2M5VALhCvI6NWc6sfR5OMT2G0cs Dwnh3YWBkUbWnmBqKFoJDk9NQ7VxOQd/xV+YITDqoGcVSo6Dkwou3k0Etc2pgaFdHKvLBY YVk5A98QdNEH3sbgjRDqOmBOnZOporw= ARC-Authentication-Results: i=1; imf07.hostedemail.com; dkim=pass header.d=linux.dev header.s=key1 header.b=ejansM5C; spf=pass (imf07.hostedemail.com: domain of kent.overstreet@linux.dev designates 95.215.58.179 as permitted sender) smtp.mailfrom=kent.overstreet@linux.dev; dmarc=pass (policy=none) header.from=linux.dev Date: Sun, 25 Feb 2024 00:24:55 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1708838703; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=QjXC2JkYyp2zA7zOsHpwHQbHa4Lbe1SXOUcO5kMx/pc=; b=ejansM5CJH7RY0h0m/Dt6l70X+H72EK7MR9ktJAeBz4EG3DHFzGR9wrbYerDQQA8BVXto5 BMhZIzr4GMEj31KUUECR+yPJyU1J5POUifBTI8EnWY5tZoA576sHeCewG0blpaT3EcZT2t kCTcDl0mVeEjyxhTgVBGAxqyAsDhm2k= X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. From: Kent Overstreet To: Luis Chamberlain Cc: lsf-pc@lists.linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-mm , Daniel Gomez , Pankaj Raghav , Jens Axboe , Dave Chinner , Christoph Hellwig , Chris Mason , Johannes Weiner , Matthew Wilcox , Linus Torvalds Subject: Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO Message-ID: <45odvhgymm7fxsgwpewoiiggaokxjcr7ootsamm4rwsdw26j62@5lohhv2kvofo> References: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Migadu-Flow: FLOW_OUT X-Stat-Signature: ogmxff3x65irhfyei91kpqm1feyfqu7r X-Rspamd-Server: rspam10 X-Rspamd-Queue-Id: B8C1440008 X-Rspam-User: X-HE-Tag: 1708838704-328083 X-HE-Meta: U2FsdGVkX18HT8WFUcXRi2ET1Sg5BOxM141jU2p5QPD4c0Gk4fR3jJHrqypYf1l4IW7QUVXVpfIo6nxf1HHWr0KBO/vM1IWO466oJSY/SETn4y1D+GoS37CRxQsTplyQy+v+MTxpnZpoATQBIH8cI8/k76tWnh4edhdolC8yRO1RhClhhnWSP4HZ8lX2oEDLisqwzWuILhxdeAIxUVJ4/Bq+RCKD13Qu495Nl1OG5bG71jzUTu+PeLFF0wwLo/dy/TTH+qlEhaUmR7euRoRlQEDEdJgDLGOk/9Gh54tcvZXZn7VB7hqxjZY3mnapwfH+yANSI9wWzCGRnevg3X39Kk1/sXqwvnLaJ29muM2BxqQk54FAp7WSoTPNZMDLGwxPRbm3f6qIJXCPwpnUuGFsmWUB+eebqWgOem2h0HUK7e2JgJY1EGZXK7o/fvhIObLxPU2SKWCrzjBJ1AftVx9Fj/gISsFzi/Wt0E5gfFYBAEz1YmS/wCqxdQsQ2i0spWWuzqdK/VxMxbhiz23AbaD3vDYk5O8bFveFvvCTs1TjnhcqX2RHs/RroZYxSIBainbQZ/6HsD1/jW1ZCABHYlItdIuaz9TMIZ+fMMEuo2xne5/jHCRqVH320nywUwJHnlcUeUiSBqmQY2J6UeXEUCvF2F9HVfIC1Dl3/zNnfn2peAVYqBBmDDr8aAPS0Ccm4yyHlVw3H+3771l6bDYwFdTMzszLcLnOIoflWC0hR3IxrmcK/zYt5LvEYT2agMaZPh9giGqbLYwxeVGcpJ1avEgbpaZ8tELUWUWIMuYMGuAnlBI= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Fri, Feb 23, 2024 at 03:59:58PM -0800, Luis Chamberlain wrote: > Part of the testing we have done with LBS was to do some performance > tests on XFS to ensure things are not regressing. Building linux is a > fine decent test and we did some random cloud instance tests on that and > presented that at Plumbers, but it doesn't really cut it if we want to > push things to the limit though. What are the limits to buffered IO > and how do we test that? Who keeps track of it? > > The obvious recurring tension is that for really high performance folks > just recommend to use birect IO. But if you are stress testing changes > to a filesystem and want to push buffered IO to its limits it makes > sense to stick to buffered IO, otherwise how else do we test it? > > It is good to know limits to buffered IO too because some workloads > cannot use direct IO. For instance PostgreSQL doesn't have direct IO > support and even as late as the end of last year we learned that adding > direct IO to PostgreSQL would be difficult. Chris Mason has noted also > that direct IO can also force writes during reads (?)... Anyway, testing > the limits of buffered IO limits to ensure you are not creating > regressions when doing some page cache surgery seems like it might be > useful and a sensible thing to do .... The good news is we have not found > regressions with LBS but all the testing seems to beg the question, of what > are the limits of buffered IO anyway, and how does it scale? Do we know, do > we care? Do we keep track of it? How does it compare to direct IO for some > workloads? How big is the delta? How do we best test that? How do we > automate all that? Do we want to automatically test this to avoid regressions? > > The obvious issues with some workloads for buffered IO is having a > possible penality if you are not really re-using folios added to the > page cache. Jens Axboe reported a while ago issues with workloads with > random reads over a data set 10x the size of RAM and also proposed > RWF_UNCACHED as a way to help [0]. As Chinner put it, this seemed more > like direct IO with kernel pages and a memcpy(), and it requires > further serialization to be implemented that we already do for > direct IO for writes. There at least seems to be agreement that if we're > going to provide an enhancement or alternative that we should strive to not > make the same mistakes we've done with direct IO. The rationale for some > workloads to use buffered IO is it helps reduce some tail latencies, so > that's something to live up to. > > On that same thread Christoph also mentioned the possibility of a direct > IO variant which can leverage the cache. Is that something we want to > move forward with? The thing to consider here would be an improved O_SYNC. There's a fair amount of tree walking and thread to thread cacheline bouncing that would be avoided by just calling .write_folios() and kicking bios off from .write_iter(). OTOH - the way it's done now is probably the best possible way of splitting up the work between multiple threads, so I'd expect this approach to get less throughput than current O_SYNC. Luis, are you profiling these workloads? I haven't looked at high throughput profiles of the buffered IO path in years, and that's a good place to start.