From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 49D0DC54798 for ; Tue, 27 Feb 2024 10:07:41 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id D9C218000F; Tue, 27 Feb 2024 05:07:40 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id D7181940008; Tue, 27 Feb 2024 05:07:40 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id C39338000F; Tue, 27 Feb 2024 05:07:40 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id B1845940008 for ; Tue, 27 Feb 2024 05:07:40 -0500 (EST) Received: from smtpin01.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id 85CEC40901 for ; Tue, 27 Feb 2024 10:07:40 +0000 (UTC) X-FDA: 81837157080.01.6265B2B Received: from out-179.mta0.migadu.com (out-179.mta0.migadu.com [91.218.175.179]) by imf04.hostedemail.com (Postfix) with ESMTP id C489540003 for ; Tue, 27 Feb 2024 10:07:37 +0000 (UTC) Authentication-Results: imf04.hostedemail.com; dkim=pass header.d=linux.dev header.s=key1 header.b=LMtICWEa; spf=pass (imf04.hostedemail.com: domain of kent.overstreet@linux.dev designates 91.218.175.179 as permitted sender) smtp.mailfrom=kent.overstreet@linux.dev; dmarc=pass (policy=none) header.from=linux.dev ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1709028458; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=ZC5i4ceQpeqp2EF6ln2H0aYPVsQ9lMsdpe13g+8CkTk=; b=kPO7b1PBz3W9MEayNG/taHFyrhD1Yc9Zpy9LJ34CTPdw8ZvmMNWJIucA9PGgxhKxuIP5l9 ve/luXLpEZz1YJLhypYjy45IqF+60hzE8spM70Mbz7UIJsAQqD88FVdYgZAYiMwoUMCEU5 NqEF1KnUanM0oiM+8wx93I9b7nneLdk= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1709028458; a=rsa-sha256; cv=none; b=xWpgtgD05AwFzDl1OdEUXdhKTUsCUwjaJ5ZpL58M/fY7Mtk4drJ2HlPIVUQe2fuwzhusl4 5okk6CHst3QzOhgAuQtrSKUPn+cXjBBnBVlROS4YlV46pvx8QCGLITiR8EX8kFfVUAeXHy YTEGL/uWiYu7z0LUOegkllELSQwlcsw= ARC-Authentication-Results: i=1; imf04.hostedemail.com; dkim=pass header.d=linux.dev header.s=key1 header.b=LMtICWEa; spf=pass (imf04.hostedemail.com: domain of kent.overstreet@linux.dev designates 91.218.175.179 as permitted sender) smtp.mailfrom=kent.overstreet@linux.dev; dmarc=pass (policy=none) header.from=linux.dev Date: Tue, 27 Feb 2024 05:07:30 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1709028455; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=ZC5i4ceQpeqp2EF6ln2H0aYPVsQ9lMsdpe13g+8CkTk=; b=LMtICWEaJAvln4EHFOHqfvW0WdwRkhPAFA4RAmLEPC0UlQKynmTwOFxR3BXH9HEPHwtIwE kLJOiG1WBli06FqIkK/WbcExtHcUvtnbQ7gOkvmzg6qNiN7Sn32d7xcMPYce7qNPG77F/q nr4SNYXzHWel1xTrv/UCgq3+oia+WVA= X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. From: Kent Overstreet To: Luis Chamberlain Cc: lsf-pc@lists.linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-mm , Daniel Gomez , Pankaj Raghav , Jens Axboe , Dave Chinner , Christoph Hellwig , Chris Mason , Johannes Weiner , Matthew Wilcox , Linus Torvalds Subject: Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO Message-ID: References: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Migadu-Flow: FLOW_OUT X-Rspamd-Queue-Id: C489540003 X-Rspam-User: X-Stat-Signature: 497buoikw1z5uwhy1o6o1b5u4mniz1ib X-Rspamd-Server: rspam03 X-HE-Tag: 1709028457-613383 X-HE-Meta: U2FsdGVkX1+OWEvEU/64S6FINqsDU3wJodcC/1sCiIokW3GCCuZnEuNvJnxEAtnmB4BTXHnN/H3QVNckkWNQTW3v0Y7M8POAAHdne9NiNGaEgR84gjU71qX3yPsxE1JKCFcUNDsP+OIP7GLGK5Wy+Y8MN39hcqfgpbHBwDMf2Y8Na2QP/cFxXsbKzZKUhQIED5xQZyrmTZoVmaxkh4sNtJqfW9HDN6wpJ/5yULRI82XzpnO4T/4DTau9V4GrVe9oChWCHzg50lugRPMqQdjIGwPso7sIzKgXE5PiVOgYKomivXUcbyTSBfzIU7D5V8erGmQ+ZmokCU2uxM3lg27xs0dShEpegsVUOx0/8xWsSfrKUzkF8AARZTcH136USrSNJfnT1qQaUJS1M7xEPzHHlTg6SaxM+N6WUOP69QIsT6nJVBnuFma/zzSZs00NBa5Nwma4geTvf5RbCi3Zz0tzVTI48tkZZvmXIPUBTdhPu7gZanwSnqVCrwoWa3aFd9V+RpssEjj/eW6uaRbiVsA1jBNLArK087Ff0507ElZUglDdNwGYtMD/k77GJSHnrewAp5GWzK75sCLL0rZGa77Hrk39CE/qwm65V0Si8GlUavFhwOBbZyie9zxYTuQtofHnIdn0Zq35g0SRPRML51AwMTgu7frmRmyrIHc2xKa3gUaR/dNTSKZAoI/BjEWKhBE14TH8zXCACfWYyPzL+i1CkXCqN8RyU+FTfEh1Y1fw98Wf/LNh/MRbeC56/+ODkVTbw8FcM+v20gCFSBDK5BB/gx0V8zgJrtl21HvV6Ioztx4= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Fri, Feb 23, 2024 at 03:59:58PM -0800, Luis Chamberlain wrote: > Part of the testing we have done with LBS was to do some performance > tests on XFS to ensure things are not regressing. Building linux is a > fine decent test and we did some random cloud instance tests on that and > presented that at Plumbers, but it doesn't really cut it if we want to > push things to the limit though. What are the limits to buffered IO > and how do we test that? Who keeps track of it? > > The obvious recurring tension is that for really high performance folks > just recommend to use birect IO. But if you are stress testing changes > to a filesystem and want to push buffered IO to its limits it makes > sense to stick to buffered IO, otherwise how else do we test it? > > It is good to know limits to buffered IO too because some workloads > cannot use direct IO. For instance PostgreSQL doesn't have direct IO > support and even as late as the end of last year we learned that adding > direct IO to PostgreSQL would be difficult. Chris Mason has noted also > that direct IO can also force writes during reads (?)... Anyway, testing > the limits of buffered IO limits to ensure you are not creating > regressions when doing some page cache surgery seems like it might be > useful and a sensible thing to do .... The good news is we have not found > regressions with LBS but all the testing seems to beg the question, of what > are the limits of buffered IO anyway, and how does it scale? Do we know, do > we care? Do we keep track of it? How does it compare to direct IO for some > workloads? How big is the delta? How do we best test that? How do we > automate all that? Do we want to automatically test this to avoid regressions? > > The obvious issues with some workloads for buffered IO is having a > possible penality if you are not really re-using folios added to the > page cache. Jens Axboe reported a while ago issues with workloads with > random reads over a data set 10x the size of RAM and also proposed > RWF_UNCACHED as a way to help [0]. As Chinner put it, this seemed more > like direct IO with kernel pages and a memcpy(), and it requires > further serialization to be implemented that we already do for > direct IO for writes. There at least seems to be agreement that if we're > going to provide an enhancement or alternative that we should strive to not > make the same mistakes we've done with direct IO. The rationale for some > workloads to use buffered IO is it helps reduce some tail latencies, so > that's something to live up to. > > On that same thread Christoph also mentioned the possibility of a direct > IO variant which can leverage the cache. Is that something we want to > move forward with? > > Chris Mason also listed a few other desirables if we do: > > - Allowing concurrent writes (xfs DIO does this now) AFAIK every filesystem allows concurrent direct writes, not just xfs, it's _buffered_ writes that we care about here. I just pushed a patch to my CI for buffered writes without taking the inode lock - for bcachefs. It'll be straightforward, but a decent amount of work, to lift this to the VFS, if people are interested in collaborating. https://evilpiepirate.org/git/bcachefs.git/log/?h=bcachefs-buffered-write-locking The approach is: for non extending, non appending writes, see if we can pin the entire range of the pagecache we're writing to; fall back to taking the inode lock if we can't. If we do a short write because of a page fault (despite previously faulting in the userspace buffer), there is no way to completely prevent torn writes an atomicity breakage; we could at least try a trylock on the inode lock, I didn't do that here. For lifting this to the VFS, this needs - My darray code, which I'll be moving to include/linux/ in the 6.9 merge window - My pagecache add lock - we need this for sychronization with hole punching and truncate when we don't have the inode lock. - My vectorized buffered write path lifted to filemap.c, which means we need some sort of vectorized replacement for .write_begin and .write_end