From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 1622AC5478C
	for <linux-mm@archiver.kernel.org>; Tue, 27 Feb 2024 22:13:18 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 9848A6B0207; Tue, 27 Feb 2024 17:13:17 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 90C616B0208; Tue, 27 Feb 2024 17:13:17 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 737736B0209; Tue, 27 Feb 2024 17:13:17 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14])
	by kanga.kvack.org (Postfix) with ESMTP id 5C61B6B0207
	for <linux-mm@kvack.org>; Tue, 27 Feb 2024 17:13:17 -0500 (EST)
Received: from smtpin06.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay09.hostedemail.com (Postfix) with ESMTP id 11D0380DAC
	for <linux-mm@kvack.org>; Tue, 27 Feb 2024 22:13:17 +0000 (UTC)
X-FDA: 81838985634.06.6C01CE3
Received: from mail-pj1-f53.google.com (mail-pj1-f53.google.com [209.85.216.53])
	by imf03.hostedemail.com (Postfix) with ESMTP id 0EF052001C
	for <linux-mm@kvack.org>; Tue, 27 Feb 2024 22:13:14 +0000 (UTC)
Authentication-Results: imf03.hostedemail.com;
	dkim=pass header.d=fromorbit-com.20230601.gappssmtp.com header.s=20230601 header.b=qHspKPcH;
	dmarc=pass (policy=quarantine) header.from=fromorbit.com;
	spf=pass (imf03.hostedemail.com: domain of david@fromorbit.com designates 209.85.216.53 as permitted sender) smtp.mailfrom=david@fromorbit.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1709071995;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=L0nNAxmT7SCNE2zM946LYiElXOI86ooFfzrPhMaZ9VA=;
	b=YKGfsNUr/1QkRRBp9bXOTr3abxRJbQ7vFrV/0NCQsA1/ZN2fWO/DB4PGa64QoRebecSWeo
	xc0J92g/y7sz3a5bCwGF5yyrozQFQ8GjPtqjPZxc376lBU2DoDoDVEtAt0FdHHh6MfyZD1
	n/vniyIEWnV3JCOKm08hwZnnkoxJjXY=
ARC-Authentication-Results: i=1;
	imf03.hostedemail.com;
	dkim=pass header.d=fromorbit-com.20230601.gappssmtp.com header.s=20230601 header.b=qHspKPcH;
	dmarc=pass (policy=quarantine) header.from=fromorbit.com;
	spf=pass (imf03.hostedemail.com: domain of david@fromorbit.com designates 209.85.216.53 as permitted sender) smtp.mailfrom=david@fromorbit.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1709071995; a=rsa-sha256;
	cv=none;
	b=DAhfTRFstoZiWRiSQHwMSH3kAd7aaaHibxeZuo6T3VC9BlQJhwDrdzmOYy7HpwexKkt1j7
	HGRzs/XL6exkwusablv2//BA5LsI2He+i1HoGSYivvX8GULsh5PNLuOKUnKzr06mdqxAXH
	BLiCpRzxbkS3gU+MMo7VodRPWG75dzk=
Received: by mail-pj1-f53.google.com with SMTP id 98e67ed59e1d1-29aa91e173dso1785609a91.0
        for <linux-mm@kvack.org>; Tue, 27 Feb 2024 14:13:14 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=fromorbit-com.20230601.gappssmtp.com; s=20230601; t=1709071994; x=1709676794; darn=kvack.org;
        h=in-reply-to:content-disposition:mime-version:references:message-id
         :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to;
        bh=L0nNAxmT7SCNE2zM946LYiElXOI86ooFfzrPhMaZ9VA=;
        b=qHspKPcHJbBAHvSUHN9bIqBBp+EUy8G9mWqfhVxmXO5GU1KnF0wfdiGFaTp88G2Wx6
         6Gbfd7+M8H5aJEZ5Nk/p8kjcoOT28ECJYII/SIS8WGLFZQbTKX9EiE2jxed50u3TPN3M
         IqVWa8xGfuEFhNQi9m/W+XHfkL/wlTC+ZETxHg8OYWXuAOV53QD/IqPgLNHBqoCUY4Kt
         C0mUN2W2coR9nX2/YIEM6fvI3QdO+UTHke19GgNlvqsfMbKOxM+IDZz0SxynnLMw0GaO
         OxOmILAjw6XS9UcTzwPoUhICf0aonlaIaqN2Fe2xJt3pU5Qgob+gOlKvAsifwxLtXSyE
         gRfg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1709071994; x=1709676794;
        h=in-reply-to:content-disposition:mime-version:references:message-id
         :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date
         :message-id:reply-to;
        bh=L0nNAxmT7SCNE2zM946LYiElXOI86ooFfzrPhMaZ9VA=;
        b=Z8IsrHC0yllCsp3MvxZjmYZZlBiVFeCK828UnpYvfHomqKc/h1X0yTM9GPLp1JvqPy
         1fV69eQ3AW+mmzFS19vKCjIfW9QBYodDgEfdJAuXZxjTBp5wbaEBA0+dhYteqHbwVYlE
         RYMpokn/sKnevTO26mxnR0i7QnDQBqaZfM/qjj+yHgb0yiZBFsYFWyrKg72PkWjPqI4Y
         34HGTWKEUFjXchcFrVE2KiD1kTPmCWy4qyD1FtD/B1NIAhRE0qdgZ715yLOk2oU5qWWs
         42/NhQz5oNdSzwySYhh07qbaJUwlBMx+F8QRyP4BeBatTQAY3L6/msfkPiyRQ2eu1nfm
         Zz+Q==
X-Forwarded-Encrypted: i=1; AJvYcCUbn9pupYkvF4Q8T3Ho6wu7l7+ci/Q6ESkP5ZYB1he9W02BC4X+OZu9nthSq12Bv+VDQMmJqNzuNG9Ef8XfCRbPBDU=
X-Gm-Message-State: AOJu0YzmUscHmTHdGb1MoQTBUwfXa/K5GO/iDzW6u1CcSCDVrA5LV18q
	NuRrziR/Ovxo9dgGSYAPkaD+DczV0EgXyoJsS303TzYVeRBC9zNK9KeMAXVSD8M=
X-Google-Smtp-Source: AGHT+IFEh8sFyjriVTBKiYgObPdXhLz0H2pYZasDxufUrdsPvoedI+GdQpptlkXlQRKWb+e1FUvYoA==
X-Received: by 2002:a17:90a:420b:b0:29a:4239:6893 with SMTP id o11-20020a17090a420b00b0029a42396893mr9212780pjg.6.1709071993640;
        Tue, 27 Feb 2024 14:13:13 -0800 (PST)
Received: from dread.disaster.area (pa49-181-247-196.pa.nsw.optusnet.com.au. [49.181.247.196])
        by smtp.gmail.com with ESMTPSA id gd23-20020a17090b0fd700b0029ab17eaa40sm36934pjb.3.2024.02.27.14.13.08
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Tue, 27 Feb 2024 14:13:13 -0800 (PST)
Received: from dave by dread.disaster.area with local (Exim 4.96)
	(envelope-from <david@fromorbit.com>)
	id 1rf5hB-00COAO-2A;
	Wed, 28 Feb 2024 09:13:05 +1100
Date: Wed, 28 Feb 2024 09:13:05 +1100
From: Dave Chinner <david@fromorbit.com>
To: Kent Overstreet <kent.overstreet@linux.dev>
Cc: Luis Chamberlain <mcgrof@kernel.org>, lsf-pc@lists.linux-foundation.org,
	linux-fsdevel@vger.kernel.org, linux-mm <linux-mm@kvack.org>,
	Daniel Gomez <da.gomez@samsung.com>,
	Pankaj Raghav <p.raghav@samsung.com>, Jens Axboe <axboe@kernel.dk>,
	Christoph Hellwig <hch@lst.de>, Chris Mason <clm@fb.com>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Matthew Wilcox <willy@infradead.org>,
	Linus Torvalds <torvalds@linux-foundation.org>
Subject: Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO
Message-ID: <Zd5ecZbF5NACZpGs@dread.disaster.area>
References: <Zdkxfspq3urnrM6I@bombadil.infradead.org>
 <xhymmlbragegxvgykhaddrkkhc7qn7soapca22ogbjlegjri35@ffqmquunkvxw>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <xhymmlbragegxvgykhaddrkkhc7qn7soapca22ogbjlegjri35@ffqmquunkvxw>
X-Rspam-User: 
X-Stat-Signature: bt7qa4i3wdwzuh46bimhdkun1ufgydbo
X-Rspamd-Server: rspam07
X-Rspamd-Queue-Id: 0EF052001C
X-HE-Tag: 1709071994-216918
X-HE-Meta: U2FsdGVkX1+eHR/S0KS4YkUyAlIPOKRpJhnY3608K0puJLU19v5h89iyrGEvz/jTxJD9vQbBQ73DwqjYh1CHers60g0eFtUBfxRta1ztE4hUSXG1tVYzbebD4os1uRFL7ASII3hwQYRURVslHEsFTXVLzc1dcARKsnl7R37LWNsVKBHXl9BcJQGWu3Y+Q6FiZWP1wi+wT57GO7oyhFBTHUeFitrqroR9jnXDhx9qalaJSASuXCnwKmsCAVJykFL8lhT7128rr/BJOFMCMV4MHIrY+oN0O1J9FmyaW6tJ/r5WmQoX9U9NtgppxFCX0fMonT9XTOYiXdotPPmDUKdFj/mauGQ9KEZUqppf26msaLVlTh3NM8ul0NFnDNVcj4cFOi/zVZ8TouJh/JRC/OOAua0Dt3Acet0/6LKBCaabLZN0nkj+qX0lHbbi7Ye6FaGQltguuyiZK4GCorp4BYbf+WQxCRD5asqAdYuZnum8hJ1RmqgpAvN1zbI3dC7RZ1zPQqTVgKhJHH449yTU1tZZ7o8/NsVD3fOIOpL9vbqXK4kRBYPSAsWLP3V9NVyM15lyae07qnJxQZx+mbokGAriLuOp8ujfco1dNqO/WJCZdNv9w49n2yQFecAVJUhi55UF8GmS3crXL9FvgictAVreMI3WpdlG2ccxooBHkpAGP3ghpBSnBDlJyE+0YYz1OT6hp2djtYbirrF0LR38EZnzdZXn7qvqOeuCRDxGDXXMN/c9uRWfAGqInn9qZpsgqKpyBvx4Urpfj2TCJXlGv5YW2995ZK/aeIJM7f/W0lo6RTEaxQR8VJS8Y4/4I/YCuFJXRs7fMrCGfWGt+Sg+N9hWkJX/kz2EHMPCAI+UCiZqwhPVRbVqkYK0Ws1oB60cbqNB+P7yV81wWor01TXWFbvzX0y3GdtsQ2ShFjWUtjYzBvtyz4av+/XYJynea1cS08ccDw8bIhLzu4X4FDWL7EH
 rEk9OryO
 VhnhbNYLLmKKxKhmUFuPXo5dC+faassLeq1yuE0KB/wmRqd2DCAfbvhTXlplRommyGTGzZo6uwCPEyLPMNRDZBx8kN+p0RaPpBr/sN6oN7/85DRwqkCuKKt0fiKcaKcAGEkEAFsyMHOG5oy67jfWkZMth0AgmaM3UpOVA/u0Cj5eOl/nfoEa+AMlbWPL/IdeiXBCTLe3hi3ZnfvZYvaYSN6Es34pZ0Cwfm7Ezs+rlCQhz8Oi7dd0B6hDVzCUIwAmU3MjWagh3/j5zt2aH8y2xlU3RTmF3PMRR3NJqdqmAqa+LK7FFW6iKZ57RTG87LKAXkiRf6EH9Gzbxqxd7gWmYmTuxnB9aQQBf+BUG8NXYEnHl2jiqOEhNnIvzxYJf+/qawhtCPFRTgos+qfgCaxJQackHG+Lcb77O2jqiUlNy1QKusqSa+wst2MykKxy2wIRNBGsgF2jtQ92UVoFeW8Vw64sAA3bTx5OkmxrdVPmK92+o8Vb1CpehnbEjBDposiMYPyYT6PKNdKxAednhppgJyx/zuA==
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On Tue, Feb 27, 2024 at 05:07:30AM -0500, Kent Overstreet wrote:
> On Fri, Feb 23, 2024 at 03:59:58PM -0800, Luis Chamberlain wrote:
> > Part of the testing we have done with LBS was to do some performance
> > tests on XFS to ensure things are not regressing. Building linux is a
> > fine decent test and we did some random cloud instance tests on that and
> > presented that at Plumbers, but it doesn't really cut it if we want to
> > push things to the limit though. What are the limits to buffered IO
> > and how do we test that? Who keeps track of it?
> > 
> > The obvious recurring tension is that for really high performance folks
> > just recommend to use birect IO. But if you are stress testing changes
> > to a filesystem and want to push buffered IO to its limits it makes
> > sense to stick to buffered IO, otherwise how else do we test it?
> > 
> > It is good to know limits to buffered IO too because some workloads
> > cannot use direct IO.  For instance PostgreSQL doesn't have direct IO
> > support and even as late as the end of last year we learned that adding
> > direct IO to PostgreSQL would be difficult.  Chris Mason has noted also
> > that direct IO can also force writes during reads (?)... Anyway, testing
> > the limits of buffered IO limits to ensure you are not creating
> > regressions when doing some page cache surgery seems like it might be
> > useful and a sensible thing to do .... The good news is we have not found
> > regressions with LBS but all the testing seems to beg the question, of what
> > are the limits of buffered IO anyway, and how does it scale? Do we know, do
> > we care? Do we keep track of it? How does it compare to direct IO for some
> > workloads? How big is the delta? How do we best test that? How do we
> > automate all that? Do we want to automatically test this to avoid regressions?
> > 
> > The obvious issues with some workloads for buffered IO is having a
> > possible penality if you are not really re-using folios added to the
> > page cache. Jens Axboe reported a while ago issues with workloads with
> > random reads over a data set 10x the size of RAM and also proposed
> > RWF_UNCACHED as a way to help [0]. As Chinner put it, this seemed more
> > like direct IO with kernel pages and a memcpy(), and it requires
> > further serialization to be implemented that we already do for
> > direct IO for writes. There at least seems to be agreement that if we're
> > going to provide an enhancement or alternative that we should strive to not
> > make the same mistakes we've done with direct IO. The rationale for some
> > workloads to use buffered IO is it helps reduce some tail latencies, so
> > that's something to live up to.
> > 
> > On that same thread Christoph also mentioned the possibility of a direct
> > IO variant which can leverage the cache. Is that something we want to
> > move forward with?
> > 
> > Chris Mason also listed a few other desirables if we do:
> > 
> > - Allowing concurrent writes (xfs DIO does this now)
> 
> AFAIK every filesystem allows concurrent direct writes, not just xfs,
> it's _buffered_ writes that we care about here.

We could do concurrent buffered writes in XFS - we would just use
the same locking strategy as direct IO and fall back on folio locks
for copy-in exclusion like ext4 does.

The real question is how much of userspace will that break, because
of implicit assumptions that the kernel has always serialised
buffered writes?

> I just pushed a patch to my CI for buffered writes without taking the
> inode lock - for bcachefs. It'll be straightforward, but a decent amount
> of work, to lift this to the VFS, if people are interested in
> collaborating.

Yeah, XFS would just revert to shared inode locking - we still need
the inode lock for things like truncate/fallocate exlusion.

> https://evilpiepirate.org/git/bcachefs.git/log/?h=bcachefs-buffered-write-locking
> 
> The approach is: for non extending, non appending writes, see if we can
> pin the entire range of the pagecache we're writing to; fall back to
> taking the inode lock if we can't.

XFS just falls back to exclusive locking if the file needs
extending.

> If we do a short write because of a page fault (despite previously
> faulting in the userspace buffer), there is no way to completely prevent
> torn writes an atomicity breakage; we could at least try a trylock on
> the inode lock, I didn't do that here.

As soon as we go for concurrent writes, we give up on any concept of
atomicity of buffered writes (esp. w.r.t reads), so this really
doesn't matter at all.

> For lifting this to the VFS, this needs
>  - My darray code, which I'll be moving to include/linux/ in the 6.9
>    merge window
>  - My pagecache add lock - we need this for sychronization with hole
>    punching and truncate when we don't have the inode lock.
>  - My vectorized buffered write path lifted to filemap.c, which means we
>    need some sort of vectorized replacement for .write_begin and
>    .write_end

I don't think we need any of that - I think you're over complicating
it. As long as the filesystems has a mechanism that works for
concurrent DIO writes, they can just reuse that for concurrent
buffered writes....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com