From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id AE3A9C5475B for ; Mon, 4 Mar 2024 00:47:00 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 194AB6B0098; Sun, 3 Mar 2024 19:47:00 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 1449E6B009C; Sun, 3 Mar 2024 19:47:00 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 032F06B009D; Sun, 3 Mar 2024 19:46:59 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id E61626B0098 for ; Sun, 3 Mar 2024 19:46:59 -0500 (EST) Received: from smtpin21.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id 92FFFA06BD for ; Mon, 4 Mar 2024 00:46:59 +0000 (UTC) X-FDA: 81857516958.21.61330B2 Received: from mail-pf1-f176.google.com (mail-pf1-f176.google.com [209.85.210.176]) by imf12.hostedemail.com (Postfix) with ESMTP id C8C1F40008 for ; Mon, 4 Mar 2024 00:46:56 +0000 (UTC) Authentication-Results: imf12.hostedemail.com; dkim=pass header.d=fromorbit-com.20230601.gappssmtp.com header.s=20230601 header.b=iCwkJUi+; spf=pass (imf12.hostedemail.com: domain of david@fromorbit.com designates 209.85.210.176 as permitted sender) smtp.mailfrom=david@fromorbit.com; dmarc=pass (policy=quarantine) header.from=fromorbit.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1709513216; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=f2HeHGRMjIF2wT+ppUMlD/kDYmzzKMlwBxD7H0DQGzg=; b=Jo+1gHz3Pn2KQj0MOAIybsuHe2GKJ2aXTU+33axGwL/860ZtSyB+HaK8+KWNdY4RnVHlsv HNw5Ne8/QXxZ4R4yLw7SmKLUyBm+OrArTQ6AAAvjXORUPa3/4pTAKHQE4bJwdu2Ik/mh/3 G32J9yxsz/9ucstjdTd53axapwa6AKc= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1709513216; a=rsa-sha256; cv=none; b=WQaZTLPlizGibUDXfCTKcN4kC1P4b3LKqyMra+SdbX0rrXwGbJG6lrPj3jE0Z9eg7DCbnD oESDDZGXOBxDJ6J1DAbo693yZ4mXuFYOurTu5JmRPA401sNUAeapsIvAv95InpqtNOxYZk 9O9noKasXtm4OA6pjL2DipLodIPGNqU= ARC-Authentication-Results: i=1; imf12.hostedemail.com; dkim=pass header.d=fromorbit-com.20230601.gappssmtp.com header.s=20230601 header.b=iCwkJUi+; spf=pass (imf12.hostedemail.com: domain of david@fromorbit.com designates 209.85.210.176 as permitted sender) smtp.mailfrom=david@fromorbit.com; dmarc=pass (policy=quarantine) header.from=fromorbit.com Received: by mail-pf1-f176.google.com with SMTP id d2e1a72fcca58-6e55b33ad14so2309651b3a.1 for ; Sun, 03 Mar 2024 16:46:56 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fromorbit-com.20230601.gappssmtp.com; s=20230601; t=1709513215; x=1710118015; darn=kvack.org; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to; bh=f2HeHGRMjIF2wT+ppUMlD/kDYmzzKMlwBxD7H0DQGzg=; b=iCwkJUi+FUNUxQttwJO4NAC/5+CBjtUlSP8rQbOZdZWhmick2efhJNLZp6H2+59ZaZ 1sqhEnwSc2d26J31rosoqZypkPv3chJrg/lwQtyCG/pGlmvcdYHGUoR12JWK/XHWmXtw umTTd/KgyneEbQGrSgE30VSxEyiw8Qt2CeIZDW5IuUKJrPs9m37LfJvU0+ZOM/BH5GEu kDWhxjRz3wAG/qm+fqbdkrUOcezVX+/FN17Q7Qo4PcM1nXpbf02wbtnGfZPFJLEG5IVz Jay40qCCzuTTO2wc6jCgCevkaxATK9QqBmehHue1TlQ4vVg6rYHBO4fN8zdMupHAJkOP dQlQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1709513215; x=1710118015; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=f2HeHGRMjIF2wT+ppUMlD/kDYmzzKMlwBxD7H0DQGzg=; b=UWEZUWHUzMhTkXSjUG79KqtwumKKpMAQFhFi7myNbOD/zhFtGtZYIa4SbKKYG5udr6 8OpgpM01C3GHqZhSjur/HTx2oejfreyYAkzfj/oILbHaguXz2IUiXXPprF7RVFPR2xwN pvUdAJJVv/eYdA7h1Fy/EqcYgL7CUFQdi0XPtrf3w4sWy7kgGmYzMGn6gFAFZjuuzoz3 IHwSbYuULPpsGIXn398jR6XF3nWbhROhG5V8gn7XsWa474vj3sbqB4zgu9L/YTqTPCVN Uhy+qimOvFlp85358f154qLCi4yZYNNmWwctjvSUQE30AXs4t2yhR8UTjsjCdBtepkrE QcOg== X-Forwarded-Encrypted: i=1; AJvYcCUTv2GgsaSYXcxqNpsHgNcZXsejsNOjTPR3Vsuz79poGI9pXx4/KEDbquSDrnKJiwBnd77ARsZ1Kzp3Rf6n/OqX1u0= X-Gm-Message-State: AOJu0YxlEp1h3D7mwUSgqJ4DOeLBYiPEK2SF8dBKr98drz8Fk0mByDV5 GgD2KdfvlY2EIADlqAcvBtfrJpn5tbM1Ewwjxrc2DBSZHG1HCr6QmrTUC00MuWw= X-Google-Smtp-Source: AGHT+IHifJ1Tx8WKSsecIgu3hpRHjyTgmPbCv4mNpzslMkrSD3VbHiFWwXQ1ty81LDv1WMPApG96Pw== X-Received: by 2002:a05:6a00:cc5:b0:6e1:482b:8c8e with SMTP id b5-20020a056a000cc500b006e1482b8c8emr7898114pfv.17.1709513215475; Sun, 03 Mar 2024 16:46:55 -0800 (PST) Received: from dread.disaster.area (pa49-181-247-196.pa.nsw.optusnet.com.au. [49.181.247.196]) by smtp.gmail.com with ESMTPSA id w189-20020a6262c6000000b006e629bd793esm48126pfb.108.2024.03.03.16.46.54 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sun, 03 Mar 2024 16:46:54 -0800 (PST) Received: from dave by dread.disaster.area with local (Exim 4.96) (envelope-from ) id 1rgwTk-00EgLm-1M; Mon, 04 Mar 2024 11:46:52 +1100 Date: Mon, 4 Mar 2024 11:46:52 +1100 From: Dave Chinner To: Kent Overstreet Cc: Amir Goldstein , Pankaj Raghav , Jens Axboe , Chris Mason , Matthew Wilcox , Daniel Gomez , linux-mm , Luis Chamberlain , Johannes Weiner , linux-fsdevel@vger.kernel.org, lsf-pc@lists.linux-foundation.org, Linus Torvalds , Christoph Hellwig , Josef Bacik , Jan Kara Subject: Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO Message-ID: References: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Stat-Signature: pfekdrc49hidnow3ii1r8gxky39dgm37 X-Rspamd-Server: rspam10 X-Rspamd-Queue-Id: C8C1F40008 X-Rspam-User: X-HE-Tag: 1709513216-407765 X-HE-Meta: U2FsdGVkX19WCWfX3oJDfAWjw+QfwMc/ezv6zwNXfmFNyX1K3IvcTl6fxlwuVlfnhJGHUfs3f2mwtcRa3G3z36UG5q8rifIgzQmIKwyMEadOz6IrksOyWpU1Rj4+UTHWFLhypWoB0xEGJbOPL+a8Qk5SyV/f+eN5n2V3sgMF9hcVajO7YSJYctmVSFlP6TLKKwqOO4ktMH1YhJGmj2Ln/7/8pfAsnBkApSYOzyd1fe2Ahocjrex52dxVbc3ynIg2GVnodzGym4OemZfv7uQzSjZnEhciIsXohO2pV1XoLvB+GnDGTpkSN42VHsSMfKfxVv5DrsLbCv0OUE9R0y0OyjHfAYKumAf6m9AIGzY8kAC9VY2hrYdijZnrPxipNS9aSgs/c7gfgyaU+iKsEKdrpSkZxyWQrozfPo8Vfi9XHnexXgzPT/Aor+ABL154KQV5aVhZQJ1PGyqTsjO2D4/DOvmH/oL8JIaPJlH9u6OPTD0X8gIFPijF9B0OftPmNTAdaonIpQo+TBwURrnqHecENkMhV2BDnRiLen6BAoS+9hOxFYlxi1Kg5dLKy7BGP0Nu2AQQxfRTSBY23aWWQZ3sqXi5YEZFuHx4exRYTBrWkj2EFa50F8SMOXlseJBoi+imUPESVHRe4wwVWmDtohrpYguEeO0FQV+gOWC6p228Bzz04p5jc1wAJF6eNAnj7bINM3c6UwDMNogF7UJCKU1fDleDupwVr6TrTqfxu/pbwaxxzLCYD6TwAPbpQyryrSiV4hjDMB9GJOPMIdikkfOpYRen6tC32t86g9VzTQesbxM5T9TGvWf60+LmmxmMQFtcElVpMluCpul4Gn69JBfL745Q0hgZcnF6GGXTYymuasZeJNec7geQNcO2Hct7sRIiwA4aij5yYedqmetusY6sA6HoDRSS0H1PhtrClFZR8a2j24mBHl7PA15cj5qXYjBXqn+LLeg+kP3jGQlWwv5 0pRjZZEv FxYbfjnPnrHjOfeU4yMjGbAjlYsXuF3v5q4ccPFPrdoWTk3cEEV1RQPFUXQmJMt1JSp8XGOxs/1IBy8coc9gDeZw9Vyt7Xsqunp1ryfsc4m7SpY1MYXdGv3Qy1A== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Wed, Feb 28, 2024 at 07:57:38PM -0500, Kent Overstreet wrote: > On Thu, Feb 29, 2024 at 11:25:33AM +1100, Dave Chinner wrote: > > > That's doable - I can try to do that. > > > What is your take regarding opt-in/opt-out of legacy behavior? > > > > Screw the legacy code, don't even make it an option. No-one should > > be relying on large buffered writes being atomic anymore, and with > > high order folios in the page cache most small buffered writes are > > going to be atomic w.r.t. both reads and writes anyway. > > That's a new take... > > > > > > At the time, I have proposed POSIX_FADV_TORN_RW API [1] > > > to opt-out of the legacy POSIX behavior, but I guess that an xfs mount > > > option would make more sense for consistent and clear semantics across > > > the fs - it is easier if all buffered IO to inode behaved the same way. > > > > No mount options, just change the behaviour. Applications already > > have to avoid concurrent overlapping buffered reads and writes if > > they care about data integrity and coherency, so making buffered > > writes concurrent doesn't change anything. > > Honestly - no. > > Userspace would really like to see some sort of definition for this kind > of behaviour, and if we just change things underneath them without > telling anyone, _that's a dick move_. I don't think you understand the full picture here, Kent. > POSIX_FADV_TORN_RW is a terrible name, though. The described behaviour for this advice is the standard behaviour for ext4, btrfs and most linux filesystems other than XFS. It has been for a -long- time. The only filesystem that gives anything resembling POSIX atomic write behaviour is XFS. No other filesystem in Linux actually provides the POSIX "buffered reads won't see partial data from buffered writes in progress" behaviour that XFS does via the IOLOCK behaviour it uses. So when I say "screw the legacy apps" I'm talking about the ancient enterprise applications that still behave as if this POSIX behaviour is reliable on modern linux systems. It simply isn't, and these apps are *already implicitly broken* on most Linux filesystems and they need fixing. > And fadvise() is the wrong API for this because it applies to ranges, > this should be an open flag or an fcntl. Not only is it the wrong API, it's also the wrong approach to take. We have a new API coming through for atomic writes: RWF_ATOMIC. If an applications needs an actual atomic IO guarantee, they will soon be able to be explicit in their requirements and they will not end up in the situation where the filesystem they use might determine if there is an implicit atomic write behaviour provided. Indeed, we don't actually say that XFS will always guarantee POSIX atomic buffered IO semantics - we've just never decided that the time is right to change the behaviour. In making such a change to XFS, normal buffered writes will get mostly the same behaviour as they do now because we now use high order folios in the page cache and serialisation will be done against high-order ranges rather than individual pages. And applications that actually need atomic IO semantics can use RWF_ATOMIC and in that case we can do explicitly serialised buffered writes that lock out concurrent buffered reads as we do right now. IOWs, there is no better time to convert XFS behaviour to match all the other Linux filesystems than right now. Applications that need atomic IO guarantees can use RWF_ATOMIC, and everyone else can get the performance benefits that come from no longer trying to make buffered IO implicitly "atomic". -Dave. -- Dave Chinner david@fromorbit.com