From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id C051DC25B47 for ; Wed, 25 Oct 2023 08:05:36 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id D44C86B0321; Wed, 25 Oct 2023 04:05:35 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id CF2156B0322; Wed, 25 Oct 2023 04:05:35 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id B92A36B0323; Wed, 25 Oct 2023 04:05:35 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id A9D606B0321 for ; Wed, 25 Oct 2023 04:05:35 -0400 (EDT) Received: from smtpin05.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id D99F380542 for ; Wed, 25 Oct 2023 08:05:33 +0000 (UTC) X-FDA: 81383249346.05.7722017 Received: from mail-pl1-f177.google.com (mail-pl1-f177.google.com [209.85.214.177]) by imf28.hostedemail.com (Postfix) with ESMTP id CC4EFC0017 for ; Wed, 25 Oct 2023 08:05:31 +0000 (UTC) Authentication-Results: imf28.hostedemail.com; dkim=pass header.d=fromorbit-com.20230601.gappssmtp.com header.s=20230601 header.b=ZGojIKT8; spf=pass (imf28.hostedemail.com: domain of david@fromorbit.com designates 209.85.214.177 as permitted sender) smtp.mailfrom=david@fromorbit.com; dmarc=pass (policy=quarantine) header.from=fromorbit.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1698221132; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=TiIdV9I/vN4NVgu/NUA15BW6zhATVMOiHC4HTyzCHqs=; b=dNmoHKdEcY1vIa3k06NfZ6Py5D6vj4LG7LHVVpqDediIqcGBKnBLPp6/M22ZZmnSfkwRMG snpKJP/iP53MLeJozg+zEJu3JYzMqwHbEvtTSUcjFtaeHyeb3Wi7RjO0pYRJQRIP7OdIEO DklvoLUnNLypTTL8m1OXQp5o0fhZxW4= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1698221132; a=rsa-sha256; cv=none; b=KffK/jeUKbqtSLJuUxGs6qDdHjUv7sAhtk2LKqVeZ/zQZ6GWMhzvDEspitJUutkVo/6iox xen0NqAMiE0O69cF+XyGa0eDkK1iASuQioBtUWrUCPB+K4OjOmhh8gBAGsk9C7r3PgguTz fvP9WcVMNxEAAmF6eQP/+57oqy2E4Mo= ARC-Authentication-Results: i=1; imf28.hostedemail.com; dkim=pass header.d=fromorbit-com.20230601.gappssmtp.com header.s=20230601 header.b=ZGojIKT8; spf=pass (imf28.hostedemail.com: domain of david@fromorbit.com designates 209.85.214.177 as permitted sender) smtp.mailfrom=david@fromorbit.com; dmarc=pass (policy=quarantine) header.from=fromorbit.com Received: by mail-pl1-f177.google.com with SMTP id d9443c01a7336-1ca82f015e4so36856935ad.1 for ; Wed, 25 Oct 2023 01:05:31 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fromorbit-com.20230601.gappssmtp.com; s=20230601; t=1698221130; x=1698825930; darn=kvack.org; h=in-reply-to:content-transfer-encoding:content-disposition :mime-version:references:message-id:subject:cc:to:from:date:from:to :cc:subject:date:message-id:reply-to; bh=TiIdV9I/vN4NVgu/NUA15BW6zhATVMOiHC4HTyzCHqs=; b=ZGojIKT8BTudQrj/92uNYPO9WgASpoegdOBtWxatmPwb3a6Eg3AHTT0IIrLxgcEiw/ vlWiRvrl9++vKZlhFe7SfgNAsYprJxXyo4wZPES3fB+Xwp+L3zanJCzSmNZ/m2Tf9/hJ 3Hpu6++o+bH1KiUXWOEf0KmwH157Q+uRLfbCiDXgNUVzuUOP86pdZCPMUL6PAgeV6FfS ejxf72CnubkYkML0HJxAGYD2RQ+MnCWlMGsxNrPtl0CAM5QPzz8NLqw/N4gW1dljWi71 WkvYkvUjM09vzD5kRBo+SSOV4n4Yv7MmhZ6ic/YCsLduXuAQl8Fp0hG5MN/SUGNC+Tzg 1TKg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1698221130; x=1698825930; h=in-reply-to:content-transfer-encoding:content-disposition :mime-version:references:message-id:subject:cc:to:from:date :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=TiIdV9I/vN4NVgu/NUA15BW6zhATVMOiHC4HTyzCHqs=; b=MycuWPJI0pxM/0eW9Jot+qd0Msw9r9Pc+tU9JTRpgVwtjZXBBi7rZcrgwtY+b3hsQv v08M+5C0CpYldpn0ey4laxObWdRtM5UwDSlB1cay2QaxdSFcBJv9KKL6QW2ojFx4zLRp RE5ZwypR3zEcaX2nss00ceRTVH5cKNteHlV/yTEEs9IiHBK1zDCE1cOm5hi81utbYqTy Kiy5NgsXjyp7QDnD003eqKLi/3BWPxh8iDOsHqs115JZjeP+WZA3ZdJB+qLSpr3jXA1k MudYXtpFLDREkNcYNASRrz8Wo5UKsZuVwS5UkcHIT7RC6dJNLA8HOsHrJP8xSppkRTlA vDQA== X-Gm-Message-State: AOJu0YySe3sFeztUpNzZFmoI38puQp89mpvZu8LCG8ySqvqHs8BDou/t Vcidr3UY2Sg6R1aJlM/8XlT3Dg== X-Google-Smtp-Source: AGHT+IF43YsoxJyfCmigF2ByrRCajkxHiglXcP/PVFXGzk0/EIpOE4fGDG8KrZpVCFCUqgg1CO+/2g== X-Received: by 2002:a17:902:f7cd:b0:1c6:30d1:7214 with SMTP id h13-20020a170902f7cd00b001c630d17214mr11982619plw.55.1698221129506; Wed, 25 Oct 2023 01:05:29 -0700 (PDT) Received: from dread.disaster.area (pa49-180-20-59.pa.nsw.optusnet.com.au. [49.180.20.59]) by smtp.gmail.com with ESMTPSA id u14-20020a170902e5ce00b001c61901ed2esm8529452plf.219.2023.10.25.01.05.28 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 25 Oct 2023 01:05:28 -0700 (PDT) Received: from dave by dread.disaster.area with local (Exim 4.96) (envelope-from ) id 1qvYtJ-003fEw-0V; Wed, 25 Oct 2023 19:05:25 +1100 Date: Wed, 25 Oct 2023 19:05:25 +1100 From: Dave Chinner To: Jeff Layton Cc: Amir Goldstein , Linus Torvalds , Kent Overstreet , Christian Brauner , Alexander Viro , John Stultz , Thomas Gleixner , Stephen Boyd , Chandan Babu R , "Darrick J. Wong" , Theodore Ts'o , Andreas Dilger , Chris Mason , Josef Bacik , David Sterba , Hugh Dickins , Andrew Morton , Jan Kara , David Howells , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-xfs@vger.kernel.org, linux-ext4@vger.kernel.org, linux-btrfs@vger.kernel.org, linux-mm@kvack.org, linux-nfs@vger.kernel.org Subject: Re: [PATCH RFC 2/9] timekeeping: new interfaces for multigrain timestamp handing Message-ID: References: <0a1a847af4372e62000b259e992850527f587205.camel@kernel.org> <61b32a4093948ae1ae8603688793f07de764430f.camel@kernel.org> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: X-Stat-Signature: modsxeppwadbhfn3dwmysona8gm8w3si X-Rspamd-Server: rspam10 X-Rspamd-Queue-Id: CC4EFC0017 X-Rspam-User: X-HE-Tag: 1698221131-632889 X-HE-Meta: U2FsdGVkX192qukyd0LugI5Y/f0q0np2qepoGdCL5EK0u4LagtGiYy1MFQdN10KQgCr8E5pwnmEcMJGIirth+fmTaAUmt7RRwmoyjv/Ea5TW/1Ti0fx6zefUDcLUf/SRMD58ii11C2dSvjnyFNgViAh/hs1grtNyYUha95pCcs5gux9ShShdlW5mgbP20nsoJNp+qex1Os+SbAES96hmbD9yqgKcsGYkH1jiOf38bPzunjluBeaYC5Etuy/7+b2tRu0b2mQC+2os6Eq+rRzxMOSCpxeRy/q0hy+NwMwEbo0Cx4xhY/VS0uG0/Lkchh9adYCTmADmOgPsOuCDJQ29bmdayThRzuzmeyS15kcALnjbzyy3Xsbpm6Lely/I0hnVOYTS0g4iHh1LKYtNtOJL5+zE09Lwx5aIvJZ9Ttov9H2pNyFBPVng8TjxJ+U3Ju8qK3UWazW3sYuqaNlyrthYBVfnHglurXhuiwjX1phfe2HoP9M9vD5D+EEhsZdVYBIBU8ErkQNjzSOtCIL9O7vQykuIiVzRCE7lwKX56/apnI/UK4cOmCAbdjFU8qD/nt2W1xJmMIcIc5JhM3Mov+RAZ5AaoXmleY8BRr1TvMj6zxkd9M+MF5FJAlCmZHMl83sNB1bwnovmKn+SiI5i1MDQFzBfbJuF4SaTQxOJQGmlwJTdT/0yDjVnKiO//p/J+DS1zVpp/FElBmQrbrP5tC4gW3Fla/MHRt7Tvy8GsRKErjMiFkA2ykHUMEYBYJysQ/glvcH2+pvIyAkynlq34p6UqooHQu2YkOKa4Bw/0VR2tgZ1/sXlNw7wN4l9fmwsmkul8MAp+/iXSfKiCeI5n0Y/2eWXzopFboJ4aq5mo/NqQcJJmgdCdH9we5LhnHQMd0SdzVVmQ+eAdBf7j9ggbO/MM2lMVSIegdPalJOBW1PPl2OW2dZGk6IQhMraxC0y0FwctrT5U3Ke61RNjnJDRbB P6jx/BaP gSQXoPCewyigN/W20PCmW00qaCGwoxNNyaYE0JOt75DfhHHON3/SbUfRm+RMWJmt2asAI7Ai/rs7wXZmGwd4BCvXhpvi+PPgD2hpiqPv0myBuA+P1acD9j/iR4M6NEp8YcuOtWEDjIRNEfxFTRd4UsrhsD/ohWWfx2d97ED9srOUOigH7xxrLIXxzFIgNlJIv7LUZSh9uLYzCxKoHT56DK/jAiWLiOwI3DFsGrNNKV1JavD5DiyzkfMxDiUkhuGYukQ9ZZg4NfYmaHKbDiyJpg3vrWs3Z2EeHGYG0TbrBTO+A7fuNbOZlwMHm2mhdRPvltZaduyqJe8I1uoFXiWufVt7/bmEoxn3XHk5uidZzAQM69w9S7WmacUn4Y06tDYzPGNPHg0XWoZdDq7h0NOtBGuZ4bX1xz4YbunZ+uVOMg+QiAAqbv+7tc7qJOzHZqn1YKPPpKrPV7Rw+rUNj5AJOvtvdhAiKXMDUzc+0bIDNac0KzKaUpEfR+/YXWfeMm1MRdt0H X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Tue, Oct 24, 2023 at 02:40:06PM -0400, Jeff Layton wrote: > On Tue, 2023-10-24 at 10:08 +0300, Amir Goldstein wrote: > > On Tue, Oct 24, 2023 at 6:40 AM Dave Chinner wrote: > > > > > > On Mon, Oct 23, 2023 at 02:18:12PM -1000, Linus Torvalds wrote: > > > > On Mon, 23 Oct 2023 at 13:26, Dave Chinner wrote: > > > > > > > > > > The problem is the first read request after a modification has been > > > > > made. That is causing relatime to see mtime > atime and triggering > > > > > an atime update. XFS sees this, does an atime update, and in > > > > > committing that persistent inode metadata update, it calls > > > > > inode_maybe_inc_iversion(force = false) to check if an iversion > > > > > update is necessary. The VFS sees I_VERSION_QUERIED, and so it bumps > > > > > i_version and tells XFS to persist it. > > > > > > > > Could we perhaps just have a mode where we don't increment i_version > > > > for just atime updates? > > > > > > > > Maybe we don't even need a mode, and could just decide that atime > > > > updates aren't i_version updates at all? > > > > > > We do that already - in memory atime updates don't bump i_version at > > > all. The issue is the rare persistent atime update requests that > > > still happen - they are the ones that trigger an i_version bump on > > > XFS, and one of the relatime heuristics tickle this specific issue. > > > > > > If we push the problematic persistent atime updates to be in-memory > > > updates only, then the whole problem with i_version goes away.... > > > > > > > Yes, yes, it's obviously technically a "inode modification", but does > > > > anybody actually *want* atime updates with no actual other changes to > > > > be version events? > > > > > > Well, yes, there was. That's why we defined i_version in the on disk > > > format this way well over a decade ago. It was part of some deep > > > dark magical HSM beans that allowed the application to combine > > > multiple scans for different inode metadata changes into a single > > > pass. atime changes was one of the things it needed to know about > > > for tiering and space scavenging purposes.... > > > > > > > But if this is such an ancient mystical program, why do we have to > > keep this XFS behavior in the present? > > BTW, is this the same HSM whose DMAPI ioctls were deprecated > > a few years back? Drop the attitude, Amir. That "ancient mystical program" is this: https://buy.hpe.com/us/en/enterprise-solutions/high-performance-computing-solutions/high-performance-computing-storage-solutions/hpc-storage-solutions/hpe-data-management-framework-7/p/1010144088 Yup, that product is backed by a proprietary descendent of the Irix XFS code base XFS that is DMAPI enabled and still in use today. It's called HPE XFS these days.... > > I mean, I understand that you do not want to change the behavior of > > i_version update without an opt-in config or mount option - let the distro > > make that choice. > > But calling this an "on-disk format change" is a very long stretch. Telling the person who created, defined and implemented the on disk format that they don't know what constitutes a change of that on-disk format seems kinda Dunning-Kruger to me.... There are *lots* of ways that di_changecount is now incompatible with the VFS change counter. That's now defined as "i_version should only change when [cm]time is changed". di_changecount is defined to be a count of the number of changes made to the attributes of the inode. It's not just atime at issue here - we bump di_changecount when make any inode change, including background work that does not otherwise change timestamps. e.g. allocation at writeback time, unwritten extent conversion, on-disk EOF extension at IO completion, removal of speculative pre-allocation beyond EOF, etc. IOWs, di_changecount was never defined as a linux "i_version" counter, regardless of the fact we originally we able to implement i_version with it - all extra bumps to di_changecount were not important to the users of i_version for about a decade. Unfortunately, the new i_version definition is very much incompatible with the existing di_changecount definition and that's the underlying problem here. i.e. the problem is not that we bump i_version on atime, it's that di_changecount is now completely incompatible with the new i_version change semantics. To implement the new i_version semantics exactly, we need to add a new field to the inode to hold this information. If we change the on disk format like this, then the atime problems go away because the new field would not get updated on atime updates. We'd still be bumping di_changecount on atime updates, though, because that's what is required by the on-disk format. I'm really trying to avoid changing the on-disk format unless it is absolutely necessary. If we can get the in-memory timestamp updates to avoid tripping di_changecount updates then the atime problems go away. If we can get [cm]time sufficiently fine grained that we don't need i_version, then we can turn off i_version in XFS and di_changecount ends up being entirely internal. That's what was attempted with generic multi-grain timestamps, but that hasn't worked. Another options is for XFS to play it's own internal tricks with [cm]time granularity and turn off i_version. e.g. limit external timestamp visibility to 1us and use the remaining dozen bits of the ns field to hold a change counter for updates within a single coarse timer tick. This guarantees the timestamp changes within a coarse tick for the purposes of change detection, but we don't expose those bits to applications so applications that compare timestamps across inodes won't get things back to front like was happening with the multi-grain timestamps.... Another option is to work around the visible symptoms of the semantic mismatch between i_version and di_changecount. The only visible symptom we currently know about is the atime vs i_version issue. If people are happy for us to simply ignore VFS atime guidelines (i.e. ignore realtime/lazytime) and do completely our own stuff with timestamp update deferal, then that also solve the immediate issues. > > Does xfs_repair guarantee that changes of atime, or any inode changes > > for that matter, update i_version? No, it does not. > > So IMO, "atime does not update i_version" is not an "on-disk format change", > > it is a runtime behavior change, just like lazytime is. > > This would certainly be my preference. I don't want to break any > existing users though. That's why I'm trying to get some kind of consensus on what rules and/or atime configurations people are happy for me to break to make it look to users like there's a viable working change attribute being supplied by XFS without needing to change the on disk format. > Perhaps this ought to be a mkfs option? Existing XFS filesystems could > still behave with the legacy behavior, but we could make mkfs.xfs build > filesystems by default that work like NFS requires. If we require mkfs to set a flag to change behaviour, then we're talking about making an explicit on-disk format change to select the optional behaviour. That's precisely what I want to avoid. -Dave. -- Dave Chinner david@fromorbit.com