From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id EAB7AC001DF for ; Thu, 3 Aug 2023 07:07:42 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 50BCC28020F; Thu, 3 Aug 2023 03:07:42 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 4957F2801EB; Thu, 3 Aug 2023 03:07:42 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 2C0AE28020F; Thu, 3 Aug 2023 03:07:42 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 1694E2801EB for ; Thu, 3 Aug 2023 03:07:42 -0400 (EDT) Received: from smtpin26.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id D894F1A1006 for ; Thu, 3 Aug 2023 07:07:41 +0000 (UTC) X-FDA: 81081913122.26.25A0AEA Received: from dfw.source.kernel.org (dfw.source.kernel.org [139.178.84.217]) by imf24.hostedemail.com (Postfix) with ESMTP id 032EC18000A for ; Thu, 3 Aug 2023 07:07:39 +0000 (UTC) Authentication-Results: imf24.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=idRaG8hd; dmarc=pass (policy=none) header.from=kernel.org; spf=pass (imf24.hostedemail.com: domain of brauner@kernel.org designates 139.178.84.217 as permitted sender) smtp.mailfrom=brauner@kernel.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1691046460; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=OQElS5Lddlt0o0JlqmEWO0k2hY8bNGlhwCvrjvMV2oM=; b=602X4pIolCAnkqnXQuV0Varf6bdH8eWd7rlcRvbRuiFHxkuf/vi0RpstvbVDA9maBCQzDN shQljsM5FOz5EcvNsQWtmx6jxf4muhweWH67uRdPyLeu2SRJpRlycKekFT4WWK33kTImPA +TCAVha615+MORQgtuXaAaQEgA5JsdE= ARC-Authentication-Results: i=1; imf24.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=idRaG8hd; dmarc=pass (policy=none) header.from=kernel.org; spf=pass (imf24.hostedemail.com: domain of brauner@kernel.org designates 139.178.84.217 as permitted sender) smtp.mailfrom=brauner@kernel.org ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1691046460; a=rsa-sha256; cv=none; b=YPnIGXV7+f54RXt1YABoj6y0cgpfVcCYpn5ODCbUpXzzKEJ/+bMqX1v8RnGIB4RlcanpPd q6DVz1xIzPIFlzuoct3qsmen9d3gKIo0UBH1AFKV+LOqLD7exbMNmVWGIzkj7+HLXqIV3z zvtlV3wx7TJWwEOGl2ckW9X571O92sQ= Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits)) (No client certificate requested) by dfw.source.kernel.org (Postfix) with ESMTPS id 161ED61C2F; Thu, 3 Aug 2023 07:07:39 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id A11F6C433C8; Thu, 3 Aug 2023 07:07:23 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1691046458; bh=OeoqI+Oj2qYvPhNh0DFzNOPr7pzFVMjDj3jNrQ8auks=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=idRaG8hdjFUCkTULCrK1dlyrRw8nW6SsI8wPcvzHLRBFJ7RKTCTZlaIU62uYdbg4B v0RwXTV+e7cpJngYnZGKJtq7ESuBCy9J/C6MoesGkjIOYrOfr0haegZpPN8Sl9AhEd /n7WAYLHhl7fcuzysrjeSm+K2USNWNIDAZPnKmucJcQSBzrpkTOBZyKIcqcDXFpGYt Ot014gaLbdp3bPGIltO92l77KW+89JgNgWl/HhyJnpogoXWT3yn30LmoJtXe7q8BgQ BoF4BFW//BfONhWdG7QTLXlmT0Mq4K+GIYn3JQ7ha2jeWaBCaSeemPXWQz4OJUap49 laskZlFECJL1w== Date: Thu, 3 Aug 2023 09:07:20 +0200 From: Christian Brauner To: Jeff Layton Cc: Jan Kara , Eric Van Hensbergen , Latchesar Ionkov , Dominique Martinet , Christian Schoenebeck , David Howells , Marc Dionne , Chris Mason , Josef Bacik , David Sterba , Xiubo Li , Ilya Dryomov , Jan Harkes , coda@cs.cmu.edu, Tyler Hicks , Gao Xiang , Chao Yu , Yue Hu , Jeffle Xu , Namjae Jeon , Sungjong Seo , Jan Kara , Theodore Ts'o , Andreas Dilger , Jaegeuk Kim , OGAWA Hirofumi , Miklos Szeredi , Bob Peterson , Andreas Gruenbacher , Greg Kroah-Hartman , Tejun Heo , Alexander Viro , Trond Myklebust , Anna Schumaker , Konstantin Komarov , Mark Fasheh , Joel Becker , Joseph Qi , Mike Marshall , Martin Brandenburg , Luis Chamberlain , Kees Cook , Iurii Zaikin , Steve French , Paulo Alcantara , Ronnie Sahlberg , Shyam Prasad N , Tom Talpey , Sergey Senozhatsky , Richard Weinberger , Hans de Goede , Hugh Dickins , Andrew Morton , "Darrick J. Wong" , Dave Chinner , Anthony Iliopoulos , v9fs@lists.linux.dev, linux-kernel@vger.kernel.org, linux-afs@lists.infradead.org, linux-btrfs@vger.kernel.org, ceph-devel@vger.kernel.org, codalist@coda.cs.cmu.edu, ecryptfs@vger.kernel.org, linux-erofs@lists.ozlabs.org, linux-fsdevel@vger.kernel.org, linux-ext4@vger.kernel.org, linux-f2fs-devel@lists.sourceforge.net, cluster-devel@redhat.com, linux-nfs@vger.kernel.org, ntfs3@lists.linux.dev, ocfs2-devel@lists.linux.dev, devel@lists.orangefs.org, linux-cifs@vger.kernel.org, samba-technical@lists.samba.org, linux-mtd@lists.infradead.org, linux-mm@kvack.org, linux-xfs@vger.kernel.org Subject: Re: [PATCH v6 2/7] fs: add infrastructure for multigrain timestamps Message-ID: <20230803-wellpappe-haargenau-f6cb3e3585d8@brauner> References: <20230725-mgctime-v6-0-a794c2b7abca@kernel.org> <20230725-mgctime-v6-2-a794c2b7abca@kernel.org> <20230802193537.vtuuwuwazocjbatv@quack3> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: X-Rspamd-Server: rspam09 X-Rspamd-Queue-Id: 032EC18000A X-Stat-Signature: 9eyhh1qg9izpp8tzrb78bm7pbeu6ro5t X-Rspam-User: X-HE-Tag: 1691046459-959265 X-HE-Meta: U2FsdGVkX1/99mXJqlDC98nbuLowBmcTEPi7Cyf+kEi/z0YCkifcQceOmwq/G6s1IHf/E6xiXioJ6KZaJxspBRR1WrFKTAL2dmL5V4wi2QZLAynakA8fd6+RyJMTbjcru0nA6JXGOxtWAmN5Z3dLJKSp7cL/COWv8zYiCNRYIeELxEqli5Q7sWCsXMLLbJF+s8tbtroByYgBkMaQZpQt1du8UiDTnsGiq3pVVrvhU+kEDS105oQ97b1hduWN8LzYaRTFIsw0fCsCQ/bm3IBxT4MHO9PcSyczuhokEo490pRdCL/Hl+qOQETGWJct0QCCWcyVSSAjPPB+3DHh4akoVRLt/Lmc03MH/pq2V4/BtxTb3XKyaou41H375Jcp9fiLQn51fEp6Mmne6t5VpTrrGMFXBT0u/5intoUh/e1SPFSy+4BqHEeI0uy3T2F6amRriHN5euXCGJ20do3oh8PtHMUZ3WX75H5lWhGsnQcT7jpxdOSf8NboLhCF+9l7DHloW9UMybKJVeWn0axBDloSSJtr5cGlQM/7iow96QP9xsQUoBlXXk7KIoP/nGxs6+Uozq17weyKWx3LBB+zusJ0mO9VABT9/beTd3zAgOCjwR7gchrcKuKHKPe32Tap7LgOS9A8AhsgwGjj7t+T+4rk8uV/NTf1eECvwwHWcs93ZNqUh5kc4wAvCEYuI2J1SM9eJCXifLfZa/cUxsW1T7f6gNYe9TQ7/T92oJ7aV6SqqaETQ3+7tUHvj/liEsLBQFMPRcwexXBCb7SY6kVY4Lp00VEp24yxG/rTtFAfe4ylmsQrD5GMTToroFZ9alQeDB0U0BNmbJSv+RryeU1Xo5gGStf0SPX5R8MfY4ktKoUAC+X5eDVH7ST2GQZPlmBdQPSTWWTN7e++iF6tY7fqBcUjX4co1mlBu+pNgoEQNDMn5OeFc+YTrKWA7eqqz/Q+wbaGIwvcjakJakmWOzzOunF lj53aYBk xJrhjVUovIMfXqwATFCNkd6uX0PFp99sXI+2mTBCVIBRRUtcMRMHjQ6mN4NDq5JNExeU1XcecwNDIlcT6DBW/2ZutbZchzs6ZG7ycjw9XNkxyIp27gP4t5zZi2lstHSGXkM7X/+iXVX3OpZn6RSEqO79mutxStz06+/3DhsTwXM3JH6ozTNP0uwPWdQ7RiWgZuJh9nFnx0mNsPTvaeJITtt/2NQTeMDdfKRKp404fMq0QNt432OeRzXhnyORd0u2NaQ9d4VJX8DCFxyjdK8UpOBPeNjUFP2NGTVwlmpMrNWQgkFnLjK01tYDPEi0EKBu371JQN5ixPNAP3ztkXcxlHno9JovMIOn1R0A/ X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Wed, Aug 02, 2023 at 04:54:09PM -0400, Jeff Layton wrote: > On Wed, 2023-08-02 at 21:35 +0200, Jan Kara wrote: > > On Tue 25-07-23 10:58:15, Jeff Layton wrote: > > > The VFS always uses coarse-grained timestamps when updating the ctime > > > and mtime after a change. This has the benefit of allowing filesystems > > > to optimize away a lot metadata updates, down to around 1 per jiffy, > > > even when a file is under heavy writes. > > > > > > Unfortunately, this has always been an issue when we're exporting via > > > NFSv3, which relies on timestamps to validate caches. A lot of changes > > > can happen in a jiffy, so timestamps aren't sufficient to help the > > > client decide to invalidate the cache. Even with NFSv4, a lot of > > > exported filesystems don't properly support a change attribute and are > > > subject to the same problems with timestamp granularity. Other > > > applications have similar issues with timestamps (e.g backup > > > applications). > > > > > > If we were to always use fine-grained timestamps, that would improve the > > > situation, but that becomes rather expensive, as the underlying > > > filesystem would have to log a lot more metadata updates. > > > > > > What we need is a way to only use fine-grained timestamps when they are > > > being actively queried. > > > > > > POSIX generally mandates that when the the mtime changes, the ctime must > > > also change. The kernel always stores normalized ctime values, so only > > > the first 30 bits of the tv_nsec field are ever used. > > > > > > Use the 31st bit of the ctime tv_nsec field to indicate that something > > > has queried the inode for the mtime or ctime. When this flag is set, > > > on the next mtime or ctime update, the kernel will fetch a fine-grained > > > timestamp instead of the usual coarse-grained one. > > > > > > Filesytems can opt into this behavior by setting the FS_MGTIME flag in > > > the fstype. Filesystems that don't set this flag will continue to use > > > coarse-grained timestamps. > > > > > > Later patches will convert individual filesystems to use the new > > > infrastructure. > > > > > > Signed-off-by: Jeff Layton > > > --- > > > fs/inode.c | 98 ++++++++++++++++++++++++++++++++++++++---------------- > > > fs/stat.c | 41 +++++++++++++++++++++-- > > > include/linux/fs.h | 45 +++++++++++++++++++++++-- > > > 3 files changed, 151 insertions(+), 33 deletions(-) > > > > > > diff --git a/fs/inode.c b/fs/inode.c > > > index d4ab92233062..369621e7faf5 100644 > > > --- a/fs/inode.c > > > +++ b/fs/inode.c > > > @@ -1919,6 +1919,21 @@ int inode_update_time(struct inode *inode, struct timespec64 *time, int flags) > > > } > > > EXPORT_SYMBOL(inode_update_time); > > > > > > +/** > > > + * current_coarse_time - Return FS time > > > + * @inode: inode. > > > + * > > > + * Return the current coarse-grained time truncated to the time > > > + * granularity supported by the fs. > > > + */ > > > +static struct timespec64 current_coarse_time(struct inode *inode) > > > +{ > > > + struct timespec64 now; > > > + > > > + ktime_get_coarse_real_ts64(&now); > > > + return timestamp_truncate(now, inode); > > > +} > > > + > > > /** > > > * atime_needs_update - update the access time > > > * @path: the &struct path to update > > > @@ -1952,7 +1967,7 @@ bool atime_needs_update(const struct path *path, struct inode *inode) > > > if ((mnt->mnt_flags & MNT_NODIRATIME) && S_ISDIR(inode->i_mode)) > > > return false; > > > > > > - now = current_time(inode); > > > + now = current_coarse_time(inode); > > > > > > if (!relatime_need_update(mnt, inode, now)) > > > return false; > > > @@ -1986,7 +2001,7 @@ void touch_atime(const struct path *path) > > > * We may also fail on filesystems that have the ability to make parts > > > * of the fs read only, e.g. subvolumes in Btrfs. > > > */ > > > - now = current_time(inode); > > > + now = current_coarse_time(inode); > > > inode_update_time(inode, &now, S_ATIME); > > > __mnt_drop_write(mnt); > > > skip_update: > > > > There are also calls in fs/smb/client/file.c:cifs_readpage_worker() and in > > fs/ocfs2/file.c:ocfs2_update_inode_atime() that should probably use > > current_coarse_time() to avoid needless querying of fine grained > > timestamps. But see below... > > > > Technically, they already devolve to current_coarse_time anyway, but > changing them would allow them to skip the fstype flag check, but I like > your idea below better anyway. > > > > @@ -2072,6 +2087,56 @@ int file_remove_privs(struct file *file) > > > } > > > EXPORT_SYMBOL(file_remove_privs); > > > > > > +/** > > > + * current_mgtime - Return FS time (possibly fine-grained) > > > + * @inode: inode. > > > + * > > > + * Return the current time truncated to the time granularity supported by > > > + * the fs, as suitable for a ctime/mtime change. If the ctime is flagged > > > + * as having been QUERIED, get a fine-grained timestamp. > > > + */ > > > +static struct timespec64 current_mgtime(struct inode *inode) > > > +{ > > > + struct timespec64 now; > > > + atomic_long_t *pnsec = (atomic_long_t *)&inode->__i_ctime.tv_nsec; > > > + long nsec = atomic_long_read(pnsec); > > > + > > > + if (nsec & I_CTIME_QUERIED) { > > > + ktime_get_real_ts64(&now); > > > + } else { > > > + struct timespec64 ctime; > > > + > > > + ktime_get_coarse_real_ts64(&now); > > > + > > > + /* > > > + * If we've recently fetched a fine-grained timestamp > > > + * then the coarse-grained one may still be earlier than the > > > + * existing one. Just keep the existing ctime if so. > > > + */ > > > + ctime = inode_get_ctime(inode); > > > + if (timespec64_compare(&ctime, &now) > 0) > > > + now = ctime; > > > + } > > > + > > > + return timestamp_truncate(now, inode); > > > +} > > > + > > > +/** > > > + * current_time - Return timestamp suitable for ctime update > > > + * @inode: inode to eventually be updated > > > + * > > > + * Return the current time, which is usually coarse-grained but may be fine > > > + * grained if the filesystem uses multigrain timestamps and the existing > > > + * ctime was queried since the last update. > > > + */ > > > +struct timespec64 current_time(struct inode *inode) > > > +{ > > > + if (is_mgtime(inode)) > > > + return current_mgtime(inode); > > > + return current_coarse_time(inode); > > > +} > > > +EXPORT_SYMBOL(current_time); > > > + > > > > So if you modify current_time() to handle multigrain timestamps the code > > will be still racy. In particular fill_mg_cmtime() can race with > > inode_set_ctime_current() like: > > > > fill_mg_cmtime() inode_set_ctime_current() > > stat->mtime = inode->i_mtime; > > stat->ctime.tv_sec = inode->__i_ctime.tv_sec; > > now = current_time(); > > /* fetches coarse > > * grained timestamp */ > > stat->ctime.tv_nsec = atomic_long_fetch_or(I_CTIME_QUERIED, pnsec) & > > ~I_CTIME_QUERIED; > > inode_set_ctime(inode, now.tv_sec, now.tv_nsec); > > > > and the information about a need for finegrained timestamp update gets > > lost. So what I'd propose is to leave current_time() alone (just always > > reporting coarse grained timestamps) and put all the magic into > > inode_set_ctime_current() only. There we need something like: > > > > struct timespec64 inode_set_ctime_current(struct inode *inode) > > { > > ... variables ... > > > > nsec = READ_ONCE(inode->__i_ctime.tv_nsec); > > if (!(nsec & I_CTIME_QUERIED)) { > > now = current_time(inode); > > > > if (!is_gmtime(inode)) { > > inode_set_ctime_to_ts(inode, now); > > } else { > > /* > > * If we've recently fetched a fine-grained > > * timestamp then the coarse-grained one may still > > * be earlier than the existing one. Just keep the > > * existing ctime if so. > > */ > > ctime = inode_get_ctime(inode); > > if (timespec64_compare(&ctime, &now) > 0) > > now = ctime; > > > > /* > > * Ctime updates are generally protected by inode > > * lock but we could have raced with setting of > > * I_CTIME_QUERIED flag. > > */ > > if (cmpxchg(&inode->__i_ctime.tv_nsec, nsec, > > now.tv_nsec) != nsec) > > goto fine_grained; > > inode->__i_ctime.tv_sec = now.tv_sec; > > } > > return now; > > } > > fine_grained: > > ktime_get_real_ts64(&now); > > inode_set_ctime_to_ts(inode, now); > > > > return now; > > } > > > > Honza > > > > This is a great idea. I'll rework the series along the lines you > suggest. That also answers my earlier question to Christian: > > I'll just resend the whole series (it's not very big anyway), and I'll > include the fill_mg_cmtime prototype change. Ok, sound good!