From: Jan Kara <jack@suse.cz>
To: Jeff Layton <jlayton@kernel.org>
Cc: John Stultz <jstultz@google.com>,
Thomas Gleixner <tglx@linutronix.de>,
Stephen Boyd <sboyd@kernel.org>,
Alexander Viro <viro@zeniv.linux.org.uk>,
Christian Brauner <brauner@kernel.org>, Jan Kara <jack@suse.cz>,
Steven Rostedt <rostedt@goodmis.org>,
Masami Hiramatsu <mhiramat@kernel.org>,
Mathieu Desnoyers <mathieu.desnoyers@efficios.com>,
Jonathan Corbet <corbet@lwn.net>,
Randy Dunlap <rdunlap@infradead.org>,
Chandan Babu R <chandan.babu@oracle.com>,
"Darrick J. Wong" <djwong@kernel.org>,
Theodore Ts'o <tytso@mit.edu>,
Andreas Dilger <adilger.kernel@dilger.ca>,
Chris Mason <clm@fb.com>, Josef Bacik <josef@toxicpanda.com>,
David Sterba <dsterba@suse.com>, Hugh Dickins <hughd@google.com>,
Andrew Morton <akpm@linux-foundation.org>,
Chuck Lever <chuck.lever@oracle.com>,
Vadim Fedorenko <vadim.fedorenko@linux.dev>,
linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org,
linux-trace-kernel@vger.kernel.org, linux-doc@vger.kernel.org,
linux-xfs@vger.kernel.org, linux-ext4@vger.kernel.org,
linux-btrfs@vger.kernel.org, linux-nfs@vger.kernel.org,
linux-mm@kvack.org
Subject: Re: [PATCH v8 02/12] fs: add infrastructure for multigrain timestamps
Date: Tue, 1 Oct 2024 15:20:27 +0200 [thread overview]
Message-ID: <20241001132027.ynzp4sahjek5umbb@quack3> (raw)
In-Reply-To: <20241001-mgtime-v8-2-903343d91bc3@kernel.org>
On Tue 01-10-24 06:58:56, Jeff Layton wrote:
> The VFS has always used coarse-grained timestamps when updating the
> ctime and mtime after a change. This has the benefit of allowing
> filesystems to optimize away a lot metadata updates, down to around 1
> per jiffy, even when a file is under heavy writes.
>
> Unfortunately, this has always been an issue when we're exporting via
> NFSv3, which relies on timestamps to validate caches. A lot of changes
> can happen in a jiffy, so timestamps aren't sufficient to help the
> client decide when to invalidate the cache. Even with NFSv4, a lot of
> exported filesystems don't properly support a change attribute and are
> subject to the same problems with timestamp granularity. Other
> applications have similar issues with timestamps (e.g backup
> applications).
>
> If we were to always use fine-grained timestamps, that would improve the
> situation, but that becomes rather expensive, as the underlying
> filesystem would have to log a lot more metadata updates.
>
> What we need is a way to only use fine-grained timestamps when they are
> being actively queried. Use the (unused) top bit in inode->i_ctime_nsec
> as a flag that indicates whether the current timestamps have been
> queried via stat() or the like. When it's set, we allow the kernel to
> use a fine-grained timestamp iff it's necessary to make the ctime show
> a different value.
>
> This solves the problem of being able to distinguish the timestamp
> between updates, but introduces a new problem: it's now possible for a
> file being changed to get a fine-grained timestamp. A file that is
> altered just a bit later can then get a coarse-grained one that appears
> older than the earlier fine-grained time. This violates timestamp
> ordering guarantees.
>
> To remedy this, keep a global monotonic atomic64_t value that acts as a
> timestamp floor. When we go to stamp a file, we first get the latter of
> the current floor value and the current coarse-grained time. If the
> inode ctime hasn't been queried then we just attempt to stamp it with
> that value.
>
> If it has been queried, then first see whether the current coarse time
> is later than the existing ctime. If it is, then we accept that value.
> If it isn't, then we get a fine-grained timestamp.
>
> Filesystems can opt into this by setting the FS_MGTIME fstype flag.
> Others should be unaffected (other than being subject to the same floor
> value as multigrain filesystems).
>
> Tested-by: Randy Dunlap <rdunlap@infradead.org> # documentation bits
> Signed-off-by: Jeff Layton <jlayton@kernel.org>
Mostly looks good. Some smaller comments below.
> +/**
> + * current_time - Return FS time (possibly fine-grained)
> + * @inode: inode.
> + *
> + * Return the current time truncated to the time granularity supported by
> + * the fs, as suitable for a ctime/mtime change. If the ctime is flagged
> + * as having been QUERIED, get a fine-grained timestamp, but don't update
> + * the floor.
> + *
> + * For a multigrain inode, this is effectively an estimate of the timestamp
> + * that a file would receive. An actual update must go through
> + * inode_set_ctime_current().
> + */
> +struct timespec64 current_time(struct inode *inode)
> +{
> + struct timespec64 now;
> + u32 cns;
> +
> + ktime_get_coarse_real_ts64_mg(&now);
> +
> + if (!is_mgtime(inode))
> + goto out;
> +
> + /* If nothing has queried it, then coarse time is fine */
> + cns = smp_load_acquire(&inode->i_ctime_nsec);
> + if (cns & I_CTIME_QUERIED) {
> + /*
> + * If there is no apparent change, then get a fine-grained
> + * timestamp.
> + */
> + if (now.tv_nsec == (cns & ~I_CTIME_QUERIED))
> + ktime_get_real_ts64(&now);
> + }
> +out:
> + return timestamp_truncate(now, inode);
> +}
> +EXPORT_SYMBOL(current_time);
> +
> static int inode_needs_update_time(struct inode *inode)
> {
> + struct timespec64 now, ts;
> int sync_it = 0;
> - struct timespec64 now = current_time(inode);
> - struct timespec64 ts;
>
> /* First try to exhaust all avenues to not sync */
> if (IS_NOCMTIME(inode))
> return 0;
>
> + now = current_time(inode);
> +
> ts = inode_get_mtime(inode);
> if (!timespec64_equal(&ts, &now))
> - sync_it = S_MTIME;
> + sync_it |= S_MTIME;
>
> ts = inode_get_ctime(inode);
> if (!timespec64_equal(&ts, &now))
> @@ -2598,6 +2637,15 @@ void inode_nohighmem(struct inode *inode)
> }
> EXPORT_SYMBOL(inode_nohighmem);
>
> +struct timespec64 inode_set_ctime_to_ts(struct inode *inode, struct timespec64 ts)
> +{
> + set_normalized_timespec64(&ts, ts.tv_sec, ts.tv_nsec);
> + inode->i_ctime_sec = ts.tv_sec;
> + inode->i_ctime_nsec = ts.tv_nsec;
> + return ts;
> +}
> +EXPORT_SYMBOL(inode_set_ctime_to_ts);
> +
> /**
> * timestamp_truncate - Truncate timespec to a granularity
> * @t: Timespec
> @@ -2630,36 +2678,75 @@ struct timespec64 timestamp_truncate(struct timespec64 t, struct inode *inode)
> EXPORT_SYMBOL(timestamp_truncate);
>
> /**
> - * current_time - Return FS time
> - * @inode: inode.
> + * inode_set_ctime_current - set the ctime to current_time
> + * @inode: inode
> *
> - * Return the current time truncated to the time granularity supported by
> - * the fs.
> + * Set the inode's ctime to the current value for the inode. Returns the
> + * current value that was assigned. If this is not a multigrain inode, then we
> + * set it to the later of the coarse time and floor value.
> *
> - * Note that inode and inode->sb cannot be NULL.
> - * Otherwise, the function warns and returns time without truncation.
> + * If it is multigrain, then we first see if the coarse-grained timestamp is
> + * distinct from what we have. If so, then we'll just use that. If we have to
> + * get a fine-grained timestamp, then do so, and try to swap it into the floor.
> + * We accept the new floor value regardless of the outcome of the cmpxchg.
> + * After that, we try to swap the new value into i_ctime_nsec. Again, we take
> + * the resulting ctime, regardless of the outcome of the swap.
This comment seems outdated now. No floor in this function anymore...
> -struct timespec64 current_time(struct inode *inode)
> +struct timespec64 inode_set_ctime_current(struct inode *inode)
> {
> struct timespec64 now;
> + u32 cns, cur;
...
> diff --git a/fs/stat.c b/fs/stat.c
> index 41e598376d7e..381926fb405f 100644
> --- a/fs/stat.c
> +++ b/fs/stat.c
> @@ -26,6 +26,35 @@
> #include "internal.h"
> #include "mount.h"
>
> +/**
> + * fill_mg_cmtime - Fill in the mtime and ctime and flag ctime as QUERIED
> + * @stat: where to store the resulting values
> + * @request_mask: STATX_* values requested
> + * @inode: inode from which to grab the c/mtime
> + *
> + * Given @inode, grab the ctime and mtime out if it and store the result
^^ of
> + * in @stat. When fetching the value, flag it as QUERIED (if not already)
> + * so the next write will record a distinct timestamp.
> + */
> +void fill_mg_cmtime(struct kstat *stat, u32 request_mask, struct inode *inode)
> +{
Given how things worked out in the end, it seems this function doesn't need
to handle mtime at all and we can move mtime handling back to shared generic
code?
> + atomic_t *pcn = (atomic_t *)&inode->i_ctime_nsec;
> +
> + /* If neither time was requested, then don't report them */
> + if (!(request_mask & (STATX_CTIME|STATX_MTIME))) {
> + stat->result_mask &= ~(STATX_CTIME|STATX_MTIME);
> + return;
> + }
> +
> + stat->mtime = inode_get_mtime(inode);
> + stat->ctime.tv_sec = inode->i_ctime_sec;
> + stat->ctime.tv_nsec = (u32)atomic_read(pcn);
> + if (!(stat->ctime.tv_nsec & I_CTIME_QUERIED))
> + stat->ctime.tv_nsec = ((u32)atomic_fetch_or(I_CTIME_QUERIED, pcn));
> + stat->ctime.tv_nsec &= ~I_CTIME_QUERIED;
> +}
> +EXPORT_SYMBOL(fill_mg_cmtime);
> +
> /**
> * generic_fillattr - Fill in the basic attributes from the inode struct
> * @idmap: idmap of the mount the inode was found from
> @@ -58,8 +87,14 @@ void generic_fillattr(struct mnt_idmap *idmap, u32 request_mask,
> stat->rdev = inode->i_rdev;
> stat->size = i_size_read(inode);
> stat->atime = inode_get_atime(inode);
> - stat->mtime = inode_get_mtime(inode);
> - stat->ctime = inode_get_ctime(inode);
> +
> + if (is_mgtime(inode)) {
> + fill_mg_cmtime(stat, request_mask, inode);
> + } else {
> + stat->ctime = inode_get_ctime(inode);
> + stat->mtime = inode_get_mtime(inode);
> + }
> +
> stat->blksize = i_blocksize(inode);
> stat->blocks = inode->i_blocks;
>
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index e3c603d01337..23908bad166c 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -1653,6 +1653,17 @@ static inline struct timespec64 inode_set_mtime(struct inode *inode,
> return inode_set_mtime_to_ts(inode, ts);
> }
>
> +/*
> + * Multigrain timestamps
> + *
> + * Conditionally use fine-grained ctime and mtime timestamps when there
> + * are users actively observing them via getattr. The primary use-case
> + * for this is NFS clients that use the ctime to distinguish between
> + * different states of the file, and that are often fooled by multiple
> + * operations that occur in the same coarse-grained timer tick.
Again, mtime seems unaffected by mgtime changes now.
> + */
> +#define I_CTIME_QUERIED ((u32)BIT(31))
> +
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
next prev parent reply other threads:[~2024-10-01 13:20 UTC|newest]
Thread overview: 20+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-10-01 10:58 [PATCH v8 00/12] fs: multigrain timestamp redux Jeff Layton
2024-10-01 10:58 ` [PATCH v8 01/12] timekeeping: add interfaces for handling timestamps with a floor value Jeff Layton
2024-10-01 16:17 ` John Stultz
2024-10-02 14:16 ` Thomas Gleixner
2024-10-01 10:58 ` [PATCH v8 02/12] fs: add infrastructure for multigrain timestamps Jeff Layton
2024-10-01 13:20 ` Jan Kara [this message]
2024-10-01 13:34 ` Jeff Layton
2024-10-02 9:14 ` Jan Kara
2024-10-02 12:32 ` Jeff Layton
2024-10-01 10:58 ` [PATCH v8 03/12] fs: have setattr_copy handle multigrain timestamps appropriately Jeff Layton
2024-10-01 10:58 ` [PATCH v8 04/12] fs: handle delegated timestamps in setattr_copy_mgtime Jeff Layton
2024-10-01 13:32 ` Jan Kara
2024-10-01 10:58 ` [PATCH v8 05/12] fs: tracepoints around multigrain timestamp events Jeff Layton
2024-10-01 10:59 ` [PATCH v8 06/12] fs: add percpu counters for significant " Jeff Layton
2024-10-01 10:59 ` [PATCH v8 07/12] timekeeping: add percpu counter for tracking floor swap events Jeff Layton
2024-10-01 10:59 ` [PATCH v8 08/12] Documentation: add a new file documenting multigrain timestamps Jeff Layton
2024-10-01 10:59 ` [PATCH v8 09/12] xfs: switch to " Jeff Layton
2024-10-01 10:59 ` [PATCH v8 10/12] ext4: " Jeff Layton
2024-10-01 10:59 ` [PATCH v8 11/12] btrfs: convert " Jeff Layton
2024-10-01 10:59 ` [PATCH v8 12/12] tmpfs: add support for " Jeff Layton
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20241001132027.ynzp4sahjek5umbb@quack3 \
--to=jack@suse.cz \
--cc=adilger.kernel@dilger.ca \
--cc=akpm@linux-foundation.org \
--cc=brauner@kernel.org \
--cc=chandan.babu@oracle.com \
--cc=chuck.lever@oracle.com \
--cc=clm@fb.com \
--cc=corbet@lwn.net \
--cc=djwong@kernel.org \
--cc=dsterba@suse.com \
--cc=hughd@google.com \
--cc=jlayton@kernel.org \
--cc=josef@toxicpanda.com \
--cc=jstultz@google.com \
--cc=linux-btrfs@vger.kernel.org \
--cc=linux-doc@vger.kernel.org \
--cc=linux-ext4@vger.kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=linux-nfs@vger.kernel.org \
--cc=linux-trace-kernel@vger.kernel.org \
--cc=linux-xfs@vger.kernel.org \
--cc=mathieu.desnoyers@efficios.com \
--cc=mhiramat@kernel.org \
--cc=rdunlap@infradead.org \
--cc=rostedt@goodmis.org \
--cc=sboyd@kernel.org \
--cc=tglx@linutronix.de \
--cc=tytso@mit.edu \
--cc=vadim.fedorenko@linux.dev \
--cc=viro@zeniv.linux.org.uk \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox