linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Chuck Lever <chuck.lever@oracle.com>
To: Jeff Layton <jlayton@kernel.org>
Cc: cel@kernel.org, linux-mm@kvack.org,
	linux-fsdevel@vger.kernel.org, Hugh Dickens <hughd@google.com>,
	yukuai3@huawei.com, yangerkun@huaweicloud.com
Subject: Re: [RFC PATCH 2/2] libfs: Improve behavior when directory offset values wrap
Date: Mon, 18 Nov 2024 15:58:09 -0500	[thread overview]
Message-ID: <ZzuqYeENJJrLMxwM@tissot.1015granger.net> (raw)
In-Reply-To: <c65399390e8d8d3c7985ecf464e63b30c824f685.camel@kernel.org>

On Mon, Nov 18, 2024 at 03:00:56PM -0500, Jeff Layton wrote:
> On Sun, 2024-11-17 at 16:32 -0500, cel@kernel.org wrote:
> > From: Chuck Lever <chuck.lever@oracle.com>
> > 
> > The fix in commit 64a7ce76fb90 ("libfs: fix infinite directory reads
> > for offset dir") introduced a fence in offset_iterate_dir() to stop
> > the loop from returning child entries created after the directory
> > was opened. This comparison relies on the strong ordering of
> > DIR_OFFSET_MIN <= largest child offset <= next_offset to terminate
> > the directory iteration.
> > 
> > However, because simple_offset_add() uses mtree_alloc_cyclic() to
> > select each next new directory offset, ctx->next_offset is not
> > always the highest unused offset. Once mtree_alloc_cyclic() allows
> > a new offset value to wrap, ctx->next_offset will be set to a value
> > less than the actual largest child offset.
> > 
> > The result is that readdir(3) no longer shows any entries in the
> > directory because their offsets are above ctx->next_offset, which is
> > now a small value. This situation is persistent, and the directory
> > cannot be removed unless all current children are already known and
> > can be explicitly removed by name first.
> > 
> > In the current Maple tree implementation, there is no practical way
> > that 63-bit offset values can ever wrap, so this issue is cleverly
> > avoided. But the ordering dependency is not documented via comments
> > or code, making the mechanism somewhat brittle. And it makes the
> > continued use of mtree_alloc_cyclic() somewhat confusing.
> > 
> > Further, if commit 64a7ce76fb90 ("libfs: fix infinite directory
> > reads for offset dir") were to be backported to a kernel that still
> > uses xarray to manage simple directory offsets, the directory offset
> > value range is limited to 32-bits, which is small enough to allow a
> > wrap after a few weeks of constant creation of entries in one
> > directory.
> > 
> > Therefore, replace the use of ctx->next_offset for fencing new
> > children from appearing in readdir results.
> > 
> > A jiffies timestamp marks the end of each opendir epoch. Entries
> > created after this timestamp will not be visible to the file
> > descriptor. I chose jiffies so that the dentry->d_time field can be
> > re-used for storing the entry creation time.
> > 
> > The new mechanism has its own corner cases. For instance, I think
> > if jiffies wraps twice while a directory is open, some children
> > might become invisible. On 32-bit systems, the jiffies value wraps
> > every 49 days. Double-wrapping is not a risk on systems with 64-bit
> > jiffies. Unlike with the next_offset-based mechanism, re-opening the
> > directory will make invisible children re-appear.
> > 
> > Reported-by: Yu Kuai <yukuai3@huawei.com>
> > Closes: https://lore.kernel.org/stable/20241111005242.34654-1-cel@kernel.org/T/#m1c448e5bd4aae3632a09468affcfe1d1594c6a59
> > Fixes: 64a7ce76fb90 ("libfs: fix infinite directory reads for offset dir")
> > Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> > ---
> >  fs/libfs.c | 36 +++++++++++++++++-------------------
> >  1 file changed, 17 insertions(+), 19 deletions(-)
> > 
> > diff --git a/fs/libfs.c b/fs/libfs.c
> > index bf67954b525b..862a603fd454 100644
> > --- a/fs/libfs.c
> > +++ b/fs/libfs.c
> > @@ -294,6 +294,7 @@ int simple_offset_add(struct offset_ctx *octx, struct dentry *dentry)
> >  		return ret;
> >  
> >  	offset_set(dentry, offset);
> > +	WRITE_ONCE(dentry->d_time, jiffies);
> >  	return 0;
> >  }
> >  
> > @@ -454,9 +455,7 @@ void simple_offset_destroy(struct offset_ctx *octx)
> >  
> >  static int offset_dir_open(struct inode *inode, struct file *file)
> >  {
> > -	struct offset_ctx *ctx = inode->i_op->get_offset_ctx(inode);
> > -
> > -	file->private_data = (void *)ctx->next_offset;
> > +	file->private_data = (void *)jiffies;
> >  	return 0;
> >  }
> >  
> > @@ -473,9 +472,6 @@ static int offset_dir_open(struct inode *inode, struct file *file)
> >   */
> >  static loff_t offset_dir_llseek(struct file *file, loff_t offset, int whence)
> >  {
> > -	struct inode *inode = file->f_inode;
> > -	struct offset_ctx *ctx = inode->i_op->get_offset_ctx(inode);
> > -
> >  	switch (whence) {
> >  	case SEEK_CUR:
> >  		offset += file->f_pos;
> > @@ -490,7 +486,8 @@ static loff_t offset_dir_llseek(struct file *file, loff_t offset, int whence)
> >  
> >  	/* In this case, ->private_data is protected by f_pos_lock */
> >  	if (!offset)
> > -		file->private_data = (void *)ctx->next_offset;
> > +		/* Make newer child entries visible */
> > +		file->private_data = (void *)jiffies;
> >  	return vfs_setpos(file, offset, LONG_MAX);
> >  }
> >  
> > @@ -521,7 +518,8 @@ static bool offset_dir_emit(struct dir_context *ctx, struct dentry *dentry)
> >  			  inode->i_ino, fs_umode_to_dtype(inode->i_mode));
> >  }
> >  
> > -static void offset_iterate_dir(struct inode *inode, struct dir_context *ctx, long last_index)
> > +static void offset_iterate_dir(struct inode *inode, struct dir_context *ctx,
> > +			       unsigned long fence)
> >  {
> >  	struct offset_ctx *octx = inode->i_op->get_offset_ctx(inode);
> >  	struct dentry *dentry;
> > @@ -531,14 +529,15 @@ static void offset_iterate_dir(struct inode *inode, struct dir_context *ctx, lon
> >  		if (!dentry)
> >  			return;
> >  
> > -		if (dentry2offset(dentry) >= last_index) {
> > -			dput(dentry);
> > -			return;
> > -		}
> > -
> > -		if (!offset_dir_emit(ctx, dentry)) {
> > -			dput(dentry);
> > -			return;
> > +		/*
> > +		 * Output only child entries created during or before
> > +		 * the current opendir epoch.
> > +		 */
> > +		if (time_before_eq(dentry->d_time, fence)) {
> > +			if (!offset_dir_emit(ctx, dentry)) {
> > +				dput(dentry);
> > +				return;
> > +			}
> >  		}
> >  
> >  		ctx->pos = dentry2offset(dentry) + 1;
> > @@ -569,15 +568,14 @@ static void offset_iterate_dir(struct inode *inode, struct dir_context *ctx, lon
> >   */
> >  static int offset_readdir(struct file *file, struct dir_context *ctx)
> >  {
> > +	unsigned long fence = (unsigned long)file->private_data;
> >  	struct dentry *dir = file->f_path.dentry;
> > -	long last_index = (long)file->private_data;
> >  
> >  	lockdep_assert_held(&d_inode(dir)->i_rwsem);
> >  
> >  	if (!dir_emit_dots(file, ctx))
> >  		return 0;
> > -
> > -	offset_iterate_dir(d_inode(dir), ctx, last_index);
> > +	offset_iterate_dir(d_inode(dir), ctx, fence);
> >  	return 0;
> >  }
> >  
> 
> Using timestamps instead of directory ordering does seem less brittle,
> and the choice to use jiffies makes sense given that d_time is also an
> unsigned long.
> 
> Reviewed-by: Jeff Layton <jlayton@kernel.org>

Precisely. The goal was to re-use as much code as possible to avoid
perturbing the current size of "struct dentry".

That said, I'm not overjoyed with using jiffies, given it has
similar wrapping issues as ctx->next_offset on 32-bit systems. The
consequences of an offset value wrap are less severe, though, since
that can no longer make children entries disappear permanently.

I've been trying to imagine a solution that does not depend on the
range of an integer value and has solidly deterministic behavior in
the face of multiple child entry creations during one timer tick.

We could partially re-use the legacy cursor/list mechanism.

* When a child entry is created, it is added at the end of the
  parent's d_children list.
* When a child entry is unlinked, it is removed from the parent's
  d_children list.

This includes creation and removal of entries due to a rename.


* When a directory is opened, mark the current end of the d_children
  list with a cursor dentry. New entries would then be added to this
  directory following this cursor dentry in the directory's
  d_children list.
* When a directory is closed, its cursor dentry is removed from the
  d_children list and freed.

Each cursor dentry would need to refer to an opendir instance
(using, say, a pointer to the "struct file" for that open) so that
multiple cursors in the same directory can reside in its d_chilren
list and won't interfere with each other. Re-use the cursor dentry's
d_fsdata field for that.


* offset_readdir gets its starting entry using the mtree/xarray to
  map ctx->pos to a dentry.
* offset_readdir continues iterating by following the .next pointer
  in the current dentry's d_child field.
* offset_readdir returns EOD when it hits the cursor dentry matching
  this opendir instance.


I think all of these operations could be O(1), but it might require
some additional locking.

-- 
Chuck Lever


  reply	other threads:[~2024-11-18 20:58 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-11-17 21:32 [RFC PATCH 0/2] Improve simple directory offset wrap behavior cel
2024-11-17 21:32 ` [RFC PATCH 1/2] libfs: Return ENOSPC when the directory offset range is exhausted cel
2024-11-18 19:55   ` Jeff Layton
2024-11-17 21:32 ` [RFC PATCH 2/2] libfs: Improve behavior when directory offset values wrap cel
2024-11-18 20:00   ` Jeff Layton
2024-11-18 20:58     ` Chuck Lever [this message]
2024-11-20  8:59       ` Christian Brauner
2024-11-20 15:05         ` Chuck Lever
2024-11-21  8:34           ` Christian Brauner
2024-11-21 14:54             ` Chuck Lever III
2024-11-21 21:18               ` Hugh Dickins
2024-11-21 21:29                 ` Chuck Lever
2024-11-22  8:49                   ` Amir Goldstein
2024-11-22 14:24                     ` Chuck Lever III
2024-12-04 21:05                     ` Chuck Lever
2024-11-22 12:57               ` Christian Brauner
2024-11-24 21:28                 ` Chuck Lever III
2024-11-20  9:01 ` [RFC PATCH 0/2] Improve simple directory offset wrap behavior Christian Brauner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZzuqYeENJJrLMxwM@tissot.1015granger.net \
    --to=chuck.lever@oracle.com \
    --cc=cel@kernel.org \
    --cc=hughd@google.com \
    --cc=jlayton@kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=yangerkun@huaweicloud.com \
    --cc=yukuai3@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox