[RFC PATCH v2 0/5] Improve simple directory offset wrap behavior

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH v2 0/5] Improve simple directory offset wrap behavior
@ 2024-11-26 15:54 cel
  2024-11-26 15:54 ` [RFC PATCH v2 1/5] libfs: Return ENOSPC when the directory offset range is exhausted cel
                   ` (5 more replies)
  0 siblings, 6 replies; 13+ messages in thread
From: cel @ 2024-11-26 15:54 UTC (permalink / raw)
  To: Hugh Dickens, Christian Brauner, Al Viro
  Cc: linux-fsdevel, linux-mm, yukuai3, yangerkun, Chuck Lever

From: Chuck Lever <chuck.lever@oracle.com>

This series attempts to narrow some gaps in the current tmpfs
directory offset mechanism that relate to misbehaviors reported by
Yu Kuai <yukuai3@huawei.com> and Yang Erkun <yangerkun@huawei.com>.

This series replaces the v1 mechanism I posted last week. It
reverts offset_readdir() to use the directory's d_children list,
instead of its mtree, for finding the next entry, as readdir did
before v6.6. Directory offset values continue to be allocated via
an mtree.

The purpose of this change is to enable readdir results to continue
to appear after a directory offset/cookie value wrap, while not
regressing generic/736. That should enable this fix to be
backported (manually) to v6.6 to address CVE-2024-46701.

These are still a little unpolished. I expect review to find
opportunities for further code reuse.

These patches pass xfstests except for generic/013, generic/637, and
generic/650. There appears to be a problem with WHITEOUT renames
which I am still looking into.

The series has been pushed to:

https://git.kernel.org/pub/scm/linux/kernel/git/cel/linux.git/log/?h=tmpfs-fixes

Chuck Lever (5):
  libfs: Return ENOSPC when the directory offset range is exhausted
  libfs: Check dentry before locking in simple_offset_empty()
  Revert "libfs: fix infinite directory reads for offset dir"
  libfs: Refactor end-of-directory detection for simple_offset
    directories
  libfs: Refactor offset_iterate_dir()

 fs/libfs.c | 145 ++++++++++++++++++++++++++++++++++++++++-------------
 1 file changed, 109 insertions(+), 36 deletions(-)

-- 
2.47.0

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [RFC PATCH v2 1/5] libfs: Return ENOSPC when the directory offset range is exhausted
  2024-11-26 15:54 [RFC PATCH v2 0/5] Improve simple directory offset wrap behavior cel
@ 2024-11-26 15:54 ` cel
  2024-11-27  3:11   ` yangerkun
  2024-11-26 15:54 ` [RFC PATCH v2 2/5] libfs: Check dentry before locking in simple_offset_empty() cel
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 13+ messages in thread
From: cel @ 2024-11-26 15:54 UTC (permalink / raw)
  To: Hugh Dickens, Christian Brauner, Al Viro
  Cc: linux-fsdevel, linux-mm, yukuai3, yangerkun, Chuck Lever, stable

From: Chuck Lever <chuck.lever@oracle.com>

Testing shows that the EBUSY error return from mtree_alloc_cyclic()
leaks into user space. The ERRORS section of "man creat(2)" says:

>	EBUSY	O_EXCL was specified in flags and pathname refers
>		to a block device that is in use by the system
>		(e.g., it is mounted).

ENOSPC is closer to what applications expect in this situation.

Note that the normal range of simple directory offset values is
2..2^63, so hitting this error is going to be rare to impossible.

Fixes: 6faddda69f62 ("libfs: Add directory operations for stable offsets")
Cc: <stable@vger.kernel.org> # v6.9+
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 fs/libfs.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/fs/libfs.c b/fs/libfs.c
index 46966fd8bcf9..bf67954b525b 100644
--- a/fs/libfs.c
+++ b/fs/libfs.c
@@ -288,7 +288,9 @@ int simple_offset_add(struct offset_ctx *octx, struct dentry *dentry)
 
 	ret = mtree_alloc_cyclic(&octx->mt, &offset, dentry, DIR_OFFSET_MIN,
 				 LONG_MAX, &octx->next_offset, GFP_KERNEL);
-	if (ret < 0)
+	if (unlikely(ret == -EBUSY))
+		return -ENOSPC;
+	if (unlikely(ret < 0))
 		return ret;
 
 	offset_set(dentry, offset);
-- 
2.47.0



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH v2 1/5] libfs: Return ENOSPC when the directory offset range is exhausted
  2024-11-26 15:54 ` [RFC PATCH v2 1/5] libfs: Return ENOSPC when the directory offset range is exhausted cel
@ 2024-11-27  3:11   ` yangerkun
  0 siblings, 0 replies; 13+ messages in thread
From: yangerkun @ 2024-11-27  3:11 UTC (permalink / raw)
  To: cel, Hugh Dickens, Christian Brauner, Al Viro
  Cc: linux-fsdevel, linux-mm, yukuai3, Chuck Lever, stable

LGTM

Reviewed-by: Yang Erkun <yangerkun@huawei.com>

在 2024/11/26 23:54, cel@kernel.org 写道:
> From: Chuck Lever <chuck.lever@oracle.com>
> 
> Testing shows that the EBUSY error return from mtree_alloc_cyclic()
> leaks into user space. The ERRORS section of "man creat(2)" says:
> 
>> 	EBUSY	O_EXCL was specified in flags and pathname refers
>> 		to a block device that is in use by the system
>> 		(e.g., it is mounted).
> 
> ENOSPC is closer to what applications expect in this situation.
> 
> Note that the normal range of simple directory offset values is
> 2..2^63, so hitting this error is going to be rare to impossible.
> 
> Fixes: 6faddda69f62 ("libfs: Add directory operations for stable offsets")
> Cc: <stable@vger.kernel.org> # v6.9+
> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> ---
>   fs/libfs.c | 4 +++-
>   1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/libfs.c b/fs/libfs.c
> index 46966fd8bcf9..bf67954b525b 100644
> --- a/fs/libfs.c
> +++ b/fs/libfs.c
> @@ -288,7 +288,9 @@ int simple_offset_add(struct offset_ctx *octx, struct dentry *dentry)
>   
>   	ret = mtree_alloc_cyclic(&octx->mt, &offset, dentry, DIR_OFFSET_MIN,
>   				 LONG_MAX, &octx->next_offset, GFP_KERNEL);
> -	if (ret < 0)
> +	if (unlikely(ret == -EBUSY))
> +		return -ENOSPC;
> +	if (unlikely(ret < 0))
>   		return ret;
>   
>   	offset_set(dentry, offset);



^ permalink raw reply	[flat|nested] 13+ messages in thread

* [RFC PATCH v2 2/5] libfs: Check dentry before locking in simple_offset_empty()
  2024-11-26 15:54 [RFC PATCH v2 0/5] Improve simple directory offset wrap behavior cel
  2024-11-26 15:54 ` [RFC PATCH v2 1/5] libfs: Return ENOSPC when the directory offset range is exhausted cel
@ 2024-11-26 15:54 ` cel
  2024-11-27  3:09   ` yangerkun
  2024-11-26 15:54 ` [RFC PATCH v2 3/5] Revert "libfs: fix infinite directory reads for offset dir" cel
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 13+ messages in thread
From: cel @ 2024-11-26 15:54 UTC (permalink / raw)
  To: Hugh Dickens, Christian Brauner, Al Viro
  Cc: linux-fsdevel, linux-mm, yukuai3, yangerkun, Chuck Lever

From: Chuck Lever <chuck.lever@oracle.com>

Defensive change: Don't try to lock a dentry unless it is positive.
Trying to lock a negative entry will generate a refcount underflow.

The underflow has been seen only while testing.

Fixes: ecba88a3b32d ("libfs: Add simple_offset_empty()")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 fs/libfs.c | 9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/fs/libfs.c b/fs/libfs.c
index bf67954b525b..c88ed15437c7 100644
--- a/fs/libfs.c
+++ b/fs/libfs.c
@@ -347,13 +347,14 @@ int simple_offset_empty(struct dentry *dentry)
 	index = DIR_OFFSET_MIN;
 	octx = inode->i_op->get_offset_ctx(inode);
 	mt_for_each(&octx->mt, child, index, LONG_MAX) {
-		spin_lock(&child->d_lock);
 		if (simple_positive(child)) {
+			spin_lock(&child->d_lock);
+			if (simple_positive(child))
+				ret = 0;
 			spin_unlock(&child->d_lock);
-			ret = 0;
-			break;
+			if (!ret)
+				break;
 		}
-		spin_unlock(&child->d_lock);
 	}
 
 	return ret;
-- 
2.47.0



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH v2 2/5] libfs: Check dentry before locking in simple_offset_empty()
  2024-11-26 15:54 ` [RFC PATCH v2 2/5] libfs: Check dentry before locking in simple_offset_empty() cel
@ 2024-11-27  3:09   ` yangerkun
  2024-11-27 14:36     ` Chuck Lever
  0 siblings, 1 reply; 13+ messages in thread
From: yangerkun @ 2024-11-27  3:09 UTC (permalink / raw)
  To: cel, Hugh Dickens, Christian Brauner, Al Viro
  Cc: linux-fsdevel, linux-mm, yukuai3, Chuck Lever

Thank you very much for your efforts on this issue!

在 2024/11/26 23:54, cel@kernel.org 写道:
> From: Chuck Lever <chuck.lever@oracle.com>
> 
> Defensive change: Don't try to lock a dentry unless it is positive.
> Trying to lock a negative entry will generate a refcount underflow.

Which member trigger this underflow?

> 
> The underflow has been seen only while testing.
> 
> Fixes: ecba88a3b32d ("libfs: Add simple_offset_empty()")
> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> ---
>   fs/libfs.c | 9 +++++----
>   1 file changed, 5 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/libfs.c b/fs/libfs.c
> index bf67954b525b..c88ed15437c7 100644
> --- a/fs/libfs.c
> +++ b/fs/libfs.c
> @@ -347,13 +347,14 @@ int simple_offset_empty(struct dentry *dentry)
>   	index = DIR_OFFSET_MIN;
>   	octx = inode->i_op->get_offset_ctx(inode);
>   	mt_for_each(&octx->mt, child, index, LONG_MAX) {
> -		spin_lock(&child->d_lock);
>   		if (simple_positive(child)) {
> +			spin_lock(&child->d_lock);
> +			if (simple_positive(child))
> +				ret = 0;
>   			spin_unlock(&child->d_lock);
> -			ret = 0;
> -			break;
> +			if (!ret)
> +				break;
>   		}
> -		spin_unlock(&child->d_lock);
>   	}

Calltrace arrived here means this is a active dir(a dentry with positive 
inode), and nowdays only .rmdir / .rename2 for shmem can reach this 
point. Lock for this dir inode has already been held, maybe this can 
protect child been negative or active? So d_lock here is no need?

>   
>   	return ret;



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH v2 2/5] libfs: Check dentry before locking in simple_offset_empty()
  2024-11-27  3:09   ` yangerkun
@ 2024-11-27 14:36     ` Chuck Lever
  2024-11-29  8:17       ` yangerkun
  0 siblings, 1 reply; 13+ messages in thread
From: Chuck Lever @ 2024-11-27 14:36 UTC (permalink / raw)
  To: yangerkun
  Cc: cel, Hugh Dickens, Christian Brauner, Al Viro, linux-fsdevel,
	linux-mm, yukuai3

On Wed, Nov 27, 2024 at 11:09:11AM +0800, yangerkun wrote:
> Thank you very much for your efforts on this issue!
> 
> 在 2024/11/26 23:54, cel@kernel.org 写道:
> > From: Chuck Lever <chuck.lever@oracle.com>
> > 
> > Defensive change: Don't try to lock a dentry unless it is positive.
> > Trying to lock a negative entry will generate a refcount underflow.
> 
> Which member trigger this underflow?

dput() encounters a dentry refcount underflow because a negative
dentry's refcount is already zero.

But perhaps it would be more accurate to say this patch attempts to
avoid triggering the DEBUG_LOCKS_WARN_ON in hlock_class.


> > The underflow has been seen only while testing.
> > 
> > Fixes: ecba88a3b32d ("libfs: Add simple_offset_empty()")
> > Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> > ---
> >   fs/libfs.c | 9 +++++----
> >   1 file changed, 5 insertions(+), 4 deletions(-)
> > 
> > diff --git a/fs/libfs.c b/fs/libfs.c
> > index bf67954b525b..c88ed15437c7 100644
> > --- a/fs/libfs.c
> > +++ b/fs/libfs.c
> > @@ -347,13 +347,14 @@ int simple_offset_empty(struct dentry *dentry)
> >   	index = DIR_OFFSET_MIN;
> >   	octx = inode->i_op->get_offset_ctx(inode);
> >   	mt_for_each(&octx->mt, child, index, LONG_MAX) {
> > -		spin_lock(&child->d_lock);
> >   		if (simple_positive(child)) {
> > +			spin_lock(&child->d_lock);
> > +			if (simple_positive(child))
> > +				ret = 0;
> >   			spin_unlock(&child->d_lock);
> > -			ret = 0;
> > -			break;
> > +			if (!ret)
> > +				break;
> >   		}
> > -		spin_unlock(&child->d_lock);
> >   	}
> 
> Calltrace arrived here means this is a active dir(a dentry with positive
> inode), and nowdays only .rmdir / .rename2 for shmem can reach this point.
> Lock for this dir inode has already been held, maybe this can protect child
> been negative or active? So d_lock here is no need?

My assumption was that child->d_lock was necessary for an
authoritative determination of whether "child" is positive or
negative. If holding that lock isn't necessary, then I agree,
there's no need to take child->d_lock here at all... There's
clearly nothing else to protect in this code path.


> >   	return ret;

-- 
Chuck Lever


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH v2 2/5] libfs: Check dentry before locking in simple_offset_empty()
  2024-11-27 14:36     ` Chuck Lever
@ 2024-11-29  8:17       ` yangerkun
  2024-11-30 17:03         ` Chuck Lever
  0 siblings, 1 reply; 13+ messages in thread
From: yangerkun @ 2024-11-29  8:17 UTC (permalink / raw)
  To: Chuck Lever
  Cc: cel, Hugh Dickens, Christian Brauner, Al Viro, linux-fsdevel,
	linux-mm, yukuai3



在 2024/11/27 22:36, Chuck Lever 写道:
> On Wed, Nov 27, 2024 at 11:09:11AM +0800, yangerkun wrote:
>> Thank you very much for your efforts on this issue!
>>
>> 在 2024/11/26 23:54, cel@kernel.org 写道:
>>> From: Chuck Lever <chuck.lever@oracle.com>
>>>
>>> Defensive change: Don't try to lock a dentry unless it is positive.
>>> Trying to lock a negative entry will generate a refcount underflow.
>>
>> Which member trigger this underflow?
> 
> dput() encounters a dentry refcount underflow because a negative
> dentry's refcount is already zero.

OK, so you mean dentry->d_lockref here.

But I cannot see why we can trigger this, if it is, it's actually a bug...

Can you give a more detail why can trigger this?

> 
> But perhaps it would be more accurate to say this patch attempts to
> avoid triggering the DEBUG_LOCKS_WARN_ON in hlock_class.

DEBUG_LOCKS_WARN_ON in hlock_class means this class_idx not in use. I 
prefer this cannot happen. Am I think wrong..

> 
> 
>>> The underflow has been seen only while testing.
>>>
>>> Fixes: ecba88a3b32d ("libfs: Add simple_offset_empty()")
>>> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
>>> ---
>>>    fs/libfs.c | 9 +++++----
>>>    1 file changed, 5 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/fs/libfs.c b/fs/libfs.c
>>> index bf67954b525b..c88ed15437c7 100644
>>> --- a/fs/libfs.c
>>> +++ b/fs/libfs.c
>>> @@ -347,13 +347,14 @@ int simple_offset_empty(struct dentry *dentry)
>>>    	index = DIR_OFFSET_MIN;
>>>    	octx = inode->i_op->get_offset_ctx(inode);
>>>    	mt_for_each(&octx->mt, child, index, LONG_MAX) {
>>> -		spin_lock(&child->d_lock);
>>>    		if (simple_positive(child)) {
>>> +			spin_lock(&child->d_lock);
>>> +			if (simple_positive(child))
>>> +				ret = 0;
>>>    			spin_unlock(&child->d_lock);
>>> -			ret = 0;
>>> -			break;
>>> +			if (!ret)
>>> +				break;
>>>    		}
>>> -		spin_unlock(&child->d_lock);
>>>    	}
>>
>> Calltrace arrived here means this is a active dir(a dentry with positive
>> inode), and nowdays only .rmdir / .rename2 for shmem can reach this point.
>> Lock for this dir inode has already been held, maybe this can protect child
>> been negative or active? So d_lock here is no need?
> 
> My assumption was that child->d_lock was necessary for an
> authoritative determination of whether "child" is positive or
> negative. If holding that lock isn't necessary, then I agree,
> there's no need to take child->d_lock here at all... There's
> clearly nothing else to protect in this code path.
> 
> 
>>>    	return ret;
> 



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH v2 2/5] libfs: Check dentry before locking in simple_offset_empty()
  2024-11-29  8:17       ` yangerkun
@ 2024-11-30 17:03         ` Chuck Lever
  0 siblings, 0 replies; 13+ messages in thread
From: Chuck Lever @ 2024-11-30 17:03 UTC (permalink / raw)
  To: yangerkun
  Cc: cel, Hugh Dickens, Christian Brauner, Al Viro, linux-fsdevel,
	linux-mm, yukuai3

On Fri, Nov 29, 2024 at 04:17:56PM +0800, yangerkun wrote:
> 
> 
> 在 2024/11/27 22:36, Chuck Lever 写道:
> > On Wed, Nov 27, 2024 at 11:09:11AM +0800, yangerkun wrote:
> > > Thank you very much for your efforts on this issue!
> > > 
> > > 在 2024/11/26 23:54, cel@kernel.org 写道:
> > > > From: Chuck Lever <chuck.lever@oracle.com>
> > > > 
> > > > Defensive change: Don't try to lock a dentry unless it is positive.
> > > > Trying to lock a negative entry will generate a refcount underflow.
> > > 
> > > Which member trigger this underflow?
> > 
> > dput() encounters a dentry refcount underflow because a negative
> > dentry's refcount is already zero.
> 
> OK, so you mean dentry->d_lockref here.
> 
> But I cannot see why we can trigger this, if it is, it's actually a bug...
> 
> Can you give a more detail why can trigger this?

I think it is possible for the mtree lookup in simple_offset_empty()
to return a pointer to a negative dentry, perhaps due to a race.
There is nothing explicit that prevents a dentry from going negative
while it is still stored in the mtree, although that condition would
be quite brief. So, best to harden the "is dentry positive" check
there so that it acts like other dentry locking code in fs/libfs.c.

Perhaps labeling this patch as a "clean up" would be more clear. I
will remove the "Fixes:" tag -- it does not fix a bug in the current
code.


-- 
Chuck Lever


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [RFC PATCH v2 3/5] Revert "libfs: fix infinite directory reads for offset dir"
  2024-11-26 15:54 [RFC PATCH v2 0/5] Improve simple directory offset wrap behavior cel
  2024-11-26 15:54 ` [RFC PATCH v2 1/5] libfs: Return ENOSPC when the directory offset range is exhausted cel
  2024-11-26 15:54 ` [RFC PATCH v2 2/5] libfs: Check dentry before locking in simple_offset_empty() cel
@ 2024-11-26 15:54 ` cel
  2024-11-26 15:54 ` [RFC PATCH v2 4/5] libfs: Refactor end-of-directory detection for simple_offset directories cel
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 13+ messages in thread
From: cel @ 2024-11-26 15:54 UTC (permalink / raw)
  To: Hugh Dickens, Christian Brauner, Al Viro
  Cc: linux-fsdevel, linux-mm, yukuai3, yangerkun, Chuck Lever

From: Chuck Lever <chuck.lever@oracle.com>

Using octx->next_offset to determine the newest entries works only
because the offset value range is 63-bits. If an offset were to
wrap, existing entries are no longer visible to readdir because
offset_readdir() stops listing entries once an entry's offset is
larger than octx->next_offset.

Revert this fix for the infinite readdir loop bug to make room for
a better fix.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 fs/libfs.c | 35 +++++++++++------------------------
 1 file changed, 11 insertions(+), 24 deletions(-)

diff --git a/fs/libfs.c b/fs/libfs.c
index c88ed15437c7..e6c46b13fc71 100644
--- a/fs/libfs.c
+++ b/fs/libfs.c
@@ -453,14 +453,6 @@ void simple_offset_destroy(struct offset_ctx *octx)
 	mtree_destroy(&octx->mt);
 }
 
-static int offset_dir_open(struct inode *inode, struct file *file)
-{
-	struct offset_ctx *ctx = inode->i_op->get_offset_ctx(inode);
-
-	file->private_data = (void *)ctx->next_offset;
-	return 0;
-}
-
 /**
  * offset_dir_llseek - Advance the read position of a directory descriptor
  * @file: an open directory whose position is to be updated
@@ -474,9 +466,6 @@ static int offset_dir_open(struct inode *inode, struct file *file)
  */
 static loff_t offset_dir_llseek(struct file *file, loff_t offset, int whence)
 {
-	struct inode *inode = file->f_inode;
-	struct offset_ctx *ctx = inode->i_op->get_offset_ctx(inode);
-
 	switch (whence) {
 	case SEEK_CUR:
 		offset += file->f_pos;
@@ -490,8 +479,7 @@ static loff_t offset_dir_llseek(struct file *file, loff_t offset, int whence)
 	}
 
 	/* In this case, ->private_data is protected by f_pos_lock */
-	if (!offset)
-		file->private_data = (void *)ctx->next_offset;
+	file->private_data = NULL;
 	return vfs_setpos(file, offset, LONG_MAX);
 }
 
@@ -522,7 +510,7 @@ static bool offset_dir_emit(struct dir_context *ctx, struct dentry *dentry)
 			  inode->i_ino, fs_umode_to_dtype(inode->i_mode));
 }
 
-static void offset_iterate_dir(struct inode *inode, struct dir_context *ctx, long last_index)
+static void *offset_iterate_dir(struct inode *inode, struct dir_context *ctx)
 {
 	struct offset_ctx *octx = inode->i_op->get_offset_ctx(inode);
 	struct dentry *dentry;
@@ -530,21 +518,17 @@ static void offset_iterate_dir(struct inode *inode, struct dir_context *ctx, lon
 	while (true) {
 		dentry = offset_find_next(octx, ctx->pos);
 		if (!dentry)
-			return;
-
-		if (dentry2offset(dentry) >= last_index) {
-			dput(dentry);
-			return;
-		}
+			return ERR_PTR(-ENOENT);
 
 		if (!offset_dir_emit(ctx, dentry)) {
 			dput(dentry);
-			return;
+			break;
 		}
 
 		ctx->pos = dentry2offset(dentry) + 1;
 		dput(dentry);
 	}
+	return NULL;
 }
 
 /**
@@ -571,19 +555,22 @@ static void offset_iterate_dir(struct inode *inode, struct dir_context *ctx, lon
 static int offset_readdir(struct file *file, struct dir_context *ctx)
 {
 	struct dentry *dir = file->f_path.dentry;
-	long last_index = (long)file->private_data;
 
 	lockdep_assert_held(&d_inode(dir)->i_rwsem);
 
 	if (!dir_emit_dots(file, ctx))
 		return 0;
 
-	offset_iterate_dir(d_inode(dir), ctx, last_index);
+	/* In this case, ->private_data is protected by f_pos_lock */
+	if (ctx->pos == DIR_OFFSET_MIN)
+		file->private_data = NULL;
+	else if (file->private_data == ERR_PTR(-ENOENT))
+		return 0;
+	file->private_data = offset_iterate_dir(d_inode(dir), ctx);
 	return 0;
 }
 
 const struct file_operations simple_offset_dir_operations = {
-	.open		= offset_dir_open,
 	.llseek		= offset_dir_llseek,
 	.iterate_shared	= offset_readdir,
 	.read		= generic_read_dir,
-- 
2.47.0



^ permalink raw reply	[flat|nested] 13+ messages in thread

* [RFC PATCH v2 4/5] libfs: Refactor end-of-directory detection for simple_offset directories
  2024-11-26 15:54 [RFC PATCH v2 0/5] Improve simple directory offset wrap behavior cel
                   ` (2 preceding siblings ...)
  2024-11-26 15:54 ` [RFC PATCH v2 3/5] Revert "libfs: fix infinite directory reads for offset dir" cel
@ 2024-11-26 15:54 ` cel
  2024-11-26 15:54 ` [RFC PATCH v2 5/5] libfs: Refactor offset_iterate_dir() cel
  2024-11-26 16:24 ` [RFC PATCH v2 0/5] Improve simple directory offset wrap behavior Christoph Hellwig
  5 siblings, 0 replies; 13+ messages in thread
From: cel @ 2024-11-26 15:54 UTC (permalink / raw)
  To: Hugh Dickens, Christian Brauner, Al Viro
  Cc: linux-fsdevel, linux-mm, yukuai3, yangerkun, Chuck Lever

From: Chuck Lever <chuck.lever@oracle.com>

This mechanism seems have been misunderstood more than once. Make
the code more self-documentary.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 fs/libfs.c | 54 ++++++++++++++++++++++++++++++++++++++++++------------
 1 file changed, 42 insertions(+), 12 deletions(-)

diff --git a/fs/libfs.c b/fs/libfs.c
index e6c46b13fc71..be641a84047a 100644
--- a/fs/libfs.c
+++ b/fs/libfs.c
@@ -453,6 +453,34 @@ void simple_offset_destroy(struct offset_ctx *octx)
 	mtree_destroy(&octx->mt);
 }
 
+static void offset_set_eod(struct file *file)
+{
+	file->private_data = ERR_PTR(-ENOENT);
+}
+
+static void offset_clear_eod(struct file *file)
+{
+	file->private_data = NULL;
+}
+
+static bool offset_at_eod(struct file *file)
+{
+	return file->private_data == ERR_PTR(-ENOENT);
+}
+
+/**
+ * offset_dir_open - Open a directory descriptor
+ * @inode: directory to be opened
+ * @file: struct file to instantiate
+ *
+ * Returns zero on success, or a negative errno value.
+ */
+static int offset_dir_open(struct inode *inode, struct file *file)
+{
+	offset_clear_eod(file);
+	return 0;
+}
+
 /**
  * offset_dir_llseek - Advance the read position of a directory descriptor
  * @file: an open directory whose position is to be updated
@@ -478,8 +506,8 @@ static loff_t offset_dir_llseek(struct file *file, loff_t offset, int whence)
 		return -EINVAL;
 	}
 
-	/* In this case, ->private_data is protected by f_pos_lock */
-	file->private_data = NULL;
+	/* ->private_data is protected by f_pos_lock */
+	offset_clear_eod(file);
 	return vfs_setpos(file, offset, LONG_MAX);
 }
 
@@ -510,15 +538,20 @@ static bool offset_dir_emit(struct dir_context *ctx, struct dentry *dentry)
 			  inode->i_ino, fs_umode_to_dtype(inode->i_mode));
 }
 
-static void *offset_iterate_dir(struct inode *inode, struct dir_context *ctx)
+static void offset_iterate_dir(struct file *file, struct dir_context *ctx)
 {
+	struct dentry *dir = file->f_path.dentry;
+	struct inode *inode = d_inode(dir);
 	struct offset_ctx *octx = inode->i_op->get_offset_ctx(inode);
 	struct dentry *dentry;
 
 	while (true) {
 		dentry = offset_find_next(octx, ctx->pos);
-		if (!dentry)
-			return ERR_PTR(-ENOENT);
+		if (!dentry) {
+			/* ->private_data is protected by f_pos_lock */
+			offset_set_eod(file);
+			return;
+		}
 
 		if (!offset_dir_emit(ctx, dentry)) {
 			dput(dentry);
@@ -528,7 +561,6 @@ static void *offset_iterate_dir(struct inode *inode, struct dir_context *ctx)
 		ctx->pos = dentry2offset(dentry) + 1;
 		dput(dentry);
 	}
-	return NULL;
 }
 
 /**
@@ -561,16 +593,14 @@ static int offset_readdir(struct file *file, struct dir_context *ctx)
 	if (!dir_emit_dots(file, ctx))
 		return 0;
 
-	/* In this case, ->private_data is protected by f_pos_lock */
-	if (ctx->pos == DIR_OFFSET_MIN)
-		file->private_data = NULL;
-	else if (file->private_data == ERR_PTR(-ENOENT))
-		return 0;
-	file->private_data = offset_iterate_dir(d_inode(dir), ctx);
+	/* ->private_data is protected by f_pos_lock */
+	if (!offset_at_eod(file))
+		offset_iterate_dir(file, ctx);
 	return 0;
 }
 
 const struct file_operations simple_offset_dir_operations = {
+	.open		= offset_dir_open,
 	.llseek		= offset_dir_llseek,
 	.iterate_shared	= offset_readdir,
 	.read		= generic_read_dir,
-- 
2.47.0



^ permalink raw reply	[flat|nested] 13+ messages in thread

* [RFC PATCH v2 5/5] libfs: Refactor offset_iterate_dir()
  2024-11-26 15:54 [RFC PATCH v2 0/5] Improve simple directory offset wrap behavior cel
                   ` (3 preceding siblings ...)
  2024-11-26 15:54 ` [RFC PATCH v2 4/5] libfs: Refactor end-of-directory detection for simple_offset directories cel
@ 2024-11-26 15:54 ` cel
  2024-11-26 16:24 ` [RFC PATCH v2 0/5] Improve simple directory offset wrap behavior Christoph Hellwig
  5 siblings, 0 replies; 13+ messages in thread
From: cel @ 2024-11-26 15:54 UTC (permalink / raw)
  To: Hugh Dickens, Christian Brauner, Al Viro
  Cc: linux-fsdevel, linux-mm, yukuai3, yangerkun, Chuck Lever

From: Chuck Lever <chuck.lever@oracle.com>

This line in offset_iterate_dir():

		ctx->pos = dentry2offset(dentry) + 1;

assumes that the next child entry has an offset value that is
greater than the current child entry. Since directory offsets are
actually cookies, this heuristic is not always correct.

We have tested the current code with a limited offset range to see
if this is an operational problem. It doesn't seem to be, but doing
a "+ 1" on what is supposed to be an opaque cookie is very likely
wrong and brittle.

Instead of using the mtree to emit entries in the order of their
offset values, use it only to map the initial ctx->pos to a starting
entry. Then use the directory's d_children list, which is already
maintained by the dcache, to find the next child to emit, as the
simple cursor-based implementation still does.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 fs/libfs.c | 89 +++++++++++++++++++++++++++++++++++++++++++-----------
 1 file changed, 71 insertions(+), 18 deletions(-)

diff --git a/fs/libfs.c b/fs/libfs.c
index be641a84047a..862b4203d389 100644
--- a/fs/libfs.c
+++ b/fs/libfs.c
@@ -241,9 +241,9 @@ const struct inode_operations simple_dir_inode_operations = {
 };
 EXPORT_SYMBOL(simple_dir_inode_operations);
 
-/* 0 is '.', 1 is '..', so always start with offset 2 or more */
 enum {
-	DIR_OFFSET_MIN	= 2,
+	DIR_OFFSET_FIRST	= 2,	/* seek to the first real entry */
+	DIR_OFFSET_MIN		= 3,	/* lowest real offset value */
 };
 
 static void offset_set(struct dentry *dentry, long offset)
@@ -267,7 +267,7 @@ void simple_offset_init(struct offset_ctx *octx)
 {
 	mt_init_flags(&octx->mt, MT_FLAGS_ALLOC_RANGE);
 	lockdep_set_class(&octx->mt.ma_lock, &simple_offset_lock_class);
-	octx->next_offset = DIR_OFFSET_MIN;
+	octx->next_offset = 0;
 }
 
 /**
@@ -511,10 +511,30 @@ static loff_t offset_dir_llseek(struct file *file, loff_t offset, int whence)
 	return vfs_setpos(file, offset, LONG_MAX);
 }
 
-static struct dentry *offset_find_next(struct offset_ctx *octx, loff_t offset)
+static noinline_for_stack struct dentry *offset_dir_first(struct file *file)
 {
+	struct dentry *child, *found = NULL, *dir = file->f_path.dentry;
+
+	spin_lock(&dir->d_lock);
+	child = d_first_child(dir);
+	if (child && simple_positive(child)) {
+		spin_lock_nested(&child->d_lock, DENTRY_D_LOCK_NESTED);
+		if (simple_positive(child))
+			found = dget_dlock(child);
+		spin_unlock(&child->d_lock);
+	}
+	spin_unlock(&dir->d_lock);
+	return found;
+}
+
+static noinline_for_stack struct dentry *
+offset_dir_lookup(struct file *file, loff_t offset)
+{
+	struct dentry *child, *found = NULL, *dir = file->f_path.dentry;
+	struct inode *inode = d_inode(dir);
+	struct offset_ctx *octx = inode->i_op->get_offset_ctx(inode);
+
 	MA_STATE(mas, &octx->mt, offset, offset);
-	struct dentry *child, *found = NULL;
 
 	rcu_read_lock();
 	child = mas_find(&mas, LONG_MAX);
@@ -538,29 +558,62 @@ static bool offset_dir_emit(struct dir_context *ctx, struct dentry *dentry)
 			  inode->i_ino, fs_umode_to_dtype(inode->i_mode));
 }
 
+/*
+ * This is find_next_child() without the dput() tail. We might
+ * combine offset_dir_next() and find_next_child().
+ */
+static struct dentry *offset_dir_next(struct dentry *dentry)
+{
+	struct dentry *parent = dentry->d_parent;
+	struct dentry *d, *found = NULL;
+
+	spin_lock(&parent->d_lock);
+	d = d_next_sibling(dentry);
+	hlist_for_each_entry_from(d, d_sib) {
+		if (simple_positive(d)) {
+			spin_lock_nested(&d->d_lock, DENTRY_D_LOCK_NESTED);
+			if (simple_positive(d))
+				found = dget_dlock(d);
+			spin_unlock(&d->d_lock);
+			if (likely(found))
+				break;
+		}
+	}
+	spin_unlock(&parent->d_lock);
+	return found;
+}
+
 static void offset_iterate_dir(struct file *file, struct dir_context *ctx)
 {
-	struct dentry *dir = file->f_path.dentry;
-	struct inode *inode = d_inode(dir);
-	struct offset_ctx *octx = inode->i_op->get_offset_ctx(inode);
-	struct dentry *dentry;
+	struct dentry *dentry, *next = NULL;
+
+	if (ctx->pos == DIR_OFFSET_FIRST)
+		dentry = offset_dir_first(file);
+	else
+		dentry = offset_dir_lookup(file, ctx->pos);
+	if (!dentry) {
+		/* ->private_data is protected by f_pos_lock */
+		offset_set_eod(file);
+		return;
+	}
 
 	while (true) {
-		dentry = offset_find_next(octx, ctx->pos);
-		if (!dentry) {
-			/* ->private_data is protected by f_pos_lock */
-			offset_set_eod(file);
-			return;
-		}
-
 		if (!offset_dir_emit(ctx, dentry)) {
-			dput(dentry);
+			ctx->pos = dentry2offset(dentry);
+			break;
+		}
+
+		next = offset_dir_next(dentry);
+		if (!next) {
+			offset_set_eod(file);
 			break;
 		}
 
-		ctx->pos = dentry2offset(dentry) + 1;
 		dput(dentry);
+		dentry = next;
 	}
+
+	dput(dentry);
 }
 
 /**
-- 
2.47.0



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH v2 0/5] Improve simple directory offset wrap behavior
  2024-11-26 15:54 [RFC PATCH v2 0/5] Improve simple directory offset wrap behavior cel
                   ` (4 preceding siblings ...)
  2024-11-26 15:54 ` [RFC PATCH v2 5/5] libfs: Refactor offset_iterate_dir() cel
@ 2024-11-26 16:24 ` Christoph Hellwig
  2024-11-26 16:59   ` Chuck Lever III
  5 siblings, 1 reply; 13+ messages in thread
From: Christoph Hellwig @ 2024-11-26 16:24 UTC (permalink / raw)
  To: cel
  Cc: Hugh Dickens, Christian Brauner, Al Viro, linux-fsdevel,
	linux-mm, yukuai3, yangerkun, Chuck Lever

On Tue, Nov 26, 2024 at 10:54:39AM -0500, cel@kernel.org wrote:
> From: Chuck Lever <chuck.lever@oracle.com>
> 
> This series attempts to narrow some gaps in the current tmpfs
> directory offset mechanism that relate to misbehaviors reported by
> Yu Kuai <yukuai3@huawei.com> and Yang Erkun <yangerkun@huawei.com>.

Any chance you could write xfstests test cases to exercise these
corner cases?



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH v2 0/5] Improve simple directory offset wrap behavior
  2024-11-26 16:24 ` [RFC PATCH v2 0/5] Improve simple directory offset wrap behavior Christoph Hellwig
@ 2024-11-26 16:59   ` Chuck Lever III
  0 siblings, 0 replies; 13+ messages in thread
From: Chuck Lever III @ 2024-11-26 16:59 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Chuck Lever, Hugh Dickens, Christian Brauner, Al Viro,
	Linux FS Devel, linux-mm, yukuai (C),
	yangerkun



> On Nov 26, 2024, at 11:24 AM, Christoph Hellwig <hch@infradead.org> wrote:
> 
> On Tue, Nov 26, 2024 at 10:54:39AM -0500, cel@kernel.org wrote:
>> From: Chuck Lever <chuck.lever@oracle.com>
>> 
>> This series attempts to narrow some gaps in the current tmpfs
>> directory offset mechanism that relate to misbehaviors reported by
>> Yu Kuai <yukuai3@huawei.com> and Yang Erkun <yangerkun@huawei.com>.
> 
> Any chance you could write xfstests test cases to exercise these
> corner cases?

generic/736 exercises the readdir loop after rename.

Kuai's test requires kernel code changes to reduce the
range of directory offset values that tmpfs can assign.
Triggering an offset value wrap with the mtree offset
allocator is not possible in millions of years.

I'll have to think about how that can be made into an
automatible test.

--
Chuck Lever



^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2024-11-30 17:08 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-11-26 15:54 [RFC PATCH v2 0/5] Improve simple directory offset wrap behavior cel
2024-11-26 15:54 ` [RFC PATCH v2 1/5] libfs: Return ENOSPC when the directory offset range is exhausted cel
2024-11-27  3:11   ` yangerkun
2024-11-26 15:54 ` [RFC PATCH v2 2/5] libfs: Check dentry before locking in simple_offset_empty() cel
2024-11-27  3:09   ` yangerkun
2024-11-27 14:36     ` Chuck Lever
2024-11-29  8:17       ` yangerkun
2024-11-30 17:03         ` Chuck Lever
2024-11-26 15:54 ` [RFC PATCH v2 3/5] Revert "libfs: fix infinite directory reads for offset dir" cel
2024-11-26 15:54 ` [RFC PATCH v2 4/5] libfs: Refactor end-of-directory detection for simple_offset directories cel
2024-11-26 15:54 ` [RFC PATCH v2 5/5] libfs: Refactor offset_iterate_dir() cel
2024-11-26 16:24 ` [RFC PATCH v2 0/5] Improve simple directory offset wrap behavior Christoph Hellwig
2024-11-26 16:59   ` Chuck Lever III

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox