From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id B0C9CE77188 for ; Tue, 24 Dec 2024 04:40:20 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id CABF96B0082; Mon, 23 Dec 2024 23:40:19 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id C5BF06B0083; Mon, 23 Dec 2024 23:40:19 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id B23A76B0085; Mon, 23 Dec 2024 23:40:19 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 930836B0082 for ; Mon, 23 Dec 2024 23:40:19 -0500 (EST) Received: from smtpin13.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id 1A5B280538 for ; Tue, 24 Dec 2024 04:40:19 +0000 (UTC) X-FDA: 82928599446.13.944D1D4 Received: from dggsgout12.his.huawei.com (dggsgout12.his.huawei.com [45.249.212.56]) by imf05.hostedemail.com (Postfix) with ESMTP id 50395100011 for ; Tue, 24 Dec 2024 04:38:58 +0000 (UTC) Authentication-Results: imf05.hostedemail.com; dkim=none; spf=pass (imf05.hostedemail.com: domain of yangerkun@huaweicloud.com designates 45.249.212.56 as permitted sender) smtp.mailfrom=yangerkun@huaweicloud.com; dmarc=none ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1735015190; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=mlGAu/vpQHTKIWSkDThvE6r4RZDLAg/2PasN+/StLKM=; b=7Tk6EeAcNAVpwAxwCW70BAiFDlzdOqdtlPvzQtE8ZtKwzTfEXERPQJduVk9vWTCP/F/VP1 Ve566n+fdJkTpOhZEv510d49WY/IKSt/rTn4y4rMQup1NE47pfVK9Ukvkf/YIQEywZxY8k 3SvfhniuDnY5vNznJc8tye67gtKhOoY= ARC-Authentication-Results: i=1; imf05.hostedemail.com; dkim=none; spf=pass (imf05.hostedemail.com: domain of yangerkun@huaweicloud.com designates 45.249.212.56 as permitted sender) smtp.mailfrom=yangerkun@huaweicloud.com; dmarc=none ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1735015190; a=rsa-sha256; cv=none; b=0igBOWt8oR+H3WcXNfOdjbkT/lCNZbtp4Or3Z6dYCLmWcpNnU5tufEUU2KKwiIc2Mo2Ggo +mVUsZJ/tb9tdS8+S1a/mixrFD/sLKvZQ03MoPNEOj6uCMyt3BnwYqEyGZPyokbXTOw5/t HOCZOrpPVn4F2r4tOWrTAJII8QfETGI= Received: from mail.maildlp.com (unknown [172.19.163.235]) by dggsgout12.his.huawei.com (SkyGuard) with ESMTP id 4YHMcd1mS9z4f3jdV for ; Tue, 24 Dec 2024 12:39:45 +0800 (CST) Received: from mail02.huawei.com (unknown [10.116.40.128]) by mail.maildlp.com (Postfix) with ESMTP id E5F2F1A06D7 for ; Tue, 24 Dec 2024 12:40:04 +0800 (CST) Received: from [10.174.177.210] (unknown [10.174.177.210]) by APP4 (Coremail) with SMTP id gCh0CgAnT4MjO2pnHwboFQ--.21175S3; Tue, 24 Dec 2024 12:40:04 +0800 (CST) Message-ID: <3976ba47-76c7-28e1-9f20-6e94e0adbbea@huaweicloud.com> Date: Tue, 24 Dec 2024 12:40:03 +0800 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101 Thunderbird/102.5.1 Subject: Re: [PATCH v6 5/5] libfs: Use d_children list to iterate simple_offset directories To: Chuck Lever , cel@kernel.org, Hugh Dickins , Christian Brauner , Al Viro Cc: linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, yukuai3@huawei.com References: <20241220153314.5237-1-cel@kernel.org> <20241220153314.5237-6-cel@kernel.org> <3ccf8255-dfbb-d019-d156-01edf5242c49@huaweicloud.com> From: yangerkun In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-CM-TRANSID:gCh0CgAnT4MjO2pnHwboFQ--.21175S3 X-Coremail-Antispam: 1UD129KBjvJXoWxtF18CrWruF1DAw1fXFW5KFg_yoWfWF1fpF n5JFW5GrW5Xrn3Gr18XF1DXryFyw17G3WUXr18W3W8J3y7Ar12gF1UWrn09ryUJr48Gr1U JF4UKrnxuF45JrJanT9S1TB71UUUUU7qnTZGkaVYY2UrUUUUjbIjqfuFe4nvWSU5nxnvy2 9KBjDU0xBIdaVrnRJUUUkKb4IE77IF4wAFF20E14v26r4j6ryUM7CY07I20VC2zVCF04k2 6cxKx2IYs7xG6rWj6s0DM7CIcVAFz4kK6r1j6r18M28lY4IEw2IIxxk0rwA2F7IY1VAKz4 vEj48ve4kI8wA2z4x0Y4vE2Ix0cI8IcVAFwI0_Xr0_Ar1l84ACjcxK6xIIjxv20xvEc7Cj xVAFwI0_Cr0_Gr1UM28EF7xvwVC2z280aVAFwI0_GcCE3s1l84ACjcxK6I8E87Iv6xkF7I 0E14v26rxl6s0DM2AIxVAIcxkEcVAq07x20xvEncxIr21l5I8CrVACY4xI64kE6c02F40E x7xfMcIj6xIIjxv20xvE14v26r1j6r18McIj6I8E87Iv67AKxVWUJVW8JwAm72CE4IkC6x 0Yz7v_Jr0_Gr1lF7xvr2IY64vIr41lc7I2V7IY0VAS07AlzVAYIcxG8wCY1x0262kKe7AK xVWUAVWUtwCF04k20xvY0x0EwIxGrwCFx2IqxVCFs4IE7xkEbVWUJVW8JwC20s026c02F4 0E14v26r1j6r18MI8I3I0E7480Y4vE14v26r106r1rMI8E67AF67kF1VAFwI0_Jw0_GFyl IxkGc2Ij64vIr41lIxAIcVC0I7IYx2IY67AKxVWUJVWUCwCI42IY6xIIjxv20xvEc7CjxV AFwI0_Jr0_Gr1lIxAIcVCF04k26cxKx2IYs7xG6r1j6r1xMIIF0xvEx4A2jsIE14v26r1j 6r4UMIIF0xvEx4A2jsIEc7CjxVAFwI0_Jr0_GrUvcSsGvfC2KfnxnUUI43ZEXa7IU1veHD UUUUU== X-CM-SenderInfo: 51dqwvhunx0q5kxd4v5lfo033gof0z/ X-Rspamd-Server: rspam05 X-Stat-Signature: 4rego549mjhq8gcogms8hyjbj5oadttt X-Rspamd-Queue-Id: 50395100011 X-Rspam-User: X-HE-Tag: 1735015138-870969 X-HE-Meta: U2FsdGVkX18UTw8/8nsWgLob0x7i3+T/GVeNABuY2HNfq8s2BYct8r1K+dTE6wMgNBQbHDrprwtgMTq4rZ5qsAfA+YkIDrE7SMFWrVFCfGezV5o6OIgmFpZd2cV7MvviGZnSa4oqq1XIQ+ZMdYYch00Fhqu6ILd6IW5h9zlTxF2yKrR+jDhCdZd13vB/+9LJW22Pt+DvtpYWh1lZAQfNDv3amZ82GEstWIJDrgQCEXPjgBxkWTfdnaM5OLBSNPsayPT40sMAHLv7E4yjCwrkFs/H3mH5CqHyaPwhyS84OdjvOE7t3mkAq9uW1ZWfhCZrcor1q1m90PLlIg2nXJy7ZiMjV41QuFbXjgepSLUOvUd84HzKYIop7LvJCRdnSfWDY9JkHKW4FiVc2fJBse0hffjf9jeLpy8ZrG6IS9/3vE7N7awVKB4RP4pNqjdVAYbom0HKBh1kGvzMXek+Cqw6hoPkiHNVG+H9rhXTHQdoSy7qIM6Tl9CNsnDEZJgj4Hgn+/HnsKMIihErhh9G25awAxB7kKROVQBX8WX/w/cnAW7v54K+gQeiya04TZBN/H/iiejRtFJ4Krl01yLxenYFU+XFEaDtuh+IeOjYzJZZPeG97/WRODGhuVYSl3ZkNk66/3ozkyaJ25YAP/M8gbWqlMxuOSEi5wheK/qA25mosEs0+r4yZhFHolV4K/ZuzAMN8hQYdsSV4VJLzwTpx1YPpJx/RwqA9/0Ln/FJ8Q83k7e9HQr7AguB0f1iKR9x8/YK13+2I6LxdW98FRK+nTfImdwIdu6a+nr1rkdyHyVJPDq0qq4aAt42rSbtTXUbFaRT+90pmdsSo1GQmbPR2IflWFSDXks3+ijS0e3kHjaGxuJ2QyAAFuCR4ZBE0dJxa1MThSqY2Q4SXAPMrZ8RcsRwbQdSH9tvh1ELzlTWonRzuGQwYC6nbKka/zteS1V6t0o7tbxUjRCdH0fGeoqtHOJ 6wBaXkNE FHcpFnwYmDdQWTWLrAEVeCN1elPzzlzXFcZLzvfA+NqSpkS+ca2fmc12ZfbLi5flUBvf4atrlTNR5pUaxl7oAP5ug7YyhwNfQXsl6yD/YAvt5QnqzmBICt6HigWBjHE0KZ05aas+/KeW0JCe8Gm2BeyO9tKHTN9d6jlhAfCIxLNZNxEcxBLBg8cUDJSXulGN3FuLvXo15QbHrVWsOf5ihEdWLaynDB4dSS3XiVSvNV4zRjTE= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: 在 2024/12/23 22:44, Chuck Lever 写道: > On 12/23/24 9:21 AM, yangerkun wrote: >> >> >> 在 2024/12/20 23:33, cel@kernel.org 写道: >>> From: Chuck Lever >>> >>> The mtree mechanism has been effective at creating directory offsets >>> that are stable over multiple opendir instances. However, it has not >>> been able to handle the subtleties of renames that are concurrent >>> with readdir. >>> >>> Instead of using the mtree to emit entries in the order of their >>> offset values, use it only to map incoming ctx->pos to a starting >>> entry. Then use the directory's d_children list, which is already >>> maintained properly by the dcache, to find the next child to emit. >>> >>> One of the sneaky things about this is that when the mtree-allocated >>> offset value wraps (which is very rare), looking up ctx->pos++ is >>> not going to find the next entry; it will return NULL. Instead, by >>> following the d_children list, the offset values can appear in any >>> order but all of the entries in the directory will be visited >>> eventually. >>> >>> Note also that the readdir() is guaranteed to reach the tail of this >>> list. Entries are added only at the head of d_children, and readdir >>> walks from its current position in that list towards its tail. >>> >>> Signed-off-by: Chuck Lever >>> --- >>>   fs/libfs.c | 84 +++++++++++++++++++++++++++++++++++++----------------- >>>   1 file changed, 58 insertions(+), 26 deletions(-) >>> >>> diff --git a/fs/libfs.c b/fs/libfs.c >>> index 5c56783c03a5..f7ead02062ad 100644 >>> --- a/fs/libfs.c >>> +++ b/fs/libfs.c >>> @@ -247,12 +247,13 @@ EXPORT_SYMBOL(simple_dir_inode_operations); >>>   /* simple_offset_add() allocation range */ >>>   enum { >>> -    DIR_OFFSET_MIN        = 2, >>> +    DIR_OFFSET_MIN        = 3, >>>       DIR_OFFSET_MAX        = LONG_MAX - 1, >>>   }; >>>   /* simple_offset_add() never assigns these to a dentry */ >>>   enum { >>> +    DIR_OFFSET_FIRST    = 2,        /* Find first real entry */ >>>       DIR_OFFSET_EOD        = LONG_MAX,    /* Marks EOD */ >>>   }; >>> @@ -458,51 +459,82 @@ static loff_t offset_dir_llseek(struct file >>> *file, loff_t offset, int whence) >>>       return vfs_setpos(file, offset, LONG_MAX); >>>   } >>> -static struct dentry *offset_find_next(struct offset_ctx *octx, >>> loff_t offset) >>> +static struct dentry *find_positive_dentry(struct dentry *parent, >>> +                       struct dentry *dentry, >>> +                       bool next) >>>   { >>> -    MA_STATE(mas, &octx->mt, offset, offset); >>> +    struct dentry *found = NULL; >>> + >>> +    spin_lock(&parent->d_lock); >>> +    if (next) >>> +        dentry = d_next_sibling(dentry); >>> +    else if (!dentry) >>> +        dentry = d_first_child(parent); >>> +    hlist_for_each_entry_from(dentry, d_sib) { >>> +        if (!simple_positive(dentry)) >>> +            continue; >>> +        spin_lock_nested(&dentry->d_lock, DENTRY_D_LOCK_NESTED); >>> +        if (simple_positive(dentry)) >>> +            found = dget_dlock(dentry); >>> +        spin_unlock(&dentry->d_lock); >>> +        if (likely(found)) >>> +            break; >>> +    } >>> +    spin_unlock(&parent->d_lock); >>> +    return found; >>> +} >>> + >>> +static noinline_for_stack struct dentry * >>> +offset_dir_lookup(struct dentry *parent, loff_t offset) >>> +{ >>> +    struct inode *inode = d_inode(parent); >>> +    struct offset_ctx *octx = inode->i_op->get_offset_ctx(inode); >>>       struct dentry *child, *found = NULL; >>> -    rcu_read_lock(); >>> -    child = mas_find(&mas, DIR_OFFSET_MAX); >>> -    if (!child) >>> -        goto out; >>> -    spin_lock(&child->d_lock); >>> -    if (simple_positive(child)) >>> -        found = dget_dlock(child); >>> -    spin_unlock(&child->d_lock); >>> -out: >>> -    rcu_read_unlock(); >>> +    MA_STATE(mas, &octx->mt, offset, offset); >>> + >>> +    if (offset == DIR_OFFSET_FIRST) >>> +        found = find_positive_dentry(parent, NULL, false); >>> +    else { >>> +        rcu_read_lock(); >>> +        child = mas_find(&mas, DIR_OFFSET_MAX); >> >> Can this child be NULL? > > Yes, this mas_find() call can return NULL. find_positive_dentry() should > then return NULL. Kind of subtle. > > >> Like we delete some file after first readdir, maybe we should break >> here, or we may rescan all dentry and return them to userspace again? > > You mean to deal with the case where the "next" entry has an offset > that is lower than @offset? mas_find() will return the entry in the > tree that is "at or after" mas->index. > > I'm not sure either "break" or returning repeats is safe. But, now that > you point it out, this function probably does need additional logic to > deal with the offset wrap case. > > But since this logic already exists here, IMO it is reasonable to leave > that to be addressed by a subsequent patch. So far there aren't any > regression test failures that warn of a user-visible problem the way it > is now. Sorry for the confusing, the case I am talking is something like below: mkdir /tmp/dir && cd /tmp/dir touch file1 # offset is 3 touch file2 # offset is 4 touch file3 # offset is 5 touch file4 # offset is 6 touch file5 # offset is 7 first readdir and get file5 file4 file3 file2 #ctx->pos is 3, which means we will get file1 for second readdir unlink file1 # can not get entry for ctx->pos == 3 second readdir # offset_dir_lookup will use mas_find but return NULL, and we will get file5 file4 file3 file2 again? And for the offset wrap case, I prefer it's safe with your patch if we won't unlink file between two readdir. The second readdir will use an active ctx->pos which means there is a active dentry attach to this ctx->pos. find_positive_dentry will stop once we meet the last child. I am not sure if I understand correctly, if not, please point out! Thanks! > > >>> +        found = find_positive_dentry(parent, child, false); >>> +        rcu_read_unlock(); >>> +    } >>>       return found; >>>   } >>>   static bool offset_dir_emit(struct dir_context *ctx, struct dentry >>> *dentry) >>>   { >>>       struct inode *inode = d_inode(dentry); >>> -    long offset = dentry2offset(dentry); >>> -    return ctx->actor(ctx, dentry->d_name.name, dentry->d_name.len, >>> offset, >>> -              inode->i_ino, fs_umode_to_dtype(inode->i_mode)); >>> +    return dir_emit(ctx, dentry->d_name.name, dentry->d_name.len, >>> +            inode->i_ino, fs_umode_to_dtype(inode->i_mode)); >>>   } >>> -static void offset_iterate_dir(struct inode *inode, struct >>> dir_context *ctx) >>> +static void offset_iterate_dir(struct file *file, struct dir_context >>> *ctx) >>>   { >>> -    struct offset_ctx *octx = inode->i_op->get_offset_ctx(inode); >>> +    struct dentry *dir = file->f_path.dentry; >>>       struct dentry *dentry; >>> +    dentry = offset_dir_lookup(dir, ctx->pos); >>> +    if (!dentry) >>> +        goto out_eod; >>>       while (true) { >>> -        dentry = offset_find_next(octx, ctx->pos); >>> -        if (!dentry) >>> -            goto out_eod; >>> +        struct dentry *next; >>> -        if (!offset_dir_emit(ctx, dentry)) { >>> -            dput(dentry); >>> +        ctx->pos = dentry2offset(dentry); >>> +        if (!offset_dir_emit(ctx, dentry)) >>>               break; >>> -        } >>> -        ctx->pos = dentry2offset(dentry) + 1; >>> +        next = find_positive_dentry(dir, dentry, true); >>>           dput(dentry); >>> + >>> +        if (!next) >>> +            goto out_eod; >>> +        dentry = next; >>>       } >>> +    dput(dentry); >>>       return; >>>   out_eod: >>> @@ -541,7 +573,7 @@ static int offset_readdir(struct file *file, >>> struct dir_context *ctx) >>>       if (!dir_emit_dots(file, ctx)) >>>           return 0; >>>       if (ctx->pos != DIR_OFFSET_EOD) >>> -        offset_iterate_dir(d_inode(dir), ctx); >>> +        offset_iterate_dir(file, ctx); >>>       return 0; >>>   } >> > >