* [PATCH v3 0/3] symlink length caching
@ 2024-11-20 11:20 Mateusz Guzik
2024-11-20 11:20 ` [PATCH v3 1/3] vfs: support caching symlink lengths in inodes Mateusz Guzik
` (4 more replies)
0 siblings, 5 replies; 11+ messages in thread
From: Mateusz Guzik @ 2024-11-20 11:20 UTC (permalink / raw)
To: brauner
Cc: viro, jack, linux-kernel, linux-fsdevel, hughd, linux-ext4,
tytso, linux-mm, Mateusz Guzik
quote:
When utilized it dodges strlen() in vfs_readlink(), giving about 1.5%
speed up when issuing readlink on /initrd.img on ext4.
The size is stored in a union with i_devices, which is never looked at
unless the inode is for a device.
Benchmark code at the bottom.
ext4 and tmpfs are patched, other filesystems can also get there with
some more work.
Arguably the current get_link API should be patched to let the fs return
the size, but that's not a churn I'm interested into diving in.
On my v1 Jan remarked 1.5% is not a particularly high win questioning
whether doing this makes sense. I noted the value is only this small
because of other slowdowns.
To elaborate here are highlights while benching on Sapphire Rapids:
1. putname using atomics (over 3.5% on my profile)
sounds like Al has plans to do something here(?), I'm not touching it if
it can be helped. the atomic definitely does not have to be there in the
common case.
2. kmem_cache_alloc_noprof/kmem_cache_free (over 7% combined)
They are both dog slow due to cmpxchg16b. A patchset was posted which
adds a different allocation/free fast path which should whack majority
of the problem, see: https://lore.kernel.org/linux-mm/20241112-slub-percpu-caches-v1-0-ddc0bdc27e05@suse.cz/
If this lands I'll definitely try to make the pathname allocs use it,
should drop about 5-6 percentage points on this sucker.
3. __legitimize_mnt issues a full fence (again over 3%)
As far as avoiding the fence is concerned waiting on rcu grace period on
unmount should do the trick. However, I found there is a bunch
complexity there to sort out before doing this will be feasible (notably
there are multiple mounts freed in one go, this needs to be batched).
There may be other issues which I missed and which make this not worth
it, but the fence is definitely avoidable in principle and I would be
surprised if there was no way to sensibly get there. No ETA, for now I'm
just pointing this out.
There is also the idea to speculatively elide lockref, but when I tried
that last time I ran into significant LSM-related trouble.
All that aside there is also quite a bit of branching and func calling
which does not need to be there (example: make vfsuid/vfsgid, could be
combined into one routine etc.).
Ultimately there is single-threaded perf left on the table in various
spots.
v3:
- use a union instead of a dedicated field, used up with i_devices
v2:
- add a dedicated field, flag and a helper instead of using i_size
https://lore.kernel.org/linux-fsdevel/20241119094555.660666-1-mjguzik@gmail.com/
v1 can be found here:
https://lore.kernel.org/linux-fsdevel/20241118085357.494178-1-mjguzik@gmail.com/
benchmark:
plug into will-it-scale into tests/readlink1.c:
#include <stdlib.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <assert.h>
#include <string.h>
char *testcase_description = "readlink /initrd.img";
void testcase(unsigned long long *iterations, unsigned long nr)
{
char *tmplink = "/initrd.img";
char buf[1024];
while (1) {
int error = readlink(tmplink, buf, sizeof(buf));
assert(error > 0);
(*iterations)++;
}
}
Mateusz Guzik (3):
vfs: support caching symlink lengths in inodes
ext4: use inode_set_cached_link()
tmpfs: use inode_set_cached_link()
fs/ext4/inode.c | 3 ++-
fs/ext4/namei.c | 4 +++-
fs/namei.c | 34 +++++++++++++++++++---------------
fs/proc/namespaces.c | 2 +-
include/linux/fs.h | 15 +++++++++++++--
mm/shmem.c | 6 ++++--
security/apparmor/apparmorfs.c | 2 +-
7 files changed, 43 insertions(+), 23 deletions(-)
--
2.43.0
^ permalink raw reply [flat|nested] 11+ messages in thread
* [PATCH v3 1/3] vfs: support caching symlink lengths in inodes
2024-11-20 11:20 [PATCH v3 0/3] symlink length caching Mateusz Guzik
@ 2024-11-20 11:20 ` Mateusz Guzik
2024-11-21 10:12 ` Christian Brauner
2024-11-20 11:20 ` [PATCH v3 2/3] ext4: use inode_set_cached_link() Mateusz Guzik
` (3 subsequent siblings)
4 siblings, 1 reply; 11+ messages in thread
From: Mateusz Guzik @ 2024-11-20 11:20 UTC (permalink / raw)
To: brauner
Cc: viro, jack, linux-kernel, linux-fsdevel, hughd, linux-ext4,
tytso, linux-mm, Mateusz Guzik
When utilized it dodges strlen() in vfs_readlink(), giving about 1.5%
speed up when issuing readlink on /initrd.img on ext4.
Filesystems opt in by calling inode_set_cached_link() when creating an
inode.
The size is stored in a new union utilizing the same space as i_devices,
thus avoiding growing the struct or taking up any more space.
Churn-wise the current readlink_copy() helper is patched to accept the
size instead of calculating it.
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
---
fs/namei.c | 34 +++++++++++++++++++---------------
fs/proc/namespaces.c | 2 +-
include/linux/fs.h | 15 +++++++++++++--
security/apparmor/apparmorfs.c | 2 +-
4 files changed, 34 insertions(+), 19 deletions(-)
diff --git a/fs/namei.c b/fs/namei.c
index 9d30c7aa9aa6..e56c29a22d26 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -5272,19 +5272,16 @@ SYSCALL_DEFINE2(rename, const char __user *, oldname, const char __user *, newna
getname(newname), 0);
}
-int readlink_copy(char __user *buffer, int buflen, const char *link)
+int readlink_copy(char __user *buffer, int buflen, const char *link, int linklen)
{
- int len = PTR_ERR(link);
- if (IS_ERR(link))
- goto out;
+ int copylen;
- len = strlen(link);
- if (len > (unsigned) buflen)
- len = buflen;
- if (copy_to_user(buffer, link, len))
- len = -EFAULT;
-out:
- return len;
+ copylen = linklen;
+ if (unlikely(copylen > (unsigned) buflen))
+ copylen = buflen;
+ if (copy_to_user(buffer, link, copylen))
+ copylen = -EFAULT;
+ return copylen;
}
/**
@@ -5304,6 +5301,9 @@ int vfs_readlink(struct dentry *dentry, char __user *buffer, int buflen)
const char *link;
int res;
+ if (inode->i_opflags & IOP_CACHED_LINK)
+ return readlink_copy(buffer, buflen, inode->i_link, inode->i_linklen);
+
if (unlikely(!(inode->i_opflags & IOP_DEFAULT_READLINK))) {
if (unlikely(inode->i_op->readlink))
return inode->i_op->readlink(dentry, buffer, buflen);
@@ -5322,7 +5322,7 @@ int vfs_readlink(struct dentry *dentry, char __user *buffer, int buflen)
if (IS_ERR(link))
return PTR_ERR(link);
}
- res = readlink_copy(buffer, buflen, link);
+ res = readlink_copy(buffer, buflen, link, strlen(link));
do_delayed_call(&done);
return res;
}
@@ -5391,10 +5391,14 @@ EXPORT_SYMBOL(page_put_link);
int page_readlink(struct dentry *dentry, char __user *buffer, int buflen)
{
+ const char *link;
+ int res;
+
DEFINE_DELAYED_CALL(done);
- int res = readlink_copy(buffer, buflen,
- page_get_link(dentry, d_inode(dentry),
- &done));
+ link = page_get_link(dentry, d_inode(dentry), &done);
+ res = PTR_ERR(link);
+ if (!IS_ERR(link))
+ res = readlink_copy(buffer, buflen, link, strlen(link));
do_delayed_call(&done);
return res;
}
diff --git a/fs/proc/namespaces.c b/fs/proc/namespaces.c
index 8e159fc78c0a..c610224faf10 100644
--- a/fs/proc/namespaces.c
+++ b/fs/proc/namespaces.c
@@ -83,7 +83,7 @@ static int proc_ns_readlink(struct dentry *dentry, char __user *buffer, int bufl
if (ptrace_may_access(task, PTRACE_MODE_READ_FSCREDS)) {
res = ns_get_name(name, sizeof(name), task, ns_ops);
if (res >= 0)
- res = readlink_copy(buffer, buflen, name);
+ res = readlink_copy(buffer, buflen, name, strlen(name));
}
put_task_struct(task);
return res;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 7e29433c5ecc..2cc98de5af43 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -626,6 +626,7 @@ is_uncached_acl(struct posix_acl *acl)
#define IOP_XATTR 0x0008
#define IOP_DEFAULT_READLINK 0x0010
#define IOP_MGTIME 0x0020
+#define IOP_CACHED_LINK 0x0040
/*
* Keep mostly read-only and often accessed (especially for
@@ -723,7 +724,10 @@ struct inode {
};
struct file_lock_context *i_flctx;
struct address_space i_data;
- struct list_head i_devices;
+ union {
+ struct list_head i_devices;
+ int i_linklen;
+ };
union {
struct pipe_inode_info *i_pipe;
struct cdev *i_cdev;
@@ -749,6 +753,13 @@ struct inode {
void *i_private; /* fs or device private pointer */
} __randomize_layout;
+static inline void inode_set_cached_link(struct inode *inode, char *link, int linklen)
+{
+ inode->i_link = link;
+ inode->i_linklen = linklen;
+ inode->i_opflags |= IOP_CACHED_LINK;
+}
+
/*
* Get bit address from inode->i_state to use with wait_var_event()
* infrastructre.
@@ -3351,7 +3362,7 @@ extern const struct file_operations generic_ro_fops;
#define special_file(m) (S_ISCHR(m)||S_ISBLK(m)||S_ISFIFO(m)||S_ISSOCK(m))
-extern int readlink_copy(char __user *, int, const char *);
+extern int readlink_copy(char __user *, int, const char *, int);
extern int page_readlink(struct dentry *, char __user *, int);
extern const char *page_get_link(struct dentry *, struct inode *,
struct delayed_call *);
diff --git a/security/apparmor/apparmorfs.c b/security/apparmor/apparmorfs.c
index 01b923d97a44..60959cfba672 100644
--- a/security/apparmor/apparmorfs.c
+++ b/security/apparmor/apparmorfs.c
@@ -2611,7 +2611,7 @@ static int policy_readlink(struct dentry *dentry, char __user *buffer,
res = snprintf(name, sizeof(name), "%s:[%lu]", AAFS_NAME,
d_inode(dentry)->i_ino);
if (res > 0 && res < sizeof(name))
- res = readlink_copy(buffer, buflen, name);
+ res = readlink_copy(buffer, buflen, name, strlen(name));
else
res = -ENOENT;
--
2.43.0
^ permalink raw reply [flat|nested] 11+ messages in thread
* [PATCH v3 2/3] ext4: use inode_set_cached_link()
2024-11-20 11:20 [PATCH v3 0/3] symlink length caching Mateusz Guzik
2024-11-20 11:20 ` [PATCH v3 1/3] vfs: support caching symlink lengths in inodes Mateusz Guzik
@ 2024-11-20 11:20 ` Mateusz Guzik
2024-11-21 11:58 ` Jan Kara
2024-11-20 11:20 ` [PATCH v3 3/3] tmpfs: " Mateusz Guzik
` (2 subsequent siblings)
4 siblings, 1 reply; 11+ messages in thread
From: Mateusz Guzik @ 2024-11-20 11:20 UTC (permalink / raw)
To: brauner
Cc: viro, jack, linux-kernel, linux-fsdevel, hughd, linux-ext4,
tytso, linux-mm, Mateusz Guzik
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
---
fs/ext4/inode.c | 3 ++-
fs/ext4/namei.c | 4 +++-
2 files changed, 5 insertions(+), 2 deletions(-)
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 89aade6f45f6..7c54ae5fcbd4 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -5006,10 +5006,11 @@ struct inode *__ext4_iget(struct super_block *sb, unsigned long ino,
if (IS_ENCRYPTED(inode)) {
inode->i_op = &ext4_encrypted_symlink_inode_operations;
} else if (ext4_inode_is_fast_symlink(inode)) {
- inode->i_link = (char *)ei->i_data;
inode->i_op = &ext4_fast_symlink_inode_operations;
nd_terminate_link(ei->i_data, inode->i_size,
sizeof(ei->i_data) - 1);
+ inode_set_cached_link(inode, (char *)ei->i_data,
+ inode->i_size);
} else {
inode->i_op = &ext4_symlink_inode_operations;
}
diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
index bcf2737078b8..536d56d15072 100644
--- a/fs/ext4/namei.c
+++ b/fs/ext4/namei.c
@@ -3418,7 +3418,6 @@ static int ext4_symlink(struct mnt_idmap *idmap, struct inode *dir,
inode->i_op = &ext4_symlink_inode_operations;
} else {
inode->i_op = &ext4_fast_symlink_inode_operations;
- inode->i_link = (char *)&EXT4_I(inode)->i_data;
}
}
@@ -3434,6 +3433,9 @@ static int ext4_symlink(struct mnt_idmap *idmap, struct inode *dir,
disk_link.len);
inode->i_size = disk_link.len - 1;
EXT4_I(inode)->i_disksize = inode->i_size;
+ if (!IS_ENCRYPTED(inode))
+ inode_set_cached_link(inode, (char *)&EXT4_I(inode)->i_data,
+ inode->i_size);
}
err = ext4_add_nondir(handle, dentry, &inode);
if (handle)
--
2.43.0
^ permalink raw reply [flat|nested] 11+ messages in thread
* [PATCH v3 3/3] tmpfs: use inode_set_cached_link()
2024-11-20 11:20 [PATCH v3 0/3] symlink length caching Mateusz Guzik
2024-11-20 11:20 ` [PATCH v3 1/3] vfs: support caching symlink lengths in inodes Mateusz Guzik
2024-11-20 11:20 ` [PATCH v3 2/3] ext4: use inode_set_cached_link() Mateusz Guzik
@ 2024-11-20 11:20 ` Mateusz Guzik
2024-11-21 11:59 ` Jan Kara
2024-11-21 12:34 ` [PATCH v3 0/3] symlink length caching Christian Brauner
2024-11-21 14:16 ` Jeff Layton
4 siblings, 1 reply; 11+ messages in thread
From: Mateusz Guzik @ 2024-11-20 11:20 UTC (permalink / raw)
To: brauner
Cc: viro, jack, linux-kernel, linux-fsdevel, hughd, linux-ext4,
tytso, linux-mm, Mateusz Guzik
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
---
mm/shmem.c | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)
diff --git a/mm/shmem.c b/mm/shmem.c
index c7881e16f4be..135f38eb2ff1 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -3868,6 +3868,7 @@ static int shmem_symlink(struct mnt_idmap *idmap, struct inode *dir,
int len;
struct inode *inode;
struct folio *folio;
+ char *link;
len = strlen(symname) + 1;
if (len > PAGE_SIZE)
@@ -3889,12 +3890,13 @@ static int shmem_symlink(struct mnt_idmap *idmap, struct inode *dir,
inode->i_size = len-1;
if (len <= SHORT_SYMLINK_LEN) {
- inode->i_link = kmemdup(symname, len, GFP_KERNEL);
- if (!inode->i_link) {
+ link = kmemdup(symname, len, GFP_KERNEL);
+ if (!link) {
error = -ENOMEM;
goto out_remove_offset;
}
inode->i_op = &shmem_short_symlink_operations;
+ inode_set_cached_link(inode, link, len - 1);
} else {
inode_nohighmem(inode);
inode->i_mapping->a_ops = &shmem_aops;
--
2.43.0
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH v3 1/3] vfs: support caching symlink lengths in inodes
2024-11-20 11:20 ` [PATCH v3 1/3] vfs: support caching symlink lengths in inodes Mateusz Guzik
@ 2024-11-21 10:12 ` Christian Brauner
2024-11-21 13:56 ` Mateusz Guzik
2024-11-22 1:56 ` Dave Chinner
0 siblings, 2 replies; 11+ messages in thread
From: Christian Brauner @ 2024-11-21 10:12 UTC (permalink / raw)
To: Mateusz Guzik
Cc: viro, jack, linux-kernel, linux-fsdevel, hughd, linux-ext4,
tytso, linux-mm
On Wed, Nov 20, 2024 at 12:20:34PM +0100, Mateusz Guzik wrote:
> When utilized it dodges strlen() in vfs_readlink(), giving about 1.5%
> speed up when issuing readlink on /initrd.img on ext4.
>
> Filesystems opt in by calling inode_set_cached_link() when creating an
> inode.
>
> The size is stored in a new union utilizing the same space as i_devices,
> thus avoiding growing the struct or taking up any more space.
>
> Churn-wise the current readlink_copy() helper is patched to accept the
> size instead of calculating it.
>
> Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
> ---
> fs/namei.c | 34 +++++++++++++++++++---------------
> fs/proc/namespaces.c | 2 +-
> include/linux/fs.h | 15 +++++++++++++--
> security/apparmor/apparmorfs.c | 2 +-
> 4 files changed, 34 insertions(+), 19 deletions(-)
>
> diff --git a/fs/namei.c b/fs/namei.c
> index 9d30c7aa9aa6..e56c29a22d26 100644
> --- a/fs/namei.c
> +++ b/fs/namei.c
> @@ -5272,19 +5272,16 @@ SYSCALL_DEFINE2(rename, const char __user *, oldname, const char __user *, newna
> getname(newname), 0);
> }
>
> -int readlink_copy(char __user *buffer, int buflen, const char *link)
> +int readlink_copy(char __user *buffer, int buflen, const char *link, int linklen)
> {
> - int len = PTR_ERR(link);
> - if (IS_ERR(link))
> - goto out;
> + int copylen;
>
> - len = strlen(link);
> - if (len > (unsigned) buflen)
> - len = buflen;
> - if (copy_to_user(buffer, link, len))
> - len = -EFAULT;
> -out:
> - return len;
> + copylen = linklen;
> + if (unlikely(copylen > (unsigned) buflen))
> + copylen = buflen;
> + if (copy_to_user(buffer, link, copylen))
> + copylen = -EFAULT;
> + return copylen;
> }
>
> /**
> @@ -5304,6 +5301,9 @@ int vfs_readlink(struct dentry *dentry, char __user *buffer, int buflen)
> const char *link;
> int res;
>
> + if (inode->i_opflags & IOP_CACHED_LINK)
> + return readlink_copy(buffer, buflen, inode->i_link, inode->i_linklen);
> +
> if (unlikely(!(inode->i_opflags & IOP_DEFAULT_READLINK))) {
> if (unlikely(inode->i_op->readlink))
> return inode->i_op->readlink(dentry, buffer, buflen);
> @@ -5322,7 +5322,7 @@ int vfs_readlink(struct dentry *dentry, char __user *buffer, int buflen)
> if (IS_ERR(link))
> return PTR_ERR(link);
> }
> - res = readlink_copy(buffer, buflen, link);
> + res = readlink_copy(buffer, buflen, link, strlen(link));
> do_delayed_call(&done);
> return res;
> }
> @@ -5391,10 +5391,14 @@ EXPORT_SYMBOL(page_put_link);
>
> int page_readlink(struct dentry *dentry, char __user *buffer, int buflen)
> {
> + const char *link;
> + int res;
> +
> DEFINE_DELAYED_CALL(done);
> - int res = readlink_copy(buffer, buflen,
> - page_get_link(dentry, d_inode(dentry),
> - &done));
> + link = page_get_link(dentry, d_inode(dentry), &done);
> + res = PTR_ERR(link);
> + if (!IS_ERR(link))
> + res = readlink_copy(buffer, buflen, link, strlen(link));
> do_delayed_call(&done);
> return res;
> }
> diff --git a/fs/proc/namespaces.c b/fs/proc/namespaces.c
> index 8e159fc78c0a..c610224faf10 100644
> --- a/fs/proc/namespaces.c
> +++ b/fs/proc/namespaces.c
> @@ -83,7 +83,7 @@ static int proc_ns_readlink(struct dentry *dentry, char __user *buffer, int bufl
> if (ptrace_may_access(task, PTRACE_MODE_READ_FSCREDS)) {
> res = ns_get_name(name, sizeof(name), task, ns_ops);
> if (res >= 0)
> - res = readlink_copy(buffer, buflen, name);
> + res = readlink_copy(buffer, buflen, name, strlen(name));
> }
> put_task_struct(task);
> return res;
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 7e29433c5ecc..2cc98de5af43 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -626,6 +626,7 @@ is_uncached_acl(struct posix_acl *acl)
> #define IOP_XATTR 0x0008
> #define IOP_DEFAULT_READLINK 0x0010
> #define IOP_MGTIME 0x0020
> +#define IOP_CACHED_LINK 0x0040
>
> /*
> * Keep mostly read-only and often accessed (especially for
> @@ -723,7 +724,10 @@ struct inode {
> };
> struct file_lock_context *i_flctx;
> struct address_space i_data;
> - struct list_head i_devices;
> + union {
> + struct list_head i_devices;
> + int i_linklen;
> + };
I think that i_devices should be moved into the union as it's really
only used with i_cdev but it's not that easily done because list_head
needs to be initialized. I roughly envisioned something like:
union {
struct {
struct cdev *i_cdev;
struct list_head i_devices;
};
struct {
char *i_link;
unsigned int i_link_len;
};
struct pipe_inode_info *i_pipe;
unsigned i_dir_seq;
};
But it's not important enough imho.
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH v3 2/3] ext4: use inode_set_cached_link()
2024-11-20 11:20 ` [PATCH v3 2/3] ext4: use inode_set_cached_link() Mateusz Guzik
@ 2024-11-21 11:58 ` Jan Kara
0 siblings, 0 replies; 11+ messages in thread
From: Jan Kara @ 2024-11-21 11:58 UTC (permalink / raw)
To: Mateusz Guzik
Cc: brauner, viro, jack, linux-kernel, linux-fsdevel, hughd,
linux-ext4, tytso, linux-mm
On Wed 20-11-24 12:20:35, Mateusz Guzik wrote:
> Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Looks good to me. Feel free to add:
Reviewed-by: Jan Kara <jack@suse.cz>
Honza
> ---
> fs/ext4/inode.c | 3 ++-
> fs/ext4/namei.c | 4 +++-
> 2 files changed, 5 insertions(+), 2 deletions(-)
>
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 89aade6f45f6..7c54ae5fcbd4 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -5006,10 +5006,11 @@ struct inode *__ext4_iget(struct super_block *sb, unsigned long ino,
> if (IS_ENCRYPTED(inode)) {
> inode->i_op = &ext4_encrypted_symlink_inode_operations;
> } else if (ext4_inode_is_fast_symlink(inode)) {
> - inode->i_link = (char *)ei->i_data;
> inode->i_op = &ext4_fast_symlink_inode_operations;
> nd_terminate_link(ei->i_data, inode->i_size,
> sizeof(ei->i_data) - 1);
> + inode_set_cached_link(inode, (char *)ei->i_data,
> + inode->i_size);
> } else {
> inode->i_op = &ext4_symlink_inode_operations;
> }
> diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
> index bcf2737078b8..536d56d15072 100644
> --- a/fs/ext4/namei.c
> +++ b/fs/ext4/namei.c
> @@ -3418,7 +3418,6 @@ static int ext4_symlink(struct mnt_idmap *idmap, struct inode *dir,
> inode->i_op = &ext4_symlink_inode_operations;
> } else {
> inode->i_op = &ext4_fast_symlink_inode_operations;
> - inode->i_link = (char *)&EXT4_I(inode)->i_data;
> }
> }
>
> @@ -3434,6 +3433,9 @@ static int ext4_symlink(struct mnt_idmap *idmap, struct inode *dir,
> disk_link.len);
> inode->i_size = disk_link.len - 1;
> EXT4_I(inode)->i_disksize = inode->i_size;
> + if (!IS_ENCRYPTED(inode))
> + inode_set_cached_link(inode, (char *)&EXT4_I(inode)->i_data,
> + inode->i_size);
> }
> err = ext4_add_nondir(handle, dentry, &inode);
> if (handle)
> --
> 2.43.0
>
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH v3 3/3] tmpfs: use inode_set_cached_link()
2024-11-20 11:20 ` [PATCH v3 3/3] tmpfs: " Mateusz Guzik
@ 2024-11-21 11:59 ` Jan Kara
0 siblings, 0 replies; 11+ messages in thread
From: Jan Kara @ 2024-11-21 11:59 UTC (permalink / raw)
To: Mateusz Guzik
Cc: brauner, viro, jack, linux-kernel, linux-fsdevel, hughd,
linux-ext4, tytso, linux-mm
On Wed 20-11-24 12:20:36, Mateusz Guzik wrote:
> Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Looks good. Feel free to add:
Reviewed-by: Jan Kara <jack@suse.cz>
Honza
> ---
> mm/shmem.c | 6 ++++--
> 1 file changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/mm/shmem.c b/mm/shmem.c
> index c7881e16f4be..135f38eb2ff1 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -3868,6 +3868,7 @@ static int shmem_symlink(struct mnt_idmap *idmap, struct inode *dir,
> int len;
> struct inode *inode;
> struct folio *folio;
> + char *link;
>
> len = strlen(symname) + 1;
> if (len > PAGE_SIZE)
> @@ -3889,12 +3890,13 @@ static int shmem_symlink(struct mnt_idmap *idmap, struct inode *dir,
>
> inode->i_size = len-1;
> if (len <= SHORT_SYMLINK_LEN) {
> - inode->i_link = kmemdup(symname, len, GFP_KERNEL);
> - if (!inode->i_link) {
> + link = kmemdup(symname, len, GFP_KERNEL);
> + if (!link) {
> error = -ENOMEM;
> goto out_remove_offset;
> }
> inode->i_op = &shmem_short_symlink_operations;
> + inode_set_cached_link(inode, link, len - 1);
> } else {
> inode_nohighmem(inode);
> inode->i_mapping->a_ops = &shmem_aops;
> --
> 2.43.0
>
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH v3 0/3] symlink length caching
2024-11-20 11:20 [PATCH v3 0/3] symlink length caching Mateusz Guzik
` (2 preceding siblings ...)
2024-11-20 11:20 ` [PATCH v3 3/3] tmpfs: " Mateusz Guzik
@ 2024-11-21 12:34 ` Christian Brauner
2024-11-21 14:16 ` Jeff Layton
4 siblings, 0 replies; 11+ messages in thread
From: Christian Brauner @ 2024-11-21 12:34 UTC (permalink / raw)
To: Mateusz Guzik
Cc: Christian Brauner, viro, jack, linux-kernel, linux-fsdevel,
hughd, linux-ext4, tytso, linux-mm
On Wed, 20 Nov 2024 12:20:33 +0100, Mateusz Guzik wrote:
> quote:
> When utilized it dodges strlen() in vfs_readlink(), giving about 1.5%
> speed up when issuing readlink on /initrd.img on ext4.
>
> The size is stored in a union with i_devices, which is never looked at
> unless the inode is for a device.
>
> [...]
Applied to the vfs-6.14.misc branch of the vfs/vfs.git tree.
Patches in the vfs-6.14.misc branch should appear in linux-next soon.
Please report any outstanding bugs that were missed during review in a
new review to the original patch series allowing us to drop it.
It's encouraged to provide Acked-bys and Reviewed-bys even though the
patch has now been applied. If possible patch trailers will be updated.
Note that commit hashes shown below are subject to change due to rebase,
trailer updates or similar. If in doubt, please check the listed branch.
tree: https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git
branch: vfs-6.14.misc
[1/3] vfs: support caching symlink lengths in inodes
https://git.kernel.org/vfs/vfs/c/5cbb3c7e0051
[2/3] ext4: use inode_set_cached_link()
https://git.kernel.org/vfs/vfs/c/740456f67017
[3/3] tmpfs: use inode_set_cached_link()
https://git.kernel.org/vfs/vfs/c/30071e02c163
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH v3 1/3] vfs: support caching symlink lengths in inodes
2024-11-21 10:12 ` Christian Brauner
@ 2024-11-21 13:56 ` Mateusz Guzik
2024-11-22 1:56 ` Dave Chinner
1 sibling, 0 replies; 11+ messages in thread
From: Mateusz Guzik @ 2024-11-21 13:56 UTC (permalink / raw)
To: Christian Brauner
Cc: viro, jack, linux-kernel, linux-fsdevel, hughd, linux-ext4,
tytso, linux-mm
On Thu, Nov 21, 2024 at 11:12 AM Christian Brauner <brauner@kernel.org> wrote:
> I think that i_devices should be moved into the union as it's really
> only used with i_cdev but it's not that easily done because list_head
> needs to be initialized. I roughly envisioned something like:
>
> union {
> struct {
> struct cdev *i_cdev;
> struct list_head i_devices;
> };
> struct {
> char *i_link;
> unsigned int i_link_len;
> };
> struct pipe_inode_info *i_pipe;
> unsigned i_dir_seq;
> };
>
> But it's not important enough imho.
I thought about doing that, but decided not to. I mentioned some time
ago that the layout of struct inode is false-sharing friendly and the
kernel is not in shape where this can be sensibly fixed yet and it may
or may not affect what to do with the above.
On the stock kernel the issues are:
- a spurious lockref acquire/release -- I posted a patch for it, Al
did not like it and wrote his own, does not look like it landed though
- apparmor -- everything serializes on label ref management (this *is*
used by ubuntu for example, but also the lkp machinery). Other people
posted patchsets to get rid of the problem, but they ran into their
own snags. I have a wip with of my own.
Anyhow, with these eliminated it will be possible to evaluate what
happens with inode rearrengements. Until then I think any messing with
the layout is best avoided.
thanks for taking the patchset
--
Mateusz Guzik <mjguzik gmail.com>
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH v3 0/3] symlink length caching
2024-11-20 11:20 [PATCH v3 0/3] symlink length caching Mateusz Guzik
` (3 preceding siblings ...)
2024-11-21 12:34 ` [PATCH v3 0/3] symlink length caching Christian Brauner
@ 2024-11-21 14:16 ` Jeff Layton
4 siblings, 0 replies; 11+ messages in thread
From: Jeff Layton @ 2024-11-21 14:16 UTC (permalink / raw)
To: Mateusz Guzik, brauner
Cc: viro, jack, linux-kernel, linux-fsdevel, hughd, linux-ext4,
tytso, linux-mm
On Wed, 2024-11-20 at 12:20 +0100, Mateusz Guzik wrote:
> quote:
> When utilized it dodges strlen() in vfs_readlink(), giving about 1.5%
> speed up when issuing readlink on /initrd.img on ext4.
>
> The size is stored in a union with i_devices, which is never looked at
> unless the inode is for a device.
>
> Benchmark code at the bottom.
>
> ext4 and tmpfs are patched, other filesystems can also get there with
> some more work.
>
> Arguably the current get_link API should be patched to let the fs return
> the size, but that's not a churn I'm interested into diving in.
>
> On my v1 Jan remarked 1.5% is not a particularly high win questioning
> whether doing this makes sense. I noted the value is only this small
> because of other slowdowns.
>
> To elaborate here are highlights while benching on Sapphire Rapids:
> 1. putname using atomics (over 3.5% on my profile)
>
> sounds like Al has plans to do something here(?), I'm not touching it if
> it can be helped. the atomic definitely does not have to be there in the
> common case.
>
> 2. kmem_cache_alloc_noprof/kmem_cache_free (over 7% combined)
>
> They are both dog slow due to cmpxchg16b. A patchset was posted which
> adds a different allocation/free fast path which should whack majority
> of the problem, see: https://lore.kernel.org/linux-mm/20241112-slub-percpu-caches-v1-0-ddc0bdc27e05@suse.cz/
>
> If this lands I'll definitely try to make the pathname allocs use it,
> should drop about 5-6 percentage points on this sucker.
>
> 3. __legitimize_mnt issues a full fence (again over 3%)
>
> As far as avoiding the fence is concerned waiting on rcu grace period on
> unmount should do the trick. However, I found there is a bunch
> complexity there to sort out before doing this will be feasible (notably
> there are multiple mounts freed in one go, this needs to be batched).
> There may be other issues which I missed and which make this not worth
> it, but the fence is definitely avoidable in principle and I would be
> surprised if there was no way to sensibly get there. No ETA, for now I'm
> just pointing this out.
>
> There is also the idea to speculatively elide lockref, but when I tried
> that last time I ran into significant LSM-related trouble.
>
> All that aside there is also quite a bit of branching and func calling
> which does not need to be there (example: make vfsuid/vfsgid, could be
> combined into one routine etc.).
>
> Ultimately there is single-threaded perf left on the table in various
> spots.
>
> v3:
> - use a union instead of a dedicated field, used up with i_devices
>
> v2:
> - add a dedicated field, flag and a helper instead of using i_size
> https://lore.kernel.org/linux-fsdevel/20241119094555.660666-1-mjguzik@gmail.com/
>
> v1 can be found here:
> https://lore.kernel.org/linux-fsdevel/20241118085357.494178-1-mjguzik@gmail.com/
>
> benchmark:
> plug into will-it-scale into tests/readlink1.c:
>
> #include <stdlib.h>
> #include <unistd.h>
> #include <sys/types.h>
> #include <sys/stat.h>
> #include <fcntl.h>
> #include <assert.h>
> #include <string.h>
>
> char *testcase_description = "readlink /initrd.img";
>
> void testcase(unsigned long long *iterations, unsigned long nr)
> {
> char *tmplink = "/initrd.img";
> char buf[1024];
>
> while (1) {
> int error = readlink(tmplink, buf, sizeof(buf));
> assert(error > 0);
>
> (*iterations)++;
> }
> }
>
> Mateusz Guzik (3):
> vfs: support caching symlink lengths in inodes
> ext4: use inode_set_cached_link()
> tmpfs: use inode_set_cached_link()
>
> fs/ext4/inode.c | 3 ++-
> fs/ext4/namei.c | 4 +++-
> fs/namei.c | 34 +++++++++++++++++++---------------
> fs/proc/namespaces.c | 2 +-
> include/linux/fs.h | 15 +++++++++++++--
> mm/shmem.c | 6 ++++--
> security/apparmor/apparmorfs.c | 2 +-
> 7 files changed, 43 insertions(+), 23 deletions(-)
>
Nice work, Mateusz!
Reviewed-by: Jeff Layton <jlayton@kernel.org>
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH v3 1/3] vfs: support caching symlink lengths in inodes
2024-11-21 10:12 ` Christian Brauner
2024-11-21 13:56 ` Mateusz Guzik
@ 2024-11-22 1:56 ` Dave Chinner
1 sibling, 0 replies; 11+ messages in thread
From: Dave Chinner @ 2024-11-22 1:56 UTC (permalink / raw)
To: Christian Brauner
Cc: Mateusz Guzik, viro, jack, linux-kernel, linux-fsdevel, hughd,
linux-ext4, tytso, linux-mm
On Thu, Nov 21, 2024 at 11:12:52AM +0100, Christian Brauner wrote:
> I think that i_devices should be moved into the union as it's really
> only used with i_cdev but it's not that easily done because list_head
> needs to be initialized.
I'm planning on using i_devices with block devices, too, so the
block device list doesn't need to use i_sb_list anymore (similar to
how i_devices is used by the char dev infrastructure. See the patch
below...
> I roughly envisioned something like:
>
> union {
> struct {
> struct cdev *i_cdev;
> struct list_head i_devices;
> };
> struct {
> char *i_link;
> unsigned int i_link_len;
> };
> struct pipe_inode_info *i_pipe;
> unsigned i_dir_seq;
> };
>
I'd probably have to undo any unioning/association with i_cdev to
use i_devices with block devs...
-Dave
--
Dave Chinner
david@fromorbit.com
bdev: stop using sb->s_inodes
From: Dave Chinner <dchinner@redhat.com>
Iteration of block device inodes is done via the
blockdev_superblock->s_inodes list. We want to remove this list and
the inode i_sb_list list heads, so we need some other way for block
devices to be iterated.
Take a leaf from the chardev code and use the inode->i_devices list
head to link all the block device inodes together and replace the
s_inodes list with a bdev private global list.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
block/bdev.c | 56 +++++++++++++++++++++++++++++++++++++++-----------------
1 file changed, 39 insertions(+), 17 deletions(-)
diff --git a/block/bdev.c b/block/bdev.c
index 33f9c4605e3a..d733507f584a 100644
--- a/block/bdev.c
+++ b/block/bdev.c
@@ -317,6 +317,8 @@ EXPORT_SYMBOL(bdev_thaw);
static __cacheline_aligned_in_smp DEFINE_MUTEX(bdev_lock);
static struct kmem_cache *bdev_cachep __ro_after_init;
+static LIST_HEAD(bdev_inodes);
+static DEFINE_SPINLOCK(bdev_inodes_lock);
static struct inode *bdev_alloc_inode(struct super_block *sb)
{
@@ -362,6 +364,10 @@ static void init_once(void *data)
static void bdev_evict_inode(struct inode *inode)
{
+ spin_lock(&bdev_inodes_lock);
+ list_del_init(&inode->i_devices);
+ spin_unlock(&bdev_inodes_lock);
+
truncate_inode_pages_final(&inode->i_data);
invalidate_inode_buffers(inode); /* is it needed here? */
clear_inode(inode);
@@ -412,19 +418,35 @@ void __init bdev_cache_init(void)
blockdev_superblock = blockdev_mnt->mnt_sb; /* For writeback */
}
-struct block_device *bdev_alloc(struct gendisk *disk, u8 partno)
+static struct inode *bdev_new_inode(void)
{
- struct block_device *bdev;
struct inode *inode;
- inode = new_inode(blockdev_superblock);
+ inode = new_inode_pseudo(blockdev_superblock);
if (!inode)
return NULL;
+
+ spin_lock(&bdev_inodes_lock);
+ list_add(&inode->i_devices, &bdev_inodes);
+ spin_unlock(&bdev_inodes_lock);
+
inode->i_mode = S_IFBLK;
inode->i_rdev = 0;
inode->i_data.a_ops = &def_blk_aops;
mapping_set_gfp_mask(&inode->i_data, GFP_USER);
+ return inode;
+}
+
+struct block_device *bdev_alloc(struct gendisk *disk, u8 partno)
+{
+ struct block_device *bdev;
+ struct inode *inode;
+
+ inode = bdev_new_inode();
+ if (!inode)
+ return NULL;
+
bdev = I_BDEV(inode);
mutex_init(&bdev->bd_fsfreeze_mutex);
spin_lock_init(&bdev->bd_size_lock);
@@ -477,10 +499,10 @@ long nr_blockdev_pages(void)
struct inode *inode;
long ret = 0;
- spin_lock(&blockdev_superblock->s_inode_list_lock);
- list_for_each_entry(inode, &blockdev_superblock->s_inodes, i_sb_list)
+ spin_lock(&bdev_inodes_lock);
+ list_for_each_entry(inode, &bdev_inodes, i_devices)
ret += inode->i_mapping->nrpages;
- spin_unlock(&blockdev_superblock->s_inode_list_lock);
+ spin_unlock(&bdev_inodes_lock);
return ret;
}
@@ -1218,8 +1240,8 @@ void sync_bdevs(bool wait)
{
struct inode *inode, *old_inode = NULL;
- spin_lock(&blockdev_superblock->s_inode_list_lock);
- list_for_each_entry(inode, &blockdev_superblock->s_inodes, i_sb_list) {
+ spin_lock(&bdev_inodes_lock);
+ list_for_each_entry(inode, &bdev_inodes, i_devices) {
struct address_space *mapping = inode->i_mapping;
struct block_device *bdev;
@@ -1231,14 +1253,14 @@ void sync_bdevs(bool wait)
}
__iget(inode);
spin_unlock(&inode->i_lock);
- spin_unlock(&blockdev_superblock->s_inode_list_lock);
+ spin_unlock(&bdev_inodes_lock);
+
/*
- * We hold a reference to 'inode' so it couldn't have been
- * removed from s_inodes list while we dropped the
- * s_inode_list_lock We cannot iput the inode now as we can
- * be holding the last reference and we cannot iput it under
- * s_inode_list_lock. So we keep the reference and iput it
- * later.
+ * We hold a reference to 'inode' so it won't get removed from
+ * bdev inodes list while we drop the lock. We need to hold the
+ * reference until we have a reference on the next inode on the
+ * list, so we can't drop it until the next time we let go of
+ * the bdev_inodes_lock.
*/
iput(old_inode);
old_inode = inode;
@@ -1260,9 +1282,9 @@ void sync_bdevs(bool wait)
}
mutex_unlock(&bdev->bd_disk->open_mutex);
- spin_lock(&blockdev_superblock->s_inode_list_lock);
+ spin_lock(&bdev_inodes_lock);
}
- spin_unlock(&blockdev_superblock->s_inode_list_lock);
+ spin_unlock(&bdev_inodes_lock);
iput(old_inode);
}
^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2024-11-22 1:56 UTC | newest]
Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-11-20 11:20 [PATCH v3 0/3] symlink length caching Mateusz Guzik
2024-11-20 11:20 ` [PATCH v3 1/3] vfs: support caching symlink lengths in inodes Mateusz Guzik
2024-11-21 10:12 ` Christian Brauner
2024-11-21 13:56 ` Mateusz Guzik
2024-11-22 1:56 ` Dave Chinner
2024-11-20 11:20 ` [PATCH v3 2/3] ext4: use inode_set_cached_link() Mateusz Guzik
2024-11-21 11:58 ` Jan Kara
2024-11-20 11:20 ` [PATCH v3 3/3] tmpfs: " Mateusz Guzik
2024-11-21 11:59 ` Jan Kara
2024-11-21 12:34 ` [PATCH v3 0/3] symlink length caching Christian Brauner
2024-11-21 14:16 ` Jeff Layton
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox