On Wed, Mar 19, 2025 at 05:18:05PM +0000, Yosry Ahmed wrote: > On Wed, Mar 19, 2025 at 11:47:32AM +0100, Mateusz Guzik wrote: > > Is not this going a little too far? > > > > the lock + irq trip is quite expensive in its own right and now is > > going to be paid for each cpu, as in the total time spent executing > > cgroup_rstat_flush_locked is going to go up. > > > > Would your problem go away toggling this every -- say -- 8 cpus? > > I was concerned about this too, and about more lock bouncing, but the > testing suggests that this actually overall improves the latency of > cgroup_rstat_flush_locked() (at least on tested HW). > > So I don't think we need to do something like this unless a regression > is observed. > To my reading it reduces max time spent with irq disabled, which of course it does -- after all it toggles it for every CPU. Per my other e-mail in the thread the irq + lock trips remain not cheap at least on Sapphire Rapids. In my testing outlined below I see 11% increase in total execution time with the irq + lock trip for every CPU in a 24-way vm. So I stand by instead doing this every n CPUs, call it 8 or whatever. How to repro: I employed a poor-man's profiler like so: bpftrace -e 'kprobe:cgroup_rstat_flush_locked { @start[tid] = nsecs; } kretprobe:cgroup_rstat_flush_locked /@start[tid]/ { print(nsecs - @start[tid]); delete(@start[tid]); } interval:s:60 { exit(); }' This patch or not, execution time varies wildly even while the box is idle. The above runs for a minute, collecting 23 samples (you may get "lucky" and get one extra, in that case remove it for comparison). A sysctl was added to toggle the new behavior vs old one. Patch at the end. "enabled"(1) means new behavior, "disabled"(0) means the old one. Sum of nsecs (results piped to: awk '{ sum += $1 } END { print sum }'): disabled: 903610 enabled: 1006833 (+11.4%) Toggle at runtime with: sysctl fs.magic_tunable=0 # disabled, no mandatory relocks sysctl fs.magic_tunable=1 # enabled, relock for every CPU I attached the stats I got for reference. I patched v6.14 with the following: diff --git a/fs/file_table.c b/fs/file_table.c index c04ed94cdc4b..441f89421413 100644 --- a/fs/file_table.c +++ b/fs/file_table.c @@ -106,6 +106,8 @@ static int proc_nr_files(const struct ctl_table *table, int write, void *buffer, return proc_doulongvec_minmax(table, write, buffer, lenp, ppos); } +unsigned long magic_tunable; + static const struct ctl_table fs_stat_sysctls[] = { { .procname = "file-nr", @@ -123,6 +125,16 @@ static const struct ctl_table fs_stat_sysctls[] = { .extra1 = SYSCTL_LONG_ZERO, .extra2 = SYSCTL_LONG_MAX, }, + { + .procname = "magic_tunable", + .data = &magic_tunable, + .maxlen = sizeof(magic_tunable), + .mode = 0644, + .proc_handler = proc_doulongvec_minmax, + .extra1 = SYSCTL_LONG_ZERO, + .extra2 = SYSCTL_LONG_MAX, + }, + { .procname = "nr_open", .data = &sysctl_nr_open, diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c index 3e01781aeb7b..f6444bf25b2f 100644 --- a/kernel/cgroup/rstat.c +++ b/kernel/cgroup/rstat.c @@ -299,6 +299,8 @@ static inline void __cgroup_rstat_unlock(struct cgroup *cgrp, int cpu_in_loop) spin_unlock_irq(&cgroup_rstat_lock); } +extern unsigned long magic_tunable; + /* see cgroup_rstat_flush() */ static void cgroup_rstat_flush_locked(struct cgroup *cgrp) __releases(&cgroup_rstat_lock) __acquires(&cgroup_rstat_lock) @@ -323,12 +325,18 @@ static void cgroup_rstat_flush_locked(struct cgroup *cgrp) rcu_read_unlock(); } - /* play nice and yield if necessary */ - if (need_resched() || spin_needbreak(&cgroup_rstat_lock)) { + if (READ_ONCE(magic_tunable)) { __cgroup_rstat_unlock(cgrp, cpu); if (!cond_resched()) cpu_relax(); __cgroup_rstat_lock(cgrp, cpu); + } else { + if (need_resched() || spin_needbreak(&cgroup_rstat_lock)) { + __cgroup_rstat_unlock(cgrp, cpu); + if (!cond_resched()) + cpu_relax(); + __cgroup_rstat_lock(cgrp, cpu); + } } } }